You are on page 1of 13

Copyright 0 1992 by the Genetics Society of America

Analysis of Molecular Variance Inferred From Metric Distances Among

DNA Haplotypes: Application to Human Mitochondrial DNA
Restriction Data

Laurent Excoffier,** Peter E. Smouse* and JosephM.Quattro*’

“Center for Theoretical and Applied Genetics (CTAG), Cook College, Rutgers University, New Brunswick, New Jersey 08903-0231,
?Department of Anthropology and Ecology, University of Geneva, 1227 Carouge, Switzerland, and $Department of
Biological Sciences, Hopkins Marine Station, Stanford University, Pacijic Grove, Calijornia 93950
Manuscript received November1, 1991
Accepted for publication February 10, 1992

We present herea framework for the study ofmolecularvariationwithin a singlespecies.
Information on DNA haplotypedivergence is incorporated into an analysisofvariance format,
derived from a matrix of squared-distances among all pairs of haplotypes.This analysis of molecular
variance (AMOVA) produces estimates of variance components and F-statistic analogs, designated
here as @-statistics,reflecting the correlation of haplotypic diversityat different levels of hierarchical
subdivision. The method is flexible enough toaccommodateseveral alternative input matrices,
corresponding to different types of moleculardata, as well as different types of evolutionary assump-
tions, without modifyingthe basic structure of the analysis. The significance of the variance compo-
nents and @-statisticsis tested using a permutational approach, eliminating the normality assumption
that is conventional for analysis of variance but inappropriate for molecular data. Application of
AMOVA to human mitochondrial DNA haplotype data shows that population subdivisionsare better
resolved whensomemeasureofmolecular differences among haplotypes is introduced into the
analysis. Atthe intraspecific level, however, the additional information provided by knowing the exact
phylogenetic relations among haplotypes or by a nonlinear translation of restriction-site change into
nucleotide diversity does not significantly modify the inferred population genetic structure. Monte
Carlo studies show that site sampling doesnot fundamentally affect the significance of the molecular
variance components. The AMOVA treatment is easily extended in several different directions and
it constitutes a coherent and flexible frameworkfor the statistical analysis of moleculardata.

0 UR knowledge of population genetic diversity

has improved considerably over the last decade,
with the application of molecular techniques to evo-
sampled from different demes (SLATKIN1987). The
use of information on the molecular connection of
DNAhaplotypesshould be valuableinpopulation
lutionarystudies.Quantitativeresolutionhas im- genetic analyses.
proved as larger numbers of haplotypic markers are Population genetic structure withina species has
defined within each sample. Moreover, information traditionally been studied using departures of allele
on the degree of divergence between alleles/restric- frequencies from panmictic expectations. Several es-
tion haplotypes/DNA sequenceshas become available. timationproceduresrelatedto WRIGHT’S (1 951,
Whenever we canmakemutational or recombina- 1965) F-statistics have been proposed for the treat-
tional assumptionsabout therelationships among hap- ment of polymorphic systems (COCKERHAM 1969,
phylogenetic reconstruction algo- 1973; NEI 1977; WEIR and COCKERHAM 1984; LONG
rithms are available to characterize evolutionary re- 1986). A few studies have tried to translate informa-
lationships more precisely (see reviewsby FELSENSTEIN tion on DNA restriction endonuclease haplotypes into
1988; SWOFFORD and OLSEN1990). estimates of the magnitude of intraspecific subdivi-
Although no precise analytic
model for
the sion. LYNCH and CREASE(1990), usinga phylogeny of
full population distribution of molecular differences haplotypes, provide estimates of the variance of nu-
among a set of interconnected haplotypes is known, cleotidediversity fordifferentsampling processes.
the expected mean number of site differences between TAKAHATA and PALUMBI (1985) compute the fraction
sets of panmictic (WATTERSON1975) and subdivided of nucleotide diversity due to interpopulation genetic
(SLATKIN1987) populations has been derived under differences and provide an analogue ofNEI’S (1973)
simple assumptions. When a species exhibits subdivi- coefficient of gene differentiation ( G S T ) . Both meth-
sion, we expect both increased haplotypic diversity ods involve nonlinear transformation of the original
and a larger number of segregating sites for genomes data set into estimates of geneticdiversity.Several
Genetics 131: 479-491 (June, 1992)
480 L. Excoffler, P.E. Smouse and.1. M. Quattro
assumptions on the underlying evolution of the mol-
ecule are required, assumptions that are neither al-

b c

1 ]

ways met nor generally verifiable. We need a more Pi=[ 0 1 1 ]

general methodology that does not depend so criti- h,
Pj-[ 0 1 0 ]
cally on the specific assumptions. h4
P,‘=[ 1 1 1 ]
Our purpose here is to design an alternative meth- h5
Pi-[ 1 1 0 ]
odology that makes use of the available molecular
informationgathered in population surveys, while
remaining flexible enough to accommodate different
types of assumptions about the evolution of the ge- n
netic system. We propose to extend thework of COCK-
ERHAM (1973), LONG(1986) and LONG,SMOUSE and
WOOD(1987) on allelic correlations among demes to
a comparable analysis of haplotypic diversity. Using
the fact that a conventional sum of squares (SS) may
be written as the sum of squared differences between *Pi= I0 0 0 0 ]
all pairs of observations(LI1976), we constructa ‘P;= [ 0 0 1 0 ]

hierarchical analysis of molecular variance directly ’P1“ I0 1 0 0 ]

from the matrixof squared-distances between all pairs *Pi= [ 0 1 0 1 1
of haplotypes. Beyond its clear relation to an analysis
of variance, the method has the additional advantage
that several different assumptions can be imposed on FIGURE1.-Example of the steps involved in the computations
the haplotype differentiation process, each of which of the boolean vectors *pJof mutational events. Each haplotype (hj)
translates into a different distance matrix, with no is first translated into a boolean vector (p,) of presence or absence
of restriction sites. The second step involves the construction of a
change in the structure of the subsequent analysis. phylogenetic network, where each haplotype is linked by a single
When all interhaplotypic distances are presumed or a series of mutational events to all the other haplotypes through
equal, theanalysis is tantamount to amultiallelic (mul- a unique pathway. The final step is the coding of each haplotype
tivariate) analysis of variance (see WEIRand COCKER- (h,) as a boolean vector (*pJ)of occurrences of mutational events
(m,) from a given haplotype chosen as a reference to h,. In this
example, haplotype 2 has been chosen as the reference.
1987). Alternatively, we can use the mean number of
restriction site differences, patristic distances along a
given network, or nucleotide diversity as measures of N individuals assayed witha standardset of restriction
interhaplotypic distances. enzymes, S polymorphic restriction sites are identified.
We illustrate with an analysis of human mitochon- A restriction haplotype ( h ) , defined as the combina-
drial DNA (mtDNA) restriction site data, performing tion of presence or absence of the various restriction
a nestedanalysis of molecularvariance on five regional sites, may be considered as an S-dimensional boolean
collections, each represented by two different popu- vector of the form
lations. The hierarchical model employs “Within Pop- p’ = @lpZp3P4 * * * PSI, (1)
ulations” (WP), “Among Populations/Within Groups”
(AP/WG), and “AmongGroups” (AG) components of where p , = 1 if h is cut at site s, and p , = 0 if it is not
(upper right, Figure 1). The difference between two
diversity. To illustrate the impact of different sets of
assumptions concerning the origins of the haplotypic haplotypes hj and hk is then defined as (pj - pk)
variants, we employ alternative distance metrics to (pj - pk)’ = [@Ij- P 1 k ) ( P Z j
examine the amount and pattern of genetic subdivi- - P2k) * (PSj - P S k ) ] . (2)
sion. We usepermutational procedures on the original
interindividualsquared distance matrix to provide Each polymorphic site contributes additional infor-
significance tests for each of the hierarchical variance mation, without necessarily being evolutionarily in-
components and relatedF-statistic analogues. We also dependent. We definea Euclidean distance metric
study the importanceof site choice on thesignificance )6;( between haplotypes hj and hk as
of the different statistics, using resampling techniques ;6 = (pj - pk)’w(pj - pk), (34
(EFRON 1982).
where W is a matrix of differential weights for the
various sites. The weight matrix W takes any of several
forms, depending upon how we wish to use ancillary
haplo- information. If all sites are assumed independent and
types: We assume that restriction analysis has been equally informative, W = I, the identity matrix, and
performed on a non-recombiningDNA segment. For the distance metric is equal to the number of restric-
of Molecular 48 1

tion-site differences. This Euclidean metric is com- where the elements ofthe block-diagonal submatrices
monly employed for population differences (NEI and D$ contain pairwise squared-distances )6;( between
TAJIMA 1981), but it may be used just as easily for individuals of the same (ith) population, and those of
differencesbetween single haplotypes. Inthe case the off-diagonal matrix blocks D$ contain pairwise
where W is diagonal, W = diag(w:], weighting sites squared-distances between individuals, one from the
differentially but treating them as independent,Equa- ith and the other from the population.
i’th Individuals
tion 3a can be rewritten as may also be grouped at higher levels, according to
such non-genetic criteria as geography,ecological en-
vironment, or language.
65 = s= 1
d(psj - psk)‘. (3b)
Aconventional sum of squares[SS(Tota~Jmay be
written, barring a constant(2N),as the sum of squared
T h e rest of the analysis does not depend on which differences between all pairs of N items (LI 1976). In
particular form of W has been chosen; we will assume the multidimensional case, using vectors instead of
that the weight matrix has been set in advance, re- scalars, the conventional sum of squares becomes a
turning to the definition of metrics and the choice of sum of squared deviations (SSD) from the centroidof
W for the humanillustration. a multidimensional space. Thus,
Evolutionary distances between haplotypes: The N
DNA haplotypes can sometimes be related mutation-
ally and arranged into a network (see Figure 1). We
SSD,,,I, = -
(xj - X)’W(xj X)
j= I
may then use thenumber of mutationsalong the 1 N-l
network as a measure of evolutionary divergence be- =- (Xj - Xk)’W(Xj - Xk),
tween any two haplotypes. Network distance does not N j = 1 b]

generally equal phenetic distance, either because of or

homoplasy (convergent site changes or reverse muta-
tions) or because the translation from the changes we
see to those we infer are nonlinear (e.g., TAKAHATA
and PALUMBI1985; LYNCH and CREASE1990). We
can always modify the definition of the p’s and W in
such a way that we can apply Equation 3a to provide
an evolutionary distance. We define agiven haplotype because 65 = 0 for all haplotypes hj. This transforma-
as a vector(*p) of independent and unique mutational tion applies equally to the total array of individuals in
events, described sequentially from any fixed position the data set, to those within each population separately
in the network (lower left, Figure l), rather thanas a (within the diagonal blocks, D$),and to those belong-
vector of restriction-site presence or absence indica- ing to a F t i c u l a r subdivision (within the diagonal
tors (upper right, Figure 1). If M mutational events blocks, DI1, Df2, D&, and DZ2).
are recognized, each haplotype is defined as a vector Where individuals are arranged into populations
of dimension M Z S . Our evolutionary distance metric and populations nested within groups defined a priori
becomes on nongenetic criteria, we employed a linear model
on the pattern first described by COCKERHAM (1969,
*6$ (*pj - *ph)‘w(*pj- *PI). (34 1973) and refined upon by others (WEIRand COCK-
In theabsence of homoplasy and keeping W constant, ERHAM 1984; LONG1986)
*6$ is the same as6;. T h e important pointis that once 3% = p
p.- + +
ag + bjg cjig, (6)
a metric has been set, the following is general.
Partitioning a distance matrix into hierarchical where pjisindexes the j t h chromosome, here equiva-
components: Our application will concern mtDNA lent to the j t h individual ( j= 1, . . . , Nk) in the ith
data. Consider a haploid genetic system where inter- population (i = 1, . . . , Zg) in the gth group (g = 1,
haplotypic distances are identical to distances between
. . . , G), and p is the unknown expectation of plig,
individuals. We can arrange a set of N individuals averaged over the whole study. The effects are a for
group, b for populations and c for individuals within
from Z populations into a distance matrix, D2, parti-
populations. The effects are assumed to be additive,
tioned into a series of submatrices corresponding to
random,uncorrelated,andto have the associated
particular subdivisions:
variance components(expectedsquared deviations)
:, a,2, and a , respectively.
Relying on the standard decomposition, we note
that for any choice of hierarchical partition of the N

I [Di] . . . ..
individuals into strata, we can write
SSD(Tota1) = SSD(Among Strata)
+ SSD(Within Strata),
482 L. Excoffier, P. E. Smouse and J. M. Quattro

placing us in traditional analysis of variance frame- 1969, 1973), but it allows for thehaploid transmission
work, designated here as Analysis of Molecular Vari- of mitochondrial genomes. It may also be useful to
ance, AMOVA (Table 1). Forillustration, we shall employ an analogous array of haplotypic correlation
partition the total sum of squared deviations, measures, which we shall term @-statistics to avoid
SSD(Total), into components for variation within pop- confusion. Following COCKERHAM'S lead, we have
ulations, SSD(WP), variation among populations
within regional groups, SSD(AP/WG), and variation a: = (1 - @ST)a2,
among regional groups, SSD(AG) ab' = (@ST - @CT)a2, ( 10 4
Nw Nw a: = 9c7-a 2
G I ccf$4
SSD(WP) = 2
jG1 '='
(84 where u2 = a: + +a; a;: @sT is viewed as the corre-
lation of random haplotypes within populations, rela-
tive to thatof random pairsof haplotypes drawn from
the whole species; @ c T as the correlation of random
haplotypes within a group of populations, relative to

-% i= 1
that of random pairs of haplotypes drawn from the
whole species, and as the correlation of the molec-
ular diversity of random haplotypes within popula-
tions, relative to that of random pairs of haplotypes
drawn from the region. Still following the analogy,
we rewritetheequations (loa) in terms of the 9-

x- 1 /

T h e corresponding mean squared deviations (MSD)

We shall not require it for mtDNA, but for the case
are obtainedby dividing each SSD by the appropriate
of diploid genetic systems, theprocedure employs
degrees of freedom, as reported in Table 1. T h e n-
within-individual haplotypic diversity as an additional
coefficients in Table 1 represent the average sample
level, following COCKERHAM (1973) and LONG(1986)
sizes of particularhierarchical levels, allowing for
exactly. T h e only difficulty is that DNA haplotype
unequal sample sizes,
diversity within nuclear genes is often assayed from
homozygous individuals, to avoid confusion over link-
age phase in multisite heterozygotes. If one cannot
avoid the resulting sampling biases, one should prob-
ably avoid the within-individual level of the hierarchy.
The limitations arising from the precise assumptions
of the F-statistics treatment (random sampling to cre-
ate theinitial subdivisions at each level, pure drift and
no migration) are almost never met in natural popu-
lations. The same comments apply to the @-statistics.
Proper caution is necessary when interpreting these
coefficients, but wemay nevertheless view them as
convenient summarizations of the packaging of ge-
neticinformation within and among populations,
being one for one with the variance components.
Testing significance of the variance components
and @-statistics: Considerable discussion has emerged
over which method to use for testing the significance
of the variance-components (WEIRand COCKERHAM
1984; LONG1986; ZHIVOTOVSKY 1988). The method
requiringthe fewest assumptions is permutational
analysis of the null distributionfor each variance-
The variance components (a2's) of each hierarchical component. Under the null hypothesis, samples are
level areextracted by equatingthe mean squares considered as drawn from a global population, with
(MSDs) to their expectations. T h e structure of the variation due to random sampling in the construction
analysis is that described for F-statistics (COCKERHAM of populations. To obtain a null distribution, we allo-
Analysis of Molecular Variance 483
General design for hierarchical analysis of molecular variance(AMOVA)

Source of variation d.f. MSD Expected MSD

+ 12’ u; + 12“ 0;f
~ ~~~

Among regions G-1 MSD/(AG) u:

Among populations within regions =I 1.- G MSD/(AP/WG) u: + n u!
Among individuals within populations N - g., I, MSD/(WP) 0:

Total N- 1

cate eachindividual to a randomly chosen population, different statistics with the previously described pro-
while holding sample sizes constant at the realized cedures.
values. This amounts to random permutation of the
rows (and corresponding columns) of the squared-
distance matrix (MANTEL1967). T h e variance-com- ILLUSTRATIONWITH HUMAN mtDNA
ponents are estimated from each of a large number HAPLOTYPES
(say 1000) of permuted matrices. We use this proce-
dure toobtain the null distribution and to test for the Due to its high relative mutation rate (BROWN,
GEORGEand WILSON 1979; BROWNet al. 1982),
significance of @ST and up.
mtDNA presentsmany distinct haplotypes in different
T w o other permutation schemes are useful. The
demes. Prevailing maternal transmission in mammals
first assumes that the regions are real but that the
(GILES et al. 1980; GYLLENSTEN et al. 1991) favors
populations within them are not, permuting individ-
higher levels of population subdivision than is true for
uals within regional groups without regard to popu-
nuclear DNA markers (BIRKY, MARUYAMA, and
lation, a procedure used to obtain the null distribu-
tions of aScand a!. The second assumes that while
Barring migration, these two effects should produce
the populations are real, the regional groupings are
increasingly non-overlapping sets of restriction hap-
artificial, permuting whole populations across groups.
lotypes as divergence time between populations in-
In this case, the sizes of the groups (but not those of
creases (WATTERSON 1985). Both of these featuresare
the populations) vary with each permutational run.
evident in human mtDNA, which is small (16,569 bp,
This randomization scheme is used to obtain the null ANDERSON et al. 198l), rapidly evolving, and appar-
distribution of +CT and ui. ently free of recombination.
Restriction site sampling: T h e sampling of nucle- Restriction haplotypes of human mtDNAhave been
otides has been shown to be a major source of varia- sampled from a substantial number of populations
bility for estimates of molecular diversity (LYNCH and (for a review of the two main data set, see EXCOFFIER
CREASE1990). One can legitimally ask whether the 1990; STONEKING et al. 1990). Our purpose is to
results of our study depend on theparticular array of illustrate the methodologydescribedabove, rather
restriction sites employed. We examine the influence than to reopen the question of human origins raised
of site samplingon the genetic structure of the popu- elsewhere (CANN,STONEKING and WILSON 1987;
lations, using a site resampling plan similar to the EXCOFFIER andLANGANEY1989; EXCOFFIER1990;
bootstrap used by EFRON(1 982).Under the assump- STONEKING et al. 1990). We consider here ten popu-
tion that the observed 62 sites are representative of lations for which ample data are available in the lit-
all potential mtDNA sites, we obtain the distribution erature (Table 2). These particular populations were
of thevariance components and associated +-statistics chosen torepresent five “regionalgroups” of two
by Monte Carlo simulation, using 500 random collec- populations each (Figure 2). The samples have also
tions of sites. For each collection, the procedure is as been analyzed for polymorphism with the five restric-
follows: (a) Draw a given number of sites from the tion enzymes most commonly used in human studies,
observed array of 62 sites, at random and with replace- BamHI: GGATCC, HpaI: GTTAAC, HaeII: (A/
ment. Given the choice of sites, the haplotype of each G)GCGC(T/C), AvaII: GG(T/A)CC, and MspI:
individual is thentakenas the combinationof the CCGG. Among the 672 mtDNAs assayed from these
originalstates of thoserandomly chosen sites; (b) ten populations, 34 of 62 recognizable sites were
compute interhaplotypic distances on the basis of the found to be polymorphic.
newly defined haplotypes and perform an AMOVA In a sample of 672, we cannot expect to see all P 4
analysis. T h e distances are simply computedfrom possible haplotypes, but sample size considerations
Equation 3b, with all w: equal to 1; and (c) permute aside, the absence of recombination practically guar-
the matrix 500 times, and test the significance of the antees large amountsof disequilibrium among the 34
484 L. Excoffier, P. E. Smouse and J. M. Quattro
Haplotypic composition of the population samples by region

Sample No.Reference
Sample size Haplotype frequencies'
1 89 13 28 47 48 4950515253 54
1 Tharu BREGAet al. ( 1 986) 91 4 8 2 5 2 3 2 2 2 1 1 1 2 1 1
61 12
13 27 28 29
2 Oriental JOHNSON
et al. (1 983) 46 3 2 1 2 4 2 2 1 1 1
West Africa
1 27 10 273952 64 65 6667 68 71
3 Wolof et al. ( 1 988)
SCOZZARI 110 23 3 9 2 9 2 2 5 2 2 I I 1 2 I
1 2 6 8 34
39 69
4 Peul et al. ( 1 988)
SCOZZARI 47 11 1 9
2 2 1 1 1
1 639 46
GARRISON and 63 59 2 1 1
KNOWLER( 1 985)
1 47 95
6 Maya et al. (1990)
SCHURR 37 30 4 3
1611 18 21 38478283
7 Finnish VILKKI, SAVONTAUS and 110 8 7 2 4 3 8 2 2 1 1
( 1 988)
126 18 21 23 34 42 47 56577273 75 7677
8 Sicilian et al. ( 1 989)
SEMINO 90 5 0 3 9 1 1 1 1 1 1 5 1 2 1 1 1 1 1
1611 17 22 36 373839
9 Israeli Jews al. ( 1986)
B O N N ~ T A MetI R 39 1514 1 1 4 1 1 1 1
4041 42 43 44 45
10 Israeli Arabs B O N N ~ T A MetI al.
R (1986) -
39 2 2 1 1 1 6 2 1 1 1 1 1 1
' For each population, haplotype numbers are reported on the first line and their absolute frequencies are shown' in italic on the second

to date. These 56 haplotypes are a subset of a much

larger world-wide collection reviewed in EXCOFFIER
(1990). The network presented in Figure 3 is a mini-
mum spanning tree(PRIM1957), obtained by the
algorithmfound in the NTSYS package (ROHLF
1990). The procedure is similar tothatproducing
Wagner trees (FARRIS 1970), but differs by using the
observed haplotypes asthe nodes of thenetwork,
rather than as branch tips of the tree. Wagner trees
and Prim networks are alternative ways of viewingthe
ab- same data, but the network better conveys the con-
- 4 nections between the haplotypes.
FIGURE2.-Geographic location of the population samples. Haplotypic diversity among samples is pictured in
Figure 4, where the darkenedcircles indicate presence
sites. T h e treatment we have developed above does of a given haplotype in a particularpopulation sample.
not require independenceof the restriction sites. Only A common feature of each sample is the presence of
35 haplotypes would be observedif each site had been type 1 (the large central circle) in substantial frequen-
the subject of a single mutational event; there is a cies. Other, less common haplotypes (2, 6, 7, 11, 39),
high level of homoplasy. Nevertheless, all 56 haplo- are found in samples from different geographic re-
types may be linked by single mutational events in a gions. Each sample also possesses a series of private
parsimonious network (Figure 3), with only two miss- haplotypes, restricted to a single sample and not found
ing.intermediates. Neitherof these missing haplotypes elsewhere. Populations within a region tend to occupy
(probably representing extinct intermediates, rather similar portions of the network, sharing more than
than sampling holes) has been found in human studies one haplotype, and differing by small mutational
Analysis of Molecular 485

variance components are highly significant. We pres-

ent the null distributions of a:, ai and in Figure 5,
obtained by the three different permutation proce-
dures described above. The null distributions of a-
statistics are highly correlated with those of the asso-
ciated variance components[Corr(aj;‘,@cr) > 0.99;
Corr(ai,asc) > 0.99; Corr(a:,@.sT)< -0.991 and would
thus have virtually identical shapes. For the permuta-
tion of whole populations across regions, testing a:
and (Per, there are 945possible ways of allocating 10
populations to five groups of two populations each
(10!/(5!2‘)). Only one combination of populations was
found to give a slightly larger value than the observed
. As shown in Figure5a,the
: null distribution is
clearly bimodal. A certain number of other combina-
tions also give a: values that are almost as large as our
observed value. Interestingly, all combinations of this
FIGURE3.-Minimum spanning network of 56 haplotypes found higher peak show the two African populations
among 10 populations. Each link between haplotype represents a grouped together in a single region. On the contrary,
unique mutational event. Two haplotypes marked with asterisks
have not been found among sampled human populations. The
each time Peul and Wolof populations are separated
designation of each haplotype follows that of the publication where in different regions, a : values are small and found in
they have been first described (listed in Table 2). The universal the lower peak around zero. Large regional diversity
haplotype 1 has been enlarged for easy recognition. may then be attributed to differences between the
African group and all other regions, the composition
steps. Populations in different regions tend tooccupy of which is of no real importance. This fact would not
different (although partially overlapping) parts of the have emerged from astandardF-ratio test. In the
network. Regional diversification represents both combination giving a maximum a: value, the Asiatic,
haplotypefrequencychanges and some degree of Middle-Eastern, and Western African groups are pre-
phyletic radiation, probably smoothed by gene flow. served, but the Pima are grouped with Finns and the
Alternative definitions of the distance metric: We Maya with Sicilians. Although clearly significant, our
have performed hierarchical analysis of variance on arbitrarily chosen geographic groupings are not opti-
fourdifferent matrices of inter-haplotypic squared mum for maximizing the “among region”diversity.
distances, computed from differentassumptions about The AMOVA treatment on the input distance ma-
the evolutionary process that produced the mtDNA trix Dl hasclose connections with TAKAHATA and
haplotypes. T h e four matrics are: Dl (a standard Eu- PALUMBI’S (1985)technique, which leads to a GST
analog, after a nonlinear transformation of restriction-
clidean metric counting the differences among hap-
site changes into nucleotide diversity estimates. With-
lotypes), D2 (an equidistant metric based on the idea
out entering intomuch detail, we would merely point
that haplotypes are merely distinguishable), D3 (a dis-
outthat TAKAHATA and PALUMRI’S equation (17),
tance measured along the network, but also incorpo-
defining an affinity measure within populations (f),
rating additional geographic and probabilistic infor-
may be modified as an affinity measure between any
mation), and D4 (a matrix allowing for nonlinearity of
two haplotypes j and k ( b k ) by letting TAKAHATA and
changes along the network). PALUMBI’S variable 1 be the total number of restric-
Dl: This first input matrix is based on a phenetic tion-sites present in the whole collection of haplotypes,
distance metric, amounting to a simple count of the rather than that for the specific pair of haplotypes j
number of restriction-site differences between two and k. Following the analogy, we also need to define
haplotypes. One would choose this type of metric an affinity measure between an “individual and itself.”
when the identities of restriction-sites are well defined The most convenient definition is thenumber of
and some haplotypes are clearly more different than restriction sites for thatindividual, the definition most
others but where no network connecting the haplo- in keeping with the spirit of TAKAHATA and PALUMBI
types is available. The results of our hierarchical par- (1985). These simple changes preserve the Euclidian
tition are reported in Table 3 under D l . The propor- closure of the inter-haplotypic distance measure if we
tion of the “among regions” variance component is used; = 4, + fh, - 2 f , k , which turns out tobe identical
large (2 1.12%), but the “among populations/within to our phenetic distance 6$, defined in (3a).
region”percentage islow (3.49%), relative tothe D2: This second input matrix assumes that all hap-
“within populations” variance component. All three lotypes are equidistant. The evolutionary relations
486 L. Excoffier, P.E. Smouse and J. M. Quattro

8 0 8 0 8 0

ThWU Oriental Wolof Peul

x 0 x 0 R o

Pima Maya Finnish Sicilian

R o

Israeli Jew Israeli Arab

FIGURE4.-Haplotypic diversity of each of the 10 population samples. The position of the haplotypes are identical for each population
and arehomologous to those of Figure 3. The haplotypes found within each population sample are shown as black circles.

Hierarchical analysis of variance on four different square matrices of distances between haplotypes

D,(haplotypic) Dn(multiallelic)

Observed partition Observed partition

Variance component Variance % total Pa @-statistics

Variance % total Pa +-statistics

Among regions Z 0.134 21.12 0.002 ipc~=0.211 0.055 15.73 0.008 *cr= 0.157
Among populations/regions ui 0.022 3.49 CO.0001 9sc= 0.044 0.013 3.59 CO.0001 9sc = 0.043
Within populations US 0.478 75.39 CO.0001 0.246 0.281 80.68 <0.0001 = 0.193
network) D, (Prim D, (nonlinear)
Among regions Z 0.142 21.99 0.002 @cr=0.220 0.127 IO-' 21.30 0.002 *p,=0.213
Among populations/regions ui 0.021 3.29 CO.0001 aSc= 0.042 0.020 3.31 CO.0001 @sc= 0.042
Within populations U: 0.484 74.72 <0.0001 9n = 0.253 0.449 lo-' 75.39 CO.0001 * s r = 0.246

a Probability of having a moreextreme variance component and *-statistic than the observed values by chance alone. *CT and 0.' are tested
under random permutations of whole populations across regions. GsCand ui are tested under random permutation of individuals across
populations but within the same region. and 0,' are tested under random permutation of individuals across populations without regard to
either their original populations or regions.

between distinguishable haplotypes are assumed to be @ST and aSc,our observed variance components
unknown, a standardtreatment forallozymes or other showed extreme values in all cases. Seven permuta-
protein systems (see, however, RICHARDSONand tions of whole populations across regions were found
SMOUSE 1976; RICHARDSON, SMOUSE and RICHARDSON to yield greater ai (and @cr)than our observed value.
1977). This treatment is also applicable to antigenic Although the result is still significant, we clearly lose
systems, or even to molecular fingerprint analysis, geographic resolution with this metric.
where the banding pattern of two individuals either D3: Our third matrix is based on a distance metric
matches or does not.T h e @-statisticsbecome the usual computed along the evolutionarily parsimonious net-
multiallelic F-statistics (LONG 1986). T h e results of work shown in Figure 3. When several connections of
our hierarchical analysis are presented in Table 3 equal length are possible for a particular haplotype,
under DP.Most of the haplotype diversity (80.68%) is two additional rules are used to make a choice (Ex-
found within each population, butan appreciable COFFIER and LANGANEY 1989). The first is a proba-
amount still (15.73%) separates regions. The differ- bility criterion; a link between two rare (<5%) haplo-
ences among populations within regions are small types is less likely than a link between rare and fre-
(3.59%). For the two procedures that involve permu- quent (>5%) haplotypes. The second criterion is
tation of individuals across populations, testing a:, a:, geographic; links between haplotypes that are found
Varianceof Molecular 487
198 1, 1983; KAPLAN 1983; NEI and MILLER 1990).

Observed value 0.13 1
For simplicity, we have used Equation 4 from NEI and
MILLER (1990), whichyields results veryclose tothe
maximum-likelihood estimates ofNEI and TAJIMA
(1 983). For each adjacent pairof haplotypes x and y
on the network, we estimate the nucleotide diversity
d, by
40001 E

@ C
e= 1

d, = E
, (1 1)
Observed value = 0.022 C
e= 1

where E is the number of enzyme classes examined,

S , is the mean number of restriction sites present in
,019 haplotypes x and y for the enzyme class e, re is the
length of the recognition sequenceof the e-th enzyme
class (for our enzymes re = 4, 14/3, 16/3 or 6), and
d,(,) is the fraction of nucleotide substitutions persite
Observed value = 0.478 between sequences x and y, estimated for the enzyme
class e. T h e computation of (1 1) is quite simple in our
case, because adjacent haplotypes are separated by
single mutation changes in most cases, so the numer-
.60 SI
ator involves only one term. Substituting (1 1) in (3),
we have
FIGURE5.-Null distributions of the molecular variance compo-
nents obtained through different random permutations of the large
matrix of squared interindividual distances Dl of dimension 672
*$ = ((*Pi - *Pk)’W’”]((*Pj- *pk)’W1’2)’
(see text). (a) Distribution of uz; (b)distribution of 062; (c) distribution = (*pj - *pk)’ W (*pj - *pk) (12)
of a:.

within the same population or within the same region

*a;h = dim(*pmj - *pmk)2,
m= 1
are favored over links between types from different
regions. These distances differ from those of Dl when- where M is the total number mutational events or
ever we have homoplasic mutations alongthe network links between haplotypes in the minimum spanning
(23 cases out of 57). As exemplified in Figure 1 for network, as defined above. This analysis is analogous
the differences between haplotypes 3 and 5 , single- to that done for Ds, but here the branches linking
site changes from (+) to (-) to (+) or from (-) to (+) each adjacent haplotypes are of length d: instead of
to (-) along the network, scored as a distance of zero 1. This weighting scheme enables us to incorporate
for Dl, are scored as distance
a of 2 forD3.The results the nucleotide diversity in an Euclidian framework,
are presented in Table 3, labeled as D3. T h e observed and to perform ananalysis very similar to that devel-
u: value accounts for a slightly larger fraction of the oped in LYNCHand CREASE(1 990), butwith consid-
total genetic variability (21.99%) than is the case for erably less computation. Using this strategy, we only
Dl, but there is again one regional combination of need to compute nucleotide diversity with (1 1) be-
populations which producesalarger a: thanthat tween the M adjacent pairs of haplotypes onthe
observed, and it is the same one describedbefore network. The nucleotide diversity between a pair of
+ +
(Pima Finns and Maya Sicilians). The handling of nonadjacent haplotypes is the sum of the stepwise
homoplasies does not modify the outcome. nucleotide diversities along the path joining these
Dq: This fourth input matrix is made upof weighted haplotypes. T h e results of the AMOVA are again
evolutionary distances, measured alongthe PRIM net- reported in Table 3, now labeled as D4. The figures
work shown in Figure 3. The weighting matrix (W) is for both variance component fractionsand @-statistics
now a diagonal matrix of dimension M = 55 (total are essentially similar to those obtained for Dl and
number of haplotypes -
l), where each w,, is equal D3, with an important fraction of molecular diversity
to thenucleotide diversity (d,) between adjacenttypes separating regions (21.3%). Careful examination of
xandy in the network,so that W = diag(dk).Different the input distance matrix (not shown) generated by
methods have been used to estimate nucleotide diver- (12) shows that the amount of nucleotide diversity
sity (d:y) from restriction-site data(ENGELS1981; Ew- emerging from single restriction-site changes is very
488 L. Excoffier, P. E. Smouse and J. M.Quattro

similar for differentenzymes. Branch-lengths between

adjacent haplotypes on the network are virtually iden-
tical, except for the two cases where more than one
restriction-site change is involved.
Geneticstructureand DNA sitesampling: We
evaluated the sensitivity to site sampling by examining
the Dl partition for a randomsample of sites, with the
number of sites ranging from 5 to 62. We report the
percentages of significant values (a < 0.05) for the /
variance components in Figure 6. These three power
curves are indistinguishable from those forthe @- /
statistics, which are suppressed. As anticipated,the
percentage of significant results increases with the 10 20 30 40 50 60
number of sampled sites for all statistics; a: and @’ST
Number of restriction sites sampled
approach 100% significant outcomes when as few as
40 sites are taken into account. When 62 sites are FIGURE6.-Percentage of significant variance components as a
randomly sampled,ab‘and 9sc are significant in 99.8% function of haplotype size (in number of restriction-sites). A given
number of sites is drawn at random with replacement from the
of all replicates, whereas u,‘ andare significant original 62 restriction sites and variance component significance is
94.8% of the time. T h e component of molecular tested by 500 permutations of the original matrix of squared
variance among regions exhibits least power and re- interindividual distances (see text). This process is repeated 500
quires the largest number of restriction sites, suggest- times to find the percentage of significant outcomes at the level (Y
= 0.05. @-statisticscurves are almost identical to corresponding
ing that differences among regionsare due tospecific variance components and are not reported on the graph.
sites and mutations. On the whole, however,these
high levels of significance show that the inferred ge- thanproof. Empirically, we see no alternative but
netic structure of our sampled populations is not a testing of the data we have.
sampling artifact and that reliable inference does not LYNCHand CREASE(1 990)studied nucleotide sam-
require an inordinatelylargenumber of sites. We pling analytically, showing that it constituted a major
have not carried the analysis to more than 62 sites, source of variance in estimating diversity at the nu-
because an increase in the number of sampled sites cleotide level. Our results are somewhat at odds with
would mean the occurrence of new haplotypes, the theirs. In our case however, the unit studied for its
distribution of which among populations is unknown diversity is notthenucleotide butthe haplotype,
from our data. which is itself a collection of sites. The variance of
That conclusion is subject, however, to the assump- haplotypic diversity due to site sampling appears to be
tion that the 62 sites observed are representative of lower than the variance of nucleotide diversity due to
all sites of the mtDNA molecule. Our sites, sampled the same sampling process. When the number ofsites
from an empiric set, are, however, not entirely ran- per haplotype is reduced, site sampling becomes in-
dom. As a practical matter, restriction enzymes that creasingly importantas shown in Figure 6. For a
do not generate restriction site variation are usually haplotype with only 5 sites a: is significant in 73% of
discarded from the assay battery. The enzymes used all replicates, uz in 44.4%, and ai inonly 30.8%,
here are used routinely in human work precisely be- showing the importance of site sampling in this case.
cause they do exhibit substantial polymorphism. They
almost surely do notprovide a random representation
of the human mtDNA genome, and our collection of Human population radiation: Hierarchical analysis
sites is certainly biased towards excess polymorphism. of human mtDNA variability shows substantial sub-
T h e fact that the variation encountered is also geo- division among human populations, but with a large
graphically structured was not used as a criterion of fraction of the variation found within populations
choice. Indeed, a recentwork (STONEKING et al. 1990) (>74%). A similar vaIue (69%) has been derived using
using additional enzymes revealing even greater a GSTapproach on another human mtDNA data set
polymorphism shows as much geographic structure as (STONEKING et al. 1990). Our rather contrivedre-
we have demonstrated here. It seems probable that a gional groups exhibit ahigh level of divergence. Pop-
truly random sample of sites (or nucleotides), a larger ulations within regions were shown to be significantly
fraction of which would be monomorphic, would be (but minimally) differentiated. Our results suggest
required to demonstrate the same level of infra-spe- that extensive studies within each of the regions are
cific structure we have described here. T h e question needed to determine whether the much greater di-
of whether our chosen genetic markersare represent- vergenceobserved“amongregions”than“among
ative set is one more often dealt with by assumption populations/within regions” is an artifact of our arbi-
Varianceof Molecular 489

trary choice of populations, a sampling consequence present in the data, as it is here, the parsimony crite-
of isolation-by-distance, or whether there are steep rion does not lead to a unique network, as is also the
boundary zones of limited genetic exchange between case for most phylogeny reconstruction algorithms,
regions. Such zones have come under increasing scru- and a large numberof equally parsimonious networks
tiny of late (BARBUJANI, ODEN andSOKAL1989; BAR- could have been imposed. The question ofhow to
BUJANI and SOKAL1990, 1991), anda generic answer choose among equally parsimonious networks (or
to the “boundaryquestion” will only be available from trees) is a problem that cannot be settled here. Our
a study of more evenly spaced samples. contention is merely that given a minimum spanning
Regional differentiation is more apparentwhen the (parsimonious) network, buttressed by frequency and
degree of difference betweenhaplotypes is taken into geographic criteria, an eminently “sensible” network,
account, in keeping with the observation that molec- one can use the methods developed here for a useful
ular distances are larger for pairsof haplotypes drawn partition of the variation. For the example at hand,
fromdifferentregionsthanfromthe same region the additional wrinkle of measuring distance along
(Figure 4). This suggests that a substantial fraction of the network does not provide any additional resolu-
the mtDNA variability among regions is due todiver- tion. Whether we could do better with adifferent
gent arrays of haplotypes, ultimately attributable to network, and how to choose such a network, we will
the occurrence of new mutations along the path to leave for a later paper.
regional radiation. It is initially surprising that com- Our analysis of regional differences shows that the
puting distances alongthenetwork only slightly geographic criterion used to define regional groups is
enhances the regional differences in our data set. On quitereasonable as a first approximation. Slightly
further reflection, however, the results make sense. greater regional divergence was found with an alter-
Homoplasies due to recurrent mutations mainly affect native partition of the populations. The European
low frequency haplotypes that are located at the tips region contains the most internal diversity, whereas
of the network. Both the low frequency of such hap- the Amerindian region contains the least. The two
lotypes and their network placement will minimally +
“alternative” regions Sicily Maya and Pima Fin- +
affect the hierarchical partitionof variation. T h e com- land present intermediate “within region” diversities,
putation of evolutionary distances along a network which slightly lower the total “within region” variabil-
should yield greater additional resolution for taxo- ity and increase the “among region”variance compo-
nomic assemblages of greaterinternalradiation, nent. One might consider that could itself be useful
whereextinction of intermediates would lead to as a criterion for defining supra-population groups.
homoplasic mutations of higherfrequency and of This situation also shows that we need to examine
more central position. more closely the extent to which each region or each
Nonlinear transformation of restriction-site differ- population contributes to the total molecular diver-
ences into estimates of nucleotide diversity between sity, as the variance components or @-statisticsdo not
haplotypes also does not substantially affect the hap- bring us much detail of the patterning of the species
lotypic variance partition. We attribute this result to variability. As has already been done for the multial-
the low divergence between adjacent haplotypes on lelic case(LONG,SMOUSE and WOOD 1987), our analy-
the network. As most of the links between adjacent sis framework could be extended to a partitioning of
haplotypes involve unique restriction-site changes, the among-population variability into pairwise popu-
taking intoaccountthe fact that aparticular site lation distance components.
involves four-, five- or six-base recognition sequences Methodologicalconsiderations: We have intro-
doesnotmatter much here.Thus,the additional duced an analytical method for studying the genetic
assumptions involved in the nonlineartranslation, structure of populations that permits use of as much
such as uniform substitution rates at different sites (or as little) of the available information on the molec-
and identical substitution probabilities for the four ular nature ofDNA haplotypes as is desired. It extends
nucleotides, may not be necessary in delineating the procedures that explicitly use an analysis of variance
internal genetic structure of a single species. However, format (COCKERHAM 1969, 1973; WEIRand COCKER-
such nonlinear transformations could be useful if the HAM 1984; LONG 1986;LONG, SMOUSEand WOOD
analysis included individuals fromdifferent species 1987) to estimate the degree of intra-specific genetic
with larger interhaplotypic differences. subdivision. If we can legitimately assume that popu-
These conclusions may depend on thechoice of the lations become differentiated by drift alone, then we
network presentedin Figure 3,which was built before can expect a linear relation between divergence time
the AMOVA analyses were performed. Itsbasic struc- and allelic correlation for short periods (REYNOLDS,
ture had already been determined in previous publi- WEIRand COCKERHAM 1983). In our case, population
cations (JOHNSON et al. 1983; EXCOFFIER and LAN- differences in restriction pattern have clearly arisen
GANEY 1989).Whenahigh level of homoplasy is from genetic driftof existing variants, from the intro-
490 L. Excoffer, P. E. Smouse and J. M. Quattro

duction of new mutations, and from some degree of literature (SWOFFORD and OLSEN1990), oneis free to
gene flow, so we will not extrapolate our results as far choose. We will content ourselves here with the ob-
as a divergence-time interpretation. servation that the use of a Euclidean metric has some
The point of thecurrent exercise is neitherto natural advantages, not the least of which is that a
estimate unknown population parameters from our matrix of such distances can be used for other pur-
variance components nor to define exactly how or at poses than phylogenetic analysis. The considerable
what rate thesepopulationdifferences have devel- variety of data types made available by molecular
oped. Our purpose here is todemonstrate how to biology needs a statistical analysis framework that is
delineate the extent of genetic differentiation within coherent butalso sufficiently flexible to accommodate
andamong populations. The approach is general the different types of questions inherent in each par-
enough to deal with any organism and to study any ticular situation. The AMOVA treatment presented
type of structure (hierarchical or otherwise) that one here is intended to serve as the beginning ofjust such
might wish to consider. T h e underlying (distance ma- a framework.
trix) structure of the analysis permits flexible explo- The authors thank OSCARGAGGIOTTI and ANDRE LANGANEY
ration of a given data set. Several different distance for their comments on the manuscript, as well as MICHAELLYNCH
matrices, one for each particular set of assumptions, and another (anonymous) reviewer for their suggestions. L.E. was
may be taken as alternate inputs and their influence funded by FNRS Switzerland 32-28784.90and 32-27845.89,and
on the outcome evaluated.The relation to F-statistics INSERM France 900 814, P.E.S. by NJAES/USDA-32102, J M Q
by the Roosevelt Fund, American Museum of Natural History and
is straightforward, though subject to the usual limita- by the Leathem-Steinetz-Stauber Fund, Rutgers University. An
tions. More important is the realization that the whole analysis of molecular variance program (AMOVA), including the
array of least-squares methods (analysis of variance, permutational testing procedures, is available on request from L.E.
analysis of covariance, regression, correlation, princi-
pal coordinates analysis, factor analysis, etc.) is acces- LITERATURECITED
sible from this same distance matrix. We have tapped ANDERSON, S., A. T. BANKIER, B. G. BARREL, M. H. L. DE BRUIJN,
only a small portion of the available repertoire here. A. R. COULSON, J. DROUIN, I. C. EPERON,D. P. NIERLICH, B.
Significance testing with permutation procedures is
and I. G. YOUNG,1981 Sequence and organization of the
both easy and essentially assumption free; in particu- human mitochondrial genome. Nature 2 9 0 457-465.
lar, we are freed from the testing limitations of normal BARBUJANI, G., N.L. ODENand R. R. SOKAL,1989 Detecting
theory, so useful in analysis of variance but so inap- areas of abrupt change in maps of biological variables. Syst.
2001. 38: 376-389.
propriate here. We can address several questions with BARBUJANI, G.,and R. R. SOKAL,1990 The zones of sharp genetic
the same data set. We might even wish to test the change in Europe are also language boundaries. Proc. Natl.
difference between outcomes formally, based on dif. Acad. Sci. USA 87: 1816-1819.
BARBUJANI, G., and R. R. SOKAL,1991 Genetic population struc-
ferent squared-distance matrices. As the computation ture of Italy. 11.Physical and cultural barriers to gene flow.
of the variance components involves only manipula- Am J. Hum. Genet. 48: 398-41 1 .
tion of the original input distance metrics, the out- BIRKY,C. W., P. FUERSTand T . MARUYAMA,1989 Organelle
come will only be as different as the inputs. Squared- gene diversity under migration, and drift: equilibrium expec-
tations, approach to equilibrium, effects of heteroplasmic cells,
distance matrices may be compared using a normal- and comparison to nuclear genes. Genetics 121: 613-627.
ized Mantel test (SMOUSE, LONGand SOKAL1986). BIRKY,C. W., T . MARUYAMA and P. FUERST,1983 An approach
If one wishes to translate restriction site differences to population and evolutionary genetic theory for genes in
mitochondria and chloroplasts, and some results. Genetics 103:
into estimates of the fraction of nucleotide differences 513-527.
between pairs of haplotypes ( r j k )several
, procedures BONNE-TAMIR, B., M. J. JOHNSON, A. NATALI,D. C. WALLACE and
are available (ENGELS1981; EWENS,SPIELMAN and L. L. CAVALLI-SFORZA, 1986 Human mitochondrial DNA
HARRIS 1981;NEI and TAJIMA 1981, 1983; KAPLAN types in two Israeli populations-a comparative study at the
DNA level. Am. J. Hum. Genet. 38: 341-351.
1983; NEI and MILLER 1990), any one of which can BREGA, A,, R. GARDELLA, 0.SEMINO, G. MORPURGO, G. B. ASTALDI
be used to modify the interhaplotypicsquared dis- RICOTTI,D.C. WALLACE and A. S. SANTACHIARA-BERENE-
tances in our technique. Additional translation may CETTI, 1986 Genetic studies on theTharu population of
Nepal: restriction endonuclease polymorphisms of mitochon-
permit linearization of these estimates with divergence drial DNA. Am. J. Hum. Genet. 39: 502-512.
time. Such transformations have the additional advan- BROWN, W. M., M. GEORGE, JR. and A. C. WILSON,1979 Rapid
tage of being independent of the number of restric- evolution of animal mitochondrial DNA. Proc. Natl. Acad. Sci.
tion sites surveyed. We have seen, however, that this USA 7 6 1967-1971.
process does not fundamentally alter either our esti- 1982 Mitochondrial DNA sequences of primates: tempo and
mates of the variance components. Extension of this mode of evolution. J. Mol. Evol. 18: 225-239.
methodology to DNA sequencedata is straightfor- CANN, L.,
Mitochondrial DNA and human evolution. Nature 325: 31-
ward and can be achieved through a redefinition of 36.
the interchromosomal distance metric. As several COCKERHAM, C. C., 1969 Variance of gene frequencies. Evolution
methods are already available for this purpose in the 23: 72-84.
Analysis of Molecular Variance 49 1

COCKERHAM, C. C., 1973 Analyses of gene frequencies. Genetics of the coancestry coefficient: Basis for a short term genetic
7 4 679-700. distance. Genetics 105: 767-779.
EFRON,B., 1982 The Jacknife, the Bootstrap and Other Resam- RICHARDSON, R. R., and P.E. SMOUSE, 1976 Patterns of electro-
pling Plans. Regional Conference Series in Applied Mathemat- phoretic mobility. I. Interspecific comparisons in the Drosophila
ics, Vol 38. Society for Industrial and Applied Mathematics, mulleri complex. Biochem. Genet. 1 4 447-466.
Philadelphia. RICHARDSON, R.
ENGELS, W.R., 1981 Estimating genetic divergence and genetic 1977 Patterns of molecular variation. 11. Associations of elec-
variation with restriction endonucleases. Proc. Natl. Acad. Sci. trophoretic mobility and larval substrate within species of the
Drosophila mulleri complex. Genetics 85: 14 - 1 154.
USA 7 8 6329-6333.
ROHLF,F.J., 1990 NTSYS. Numerical Taxonomy and Multivar-
EWENS,W. J., R. S . SPIELMAN and H. HARRIS,1981 Estimation
iate Analysis System. Ver. 1.60. Exeter Publ. Ltd., Setauket,
of genetic variation at the DNA level from restriction endo- N.Y.
nuclease data. Proc. Natl. Acad. Sci. USA 78: 3748-3750. SCHURR, T. G., S. W. BALLINGER, Y.-Y. GAN,J. A. HODCE, D. A.
EXCOFFIER,L., 1990 Evolution of human mitochondrial DNA: MERRIWEATHER, D. N. LAWRENCE, W.C. KNOWLER, K. M.
evidence for departure from a pure neutral model of popula- WEISSand D.C. WALLACE, 1990 Amerindian mitochondrial
tions at equilibrium. J. Mol. Evol. 3 0 125-139. DNAs have rare Asian mutations at high frequencies, suggest-
EXCOFFIER, L., and A. LANGANEY, 1989 Origin and differentia- ing they derived from four primary maternal lineages. Am. J.
tion of human mitochondrial DNA. Am. J. Hum. Genet. 44: Hum. Genet. 4 6 613-623.
FARRIS, J. S., 1970 Methods for computing Wagner trees. Syst. A. S . SANTACHIARA-BERENECETTI, 1988 Genetic studies on
ZOO^. 1 9 83-92. the Senegal population. I. Mitochondrial DNA polymorphisms.
FELSENSTEIN, J., 1988 Phylogenies from molecular sequences: Am. J. Hum. Genet. 43: 534-544.
inference and reliability. Annu. Rev. Genet. 22: 521-565. SEMINO, O., A. TORRONI, R. SCOZZARI, A. BREGA,G. DE BENEDIC-
1980 Maternal inheritance of human mitochondrial DNA. Mitochondrial DNA polymorphisms in Italy. 111. Population
Proc. Natl. Acad. Sci. USA 77: 671 5-6719. data from Sicily: a possible quantitation of African ancestry.
GYLLENSTEN, U., D. WHARTON, A. JOSEFSSON and A. C. WILSON, Ann. Hum. Biol. 53: 193-202.
1991 Paternal inheritance of mitochondrial DNAinmice. SLATKIN, M., 1987 The average number of sites separating DNA
Nature 3 5 2 255-257. sequences drawn from a subdivided population. Theor. Popul.
JOHNSON, M. J., D. C. WALLACE, S. D. FERRIS,M. C. RATTAZZI and Biol. 32: 42-49.
L. L. CAVALLI-SFORZA, 1983 Radiation of human mitochon- SMOUSE,P. E., J. C. LONGand R. R. SOKAL,1986 Multiple
drial DNA types analyzed by restriction endonuclease cleavage regression and correlation extensions of the Mantel test of
matrix correspondence. Syst. Zool. 3 5 627-632.
patterns. J. Mol. Evol. 1 9 255-271.
KAPLAN,N., 1983 Statistical analysis of restriction enzyme map 1990 Geographic variation in human mitochondrial DNA
data and nucleotide sequence data, pp. 75-106 in Statistical from Papua New Guinea. Genetics 124 717-733.
Analysis of DNA Sequence Data, edited byB. S. WEIR.Marcel SWOFFORD, D.L., and G. J. OLSEN, 1990 Phylogeny reconstruc-
Dekker, New York. tion, pp. 41 1-501 in MolecularSystematics, edited byD. M.
LI, C. C., 1976 Population Genetics. Boxwood, Pacific Grove, Calif. HILLISand C. MORITZ. Sinauer Associates, New York.
LONG,J. C., 1986 The allelic correlation structure of Gainj- and TAKAHATA, N., and S. R. PALUMBI, 1985 Extranuclear differen-
Kalam-speaking people. I. The estimation and interpretation tiation and gene flow in the finite island model. Genetics 1 0 9
of Wright’s F-statistics. Genetics 112: 629-647. 441-457.
LONG,J. C., P. E. SMOUSE and J. W. WOOD,1987 The allelic VILKKI,J., M.-L. SAVONTAUS and E.V. NIKOSKELAINEN, 1988
correlation structure of Gainj- and Kalam-speaking people. 11. Human mitochondrial types in Finland. Hum. Genet. 8 0 3 17-
The genetic distance between population subdivisions.Genetics 321.
117: 273-283. WALLACE,D. C., K. GARRISON and W. C. KNOWLER,1985
LYNCH,M., and T. J. CREASE,1990 The analysis of population Dramatic founder effect in Amerindian mitochondrial DNAs.
survey data on DNA sequence variation. Mol.Biol.Evol. 7: Am. J. Phys. Anthrop. 6 8 149-155.
377-394. WATTERSON, G. A., 1975 On the number of segregating sites in
MANTEL,N.,1967 The detection of disease clustering and a genetical models without recombination. Theor. Popul. Biol.
7: 256-276.
generalized regression approach. Cancer Res. 27: 209-220.
WATTERSON, G. A., 1985 The genetic divergence of two popula-
NEI, M., 1973 Analysis of gene diversity in subdivided popula- tions. Theor. Popul. Biol. 27: 298-317.
tions. Proc. Natl. Acad. Sci. USA 7 0 3321-3323. WEIR,B. S., and C.C. COCKERHAM, 1984 Estimating F-statistics
NEI, M., 1977 F-statistics andthe analysis of gene diversity in for the analysis of population structure. Evolution 38: 1358-
subdivided populations. Ann. Hum. Genet. 41: 225-233. 1370.
NEI, M., and J. C. MILLER,1990 A simple method for estimating WRIGHT,S., 1951 The genetical structure of populations. Ann.
average number of nucleotide substitutions within and between Eugen. 1: 323-334.
populations from restriction data. Genetics 125: 873-879. WRIGHT,S., 1965 The interpretation of population structure by
NEI, M., and F. TAJIMA, 1981 DNA polymorphism detectable by F-statistics with specialregards to systems of mating. Evolution
restriction endonucleases. Genetics 97: 145-163. 1 9 395-420.
NEI, M., and F. TAJIMA,1983 Maximum likelihood estimation of ZHIVOTOVSKY, L. A. 1988 Some methods of analysis of correlated
the number of nucleotide substitutions from restriction sites characters, pp. 423-432 in Proceedings of the II International
data. Genetics 1 0 5 207-2 17. Conference onQuantitativeGenetics, edited byB. S. WEIR,G.
PRIM,R. C., 1957 Shortest connection networks and some gen- EISEN,M. M. GOODMAN, and G. NAMKOONG. Sinauer Associ-
eralizations. Bell Syst. Tech. J. 3 6 1389-1401. ates, Sunderland, Mass.
REYNOLDS, J., B. S. WEIRand C. C. COCKERHAM, 1983 Estimation Communicating editor: E. THOMPSON