• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
Copyright 
Ó
2007 by the Genetics Society of AmericaDOI: 10.1534/genetics.106.067355
Genetic Similarities Within and Between Human Populations
D. J. Witherspoon,* S. Wooding,
 A. R. Rogers,
E. E. Marchani,* W. S. Watkins,*M. A. Batzer 
§
and L. B. Jorde*
,1
*
 Department of Human Genetics, University of Utah Health Sciences Center, Salt Lake City, Utah 84112,
 Department of Anthropology,University of Utah, Salt Lake City, Utah 84112,
McDermott Center for Human Growth and Development, University of  Texas Southwestern Medical Center, Dallas, Texas 75390 and 
§
 Department of Biological Sciences,Louisiana State University, Baton Rouge, Louisiana 70803 
Manuscript received October 25, 2006 Accepted for publication February 5, 2007 ABSTRACTThe proportion of human genetic variation due to differences between populations is modest, andindividuals from different populations can be genetically more similar than individuals from the samepopulation. Yet sufficient genetic data can permit accurate classification of individuals into populations.Both findings can be obtained from the same data set, using the same number of polymorphic loci. Thisarticle explains why. Our analysis focuses on the frequency,
v
, with which a pair of random individualsfrom two different populations is genetically more similar than a pair of individuals randomly selectedfrom any single population. We compare
v
to the error rates of several classification methods, using datasets that vary in number of loci, average allele frequency, populations sampled, and polymorphismascertainment strategy. We demonstrate that classification methods achieve higher discriminatory powerthan
v
because of their use of aggregate properties of populations. The number of loci analyzed is themost critical variable: with 100 polymorphisms, accurate classification is possible, but 
v
remains sizable,even when using populations as distinct as sub-Saharan Africans and Europeans. Phenotypes controlled by a dozen or fewer loci can therefore be expected to show substantial overlap between human populations.This provides empirical justification for caution when using population labels in biomedical settings, withbroad implications for personalized medicine, pharmacogenetics, and the meaning of race.
D
ISCUSSIONS of genetic differences between ma- jor human populations have long been domi-nated by two facts: (a) Such differences account foronly a small fraction of variance in allele frequencies,but nonetheless (b) multilocus statistics assign most individuals to the correct population. This is widely understood to reflect the increased discriminatory power of multilocus statistics. Yet B
amshad
et al 
. (2004)showed, using multilocus statistics and nearly 400 poly-morphic loci, that (c) pairs of individuals from differ-ent populations are often more similar than pairs fromthe same population. If multilocus statistics are sopowerful, then how are we to understand this finding? All three of the claims listed above appear in disputesover the significance of human population variation and‘‘race.’’ In particular, the A 
merican
nthropological
 A 
ssociation
(1997, p. 1) stated that ‘‘data also show that any two individuals within a particular populationare as different genetically as any two people selectedfrom any two populations in the world’(subsequentlamended to ‘about as different’’). Similarly, educa-tional material distributed by the H
uman
G
enome
P
roject
(2001, p. 812) states that ‘‘two random indi- viduals from any one group are almost as different [genetically] as any two random individuals from theentire world.’’ Previously, one might have judged thesestatementstobeessentiallycorrectforsingle-locuschar-acters,butnotformultilocusones.However,thefindingof B
amshad
et al 
. (2004) suggests that an empirical in- vestigation of these claims is warranted.In what follows, we use several collections of locigenotyped in various human populations to examinetherelationship betweenclaims a,b,andcabove.Thesedata sets vary in the numbers of polymorphic loci geno-typed, population sampling strategies, polymorphismascertainment methods, and average allele frequencies.To assess claim c, we define
v
as the frequency with which a pair of individuals from different populations isgenetically more similar than a pair from the samepopulation. We show that claim c, the observation of high
v
,holdswithsmallcollectionsofloci.Itholdseven with hundreds of loci, especially if the populationssampled have not been isolated from each other forlong. It breaks down, however, with data sets compris-ing thousands of loci genotyped in geographically distinct populations: In such cases,
v
becomes zero.
1
Correspondingauthor: 
DepartmentofHumanGenetics,EcclesInstituteof Human Genetics, University of Utah, 15 N. 2030 E., Room 7225, Salt Lake City, UT 84112-5330. E-mail: lbj@genetics.utah.edu
Genetics
176:
351–359 (May 2007)
 
Classification methods similarly yield high error rates withfewlociandalmostnoerrorswiththousandsofloci.Unlike
v
, however, classification statistics make use of aggregate properties of populations, so they can ap-proach 100% accuracy with as few as 100 loci.
MATERIALS AND METHODS
Data sets:
Threedatasetswereused.Lociorindividualswith
.
10% missing data were not included in any data set (loci were pruned first and then individuals). The first data set (‘‘insertions’’) consists of 175 polymorphic transposable ele-ment insertion loci (100
Alu 
and 75
L1
) previously genotypedin 259 individuals. The population sample consists of 104individuals from sub-Saharan Africa, 54 East Asians, 61 indi- vidualsofnorthernEuropeanancestry,and40individualsfrom Andhra Pradesh, India (W 
atkins
et al 
. 2005; W 
itherspoon
et al 
. 2006). The second data set (‘‘microarray’’) consists of 9922 biallelic single-nucleotide polymorphism (SNP) locigenotyped in 278 individuals (55 Africans, 42 African Amer-icans, 40 Native Americans, 22 Indians, 20 East Asians, 62Europeans, 18 Hispano–Latinos from Puerto Rico, and 19individuals from New Guinea). This data set is derived fromthatofS
hriver 
et al 
. (2005).The thirddataset(‘‘resequenced’)is derived from the 10 ENCODE regions of the HapMapproject, release 16c.1 of phase I, June 2005 (I
nternational
H
ap
M
ap
C
onsortium
2005). These regions were rese-quenced in 48 individuals to identify SNPs without ascertain-mentbiasinfavoroflociwithcommonpolymorphisms.TheseSNPs were then genotyped in 209 unrelated individuals: 60 Yoruba in Ibadan, Nigeria (YRI); 60 Utah residents withancestry from northern and western Europe (CEU, from theCEPH diversity panel); and 89 Japanese in Tokyo, Japan, plusHan Chinese in Beijing, China (CHB
1
JPT). Our subset consists of 14,258 SNPs. All markers in all three data sets arebiallelic. The proportions of missing genotypes are 2.4, 2.1,and 0.36%, respectively.
Data subsampling:
To examine the effect of populationsampling (
i.e 
., the effects of comparing relatively isolatedpopulations
vs.
more closely related or admixed ones), twosubsets were constructed from each of the insertions andmicroarraymaindatasets:oneconsistingoftheentiredataset, with all its labeled populations, and another consisting of East  Asian, European, and sub-Saharan African population groupsonly. The resequenced data set consists only of the latter threepopulation groups.To investigate the effect of allele frequency, these five datasubsetsweresubdividedaccordingtothreefurthertreatments:loci with common polymorphisms (with
inor
llele
f  
re-quency, MAF,
.
0.1); loci with rare polymorphisms (MAF
,
0.1); and all polymorphic loci, regardless of frequency.Henceforth we refer to these classes of loci as rare polymor-phisms, common polymorphisms, or all polymorphisms. Forthisclassification,allelefrequencieswerecomputedacrosstheentiresampleintheparentdataset.Toinvestigatetheeffectof incrementally increasing the number of loci used, loci fromeach of these 15 data subsets were sampled (without re-placement) to produce 200 independent data sets with num-bers of loci varying in 21 steps on a logarithmic scale from 10to the maximum.
Pairwise genetic distance:
We use the ‘‘shared alleles’’ ge-netic distance (C
hakraborty
and J
in
1993; B
owcock 
et al 
.1994; M
ountain
and C
avalli
-S
forza
1997), which definesthe distance between two individuals at a locus as one minushalf the number of alleles they share. The genetic distancebetween individuals is the average of their per-locus distances.Pairs of individuals are classified as ‘‘within population’’ or‘‘between population’’ according to whether the individuals were sampled from the same or different groups of popula-tions as defined above.
Dissimilarity fraction
ˆ
v
:
Let 
v
be the probability that a pairof individuals randomly chosen from different populations isgenetically more similar than an independent pair chosenfrom any single population. We compute all possible pair- wise genetic distances, classify them as within- or between-population distances (the sets
 W 
or
B
, respectively), and thencalculate the frequency with which
 W 
.
B
(that is, a within-population pair is more dissimilar than a between-populationpair).Thisfraction,ˆ
v
,isanestimatorof 
v
.Theexpectedvalueof ˆ
v
ranges from 0 to 0.5 (regardless of the number of pop-ulations). At ˆ
v
¼
0, individuals are always more similar tomembers of their own population than to members of otherpopulations; at ˆ
v
¼
0.5, individuals are as likely to be moresimilar to members of other populations as to members of their own. The distributions of pairwise genetic distances im-plied here resemble the common ancestry profiles proposedby M
ountain
and
amakrishnan
(2005), who use adifferent measure of genetic distance. The shared-alleles distance usedhere generally yields slightly lower values of ˆ
v
.
Centroid misclassification rate
C
:
The centroid classifica-tion method is also based on pairwise genetic distances, withone critical difference: Every individual is compared to thecentroid of each population, rather than to every other indi- vidual. The centroid is the genetic average of a population, anindividual whose pseudogenotypes at each locus are thefrequencies of the genotypes in that population (not includ-ing the individual being compared to the centroid). Thisgenetic distance is equivalent to the average of the geneticdistances from an individual to all other individuals in thetarget population. Each individual is then assigned to thepopulation with the closest centroid, as in C
ornuet
et al.
(1999). These assignments are compared to the known popu-lations of origin, and the proportion of individuals misclassi-fied is reported as
C
. The expected classification error forrandom assignment of individuals to populations is 1
À
1/
, where
is the number of populations.
Population trait value misclassification rate
T
:
Our defi-nition of 
T
is implicit in the theoretical illustrations of R 
isch
et al.
(2002) and E
dwards
(2003). These authors used sim-plified models to show how modest differences between pop-ulations can nonetheless enable accurate classification. Inboth cases, population membership is treated as an additivequantitative genetic trait controlled by many loci of equaleffect, and individuals are divided into populations on thebasis of their trait values.This method is inherently limited to dividing individualsinto just two clusters using only biallelic loci, so we limit ourdefinitions to that situation. Consider individuals sampledfrom two populations, A and B, and genotyped at many bi-allelic loci. At each locus, we identify the allele whose fre-quencyishigherinpopulationAandassignitavalueof0.Theother allele (more frequent in B than in A) is assigned a valueof 1. Let 
ij 
represent the genotype of individual
at locus
,defined as the average of the assigned values of the two allelescarried by that individual at that locus. Now define
as theaverage of 
ij 
over all loci
(so
is a polygenic quantitativegenetic trait). Given these definitions, if populations A and Bare typified by even slightly different allele frequencies at many loci, then
will usually be smaller for a member of population A than for a member of population B. Thus the value of thetrait 
indicates membership in one population or the other,so we call
the ‘‘population trait’’ value of individual
.IndividualsareassignedtopopulationAorBdependingon whether their population trait value
falls below or above352 D. J. Witherspoon
et al.
 
some dividing criterion
C
, respectively. In the case of just twopopulations, these assignments are compared to the knownorigins of the individuals, and the proportion misclassified isreported as
T
. The classification criterion
C
is chosen asfollows. Let 
 A 
be the mean of 
taken over all individuals inpopulation A, and define
B
similarly for population B. If thedistributions of 
for individuals from the two populations aresymmetric with equal variance, then letting
C
¼
(
 A 
1
B
)/2minimizes misclassification (
cf.
isch
et al 
. 2002; E
dwards
2003). To better account for unequal variances, we generalizeslightly and solve for a criterion
C
such that 
(
C
)
¼
(
C
) and
 A 
,
C
,
B
, where
and
are normal probability density functions with means and variances estimated from the dis-tributions of 
for populations A and B, respectively.To extend this inherently pairwise approach to more thantwo populations, assignments for each individual are initially computed with reference to eachpossible pair of populations.The values (0 or 1) assigned to particular alleles, the criterion
C
, and all
are calculated anew for each pair of populations.Individuals are finally assigned to a population only if they  were assigned to it in all pairwise comparisons involving that population. The proportion of individuals misclassified (ornotclassified, since thismethodcan failtoclassify individuals)is reported as
T
. For comparison, a ‘‘single-locus’’ classifica-tion error rate is computed by using this method to classify individuals using each locus singly and then averaging theresults over all loci.RESULTS
Distributions of distances:
Thestatisticsˆ
v
,
C
,and
T
are closely related by design. To illustrate the relation-ships between them, the distributions of the geneticmeasures that underlie them are shown in Figure 1. Forsimplicity, only two populations (Europeans and sub-Saharan Africans) and 50 typical loci randomly chosenfrom the insertions data set are used. The distributionsof pairwise genetic distances for within- and between-population pairs of individuals (Figure 1A) overlapconsiderably even for these geographically isolatedpopulations. The dissimilarity fraction,ˆ
v
, is 20%, in-dicating thatbetween-population pairs are more similarthan within-population pairs one-fifth of the time. Incontrast, the distributions of individuals’ distances tothe centroids of their own or different populations(Figure 1B) show much less overlap, resulting in
C
¼
4.2%. The population trait value distributions for Afri-cans and Europeans overlap for just three individuals, yielding
T
¼
1.8%. Classifications using model-basedmethods such as Structure (P
ritchard
et a
. 2000)achieve 90% accuracy or better using the same data(B
amshad
et al 
. 2003; W 
itherspoon
et al 
. 2006).The variances of the distributions are much greaterfortheindividual-to-individualcomparisons(Figure1A)thanforthecentroid-to-individual comparisons (Figure1B). The distribution means are nearly identical, how-ever,sothedistributionsoverlapmoreinFigure1Athanin 1B, and thusˆ
v
.
C
. The difference in variances isdue to the fact that each genetic distance to a centroid(each datum in Figure 1B) is equivalent to the averageof a sizable subset of pairwise genetic distances re-presented in Figure 1A (see
materials and methods
).That averaging step eliminates considerable variationand produces the narrower distributions of Figure 1B.The simplifications introduced by R 
isch
et al.
(2002)and E
dwards
(2003) allow an alternative view, repre-sented in Figure 1C. Here, each individual
is assigneda unidimensional genetic location
(the individual’spopulation trait value; see
materials and methods
).The trait distance between any two individuals
and
isnow just the horizontal distance between them,
j
 y 
j
.This simplification is possible only in the two-popula-tion case and requires a population-specific coding of allele states, so the trait distance is not equivalent to thegeneticdistances represented in Figure 1, A and B.None-theless, it is instructive to consider the analogy usingFigure 1C as a guide. For example, an African individ-ual
with
¼
0.52 will be more similar to a European
 with
 y 
¼
0.60 than to another African
 with
¼
0.4. Yet that individual
will still be closer to the populationmean trait value for Africans (
 A 
0.48, the Africancentroid) than to the mean value of Europeans (
B
0.68). It follows that many individuals like this one willbe correctly classified (yielding low 
C
and
T
) eventhough they are often more similar to individuals of theother population than to members of their own pop-ulation (yielding highˆ
v
).To empirically and quantitatively understand therelationships and contrasts betweenˆ
v
and the mis-classificationrates
C
and
T
,we examinethree primary factorsthatinfluencethem:thenumberofpolymorphicloci used, the allele frequencies at those loci, and thedegree of differentiation between the populationsexamined.
Data subset statistics:
Three data sets, labeled inser-tions, microarray, and resequenced, were used, and 15subsets were constructed from these to examine the ef-fectsofdifferentdatacollectionstrategies(see
materialsand methods
). Table 1 lists the 15 data subsets and re-portsˆ
v
,
C
,and
T
(eachcomputed over all loci ineachdata subset) as well as the expected value of 
T
whenonly a single locus is used. Table 1 also gives values of five descriptive statistics for each data subset: the pro-portion of genetic variance explained by interpopula-tion differences (
 F 
ST
); the observed proportions of heterozygotes (% het); the absolute differences in allelefrequencies between population pairs (averaged),
d
;the fraction of polymorphisms that are rare (MAF
,
0.1) in at least one population and at the same timecommon (MAF
.
0.1) in another population (% rareand common); and the fraction of loci that are mono-morphic in at least one population and common poly-morphisms in another (% fixed and common). The values observed are typical of human population ge-netic data sets (N
ei
1973; D
ean
et al 
. 1994; I
nterna-tional
H
ap
M
ap
C
onsortium
2005; S
hriver 
et al 
.2005; W 
itherspoon
et al 
. 2006).
Dependency of 
ˆ
v
on number of loci:
Figure 2 showsthe dependency of ˆ
v
,
C
, and
T
on the number of loci
Similarity 
vs 
. Classication 353
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...