Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
Edwards -- Human Genetic Diversity- Lewontin's Fallacy

Edwards -- Human Genetic Diversity- Lewontin's Fallacy

Ratings: (0)|Views: 72 |Likes:
Published by Nemester
Sufficiently large gene samples make it very easy (100% efficient) to sort by race in principle component analysis space.
Sufficiently large gene samples make it very easy (100% efficient) to sort by race in principle component analysis space.

More info:

Published by: Nemester on Mar 12, 2014
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

04/14/2014

pdf

text

original

 
Human genetic diversity:Lewontin’s fallacy
A.W.F. Edwards
Summary
In popular articles that play down the genetical differ-ences among human populations, it is often stated thatabout 85% of the total genetical variation is due toindividual differences within populations and only 15%to differences between populations or ethnic groups. Ithas therefore been proposed that the division of
 Homo sapiens 
 into these groups is not justified by the geneticdata. This conclusion, due to R.C. Lewontin in 1972, isunwarranted because the argument ignores the factthat most of the information that distinguishes popu-lations is hidden in the correlation structure of the dataand not simply in the variation of the individual factors.The underlying logic, which was discussed in the earlyyearsofthelastcentury,isherediscussedusingasimplegenetical example.
 BioEssays 
 25:798–801, 2003.
2003 Wiley Periodicals, Inc.
‘‘When a large number of individuals [of any kind oforganism] are measured in respect of physical dimensions,weight, colour, density, etc., it is possible to describe withsome accuracy the population of which our experiencemay be regarded as a sample. By this means it may bepossible to distinguish it from other populations differing intheir genetic origin, or in environmental circumstances. Thuslocal races may be very different as populations, althoughindividuals may overlap in all characters;
. . .
’’ R.A. Fisher(1925). See Ref. 1.‘‘Itisclear thatourperceptionofrelativelylargedifferencesbetween human races and subgroups, as compared to thevariation within these groups, is indeed a biased perceptionand that, based on randomly chosen genetic differences,human races and populations are remarkably similar to eachother, with the largest part by far of human variation beingaccounted for by the differences between individuals. Humanracial classification is of no social value and is positivelydestructive of social and human relations. Since such racialclassification is now seen to be of virtually no genetic ortaxonomicsignificanceeither,nojustificationcanbeofferedforits continuance’’. R.C. Lewontin (1972). See Ref. 2.‘‘The study of genetic variations in
 Homo sapiens 
 showsthat there is more genetic variation within populations thanbetweenpopulations.Thismeansthattworandomindividualsfromanyonegrouparealmostasdifferentasanytworandomindividuals from the entire world. Although it may be easy toobserve distinct external differences between groups ofpeople, it is more difficult to distinguish such groups geneti-cally, since most genetic variation is found within all groups.’’
Nature 
 (2001). See Ref. 3.
Introduction
In popular articles that play down the genetical differencesamong human populations it is often stated, usually withoutanyreference,thatabout85%ofthetotalgeneticalvariationisdue to individual differences within populations and only 15%to differences between populations or ethnic groups. It hastherefore been suggested that the division of
 Homo sapiens 
intothesegroupsisnotjustifiedbythegeneticdata.Peopletheworld over are much more similar genetically than appear-ances might suggest.Thus an article in
 New Scientist 
(4)
reported that in 1972Richard Lewontin of Harvard University ‘‘found that nearly85 per cent of humanity’s genetic diversity occurs amongindividuals within a single population.’’ ‘‘In other words, twoindividuals are different because they are individuals, notbecause they belong to different races.’’ In 2001, the
 Human Genome 
 edition of
 Nature 
(3)
came with a compact disccontaining a similar statement, quoted above.Such statements seem all to trace back to a 1972 paperby Lewontin in the annual review
 Evolutionary Biology 
.
(2)
Lewontin analysed data from 17 polymorphic loci, includingthe major blood-groups, and 7 ‘races’ (Caucasian, African,Mongoloid, S. Asian Aborigines, Amerinds, Oceanians,Australian Aborigines). The gene frequencies were givenfor the 7 races but not for the individual populations compris-ing them, although the final analysis did quote the within-population variability. ‘‘The results are quite remarkable.The mean proportion of the total species diversity that iscontained within populations is 85.4%
. . .
. Less than 15%of all human genetic diversity is accounted for by dif-ferences between human groups! Moreover, the differencebetween populations within a race accounts for an addi-tional 8.3%, so that only 6.3% is accounted for by racialclassification.’’
798
 BioEssays 25.8
 BioEssays 25:798–801,
 
 2003 Wiley Periodicals, Inc.
Gonville and Caius College, Cambridge, CB2 1TA, UK.E-mail: awfe@cam.ac.ukDOI 10.1002/bies.10315Published online in Wiley InterScience (www.interscience.wiley.com).
Problems and paradigms
 
Lewontin concluded ‘‘Since
. . .
racial classification is nowseentobeofvirtuallynogeneticor taxonomicsignificance
. . .
,no justification can be offered for its continuance’’ (fullquotation given above).Lewontin included similar remarks in his 1974 book
 The Genetic Basis of Evolutionary Change 
(5)
‘‘The taxonomicdivision of the human species into races places a completelydisproportionateemphasisonaverysmallfraction ofthetotalof human diversity. That scientists as well as nonscientistsnevertheless continue to emphasize these genetically minordifferences and find new ‘scientific’ justifications for doing sois an indication of the power of socioeconomically basedideology over the supposed objectivity of knowledge.’’
The fallacy
These conclusions are based on the old statistical fallacyof analysing data on the assumption that it contains noinformationbeyondthatrevealedonalocus-by-locusanalysis,and then drawing conclusions solely on the results of such ananalysis. The ‘taxonomic significance’ of genetic data in factoftenarisesfromcorrelationsamongstthedifferentloci,foritisthesethatmaycontaintheinformationwhichenablesastableclassification to be uncovered.Cavalli-Sforza and Piazza
(6)
coined the word ‘treeness’ todescribe the extent to which a tree-like structure was hiddenamongst the correlations in gene-frequency data. Lewontin’ssuperficial analysis ignores this aspect of the structure of thedataandleadsinevitablytotheconclusionthatthedatadonotpossess such structure. The argument is circular. A contrast-ing analysis to Lewontin’s, using very similar data, waspresented by Cavalli-Sforza and Edwards at the 1963International Congress of Genetics.
(7)
Making no priorassumptions about the form of the tree, they derived aconvincing evolutionary tree for the 15 populations that theystudied.Lewontin,
(2,5)
thoughheparticipatedintheCongress,did not refer to this analysis.The statistical problem has been understood at least sincethe discussions surrounding Pearson’s ‘coefficient of raciallikeness’
(8)
in the 1920s. It is mentioned in all editions ofFisher’s
 Statistical Methods for Research Workers 
(1)
from1925 (quoted above). A useful review is that by Gower
(9)
in a1972 conference volume
 The Assessment of Population Affinities in Man 
. As he pointed out, ‘‘
. . .
the human minddistinguishes between different groups
 because 
 there arecorrelated characters within the postulated groups.’’The original discussions involved anthropometric data, butthe fallacy may equally be exposed using modern geneticterminology. Consider two haploid populations each of size
 n 
.Inpopulation1thefrequencyofagene,say
þ
asopposedto
’, at a single diallelic locus is
 p 
 and in population 2 it is
 q 
,where
 p 
þ
¼
1. (The symmetry is deliberate.) Each popula-tion manifests simple binomial variability, and the overallvariability is augmented by the difference in the means. Thenatural way to analyse this variability is the analysis ofvariance, from which it will be found that the ratio of thewithin-populationsumofsquarestothetotalsumofsquaresissimply4
pq 
.Taking
¼
0.3and
¼
0.7,thisratiois0.84;84%ofthe variability is within-population, corresponding closely toLewontin’s figure. The probability of misclassifying an indivi-dual based on his gene is
 p 
, in this case 0.3. The genes at asingle locus are hardly informative about the population towhich their bearer belongs.Now suppose there are
 
 similar loci, all with genefrequency
 p 
 in population 1 and
 q 
 in population 2. The ratioof the within-to-total variability is still 84% at each locus. Thetotal number of ‘
þ
’ genes in an individual will be binomial withmean
 kp 
 in population 1 and
 kq 
 in population 2, with variance
kpq 
 in both cases. Continuing with the former gene frequen-cies and taking
 
¼
100 loci (say), the mean numbers are30 and 70 respectively, with variances 21 and thus standarddeviations of 4.58. With a difference between the means of40 and a common standard deviation of less than 4.6, thereis virtually no overlap between the distributions, and theprobability of misclassification is infinitesimal, simply on thebasis of counting the number of ‘
þ
’ genes. Fig. 1 shows howthe probability falls off for up to 20 loci.One way of looking at this result is to appreciate that thetotalnumberof
þ
genesislikethefirstprincipalcomponentina principal component analysis (Box 1). For this componentthe between-population sum of squares is very much greaterthan the within-population sum of squares. For the othercomponentsthereversewillhold,sothatoverallthebetween-population sum of squares is only a small proportion (in thisexample 16%) of the total. But this must not beguile one intothinkingthatthetwopopulationsarenotseparable,whichtheyclearly are.Each additional locus contributes equally to the within-population and between-population sums of squares, whose
Figure 1.
 Graph showing how the probability of misclassifi-cationfallsoffasthenumberofgenelociincreases,for thefirstexamplegiveninthetext.Theproportionofthevariabilitywithingroupsremainsat84%asinLewontin’sdata,buttheprobabilityof misclassification rapidly becomes negligible.
Problems and paradigmsBioEssays 25.8
 799
 
proportions therefore remain unchanged but, at the sametime, it contributes information about classification which iscumulative over loci because their gene frequencies arecorrelated.
Classification
It might be supposed, though it would be wrong, that thisexample is prejudiced by the assumptions that membershipof the two populations is known in advance and that, ateach locus, it is the same population that has the higherfrequency of the ‘
þ
’ gene. In fact the only advantage of thelatter simplifying assumption was that it made it obvious thatthe totalnumber of
þ
genesisthe bestdiscriminantbetweenthe two populations.To dispel these concerns, consider the same example butwith ‘
þ
’ and ‘
’ interchanged at each locuswith probability ½,and suppose that there is no prior information as to whichpopulationeachindividualbelongs.Clearly,thetotalnumberof
þ
genesanindividualcontainsisnolongeradiscriminant,fortheexpectednumberisnowthesameineachgroup.Aclusteranalysiswillbenecessaryinorder touncover thegroups,anda convenient criterion is again based on the analysis ofvarianceasinthemethodintroducedbyEdwardsandCavalli-Sforza.
(10)
Here the preferred division into two clustersmaximises the between-clusters sum of squares or, what isthesamething,minimisesthesumofthewithin-clusterssumsof squares.As pointed out by these authors, it is extremely easy tocompute these sums for binary data, for all the information iscontainedinthehalf-matrixofpairwisedistancesbetweentheindividuals, and at each locus this distance is simply 0 for amatchand1foramismatchofthegenes.Sinceinterchanging
þ
’and‘
makesnodifferencetothenumbersofmatchesandmismatches, it is clear that the random changes introducedaboveareirrelevant.Continuingthesymmetricalexample,theprobability of a match is
 p 
2
þ
2
if the two individuals are fromthe same population and 2
pq 
 if they are from differentpopulations. With
 k 
 loci, therefore, the distance between twoindividuals from the same population will be binomial withmean
(
2
þ
2
)andvariance
(
2
þ
2
)(1
2
2
)andif fromdifferent populations binomial with mean 2
kpq 
 and variance2
kpq 
(1
2
pq 
). These variances are, of course, the same.Taking
 p 
¼
0.3,
 q 
¼
0.7 and
 k 
¼
100 as before, the meansare58and42respectively,adifferenceof16,thevariancesare24.36 and the standard deviations both 4.936. The meansare thus more than 3 standard deviations apart (3.2415). Theentries of the half-matrix of pairwise distances will thereforedivide into two groupswith very little overlap, and it will be pos-sible to identify the two clusters with a risk of misclassificationwhich tends to zero as the number of loci increases.Byanalogywiththeaboveexample,itislikelythatacountofthe four DNA base frequencies in homologous tracts of agenome would prove quite a powerful statistical discriminantfor classifying people into population groups.
Conclusion
There is nothing wrong with Lewontin’s statistical analysis ofvariation,onlywiththebeliefthatitisrelevanttoclassification.
Box 1. Principal component analysis
PrincipalComponentAnalysis(PCA)isawayofteasingout the more important information in multivariate data,where the high dimensionality renders simple graphicalpresentation impossible. The procedure can easily beunderstood even with just two variates, though its usemight then be unnecessary. Taking an example fromanthropometry where PCA originated, we might havedataonthelengthsandbreadthsofanumberofhumanskulls. Each skull can be represented by a point in adiagram whose two axes are length and breadth. Sincelength and breadth will almost certainly be associatedto some extent, the points will tend to be spread outpreferentiallyinacertaindirection,stretchingfromshortlength and breadth (a small skull) to long length andbreadth (a large skull).PCAdefinesthisdirectionpreciselyasthatofthelinefor which the sum of the squares of the perpendiculardistances from the points to the line is a minimum. Thisline passes through the centre of gravity of the points,and a simple application of Pythagoras’s Theoremshows that the one-dimensional array of the pointsdefinedbythefeetoftheperpendicularsfromthepointsto the line then has the maximum possible sum ofsquares. In other words, the variability of the data hasbeen partitioned into two components, one of which,along this line, is known as the (First) Principal Com-ponent because it encapsulates as much of the vari-ability as can be represented in one dimension. TheSecond Component, at right angles to the First, encap-sulates the remainder, which is, of course, a minimum.Thesetwocomponentscanbeusedasreplacementaxesonthegraph.SometimestheFirstComponentwillhaveanobviousmeaning,aswouldbethecasewiththeskulls, where it is clear that it corresponds in a generalway to ‘size’. Similarly the Second Component corre-spondsinsomesenseto‘shape’,becauseaskullwhosedata-pointisfarfromthelineoftheFirstComponentwilleither be longer and narrower than the norm,or shorterand broader.Theproceduregeneralisestoanynumberofvariates,and the successive First, Second, Third,
. . .
Compo-Components are then mutually-orthogonal directionspartitioning the total variability into ever-decreasingamounts. A graph of the first two components willrepresent as much of the information as is possibleusing only two dimensions.
Problems and paradigms
800
 BioEssays 25.8

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->