|Views: 106
|Likes: 0

Published by Nemester

Sufficiently large gene samples make it very easy (100% efficient) to sort by race in principle component analysis space.

Sufficiently large gene samples make it very easy (100% efficient) to sort by race in principle component analysis space.

See more

See less

Human genetic diversity:Lewontin’s fallacy

A.W.F. Edwards

Summary

In popular articles that play down the genetical differ-ences among human populations, it is often stated thatabout 85% of the total genetical variation is due toindividual differences within populations and only 15%to differences between populations or ethnic groups. Ithas therefore been proposed that the division of

Homo sapiens

into these groups is not justified by the geneticdata. This conclusion, due to R.C. Lewontin in 1972, isunwarranted because the argument ignores the factthat most of the information that distinguishes popu-lations is hidden in the correlation structure of the dataand not simply in the variation of the individual factors.The underlying logic, which was discussed in the earlyyearsofthelastcentury,isherediscussedusingasimplegenetical example.

BioEssays

25:798–801, 2003.

2003 Wiley Periodicals, Inc.

‘‘When a large number of individuals [of any kind oforganism] are measured in respect of physical dimensions,weight, colour, density, etc., it is possible to describe withsome accuracy the population of which our experiencemay be regarded as a sample. By this means it may bepossible to distinguish it from other populations differing intheir genetic origin, or in environmental circumstances. Thuslocal races may be very different as populations, althoughindividuals may overlap in all characters;

. . .

’’ R.A. Fisher(1925). See Ref. 1.‘‘Itisclear thatourperceptionofrelativelylargedifferencesbetween human races and subgroups, as compared to thevariation within these groups, is indeed a biased perceptionand that, based on randomly chosen genetic differences,human races and populations are remarkably similar to eachother, with the largest part by far of human variation beingaccounted for by the differences between individuals. Humanracial classification is of no social value and is positivelydestructive of social and human relations. Since such racialclassification is now seen to be of virtually no genetic ortaxonomicsignificanceeither,nojustificationcanbeofferedforits continuance’’. R.C. Lewontin (1972). See Ref. 2.‘‘The study of genetic variations in

Homo sapiens

showsthat there is more genetic variation within populations thanbetweenpopulations.Thismeansthattworandomindividualsfromanyonegrouparealmostasdifferentasanytworandomindividuals from the entire world. Although it may be easy toobserve distinct external differences between groups ofpeople, it is more difficult to distinguish such groups geneti-cally, since most genetic variation is found within all groups.’’

Nature

(2001). See Ref. 3.

Introduction

In popular articles that play down the genetical differencesamong human populations it is often stated, usually withoutanyreference,thatabout85%ofthetotalgeneticalvariationisdue to individual differences within populations and only 15%to differences between populations or ethnic groups. It hastherefore been suggested that the division of

Homo sapiens

intothesegroupsisnotjustifiedbythegeneticdata.Peopletheworld over are much more similar genetically than appear-ances might suggest.Thus an article in

New Scientist

(4)

reported that in 1972Richard Lewontin of Harvard University ‘‘found that nearly85 per cent of humanity’s genetic diversity occurs amongindividuals within a single population.’’ ‘‘In other words, twoindividuals are different because they are individuals, notbecause they belong to different races.’’ In 2001, the

Human Genome

edition of

Nature

(3)

came with a compact disccontaining a similar statement, quoted above.Such statements seem all to trace back to a 1972 paperby Lewontin in the annual review

Evolutionary Biology

.

(2)

Lewontin analysed data from 17 polymorphic loci, includingthe major blood-groups, and 7 ‘races’ (Caucasian, African,Mongoloid, S. Asian Aborigines, Amerinds, Oceanians,Australian Aborigines). The gene frequencies were givenfor the 7 races but not for the individual populations compris-ing them, although the final analysis did quote the within-population variability. ‘‘The results are quite remarkable.The mean proportion of the total species diversity that iscontained within populations is 85.4%

. . .

. Less than 15%of all human genetic diversity is accounted for by dif-ferences between human groups! Moreover, the differencebetween populations within a race accounts for an addi-tional 8.3%, so that only 6.3% is accounted for by racialclassification.’’

798

BioEssays 25.8

BioEssays 25:798–801,

2003 Wiley Periodicals, Inc.

Gonville and Caius College, Cambridge, CB2 1TA, UK.E-mail: awfe@cam.ac.ukDOI 10.1002/bies.10315Published online in Wiley InterScience (www.interscience.wiley.com).

Problems and paradigms

Lewontin concluded ‘‘Since

. . .

racial classification is nowseentobeofvirtuallynogeneticor taxonomicsignificance

. . .

,no justification can be offered for its continuance’’ (fullquotation given above).Lewontin included similar remarks in his 1974 book

The Genetic Basis of Evolutionary Change

(5)

‘‘The taxonomicdivision of the human species into races places a completelydisproportionateemphasisonaverysmallfraction ofthetotalof human diversity. That scientists as well as nonscientistsnevertheless continue to emphasize these genetically minordifferences and find new ‘scientific’ justifications for doing sois an indication of the power of socioeconomically basedideology over the supposed objectivity of knowledge.’’

The fallacy

These conclusions are based on the old statistical fallacyof analysing data on the assumption that it contains noinformationbeyondthatrevealedonalocus-by-locusanalysis,and then drawing conclusions solely on the results of such ananalysis. The ‘taxonomic significance’ of genetic data in factoftenarisesfromcorrelationsamongstthedifferentloci,foritisthesethatmaycontaintheinformationwhichenablesastableclassification to be uncovered.Cavalli-Sforza and Piazza

(6)

coined the word ‘treeness’ todescribe the extent to which a tree-like structure was hiddenamongst the correlations in gene-frequency data. Lewontin’ssuperficial analysis ignores this aspect of the structure of thedataandleadsinevitablytotheconclusionthatthedatadonotpossess such structure. The argument is circular. A contrast-ing analysis to Lewontin’s, using very similar data, waspresented by Cavalli-Sforza and Edwards at the 1963International Congress of Genetics.

(7)

Making no priorassumptions about the form of the tree, they derived aconvincing evolutionary tree for the 15 populations that theystudied.Lewontin,

(2,5)

thoughheparticipatedintheCongress,did not refer to this analysis.The statistical problem has been understood at least sincethe discussions surrounding Pearson’s ‘coefficient of raciallikeness’

(8)

in the 1920s. It is mentioned in all editions ofFisher’s

Statistical Methods for Research Workers

(1)

from1925 (quoted above). A useful review is that by Gower

(9)

in a1972 conference volume

The Assessment of Population Affinities in Man

. As he pointed out, ‘‘

. . .

the human minddistinguishes between different groups

because

there arecorrelated characters within the postulated groups.’’The original discussions involved anthropometric data, butthe fallacy may equally be exposed using modern geneticterminology. Consider two haploid populations each of size

n

.Inpopulation1thefrequencyofagene,say‘

þ

’asopposedto‘

’, at a single diallelic locus is

p

and in population 2 it is

q

,where

p

þ

q

¼

1. (The symmetry is deliberate.) Each popula-tion manifests simple binomial variability, and the overallvariability is augmented by the difference in the means. Thenatural way to analyse this variability is the analysis ofvariance, from which it will be found that the ratio of thewithin-populationsumofsquarestothetotalsumofsquaresissimply4

pq

.Taking

p

¼

0.3and

q

¼

0.7,thisratiois0.84;84%ofthe variability is within-population, corresponding closely toLewontin’s figure. The probability of misclassifying an indivi-dual based on his gene is

p

, in this case 0.3. The genes at asingle locus are hardly informative about the population towhich their bearer belongs.Now suppose there are

k

similar loci, all with genefrequency

p

in population 1 and

q

in population 2. The ratioof the within-to-total variability is still 84% at each locus. Thetotal number of ‘

þ

’ genes in an individual will be binomial withmean

kp

in population 1 and

kq

in population 2, with variance

kpq

in both cases. Continuing with the former gene frequen-cies and taking

k

¼

100 loci (say), the mean numbers are30 and 70 respectively, with variances 21 and thus standarddeviations of 4.58. With a difference between the means of40 and a common standard deviation of less than 4.6, thereis virtually no overlap between the distributions, and theprobability of misclassification is infinitesimal, simply on thebasis of counting the number of ‘

þ

’ genes. Fig. 1 shows howthe probability falls off for up to 20 loci.One way of looking at this result is to appreciate that thetotalnumberof‘

þ

’genesislikethefirstprincipalcomponentina principal component analysis (Box 1). For this componentthe between-population sum of squares is very much greaterthan the within-population sum of squares. For the othercomponentsthereversewillhold,sothatoverallthebetween-population sum of squares is only a small proportion (in thisexample 16%) of the total. But this must not beguile one intothinkingthatthetwopopulationsarenotseparable,whichtheyclearly are.Each additional locus contributes equally to the within-population and between-population sums of squares, whose

Figure 1.

Graph showing how the probability of misclassifi-cationfallsoffasthenumberofgenelociincreases,for thefirstexamplegiveninthetext.Theproportionofthevariabilitywithingroupsremainsat84%asinLewontin’sdata,buttheprobabilityof misclassification rapidly becomes negligible.

Problems and paradigmsBioEssays 25.8

799

proportions therefore remain unchanged but, at the sametime, it contributes information about classification which iscumulative over loci because their gene frequencies arecorrelated.

Classification

It might be supposed, though it would be wrong, that thisexample is prejudiced by the assumptions that membershipof the two populations is known in advance and that, ateach locus, it is the same population that has the higherfrequency of the ‘

þ

’ gene. In fact the only advantage of thelatter simplifying assumption was that it made it obvious thatthe totalnumber of‘

þ

’genesisthe bestdiscriminantbetweenthe two populations.To dispel these concerns, consider the same example butwith ‘

þ

’ and ‘

’ interchanged at each locuswith probability ½,and suppose that there is no prior information as to whichpopulationeachindividualbelongs.Clearly,thetotalnumberof‘

þ

’genesanindividualcontainsisnolongeradiscriminant,fortheexpectednumberisnowthesameineachgroup.Aclusteranalysiswillbenecessaryinorder touncover thegroups,anda convenient criterion is again based on the analysis ofvarianceasinthemethodintroducedbyEdwardsandCavalli-Sforza.

(10)

Here the preferred division into two clustersmaximises the between-clusters sum of squares or, what isthesamething,minimisesthesumofthewithin-clusterssumsof squares.As pointed out by these authors, it is extremely easy tocompute these sums for binary data, for all the information iscontainedinthehalf-matrixofpairwisedistancesbetweentheindividuals, and at each locus this distance is simply 0 for amatchand1foramismatchofthegenes.Sinceinterchanging‘

þ

’and‘

’makesnodifferencetothenumbersofmatchesandmismatches, it is clear that the random changes introducedaboveareirrelevant.Continuingthesymmetricalexample,theprobability of a match is

p

2

þ

q

2

if the two individuals are fromthe same population and 2

pq

if they are from differentpopulations. With

k

loci, therefore, the distance between twoindividuals from the same population will be binomial withmean

k

(

p

2

þ

q

2

)andvariance

k

(

p

2

þ

q

2

)(1

p

2

q

2

)andif fromdifferent populations binomial with mean 2

kpq

and variance2

kpq

(1

2

pq

). These variances are, of course, the same.Taking

p

¼

0.3,

q

¼

0.7 and

k

¼

100 as before, the meansare58and42respectively,adifferenceof16,thevariancesare24.36 and the standard deviations both 4.936. The meansare thus more than 3 standard deviations apart (3.2415). Theentries of the half-matrix of pairwise distances will thereforedivide into two groupswith very little overlap, and it will be pos-sible to identify the two clusters with a risk of misclassificationwhich tends to zero as the number of loci increases.Byanalogywiththeaboveexample,itislikelythatacountofthe four DNA base frequencies in homologous tracts of agenome would prove quite a powerful statistical discriminantfor classifying people into population groups.

Conclusion

There is nothing wrong with Lewontin’s statistical analysis ofvariation,onlywiththebeliefthatitisrelevanttoclassification.

Box 1. Principal component analysis

PrincipalComponentAnalysis(PCA)isawayofteasingout the more important information in multivariate data,where the high dimensionality renders simple graphicalpresentation impossible. The procedure can easily beunderstood even with just two variates, though its usemight then be unnecessary. Taking an example fromanthropometry where PCA originated, we might havedataonthelengthsandbreadthsofanumberofhumanskulls. Each skull can be represented by a point in adiagram whose two axes are length and breadth. Sincelength and breadth will almost certainly be associatedto some extent, the points will tend to be spread outpreferentiallyinacertaindirection,stretchingfromshortlength and breadth (a small skull) to long length andbreadth (a large skull).PCAdefinesthisdirectionpreciselyasthatofthelinefor which the sum of the squares of the perpendiculardistances from the points to the line is a minimum. Thisline passes through the centre of gravity of the points,and a simple application of Pythagoras’s Theoremshows that the one-dimensional array of the pointsdefinedbythefeetoftheperpendicularsfromthepointsto the line then has the maximum possible sum ofsquares. In other words, the variability of the data hasbeen partitioned into two components, one of which,along this line, is known as the (First) Principal Com-ponent because it encapsulates as much of the vari-ability as can be represented in one dimension. TheSecond Component, at right angles to the First, encap-sulates the remainder, which is, of course, a minimum.Thesetwocomponentscanbeusedasreplacementaxesonthegraph.SometimestheFirstComponentwillhaveanobviousmeaning,aswouldbethecasewiththeskulls, where it is clear that it corresponds in a generalway to ‘size’. Similarly the Second Component corre-spondsinsomesenseto‘shape’,becauseaskullwhosedata-pointisfarfromthelineoftheFirstComponentwilleither be longer and narrower than the norm,or shorterand broader.Theproceduregeneralisestoanynumberofvariates,and the successive First, Second, Third,

. . .

Compo-Components are then mutually-orthogonal directionspartitioning the total variability into ever-decreasingamounts. A graph of the first two components willrepresent as much of the information as is possibleusing only two dimensions.

Problems and paradigms

800

BioEssays 25.8