Professional Documents
Culture Documents
CONTENTS
Glossary
Project Organisation and DNA samples
1. Project organisation
2. DNA samples
SNP Discovery, SNP Selection and Genotyping
1. Genome-wide SNP discovery
2. SNP selection for inclusion in Phase I
3. SNP genotyping protocols and methods
Phase I Data Set
1. Phase I data set description
2. Data coordination and distribution
3. Quality control and quality assessment analysis
Population Genetic Data Analysis
1. SNP ascertainment features
2. Constructing a simulated Phase I HapMap for the ENCODE regions
3. Comparison of pairwise summaries of LD in ENCODE, HapMap, and previous studies
4. Selection of tag SNPs
5. Detecting cryptic relatedness of samples
6. Estimating recombination rates and detecting recombination hotspots
7. Nearest-neighbour analyses of haplotype structure
8. Estimation of FST
9. Identification of regions of unusual genetic variation
10. Tests of natural selection
11. Tests of transmission distortion
Supplementary Tables
Supplementary Figure Legends
References
Figures (provided as individual files)
GLOSSARY
Allele: One of several forms of a gene; at the DNA sequence level it refers to one of several
(usually, 2) nucleotide sequences at a particular position in the genome.
Genotype: The two specific alleles present in an individual; called a homozygote or
heterozygote depending on whether the two alleles are identical or different.
Polymorphism: The occurrence of multiple alleles at a specific site in the DNA sequence.
Classically, a site has been called polymorphic if the rarer of the two alleles, called the minor
allele, has a frequency above 1% in the population.
SNP (single nucleotide polymorphism): Polymorphism where multiple (usually, 2) bases
(alleles) exist at a specific genomic sequence site within a population, such as A and G. In
individuals, the possible combinations (genotypes) may be homozygous (AA or GG) or
heterozygous (AG).
Heterozygosity: The frequency of heterozygotes in the population.
Haplotype: A combination of polymorphic alleles on a chromosome delineating a specific
pattern that occurs in a population. The term is short for haploid genotype and has been used
classically to describe the patterns of variation in a small segment of the genome where genetic
recombination is rare, such as the HLA locus. However, when described as a haploid genotype it
can refer to the specific arrangement of alleles along an entire chromosome observed in an
individual, or in a specific region of a chromosome. For two SNPs with alleles A and G, and C
and G, the possible haplotypes are AC, AG, GC and GG.
Linkage phase: The specific arrangement of alleles in the haplotypes. For an individual who is
heterozygous at two SNPs, AG and CG (see above), the two haplotypes are either AC and GG, or
AG and GC. These arrangements are referred to as the phases of the genotypes.
Linkage disequilibrium (LD): The statistical association between alleles at two or more sites
(SNPs) along the genome in a population. Irrespective of the starting genetic composition of a
population, over time, the frequencies of the four possible haplotypes AC, AG, GC and GG are
expected to become the numerical products of the constituent allele frequencies, that is, reach an
equilibrium state. Any departure from this state is called disequilibrium and defined as D =
P(AC)P(GG) P(AG)P(GC) (using the above example) where P(.) refers to the frequency of that
haplotype. LD is commonly measured by the statistic D, which is the absolute value of D
divided by the maximum value that D could take given the allele frequencies; D ranges between
0 (no LD) and 1 (complete LD). LD decays depending on the rate of recombination between the
SNPs. Thus, the patterns of genomic recombination, and the occurrence of recombination
hotspots and coldspots, affect the decay of LD and its local patterns. When two SNPs are in
strong linkage disequilibrium, one or two of the four possible haplotypes may be missing.
Another way of measuring LD is by the coefficient of determination between the two alleles of
the two SNPs, a statistic called r2. The value of r2 (the square of the correlation coefficient) lies
between 0 and 1 and its maximum possible value depends on the MAFs of the two SNPs. It has
been used because its theoretical properties have been well studied and, most importantly,
because it measures how well one SNP can act as a surrogate (proxy) for another.
Tag SNPs (or tags): The set of SNPs selected for genotyping in a disease study. Given the
considerable extent of LD in local genomic regions, the choice of these SNPs for genotyping in a
disease association study is critical, as long as the cost of genotyping is still substantial. The
extensive correlation among neighbouring SNPs implies that not all of them need to be genotyped
since they provide (to some degree) redundant information. Tag SNP selection can be performed
using a variety of methods, with a common goal to capture efficiently the variation in the
genomic region of interest.
Demographic history: Extant human groups have populated the world after a founding group
emerged Out of Africa ~150,000 years ago. The changes in the demography (population size,
mating behaviour, migration, etc.) of this ancestral population, and the descendant ones, have
shaped the quantity and patterns of genetic variation in the human genome. Demographic history
is important for understanding the patterns of both benign and disease-related variation.
October of 2005, in which genotyping will be attempted in an additional 4.6 million SNPs, for a
final density of 1 SNP per kb. A Phase III component will assess the adequacy of the tag SNPs in
samples from additional populations in the ENCODE regions.
A complete accounting of SNPs genotyped for the Phase I data set by the HapMap Project by
chromosome, genotyping centre, genotyping technology, and analysis panel is provided in
Supplementary Table 11.
http://locus.umdnj.edu/nigms/nigms_cgi/sample.cgi?MOR50002
(iii) The Celera human genome sequencing effort used four samples 5:
HuAA.ref
HuAA.var
HuCC.ref
HuCC.var
HuDD.ref
HuDD.var
HuFF.ref
HuFF.var
(iv) Some sequences came from the following fosmid ends:
G248.ref
G248.var
NA15510
http://locus.umdnj.edu/nigms/nigms_cgi/sample.cgi?SEQVAR
(v) BCMWGS_S213.ref
BCMWGS_S213.var
The SNP reads are from a pool of 8 unrelated adult African-Americans, 4 female and 4 male,
from Houston, TX. The 8 samples were from the Baylor Polymorphism Resource, which
includes more than 500 ethnically diverse samples.
(vi) NIH24.ref
NIH24.var
The SNP Consortium used the Polymorphism Discovery Resource panel of 24 ethnically diverse
individuals6 for SNP discovery in a pooled form7.
(vii) WGSA.ref
WGSA.var
This is a mosaic single haploid, i.e., the Celera assembly, as submitted to GenBank under
accession #AADD00000000
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=42795668
(viii) CLONE.ref
CLONE.var
All human sequence clone data from GenBank compared to the reference genome sequence.
(ix) EST.ref
EST.var
All EST sequence compared to the reference sequence. Not used for SNP discovery but for
double hit totals.
(x) Additional SNP discovery was performed for the ENCODE Project. (See
http://www.hapmap.org/downloads/encode1.html.en.) The DNA from the HapMap samples was
obtained from the Coriell Institute.
For the 5 ENCODE regions resequenced at Baylor, the DNA was amplified in segments
averaging 600 bases in length, using PCR primers designed with local custom software.
Amplified fragments were multiplexed up to six-fold to reduce the burden of subsequent
purifications using alkaline phosphatase treatments. Each PCR primer included a segment
corresponding to a DNA sequencing primer. Sequencing used standard fluorescent di-deoxy
chemistry. Base differences were identified using the SNP Detector software 8. Sequence traces
were submitted to the NCBI trace archive.
For the 5 ENCODE regions resequenced at the Broad Institute, PCR amplicons were designed to
tile across each region, with a target length of 750 bases per amplicon and 150 bases of overlap
between amplicons. PCR and clean-up were performed according to standard methods and
sequence traces were generated on ABI 3730 DNA Analyzers. All 488,747 sequence traces
generated are publicly available at the NCBI trace archive (http://www.ncbi.nlm.nih.gov/,
CENTER_NAME='WIBR' AND STRATEGY='ENCODE'). SNPs were discovered in a fully
SNPs were polymorphic, respectively. The degree of completeness of SNPs in each analysis
panel are provided in Supplementary Table 1; on average, data completeness was 99.34%; 93%
of SNPs exceeded 95% completeness.
Supplementary Table 11 shows the numbers of SNPs genotyped by each centre and platform for
each chromosome in each analysis panel. These are the data that passed the QC filters for all
three analysis panels and that were polymorphic in at least one analysis panel; this is the Phase I
data set. Supplementary Figure 18 shows the MAF distribution by analysis panel.
Supplementary Figure 19 shows the distribution of inter-SNP distances for each analysis panel,
by chromosome or region as done by each centre.
All data on genotyping assays were deposited at the DCC for distribution in the next release,
together with notes as to how the releases differ. The analyses of the data in this paper are all
based on release 16 unless otherwise noted. The analyses of genome-wide phased data are based
on release 16a. Unless noted, all other analyses, including the analyses of ENCODE phased data,
are based on release 16c1. These data include the local genomic sequence, sequence of primers
used for PCR amplification and genotyping, a detailed protocol for performing genotyping
assays, individual genotypes for each sample attempted, and, for samples that failed quality
filters, a code indicating the mode(s) of failure. Unmapped SNPs either have no map position in
the current human sequence build 34, or map to more than one location.
2. Data coordination and distribution
The Data Coordination Center (DCC) was responsible for distributing SNP allocations to
genotyping centres, integrating genotype data, applying quality filters, tracking progress,
distributing data through the project website, and managing the transient HapMap click-wrap
license agreement.
The HapMap web site provides interactive browsing using two open source tools, the GBrowse
genome browser and BioMart. GBrowse provides web-based graphical access to all of the
project data, relevant genome annotations, and additional annotations of each SNP. Researchers
can search GBrowse for a gene of interest, browse the region for genotyped SNPs, highlight
SNPs that match specific genetic criteria, view allele frequencies in the region, and inspect the
patterns of LD. The web site supports BioMart to extract SNPs that meet criteria such as distance
from a gene of interest, extent of LD with another SNP, minor allele frequency, or the availability
of an assay on a particular platform. GBrowse also provides tools for downloading data on the
SNPs in a selected region and for generating tag SNP sets according to a flexible set of criteria.
GBrowse can directly launch the linkage disequilibrium (LD) analysis software Haploview
(http://www.broad.mit.edu/mpg/haploview/), aiding LD analyses and tag SNP selection.
GBrowse provides data sharing so that users can upload their own genotype data for viewing in
the context of HapMap data, as well as superimposing annotation tracks from the UCSC Genome
Browser and EnsEMBL. The source code, configuration files and ancillary utilities for the
HapMap web site are available from the DCC. All methods produced by the project are also
available from the DCC.
3. Quality control and quality assessment analysis
Our aim was to release data of the highest quality, and at the same time, to ensure that all data,
including failures, be made available. Information on failed assays is important to other
researchers who intend to genotype the same SNPs or who want to improve the performance of
the platforms. In addition, SNPs that fail quality standards (whether through technical failures in
assay design or execution or because they exhibit an excess of missing genotypes, non-mendelian
segregation within families, or deviations from Hardy-Weinberg equilibrium) can help identify
interesting biological phenomena including the presence of nearby polymorphisms under
genotyping primers, polymorphic insertions/deletions containing a SNP, paralogous loci, and
natural selection. The project developed quality control filters to identify high-quality data, but
required that all data on all attempted genotyping assays be deposited and made available through
the Data Coordination Center (DCC).
A series of three QA exercises was carried out within the project to assess the quality of the
genotype data. The first exercise was a calibration exercise that benchmarked the different
platforms and laboratory protocols. This exercise provided operational insights and resulted in
consensus genotypes for a validated set of SNPs that can be used to evaluate any genotyping
platform. The second exercise was a blind quality progress check that monitored ongoing data
production, and was designed to evaluate all centres and platforms. From both tests it was clear
that the overall data quality was extremely high, exceeding initial expectations. The final exercise
was a fully blinded analysis of a random sample of all data incorporated into the HapMap. In
addition, a number of experiments of nature occurred, in which a set of SNPs was either
inadvertently or deliberately genotyped more than once during the course of the project or in
other experiments. These included a set of almost 22,000 SNPs on chromosome 2p already
genotyped in Phase I that were re-genotyped by Perlegen Sciences as part of the pilot for the
Phase II HapMap and a SNP set genotyped by Perlegen Sciences in their recent project 4 that
included 9 CEU samples also genotyped by the HapMap Project. All of these exercises,
including comparisons to internal and external duplicate data, documented that in the overall data
generated by the centres, SNPs passing QC filters had an average completeness exceeding 99%
and accuracy of 99.7%, with no centre having an error rate of greater than 1%.
The Calibration Exercise (Supp. Table 2) was initiated at the beginning of the project to
establish the baseline performance of each of the chosen platforms, and to identify types of errors
that might be found in the final data. It also aimed to test the data-flow protocols, quality filters,
and to establish a calibrated standard SNP set for others to use in future studies of genotyping
accuracy. In total, 1,500 variants were randomly chosen from dbSNP build 110 entries and were
provided to each of the centres for genotyping on the HapMap samples then available (CEU).
The genotyping data were returned to the DCC and used to build a consensus for 80,229
genotypes at 892 polymorphic SNPs, each called successfully by three or more centres. The
group consensus was then compared to each centres submission. The Calibration Exercise data
showed a range of accuracies from 97.20% to 99.95%, but, overall, demonstrated remarkably
high data quality from each centre, suggesting an error rate of ~ 5/1,000 genotypes across the
whole project.
Remarkably, about 95% of SNPs with a unique map position could be converted to a useful assay
by at least one platform, but fewer than 20% were successfully genotyped by all. This shows that
nearly all SNPs in the public database can be converted to a working assay, but that any
individual platform can succeed for only a subset of the assayable SNPs.
The existence of platform-to-platform differences was not regarded as limiting because all
functioned well. However, this exercise was not blind and was explicitly designed to catalyze
improvements at all the genotyping centres. The observation of loss of information from some
specific allele/platform combinations prompted follow-up DNA sequencing of a small number of
samples and provided limited evidence that variation that occurred at the site of binding of PCR
primers could influence genotype calls for some assays. The genotypes selected for examination
through re-sequencing led us to identify a class of SNPs that were truly polymorphic but appeared
monomorphic due to a consistent failure to recover one allele. Such allelic dropout was
estimated to occur in about 6% of the SNPs for each platform (about 20% of the SNPs flagged as
monomorphic), but had little consequence for the analyses since they were performed on SNPs
found to be polymorphic.
The second Quality Progress Check Exercise (Supp. Table 3) was performed when
approximately 50% of the CEU data had accumulated, and aimed to check the performance of
each platform at each centre. A set of 1,496 SNPs that were already genotyped was selected,
comprising 136 SNPs randomly selected from submissions by each of the eleven unique
centre/platform combinations. All 1,496 SNPs were attempted for genotyping by each of the
submitting centres and platforms. (The Broad Institute checked using only the Sequenom
platform.) In this exercise the genotypes had been deposited by the responsible centre prior to
their selection as part of the exercise; this was a blind exercise on the part of the production
genotyping centres.
The new genotype data were compiled, requiring three or more independent submitters to agree
on a consensus genotype. This consensus was then compared to each centres original
submission. The Quality Progress Check Exercise again demonstrated that the overall quality of
the data was very high. It was particularly noteworthy that a small number of individual SNPs
were responsible for the majority of the observed errors. At this point in the project there was an
increase in the overall fraction of SNPs that yielded high quality genotypes compared with the
first exercise, which may have been due to improvements in the individual platforms, but more
likely reflected the fact that SNPs successfully genotyped by one platform are more likely to
succeed on another.
The Final Phase I Quality Control Exercise (Supp. Tables 12, 13) was aimed at evaluating the
overall quality of the Phase I map that is, not balanced according to centre and platform, but
rather a random sample from the complete Phase I data set. Thus, 1,000 successfully genotyped
SNPs were selected randomly from the map and retested on three platforms (RIKEN Third Wave,
Sanger Illumina, and Broad Sequenom). In addition, 100 SNPs from each of five error classes
were selected and re-genotyped. To ensure completeness in this data set, 35 SNPs that did not
return data from at least two platforms were also attempted on a different platform (UCSF FPTDI); 33 worked. Consensus genotypes were obtained based on agreement among the multiple
determinations. Comparison with the original submission showed the overall accuracy of data in
the map was about 99.75%. In contrast, we estimate that the accuracy for assays failing the QC
filters is always less than 95%, whether the failure was due to an excess of missing genotypes,
Mendelian inconsistencies, or deviations from Hardy-Weinberg equilibrium resulting in a deficit
or excess of heterozygotes. The QC filters were applied on a per-analysis panel basis. When one
analysis panel failed the QC filters for a SNP, we estimate the accuracy for the other analysis
panels passing the QC filters for the same SNP to be about 99%.
The inadvertent duplicate genotyping performed (experiments of nature) provided an
independent check on the accuracy of genotyping. Irrespective of the analysis unit examined, and
based on total numbers of genotypes in excess of those generated as a part of the QA exercises, it
was amply clear that genotyping error rates were under 0.3% (Supp. Table 14.). These data also
emphasize that the most common genotyping error is the inability to identify both distinct alleles
in a heterozygote (~67% of all errors; 1 allele discrepant) rather than the miscalling of a
homozygote as a heterozygote or misidentification of alleles as complements of one another
(allele flipping; 2 alleles discrepant). Additional evidence of the high quality of the HapMap
genotyping arose from the 99.18% concordance between the Perlegen and HapMap genotyping in
the ENCODE regions for the CEU samples.
10
11
number of SNPs are ascertained from a small number of individuals 5,19. Some SNP discovery
projects7 included DNA from many individuals each sequenced to very low coverage; as a
consequence, the likelihood that two nearby SNPs would be correlated because of their shared
discovery from a given individual would be limited. However, other large SNP discovery efforts
exploited data, such as that collected by Celera 5, (which included only a few individuals,
including 3x coverage of donor B) or by mining sets of SNPs from long BAC overlaps in
genome sequencing projects. These designs have the disadvantageous property that regional sets
of SNPs may show elevated correlation simply because certain branches of the genealogical tree
relating human sequences are being disproportionately sampled.
Luckily, this subtle bias in the HapMap data is transient. In recent years, contributions to dbSNP
have come from an increasingly large number of DNA sources and this fact in conjunction with
the much more complete typing of SNPs ongoing in Phase II should provide a more complete and
even sampling of variation.
4. Selection of tag SNPs
We used the computer program Tagger to pick tag SNPs20. Tagger operates in two modes: (1) by
a greedy pairwise approach, in which the SNPs of interest are captured at a given minimal r2 by a
single marker (that is, a single tag), or (2) by aggressively searching for specific multi-marker
haplotype tests to capture the SNPs of interest. The latter is achieved by iteratively replacing a
tag SNP from pairwise tagging with a specific multi-marker test (based on the remaining tag
SNPs). That predictor will be accepted only if it can capture the alleles that were captured by the
discarded tag at the required r2; otherwise, that tag SNP is considered indispensable and retained.
To minimize the risk of overfitting, tag SNPs within a specified multi-marker test are forced to be
in strong LD (here defined as LOD score > 3) with one another and with the predicted allele.
Importantly, this multi-marker approach essentially performs an identical set of 1 d.f. tests of
association, only now using certain specific haplotypes as surrogates for single tag SNPs, thereby
requiring fewer tag SNPs for genotyping. Tagger is available as a web server
http://www.broad.mit.edu/mpg/tagger/ and as a stand-alone version in Haploview 21.
5. Detecting cryptic relatedness of samples
To identify related pairs of individuals we used the RELPAIR 22 and GRR23 software packages, as
well as a rapid algorithm for global IBD estimation (the Sham-Purcell method), described briefly
below. After identifying pairs of possibly related individuals, we constructed a series of dummy
pedigrees each describing a possible relationship between each pair of individuals and including
their genotyped spouse and offspring (for the YRI and CEU analysis panels). We then calculated
the multipoint likelihood for each dummy pedigree based on a subset of 10,000 SNPs evenly
distributed throughout the genome 24.
To verify results, an additional method, referred to as global IBD estimation (Sham, P. & Purcell,
S., personal communication) was used. This method assumes that the sample is homogeneous
and from the same population so that estimated allele frequencies are valid for each individual in
the sample. For a given pair of individuals, the observed number of SNPs that share 0, 1, or 2
alleles IBS is tallied. The expected number of SNPs that have 0, 1 or 2 alleles IBS, given that the
pair are unrelated (IBD = 0), parent-offspring (IBD = 1) and monozygotic twins (IBD = 2) is
calculated based on the estimated allele frequencies of the SNPs. Then the 3 observed IBS counts
are equated to 3 expected IBS counts, where each expected count is a weighted average of the 3
conditional expectations given the 3 possible IBD levels, with the weights being the unknown
probabilities. Solving the resulting set of simple equations allows the unknown IBD probabilities
to be estimated. To estimate an inbreeding coefficient for each individual, we maximized the
following pseudo-likelihood:
12
fpij (1 f ) pij2
P (Gi | f , pi )
2(1 f ) pij pik
In the likelihood above, f denotes the estimated inbreeding coefficient, pij the estimated allele
frequency for allele j at marker i, and Gi the observed genotype for marker i. The likelihood is
only approximate because it ignores linkage disequilibrium between markers. We obtained
similar results by simply examining the excess homozygosity for each individual.
Analysis of the genotype data revealed unreported relationships among the samples (Supp. Table
15). Three sets of relatives were found across trios in the Yoruba samples: one pair of first degree
relatives (individual NA19238 is likely the mother of NA18913, who is a parent in another trio),
one pair of second degree relatives (individual NA19192 is likely an uncle of NA19130), and one
pair of individuals who share about 1/8th of their genomic sequence (NA19092 and NA19101 are
likely first cousins). We also identified three pairs of individuals who shared approximately
1/16th of their genomic sequence across the CEPH trios and two Japanese individuals who show
an above average degree of cryptic relatedness. These individuals are all included in the data and
analyses presented in this paper. As the total level of sharing is not great, it is unlikely to affect
any of our genomic analyses substantially. We repeated some of the analyses after removing the
most closely related individuals and observed no major differences (on average, estimated
pairwise r2 coefficients differed by less than 0.002 for SNPs separated by less than 100 kb when
closely related individuals were removed). Nevertheless, we recommend that one individual from
each of these pairs be excluded when picking tag SNPs or examining LD for rare haplotypes,
since those applications may be more sensitive to duplicated chromosomes.
6. Estimating recombination rates and detecting recombination hotspots
We used the HapMap data to provide a genome-wide map of recombination rates and identify the
locations of recombination hotspots. We used the methods published in McVean et al. 25 to
estimate recombination rates (LDhat) and identify recombination hotspots (LDhot).
To estimate recombination rates on the autosomes (LDhat), we first broke the data into chunks of
2,000 SNPs (overlapping 200 SNPs) and then ran the method for 10 million iterations with a
burn-in of 100,000 iterations for each chunk. The method used the phased haplotype data, and
was run on the data for the unrelated parents (YRI and CEU, separately) and the full sample
(CHB+JPT combined) to produce three sets of recombination rates across the genome. To
convert these rates (in units of 4Ner per kb) into units of cM Mb-1 (centiMorgan per Megabase)
for each of the three sets of rates, we estimated N e separately by taking the total estimated
distance (4Ner units) across the whole genome, and comparing this to the genomic total distance
(cM units) as calculated from the deCODE map 26, summing rates across chromosomal segments
where both HapMap data and deCODE SNP positions allowed estimates of rates. This gave
Ne=15,459 (YRI), 10,699 (CEU), and 12,491 (CHB+JPT). We constructed our genomic
recombination map by averaging the three normalised rates, interpolating where necessary. For
the pseudoautosomal region of the X chromosome, we proceeded exactly as for the autosomes,
while for the non pseudo-autosomal portion, within which the number of chromosomes was
reduced relative to the rest of the genome, we re-estimated N e values separately using the same
procedure.
13
To detect recombination hotspots (LDhot), we analysed the same analysis panels separately from
each other. We tested for hotspots in 2 kb windows, slid 1 kb at a time, across the genome. We
compared the recombination rate in this window to that in the surrounding 200 kb (50 kb for the
densely resequenced ENCODE regions). We approximated dbSNP ascertainment by assuming
ascertainment in a panel of 12 individuals with a Poisson number of chromosomes (mean 1)
sampled from this panel, using a single hit ascertainment scheme. This scheme was chosen to
match the mean number of chromosomes, and average sharing between two ascertainment sets,
observed in the data at genotyped SNPs. For the ENCODE regions, we approximated SNP
ascertainment as being conducted by re-sequencing 16 individuals in each analysis panel. This
allowed us to obtain 3 sets of p-values across the genome, and we combined these p-values to call
hotspots, requiring that two of the three analysis panels show some evidence of a hotspot (p <
0.05) and at least one analysis panel show stronger evidence for a hotspot (p < 0.01). Hotspot
centres were estimated at those locations where distinct recombination rate estimate peaks (with
at least a factor of two separation between peaks) occurred, within the low p-value intervals.
7. Nearest-neighbour analyses of haplotype structure
We use the hidden-Markov methodology (HMM) of Li and Stephens 27 to model the conditional
distribution of the nth haplotype, using estimated recombination rates. For each of the 418
haplotypes in turn we calculate the posterior probability that it is most closely related to each of
the other 417 haplotypes for every SNP along the sequence using the forward and backward
algorithms. To describe the relationship between relatedness and panel-of-origin, we sum
posterior probabilities over haplotypes. For each SNP we can therefore represent the information
about panel of origin in colour (green=YRI posterior probability, orange=CEU,
purple=CHB+JPT). With no information about analysis panel origin, all haplotypes would be
brown (Supp. Fig. 5).
We can also use the HMM methods to calculate, for any given SNP position, the expected length
of the segment over which the local nearest-neighbour (which other haplotype it is most closely
related to) extends. Specifically, we can construct a forward matrix
R f (k ; i 1) q kk (i ) (i ) R f (k , i )
where k indexes the other haplotypes, i indexes the SNP, (i) is the physical or recombination
distance between SNP i and SNP i+1 and qkk(i) is the posterior probability that no recombination
occurred between the ith and the i+1th SNP conditional on k being the nearest-neighbour at SNP
i+1 (obtained from the standard forward and backward matrices). Similarly, we can construct a
reverse matrix
R b (k ; i ) q kk (i ) (i ) R b (k , i 1)
E ( R; i ) p k (i )[ R f (k , i ) R b ( k , i )]
k 1
where pk(i) is the posterior probability that haplotype k is the nearest-neighbour at position i
(obtained from the standard forward and backward matrices).
8. Estimation of FST:FST was estimated from the average pairwise differences between
chromosomes in each analysis panel compared to the combined samples
14
FST 1
nj
2
i
nij
nij 1
2
i
nj
xij (1 xij ) /
ni
ni 1
xi (1 xi )
(proportion) of the minor allele at SNP i in population j, nij is the number of genotyped
chromosomes at that position, and nj is the number of chromosomes analysed in that population.
The lack of the j subscript in the denominator indicates that statistics ni and xi are calculated
across the combined data sets. Note that alternative formulas for Fst, for example that do not
weight by sample size, or that use variance component estimation, will give slightly different
results.
9. Identification of regions of unusual genetic variation
From the phased haplotype data combined across all populations, in which missing data has been
imputed, we identified all haplotypes of at least 2 SNPs with a frequency of 0.05 or greater. Of
this set, any haplotype that was a subset of another one was removed to create a list of nonredundant haplotypes. Very long haplotypes are identified as those of 1 Mb or greater and consist
of at least 500 SNPs.
Candidates for balancing selection are identified, again from the phased haplotypes with imputed
missing data, from large clusters of SNPs in complete association. Unusual regions are identified
as those with more than 25 SNPs in complete association.
To identify SNPs showing unusually high levels of between-population differentiation we
calculated a likelihood ratio test statistic for heterogeneity in allele frequency across populations
X ij ln xij
ij
ln xi
where Xij is the count of allele i in population j, and xij is the estimated frequency of allele i in
population j (the denominator without the j subscript indicates that estimates and counts are
obtained from the combined population data). We chose to calibrate the method so as to identify
all SNPs that show as extreme a pattern of differentiation as rs12075, a non-synonymous SNP in
the Duffy (FY) gene, for which geographically restricted selection is known to be important. This
has a likelihood ratio test statistic of 150.
Alternative approaches to summarising levels of differentiation were also considered, including
FST and the beta-binomial model of population differentiation 28. However, neither alternative
identified rs12075 as being such as strong outlier.
15
2.2% (1.5% for the X chromosome); only SNPs that were polymorphic in at least one
analysis panel were included. 10,283 windows on the autosomes and 425 windows on the
X chromosome were accepted for analysis, covering a total of 2.57 Gb on the autosomes
and 129 Mb on the X.
The pexcess statistic was the fraction of SNPs with pexcess 0.6 where
pexcess
panc )
( p
1 panc
panc )
( p
pexcess
)
( panc p
p
anc
)
( panc p
is the estimated allele frequency in the target analysis panel and panc is the
where p
estimated ancestral allele frequency29 We estimated panc by a weighted mean of the
observed frequencies
in the other two analysis panels. Weights were obtained by
regressing all allele frequencies for the target analysis panel against those in the other two
analysis panels. pexcess was calculated only for SNPs with predicted and observed minor
allele frequencies greater than 0,05; a minimum of 20 (12 for X) comparisons were
required for each window.
Heterozygosity was calculated from comparison of the random shotgun SNP
ascertainment reads to the public reference genome. Libraries were chosen from among
those available to match the ancestry of the HapMap samples as closely as possible.
Heterozygosity calculated for the YRI analysis panel was based on the Baylor pool of
eight African-American individuals (see earlier), for CEU analysis panel was based on
two libraries of European ancestry (Coriell 07340 and Celera individual A), and for the
CHB+JPT analysis panel was based on two libraries of Chinese ancestry (Coriell 11321
and Celera individual F). The test statistic was the ratio of observed to expected
heterozygosity, based on local divergence and recombination rate (the latter as estimated
by the fine-scale recombination map). The dependence of heterozygosity on divergence
was determined by linear regression for all windows of the genome. An additional
dependence on recombination, after correction for diversity, was calculated for bins of
recombination rate.
Candidate windows were chosen based on the empirical distributions of the test statistics,
specifically those that were in the extreme tails for two different measures, with the
requirement that one of the two measures be diversity. This requirement was adopted
because the allele frequency and diversity measures are only weakly correlated for
neutrally evolving sequence, but strongly correlated for real selection events. It also had
the advantage of requiring evidence to come from two different data sources, with
different potential artefacts. The threshold for candidate status was that one statistic was
in the extreme 1.5% of the distribution and the other in the extreme 0.05% (2% and 2%
for the X chromosome). These thresholds, along with all other analysis choices, were
16
based on comparison of simulated neutral and selected loci, using a previously validated
model30.
A second class of methods was used to detect selection based on the allele frequency
spectrum of the derived allele. Allele frequencies were estimated from the genotype data
in release 16c1 for all 3 analysis panels. Each of the two human alleles was compared to
the chimpanzee genome respective nucleotide based on blastZ local alignments available
at the UC Santa Cruz Genome Browser. We defined the ancestral allele as the human
allele that matched the chimpanzee allele and report the frequency of the derived allele;
the ancestral allele was inferred for 93% of the HapMap SNPs.
11. Tests of transmission distortion
Deviations from the expected 50:50 transmission from a heterozygous parent can indicate
alleles with strong differential influences on survival and bias the results of family-based
association analyses31. We systematically searched for deviations from the expected
50:50 transmission ratio from heterozygous parents across the 60 parent-offspring trios.
While a number of dramatically skewed ratios (20:1) were observed (and verified as not
being genotyping errors), none of these exceeded the most extreme deviations expected
due to chance alone (Supp. Table 16, Supp. Table 17). Since power is limited given only
60 trios, confirmation in larger samples is needed.
17
SUPPLEMENTARY TABLES
Supp. Table 1 Completeness for QC-passed non-redundant genotype data
Set of SNPs
All SNPs (non-redundant)
Monomorphic
Polymorphic
SNPs 80-95% complete
Monomorphic
Polymorphic
YRI
Number
1,076,392
156,290
920,102
Proportion
100%
15%
85%
27,858
2,818
25,040
3%
0.3%
2.3%
Analysis panel
CEU
Number
Proportion
1,104,980
100%
234,482
21%
870,498
79%
CHB+JPT
Number
Proportion
1,087,305
100%
268,325
25%
818,980
75%
26,128
4,744
21,384
29,698
4,400
25,298
2%
0.4%
1.9%
18
3%
0.4%
2.3%
97%
24%
73%
Beijing*
Broad
Perkin Elmer Sequenom
Submitted SNPs
Submitted genotypes
999
86,256
796
69,090
910
80,657
SNPs
Genotypes
Call rate
Per SNP comparison
SNPs in consensus
SNPs with mismatch
Per genotype
comparison
Genotypes in
consensus
Genotypes with
mismatch
Error rate
Missing genotypes
Consensus is
homozygote
Consensus is
heterozygote
Proportion of
heterozygotes
690
61,129
0.984
530
45,872
0.962
653
57,836
0.984
673
52
514
212
631
74
596
199
822
42
773
18
59,648
44,483
55,906
51,089
73,906
281
0.0047
1,228
0.0276
242
0.0043
356
0.0070
639
1,071
299
253
678
556
0.284
0.388
0.650
241
227
254
201
40
175
52
212
42
Monomorphic SNPs
Confirmed
monomorphic
Potential dropout
Sanger
Illumina
Shanghai
Illumina
UCSF-WU
Perkin Elmer
1,034
92,964
1,194
107,375
597
51,810
746
67,095
0.999
858
77,160
0.999
430
37,687
0.974
790
23
743
28
833
56
413
48
69,201
70,566
66,826
74,902
36,213
139
0.0019
31
0.0005
83
0.0012
48
0.0007
135
0.0018
97
0.0027
1,371
22
192
314
24
27
598
1,160
22
159
177
15
17
345
0.458
0.500
0.453
0.360
Monomorphic SNPs that passed QC filters
227
342
320
300
0.385
0.386
0.366
287
337
129
226
61
267
70
96
33
190
37
19
262
80
242
78
260
40
1,500 variants (1,406 SNPs) were selected at random from dbSNP build 110 and submitted for genotyping at ten participating HapMap centres.
Genotype calls were used to build consensus calls, based on a majority vote among the centres calling each genotype. A minimum of three identical
genotype calls were required before assigning a consensus genotype. The original genotypes were then compared to the consensus and an error rate
estimated from the number of observed differences. The table summarises the agreement between the genotypes submitted by each centre and the
consensus call among all centres. In total, 96% of SNPs were converted by at least one centre.
Monomorphic SNPs were classified as confirmed monomorphic if none of the ten centres submitted a polymorphic call. They were classified as
potential dropout if at least 1 centre submitted a polymorphic call. Among SNPs classified as potential instances of allele dropout, about half appeared to
be polymorphic in submissions by at least 2 centres.
*The FP platform used by the Beijing Genome Center for this exercise was later replaced by an Illumina instrument. Thus, the results in this table do not
reflect the quality of contributions by the Beijing Genome Center to the project.
20
Submitted SNPs
Submitted genotypes
SNPs
Call rate
Per SNP comparison
SNPs in consensus
SNPs with mismatch
Per genotype
comparison
Genotypes in consensus
Genotypes with mismatch
Error rate
Missing genotypes
Consensus is homozygote
Consensus is
heterozygote
Proportion of
heterozygotes
Monomorphic SNPs
Confirmed monomorphic
Potential allelic dropout
Baylor
Beijing
Broad
Broad
ParAllele
Illumina
Sequenom
Illumina
136
11,913
136
11,752
139
12,321
111
0.973
117
0.960
126
0.985
113
1.000
111
0.946
117
0.999
110
11
117
20
123
7
113
5
110
6
9,626
25
0.0026
10,111
29
0.0029
10,911
14
0.0013
10,165
21
0.0021
178
247
48
96
172
111
0.350
0.411
0.698
24
19
5
19
13
6
13
9
4
Sequenom
Illumina
Illumina
RIKEN
Sanger
Shanghai
ThirdWave
Illumina
Illumina
UCSFWU
Perkin
Elmer
136
12,118
136
12,062
136
12,112
132
11,520
120
0.995
116
0.990
121
0.985
124
0.989
121
0.972
116
4
120
1
115
2
121
2
124
3
112
7
9,366
8
0.0009
10,431
135
0.0129
10,746
4
0.0004
10,246
2
0.0002
10,729
7
0.0007
11,042
3
0.0003
9,812
9
0.0009
197
30
68
103
79
153
337
24
36
58
35
115
0.346
0.360
0.307
0.429
20
12
8
15
10
5
12
8
4
10
5
5
0.400
0.631
0.333
0.444
Monomorphic SNPs that passed QC filters
26
25
19
13
15
20
16
10
11
5
3
3
21
1,496 SNPs were selected from submitted HapMap data at the half-way point in the project. SNPs were selected so as to include 136 original
submissions from each of 11 centre/platform combinations and submitted for genotyping at ten participating HapMap centres, using 11 different
protocols. Polymorphic calls were used to build consensus genotype calls, based on a majority vote among the centres calling each genotype. A
minimum of three identical genotype calls were required before assigning a consensus genotype. Genotypes for the original submissions (i.e. those
made before the exercise) were then compared to the consensus and an error rate for released HapMap data estimated from the number of observed
differences. The table summarises the agreement between the genotypes originally submitted by each centre and the consensus call generated in this
exercise.
Monomorphic SNPs were classified as confirmed monomorphic if none of the ten centres submitted a polymorphic call. They were classified as
potential dropout if at least one centre submitted a polymorphic call. Among SNPs classified as potential instances of allele dropout, about half
appeared to be polymorphic in submissions by at least two centres.
Changes in dbSNP and some SNPs that were inadvertently genotyped by multiple centres resulted in slightly more (or slightly fewer) than 136 SNPs
evaluated for each centre.
Note: Error rates from the Illumina submissions are due to the presence of two SNPs with genotypes very different from consensus in the SNP set
selected for this exercise. These SNPs are expected to occur at a low frequency in submissions from all centres, and the data do not suggest a
significantly higher error rate for genotypes submitted by Illumina.
22
23
1
3
3
8
8
8
8
9
17
18
X
X
X
X
X
92507914 - 92671929
88024503 - 88166914
121935322 - 122378721
52775365 - 52860503
78501582 - 78538286
99946447 - 100124266
109416887 - 109491211
95744289 - 95840481
44184837 - 44768436
32788105 - 32924881
62314035 - 63425220
65002577 - 65254427
65003211 - 65301712
74735355 - 74898295
74735885 - 75007543
24
Frequency in
global sample
0.048
0.251
0.084
0.481
0.179
0.328
0.376
0.141
0.057
0.263
0.283
0.315
0.322
0.427
0.303
No. of
SNPs
26
28
50
32
30
28
27
33
66
26
64
39
29
26
36
1
1
1
2
2
2
3
4
4
4
4
4
5
5
5
5
6
6
6
6
6
7
7
8
9
9
9
9
9
10
10
10
11
11
11
11
14
14
15
15
16
16
34666831 - 36348889
119473878 - 143310281
172886417 - 174320697
107029550 - 109653172
127609744 - 129120502
134797057 - 139716185
103087318 - 104771492
12075501 - 13539560
32458790 - 34930073
74566404 - 76287602
106940317 - 108903852
161397235 - 163946479
27884759 - 29766957
54139172 - 55716137
57910979 - 59501073
119283943 - 121042296
28521470 - 32948854
66269208 - 68677244
83898727 - 87867319
89073196 - 90734671
121879827 - 123512551
78366236 - 79988524
93125169 - 94737818
81548121 - 82648870
21121559 - 22145709
22196987 - 23377019
72731231 - 73791613
100038689 - 101565903
122495401 - 124370907
21452373 - 23373427
24693591 - 25995778
95015067 - 96705330
3555391 - 5995893
76284587 - 78346969
83228274 - 85584561
108494513 - 110228640
43191391 - 46991264
61711113 - 63648365
45923522 - 47989726
69756592 - 72423810
29057931 - 46753964
66105887 - 69313481
Number of
haplotypes
1
57
1
8
1
12
1
1
8
1
1
1
1
1
1
1
3
1
2
1
1
1
1
1
1
1
1
1
1
2
1
1
2
2
1
1
2
1
1
3
2
1
25
17
18
18
19
20
21
21
X
X
X
X
X
X
X
X
X
X
X
X
X
19051578 - 21327987
20200126 - 21207668
26923199 - 27961662
40538716 - 44451909
24527613 - 34579062
29128255 - 31143473
39093741 - 39962217
9457691 - 11746281
18241221 - 21142333
32789701 - 37015221
46508235 - 67199078
68171808 - 69734950
81076594 - 83895636
94663879 - 98658508
106131921 - 113828213
119348319 - 120678884
123296012 - 126883240
127451609 - 132500308
133308455 - 136249976
145464059 - 148067734
2
1
1
2
10
1
1
5
39
57
72
1
1
14
100
1
4
18
4
13
26
Haploid
genomes
48
2
16
2 per individual,
4 individuals
2 per individual,
4 individuals
1
-48
128
Sequence
coverage
0.6X
0.3X
1.0X
1-5X
SNPs
submitted1
880,764
583,482
2,566,383
4,519,749
1.0X
2,946,198
1.0X
-ND
6-10X
2,538,812
1,000,000
425,000
14,035,388
Total (non-redundant)
9,209,337
Uniquely mapped SNPs from major SNP discovery sources in dbSNP build 124.
These include only single nucleotide polymorphism variations with a unique
placement on the reference human genome assembly version 34.3.
1
For each submission, the SNPs counted were those not already in the dbSNP
build existing at that time. Some SNPs were in multiple submissions deposited
at similar times; the total of the submissions includes redundant SNPs.
Thus the total number of non-redundant SNPs is smaller than the sum of the
submissions.
27
Supp. Table 8 Rates of monomorphic and rare variants in dbSNP estimated from the ten HapMap
ENCODE regions
dbSNP
build
36
52
54
76
79
80
83
86
87
88
89
92
94
96
98
100
101
102
103
105
106
107
Latest SNP
submission
date in build
1/1999
8/1999
9/1999
4/2000
6/2000
7/2000
7/2000
9/2000
10/2000
10/2000
11/2000
1/2001
1/2001
6/2001
8/2001
9/2001
10/2001
11/2001
2/2002
4/2002
6/2002
8/2002
No. SNPs
added in
build
(genomewide)
3,955
9,528
398
16,409
175,892
67,537
156,638
304,645
91,118
253,581
82,631
191,317
43,501
140,766
13,113
451,932
117,231
3,992
30,976
4,884
3,157
69,061
No. SNPs
added in
build
(ENCODE)
Polymorphic
SNPs
Monomorphic
SNPs (false
positives)
Cumulative
false
positive
rate (%)
Common
SNPs
(MAF >
0.1)
Rare
SNPs
(MAF
0.1)
Cumulative
rare SNPs
(%)
5
3
1
59
113
10
192
525
112
535
31
288
67
262
42
912
59
1
33
7
3
108
5
3
1
56
105
7
169
436
100
486
18
239
58
215
39
767
47
1
24
3
2
98
0
0
0
3
8
3
23
89
12
49
13
49
9
47
3
145
12
0
9
4
1
10
0.0
0.0
0.0
4.4
6.1
7.3
9.7
13.9
13.5
12.0
12.6
13.3
13.3
13.8
13.7
14.3
14.5
14.5
14.6
14.7
14.7
14.5
5
3
0
52
103
7
166
413
97
470
16
230
56
208
38
724
43
1
19
2
2
88
0
0
1
4
2
0
3
23
3
16
2
9
2
7
1
43
4
0
5
1
0
10
0.0
0.0
11.1
7.7
4.1
4.0
2.9
4.2
4.1
3.8
3.9
3.9
3.9
3.8
3.8
4.3
4.4
4.4
4.5
4.5
4.5
4.7
28
108
9/2002
117,692
102
73
29
14.9
65
8
4.9
110
10/2002
10,020
22
20
2
14.9
18
2
4.9
111
2/2003
473,396
690
600
90
14.6
564
36
5.1
113
3/2003
34,516
27
14
13
14.8
12
2
5.1
114
4/2003
220,322
5
4
1
14.8
3
1
5.2
116
7/2003 1,583,055
2,058
1,667
391
16.2
1,550
117
5.7
117
7/2003
13,411
3
1
2
16.2
1
0
5.7
119
11/2003
850,690
1,032
869
163
16.1
788
81
6.3
120
2/2004 1,614,032
2,967
2,451
516
16.5
2,191
260
7.5
121
3/2004
655,416
696
466
230
17.6
411
55
7.7
123
8/2004
411,158
11
10
1
17.5
10
0
7.7
125
3/2005
19
16
3
17.5
15
1
7.7
Total
8,245,425
11,000
9,070
1,930
17.5
8,371
699
7.7
Rates of monomorphic and rare variants in dbSNP from the ten HapMap ENCODE regions. The number of SNPs added genome-wide in each successive build,
and the number of SNPs added within ENCODE regions are shown. Counts of polymorphic/monomorphic SNPs and common/rare variants are presented only
for dbSNPs mapped to one of the ten HapMap ENCODE regions (novel SNPs from the resequencing project are not included). Allele frequencies are calculated
from genotypes of all unrelated individuals from the four HapMap population samples. (Note that a monomorphic SNP within the HapMap samples is not
necessarily a false positive; the SNP may be polymorphic in a non-HapMap sample.) dbSNP builds not containing new SNPs in the HapMap ENCODE regions
are not shown (however, the total count of genome-wide dbSNPs represents the cumulative count for all builds); the genome-wide count for the most recent build
(125) is not yet available. Over time, as the depth of resequencing increases, the cumulative rate of rare SNPs and false positives also increases.
29
30
Research group
Percent
genome (%)
Chromosomes
Genotyping
24.3%
David Bentley
Peter Donnelly
U. of Oxford
Analysis
Lon Cardon
U. of Oxford
Analysis
Thomas Hudson
McGill U. and
Gnome Qubec
Innovation Centre
Genotyping
Japan
Ichiro Matsuda
Canada
Role
RIKEN, U. of
Tokyo
Health Sciences
U. of Hokkaido,
Eubios Ethics
Inst., Shinshu U.
Sanger Inst.
Yusuke Nakamura
United
Kingdom
Institution
Japanese MEXT
Public
consult.,
Samples
Genotyping
23.7%
1, 6, 10, 13, 20
10.1%
2, 4p
China
9.5%
Huanming
Yang
The
Chinese
HapMap
Consortium
(CHMC)
Huanming
Yang
Yan Shen
Lap-Chee
Tsui
Wei Huang
Beijing Genomics
Inst.
Chinese National
Human Genome
Center at Beijing
U. of Hong Kong,
Hong Kong U. of
Sci. & Tech.,
Chinese U. of
Hong Kong
Chinese National
Human Genome
Center at
Shanghai
Funding agency
Genotyping
5.9%
Genotyping
2.5%
Genotyping
1.1%
31
3, 8p, 21
Wellcome Trust
Wellcome Trust,
Nuffield Found.,
Wolfson Found.,
TSC, US NIH
Wellcome Trust,
TSC, US NIH
Genome
Canada,
Gnome Qubec
Chinese MOST,
Chinese
Academy of
Sciences,
National Natural
Science Found.
of China,
Hong Kong
Innovation and
Technology
Commission,
University Grants
Committee of
Hong Kong
Houcan Zhang
Comm.
engage.
Chinese MOST
Mark Chee
Beijing Genomics
Inst.
Illumina
Genotyping
16.1%
David Altshuler
Broad Inst.
Genotyping
9.7%
David Altshuler
Broad Inst.
Baylor College of
Medicine,
ParAllele
UCSF,
Washington U.
Analysis
Changqing Zeng
Richard Gibbs
Pui-Yan Kwok
United
States
Beijing Normal U.
Samples
32.4%
Genotyping
4.6%
12
Genotyping
2.0%
7p
US NIH
Kelly Frazer
Perlegen
Sciences
Genotyping
Aravinda Chakravarti
Gonalo Abecasis
Johns Hopkins U.
U. of Michigan
Analysis
Analysis
U. of Utah
Comm.
engage.,
Samples
Mark Leppert
ENCODE regions
and data for SNP
selection
Charles Rotimi
5 Mb (ENCODE)
on 2, 4, 7, 8, 9,
12, 18 in CEU
W.M. Keck
Found.,
Delores Dore
Eccles Found.,
US NIH
Comm.
engage.,
Samples
Cold Spring
Data
Lincoln Stein
Harbor Lab., New Coordination
York
Center
Note: Percent genome is based on the proportion of the HapMappable genome.
Nigeria
Howard U., U. of
Ibadan
32
US NIH
TSC, US NIH
Supp. Table 11 Number of SNPs genotyped for each chromosome, by centre, platform, and analysis panel
Total number of successfully
genotyped SNPs
Chrom. Center
1
Sanger
Baylor
Illumina
RIKEN
UCSF
Broad
Affy/Broad
Illumina
Total chr 1
2
McGill
Baylor
Sanger
Perlegen
Illumina
Broad
RIKEN
UCSF
Affy/Broad
Illumina
Total chr 2
3
CHMC
Baylor
Sanger
Illumina
RIKEN
UCSF
Platform
Illumina
ParAllele
Illumina
Third Wave
Perkin Elmer
Illumina
Centurion Chip
Illumina 40k
Illumina
ParAllele
Illumina
Perlegen
Illumina
Sequenom
Third Wave
Perkin Elmer
Centurion Chip
Illumina 40k
Illumina
Sequenom
ParAllele
Illumina
Illumina
Third Wave
Perkin Elmer
YRI
60447
746
152
9
0
1
7877
2674
71906
72132
495
458
0
20
19
7
0
9086
3010
85227
61653
14821
667
657
18
12
0
MAF = 0
33
CEU CHB+JPT
10329
12526
144
138
32
32
9
12
1
0
0
0
857
1298
1
172
11373
14178
11222
14528
86
94
79
81
115
0
9
12
6
7
4
4
2
0
888
1449
0
216
12411
16391
7377
10844
2830
2579
127
89
108
98
3
3
4
9
1
0
MAF 0.5
YRI
48713
519
118
2
0
1
6285
2254
57892
60236
346
320
0
15
16
0
0
7225
2552
70710
52509
12381
493
485
14
0
0
CEU CHB+JPT
43931
40800
505
514
106
104
1
1
5
0
1
1
6224
5451
2643
2201
53416
49072
53813
49442
318
329
307
322
404
0
10
4
9
9
1
1
4
0
7373
6431
2976
2487
65215
59025
39357
43905
19335
10403
454
500
442
447
14
10
0
2
10
0
10262
4146
92236
25564
8076
14312
313
295
0
6
3
0
7503
2491
58563
46010
389
391
3
0
7248
2083
56124
51939
518
223
11
0
7059
2344
62094
24223
5993
8919
10375
4215
91550
25794
7961
14319
295
287
458
6
4
4
7508
2528
59164
46209
371
363
4
2
7324
2104
56377
51866
780
223
7
7
7131
2361
62375
24446
5815
9001
10290
4146
92162
25480
8173
14310
309
291
0
6
10
0
7512
2503
58594
46093
387
390
4
0
7242
2069
56185
51975
525
222
11
0
7038
2332
62103
24185
6042
8900
886
289
5972
1637
512
770
58
55
0
1
0
0
609
129
3771
3496
84
77
0
0
713
129
4499
4626
78
23
3
0
648
169
5547
2241
531
743
34
914
0
11364
4886
675
1832
60
64
63
1
3
1
849
2
8436
6796
65
71
0
0
668
0
7600
7613
106
27
7
2
513
0
8268
5203
545
917
1424
261
15307
6264
1132
2476
66
71
0
3
3
0
1320
143
11478
8358
54
66
0
0
1120
90
9688
9037
84
49
8
0
894
156
10228
6742
754
1159
1108
353
8610
2838
806
1378
49
47
0
1
2
0
868
211
6200
5029
34
38
0
0
871
162
6134
5856
70
33
7
0
877
227
7070
3122
556
856
1031
71
8000
3037
473
1248
41
39
54
1
0
0
821
53
5767
4164
43
40
0
0
758
53
5058
5033
78
38
0
0
643
49
5841
3022
341
732
1414
444
10695
3097
832
1707
42
33
0
1
5
0
997
243
6957
5075
63
64
1
0
1022
181
6406
6289
75
28
3
0
1023
257
7675
3037
637
864
8268
3504
77654
21089
6758
12164
206
193
0
4
1
0
6026
2151
48592
37485
271
276
3
0
5664
1792
45491
41457
370
167
1
0
5534
1948
49477
18856
4906
7320
8430
4144
72186
17871
6813
11239
194
184
341
4
1
3
5838
2473
44961
35249
263
252
4
2
5898
2051
43719
39219
596
158
0
5
5975
2312
48265
16217
4929
7352
7452
3441
66160
16119
6209
10127
201
187
0
2
2
0
5195
2117
40159
32660
270
260
3
0
5100
1798
40091
36648
366
145
0
0
5121
1919
44199
14402
4651
6877
Perlegen
Baylor
Sanger
RIKEN
McGill
Illumina
Affy/Broad
Illumina
Total chr 7
8
Illumina
CHMC
Perlegen
ParAllele
Illumina
Third Wave
Illumina
Illumina
Centurion Chip
Illumina 40k
Illumina
Illumina
Sequenom
Sanger
Illumina
Baylor
ParAllele
Perlegen Perlegen
McGill
Illumina
Broad
Sequenom
Illumina
RIKEN
Third Wave
UCSF
Perkin Elmer
Affy/Broad Centurion Chip
Illumina
Illumina 40k
Total chr 8
9
Illumina
Illumina
Baylor
ParAllele
Sanger
Illumina
McGill
Illumina
Perlegen Perlegen
Broad
Sequenom
RIKEN
Third Wave
UCSF
Perkin Elmer
Affy/Broad Centurion Chip
Illumina
Illumina 40k
Total chr 9
0
326
331
4
3
1
5809
1961
47570
46652
10833
314
254
218
0
54
46
1
8
0
6102
1372
65854
45710
314
276
76
0
24
6
0
4245
1072
51723
1114
317
311
2
3
1
5807
1984
48801
46596
8800
2355
240
213
470
54
26
1
4
1
6226
1378
66364
45665
305
263
77
195
23
1
7
4342
1075
51953
0
337
327
7
3
1
5804
1961
47567
46656
10836
314
247
225
0
54
46
1
15
0
6097
1371
65862
45702
321
277
76
0
24
6
0
4248
1073
51727
0
56
73
2
0
0
551
125
4322
2720
484
25
43
37
0
8
6
0
6
0
569
72
3970
3254
65
44
15
0
5
3
0
388
75
3849
35
207
58
59
2
0
0
475
1
7467
5418
1303
215
49
54
156
19
12
0
3
0
628
0
7857
5555
62
52
17
63
5
1
0
419
0
6174
0
64
62
2
0
0
833
131
9747
7320
1995
41
53
56
0
22
32
0
8
0
1017
77
10621
6821
58
49
25
0
8
2
0
646
69
7678
0
49
47
1
0
0
727
192
5550
4642
809
41
38
36
0
9
14
0
1
0
713
123
6426
4262
35
37
12
0
3
2
0
487
97
4935
84
35
44
0
0
1
641
42
4942
3517
975
225
29
26
47
7
6
0
0
0
658
27
5517
3619
47
38
16
17
3
0
0
375
22
4137
0
43
40
5
0
1
733
169
5529
4150
1157
54
34
32
0
5
6
0
6
0
799
118
6361
4594
44
41
7
0
1
4
0
625
108
5424
0
221
211
1
3
1
4531
1644
37694
39288
9540
248
173
145
0
37
26
1
1
0
4820
1177
55456
38194
214
195
49
0
16
1
0
3370
900
42939
823
224
208
0
3
0
4691
1941
36388
37659
6522
1915
162
133
267
28
8
1
1
1
4940
1351
52988
36491
196
173
44
115
15
0
7
3548
1053
41642
0
230
225
0
3
0
4238
1661
32287
35184
7684
219
160
137
0
27
8
1
1
0
4281
1176
48878
34287
219
187
44
0
15
0
0
2977
896
38625
10
Sanger
Baylor
Illumina
RIKEN
McGill
UCSF
Affy/Broad
Illumina
Total chr 10
11
Baylor
Illumina
RIKEN
Sanger
UCSF
Affy/Broad
Illumina
Total chr 11
12
Baylor
Illumina
McGill
Sanger
Perlegen
Broad
Illumina
ParAllele
Illumina
Third Wave
Illumina
Perkin Elmer
Centurion Chip
Illumina 40k
ParAllele
Illumina
Third Wave
Illumina
Perkin Elmer
Centurion Chip
Illumina 40k
ParAllele
Illumina
Illumina
Illumina
Perlegen
Sequenom
Illumina
RIKEN
Third Wave
CHMC
Sequenom
UCSF
Perkin Elmer
Affy/Broad Centurion Chip
Illumina
Illumina 40k
Total chr 12
13
Sanger
Illumina
Baylor
ParAllele
Illumina
Illumina
RIKEN
Third Wave
37994
386
184
2
1
0
5017
1800
45384
554
7
34627
478
0
4771
1584
42021
32320
1838
568
329
0
156
1
6
1
0
4642
1963
41824
27737
168
56
5
37877
384
184
1
1
1
5063
1836
45347
546
7
34735
452
1
4826
1611
42178
32123
1837
573
310
627
108
1
3
1
3
4693
1995
42274
27633
164
56
5
37978
395
184
1
1
0
5022
1803
45384
563
7
34674
474
0
4766
1589
42073
32324
1838
571
321
0
145
1
5
1
0
4628
1953
41787
27739
166
56
7
3014
89
21
0
0
0
538
141
3803
73
1
2555
92
0
377
96
3194
3178
170
136
73
0
48
0
2
0
0
511
152
4270
1765
14
4
1
36
5898
74
42
1
0
0
435
0
6450
112
0
5582
99
1
492
1
6287
4979
283
102
73
83
16
0
3
0
0
399
1
5939
4237
32
10
5
7295
75
54
0
0
0
741
121
8286
102
0
6532
108
0
730
99
7571
4572
379
115
78
0
25
0
3
1
0
712
127
6012
5348
29
16
3
4211
47
31
1
0
0
550
152
4992
59
2
3707
50
0
525
141
4484
3543
234
66
34
0
23
0
1
1
0
571
172
4645
2910
17
8
4
3627
41
18
0
0
0
535
50
4271
46
0
3222
50
0
469
43
3830
3032
151
105
29
116
19
0
0
1
0
532
52
4037
2660
11
6
0
4346
46
21
1
0
0
671
153
5238
67
0
3683
65
0
601
132
4548
5134
270
121
48
0
38
1
1
0
0
674
226
6513
3167
27
8
3
30769
250
132
1
1
0
3929
1507
36589
422
4
28365
336
0
3869
1347
34343
25598
1434
366
222
0
85
1
3
0
0
3560
1639
32908
23062
137
44
0
28352
269
124
0
1
1
4093
1786
34626
388
7
25931
303
0
3865
1567
32061
24111
1403
366
208
428
73
1
0
0
3
3762
1942
32297
20736
121
40
0
26337
274
109
0
1
0
3610
1529
31860
394
7
24459
301
0
3435
1358
29954
22617
1189
335
195
0
82
0
1
0
0
3242
1600
29261
19224
110
32
1
UCSF
Affy/Broad
Illumina
Total chr 13
14
RIKEN
Sanger
Baylor
Illumina
UCSF
Affy/Broad
Illumina
Total chr 14
15
RIKEN
Baylor
Sanger
Illumina
Affy/Broad
Illumina
Total chr 15
16
RIKEN
Sanger
Baylor
Illumina
McGill
Affy/Broad
Illumina
Total chr 16
17
RIKEN
Sanger
Baylor
Illumina
Affy/Broad
Illumina
Total chr 17
Perkin Elmer
Centurion Chip
Illumina 40k
Third Wave
Illumina
ParAllele
Illumina
Perkin Elmer
Centurion Chip
Illumina 40k
Third Wave
ParAllele
Illumina
Illumina
Centurion Chip
Illumina 40k
Third Wave
Illumina
ParAllele
Illumina
Illumina
Centurion Chip
Illumina 40k
Third Wave
Illumina
ParAllele
Illumina
Centurion Chip
Illumina 40k
0
4593
1430
33989
23099
269
229
3
0
3530
1588
28718
21110
229
234
1
2644
1282
25500
20259
260
201
3
1
2039
1072
23835
20476
402
304
1
1694
952
23829
2
4646
1451
33957
23213
252
219
3
1
3574
1604
28866
21178
225
222
1
2677
1298
25601
20321
245
195
3
1
2056
1089
23910
20555
374
295
1
1712
975
23912
0
4591
1420
33979
23152
269
236
3
0
3527
1583
28770
21131
228
223
1
2641
1281
25505
20296
254
207
3
1
2038
1068
23867
20511
401
305
1
1689
961
23868
0
361
72
2217
1749
53
24
0
0
340
91
2257
1704
37
46
0
245
74
2106
1603
73
48
0
0
182
79
1985
1888
75
44
0
149
81
2237
37
0
485
1
4770
3987
44
33
1
0
343
1
4409
4025
43
41
0
305
0
4414
3830
58
44
0
1
233
0
4166
3665
71
53
0
156
0
3945
0
729
84
6209
4574
41
32
1
0
530
96
5274
4376
32
24
0
418
64
4914
4845
58
50
0
1
331
67
5352
4483
59
53
0
248
79
4922
0
495
99
3533
2702
42
36
0
0
387
160
3327
2262
31
34
0
315
100
2742
2293
24
24
0
1
248
95
2685
2350
47
37
0
203
90
2727
0
527
44
3248
2314
21
18
0
0
351
30
2734
2343
23
25
0
290
40
2721
2184
28
29
0
0
230
18
2489
1777
36
28
0
136
20
1997
0
594
135
3934
2503
35
29
0
0
502
141
3210
2342
34
40
0
344
140
2900
2112
35
40
1
0
223
89
2500
2142
52
34
0
201
68
2497
0
3737
1259
28239
18648
174
169
3
0
2803
1337
23134
17144
161
154
1
2084
1108
20652
16363
163
129
3
0
1609
898
19165
16238
280
223
1
1342
781
18865
2
3634
1406
25939
16912
187
168
2
1
2880
1573
21723
14810
159
156
1
2082
1258
18466
14307
159
122
3
0
1593
1071
17255
15113
267
214
1
1420
955
17970
0
3268
1201
23836
16075
193
175
2
0
2495
1346
20286
14413
162
159
1
1879
1077
17691
13339
161
117
2
0
1484
912
16015
13886
290
218
1
1240
814
16449
18
Illumina
Broad
Illumina
Illumina
Sequenom
Perlegen Perlegen
Baylor
ParAllele
Sanger
Illumina
McGill
Illumina
RIKEN
Third Wave
Affy/Broad Centurion Chip
Illumina
Illumina 40k
Total chr 18
19
RIKEN
Third Wave
Sanger
Illumina
Baylor
ParAllele
Illumina
Illumina
UCSF
Perkin Elmer
Affy/Broad Centurion Chip
Illumina
Illumina 40k
Total chr 19
20
Sanger
Illumina
Baylor
ParAllele
Illumina
Illumina
RIKEN
Third Wave
Affy/Broad Centurion Chip
Illumina
Illumina 40k
Total chr 20
21
CHMC
Illumina
Sequenom
Sanger
Illumina
Baylor
ParAllele
Illumina
Illumina
RIKEN
Third Wave
Affy/Broad Centurion Chip
Illumina
Illumina 40k
29294
2247
999
0
160
152
55
4
3195
1067
37173
14479
539
388
1
0
582
758
16747
14681
184
100
1
1800
641
17407
15030
55
128
100
3
4
1677
634
29277
2272
976
475
149
145
56
2
3276
1067
37695
14527
502
385
1
1
591
769
16776
14803
180
100
0
1812
643
17538
14978
55
119
91
3
2
1716
638
29281
2251
994
0
159
150
55
8
3199
1065
37162
14491
535
398
1
0
577
760
16762
14676
193
100
0
1799
641
17409
15042
55
126
99
3
2
1679
637
1563
122
40
0
24
20
3
1
303
80
2156
1319
93
50
0
0
41
35
1538
1093
25
11
0
152
45
1326
985
12
21
12
0
2
120
45
38
4484
428
77
125
24
17
26
2
311
0
5494
2598
97
63
0
0
50
0
2808
1695
26
27
0
164
1
1913
1855
18
18
15
1
2
147
1
5744
534
130
0
29
20
28
3
536
60
7084
2889
81
61
0
0
64
36
3131
2449
33
35
0
283
40
2840
2360
26
20
18
1
1
219
35
2804
189
75
0
20
17
13
3
352
79
3552
1757
76
42
0
0
59
81
2015
1477
20
22
1
173
55
1748
1503
14
11
14
2
2
185
69
2472
251
43
35
17
21
3
0
373
19
3234
1580
68
45
0
0
73
23
1789
1007
24
14
0
181
21
1247
1029
9
9
9
0
0
166
12
3245
242
90
0
12
13
5
4
482
113
4206
1709
59
36
0
0
92
65
1961
1703
28
10
0
211
53
2005
1346
8
16
11
0
1
220
39
24926
1936
884
0
116
115
39
0
2540
908
31464
11403
370
296
1
0
482
642
13194
12111
139
67
0
1475
541
14333
12542
29
96
74
1
0
1372
520
22320
1593
856
315
108
107
27
0
2592
1048
28966
10349
337
277
1
1
468
746
12179
12101
130
59
0
1467
621
14378
12094
28
92
67
2
0
1403
625
20291
1475
774
0
118
117
22
1
2181
892
25871
9893
395
301
1
0
421
659
11670
10524
132
55
0
1305
548
12564
11336
21
90
70
2
0
1240
563
Total chr 21
22
Illumina
Illumina
Sanger
Illumina
Baylor
ParAllele
McGill
Illumina
RIKEN
Third Wave
UCSF
Perkin Elmer
Affy/Broad Centurion Chip
Illumina
Illumina 40k
Total chr 22
X
Illumina
Illumina
Sanger
Illumina
Baylor
ParAllele
RIKEN
Third Wave
McGill
Illumina
UCSF
Perkin Elmer
Affy/Broad Centurion Chip
Illumina
Illumina 40k
Total chr X
Y
Broad
Sequenom
Illumina
Illumina 40k
Total chr Y
mtDNA Broad
Sequenom
Total all chromosomes
17631
17602
14765
14750
261
233
147
142
6
6
4
2
0
3
641
666
660
661
16484
16463
44324
44306
20
33
36
25
1
2
0
0
0
1
1936
2057
1362
1364
47679
47788
13
12
0
1
13
13
168
168
1009531 1014331
17643 1197
2057
14772 1461
1603
248
59
32
152
26
17
6
0
0
5
1
2
0
0
0
656
86
64
660
69
0
16499 1702
1718
44337 3543 10291
68
11
27
48
11
20
2
0
2
1
0
0
0
0
0
1952
154
357
1362
100
0
47770 3819 10697
13
10
8
0
0
0
13
10
8
168
62
120
1009935 77456 146025
2680
1800 1234
2304
1578 1394
47
17
41
31
13
21
0
1
0
2
2
0
0
0
0
115
78
91
55
67
14
2554
1756 1561
13230
4889 4305
22
8
4
21
8
5
2
1
0
1
0
0
0
0
1
515
227
223
143
127
37
13934
5260 4575
8
0
1
0
0
0
8
0
1
105
41
19
186087 107013 93642
SNPs that were genotyped by more than one centre are counted for each centre, so the total here is slightly more than the actual
number of SNPs in the Phase I data set.
39
13322
10975
171
104
4
0
0
455
547
12256
26348
40
21
0
0
0
1186
1089
28684
3
0
3
27
708218
Analysis panel
CEU
CHB+JPT
40
Analysis panel
CEU
CHB+JPT
69
6,369
58
5,457
55
5
1
49
45
2
1
42
60
3,250
20
994
71
4,064
54
31
1
22
18
15
1
2
64
37
4
23
48
3,858
33
3,086
57
4,842
45
8
2
35
29
1
0
28
49
4
2
43
58
5,345
63
5,857
80
7,282
36
6
0
30
41
11
0
30
55
8
2
45
To evaluate the quality of genotyping, we selected 1,500 SNPs for repeat genotyping
among the 1,143,598 SNPs genotyped as of February 2005. Each of the SNPs was regenotyped in up to 4 different labs (RIKEN, Broad, Sanger, and Washington University)
and re-genotyping results were used to create a consensus call. The table summarises
results for a comparison of the original genotype with consensus genotype for 100 SNPs
each that were selected because they: a) exhibited an excess of Mendelian
inconsistencies, b) had >20% missing data, c) exhibited an excess of homozygotes or d)
exhibited an excess of heterozygotes.
41
Supp. Table 14 Genotyping error rates from duplicate data that passed QC filters
a) 5 internal duplicates, passed QC filters
SNPs compared
Genotypes compared from dup
samples
Successful in both
Number discrepant
1 allele discrepant
2 alleles discrepant
Total error rate
Analysis panel
CEU
1,157,650
YRI
1,123,296
5,616,480
5,051,963
1,967
1,944
23
0.04%
0.04%
0.00%
0.02%
5,788,250
5,150,411
2,454
2,414
40
0.60%
0.51%
0.09%
0.30%
21,841
1,965,690
1,917,620
10,885
8,015
2,870
436,179
3,925,611
3,849,881
17,717
14,029
3,688
42
CHB+JPT
1,134,726
0.05%
0.05%
0.00%
0.02%
0.47%
0.35%
0.12%
0.24%
0.57%
0.42%
0.15%
0.28%
0.46%
0.36%
0.10%
0.23%
5,673,630
5,100,801
2,821
2,699
122
50,101
4,709,494
4,476,142
24,319
20,328
3,991
21,787
1,939,043
1,894,347
8,981
7,303
1,678
0.06%
0.05%
0.00%
0.03%
0.54%
0.45%
0.09%
0.27%
0.47%
0.39%
0.09%
0.24%
Duplicate SNPs were identified from inadvertent genotyping of the same SNP by two centres
within the project, from the addition of data from the Illumina 40K and the Affymetrix Centurion
products, and from comparison to external published studies and the pilot Phase II data on
chromosome 2p. The discrepancy and genotype error rates are estimated from a considerable
number of genotypes from all 3 analysis panels.
43
44
Position (bp)
227,470,743
175,426,257
16,417,587
84,986,086
205,204,106
125,833,675
179,093,969
185,812,239
185,812,239
37,941,903
36,710,212
36,714,465
42,773,204
138,812,597
19,121,947
122,847,845
42,691,906
3,118,422
3,120,428
106,320,198
81,489,046
5,928,062
19,046,338
19,046,338
134,151,957
16,798,123
29,552,643
34,186,812
105,197,455
105,197,455
124,020,475
124,020,475
80,152,804
1,824,502
SNP
rs2046614
rs10798601
rs1429405
rs1192372
rs1559930
rs4678160
rs9877019
rs7660649
rs7660649
rs1423436
rs6457940
rs6906101
rs6908950
rs93241662
rs2390085
rs6971297
rs7825957
rs985648
rs657877
rs1412427
rs6584721
rs7125355
rs1503503
rs1503503
rs4614466
rs1471943
rs2216858
rs11832550
rs2279759
rs2279759
rs4622332
rs4622332
rs6574676
rs171162
Ratio
3
24
2
23
16 0
4
26
2
21
26 4
0
16
2
22
18 1
25 3
19 1
2
21
27 4
15 0
1
22
1
18
22 2
26 4
2
21
16 0
18 0
2
24
26 2
19 1
2
21
22 2
2
22
24 3
4
27
3
23
3
25
1
18
23 3
1
18
p value
4.9 x 10-5
1.9 x 10-5
3.1 x 10-5
5.9 x 10-5
6.6 x 10-5
5.9 x 10-5
3.1 x 10-5
3.6 x 10-5
7.6 x 10-5
2.7 x 10-5
4.0 x 10-5
6.6 x 10-5
3.4 x 10-5
6.1 x 10-5
5.7 x 10-6
7.6 x 10-5
3.6 x 10-5
5.9 x 10-5
6.6 x 10-5
3.1 x 10-5
7.6 x 10-6
1.0 x 10-5
3.0 x 10-6
4.0 x 10-5
6.6 x 10-5
3.6 x 10-5
3.6 x 10-5
4.9 x 10-5
3.4 x 10-5
8.8 x 10-5
2.7 x 10-5
7.6 x 10-5
8.8 x 10-5
7.6 x 10-5
CEU
8
11
1
4
13
0
12
8
16
1
11
19
9
3
3
10
5
12
5
5
0
0
6
12
6
12
6
13
5
17
5
Ratio
6
6
3
7
7
0
5
9
10
8
1
16
18
4
1
16
4
14
9
7
3
0
3
15
6
12
4
13
5
12
6
p value
.7905
.3323
.6250
.5488
.2632
1.000
.1435
1.000
.3269
.0391
.0063
.7359
.1221
1.000
.6250
.3269
1.000
.8450
.4240
.7744
.2500
1.000
.5078
.7011
1.000
1.000
.7539
1.000
1.000
.4583
1.000
Gene
ITB5
ENPP6
ENPP6
Ectonucleotide pyrophosphatase
MRGX2
MRGX2
ARG99
b
Transmissions to female offspring only
Transmissions to male offspring only
d
Transmissions from female parents only
Transmissions from male parents only
e
Multiple SNPs in this region showed identical evidence for transmission distortion
c
45
Position (bp)
55,193,325
55,219,072
16,585,452
50,383,194
SNP
rs10496036
rs17046594
rs4685345
rs2236953
Analysis panel
CEU
YRI
Ratio
p value
Ratio
p value
3 24
4.9 x 10-5
8
8
1.000
26
4
5.9 x 10-5
9
7
.8036
17
0
1.5 x 10-5
5
2
.4531
19
1
4.0 x 10-5
0
0
1.000
3e
3e
3c
50,384,189
50,408,074
132,649,868
rs2236954
rs2236964
rs1393555
1
18
19
20
1
1
2.1 x 10-5
7.6 x 10-5
4.0 x 10-5
17
1
13
0
.5847
1.000
3b,e
3b,e
5
5
7
7
7
9a
12
13
15e
20
167,273,210
167,385,449
24,968,096
31,772,165
37,713,607
100,757,589
143,504,006
6,157,017
104,118,222
36,566,892
50,847,852
61,700,086
rs9855808
rs10936485
rs7711040
rs17414142
rs2598108
rs13236236
rs929288
rs2381413
rs3794233
rs1924181
rs2440359
rs1321353
1
1
18
19
23
1
3
16
16
2
18
3
19
18
1
1
3
18
24
0
0
21
1
23
4.0 x 10-5
7.6 x 10-5
7.6 x 10-5
4.0 x 10-5
8.8 x 10-5
7.6 x 10-5
4.9 x 10-5
3.1 x 10-5
3.1 x 10-5
6.6 x 10-5
7.6 x 10-5
8.8 x 10-5
10
3
8
5
5
6
16
4
12
12
1
18
2
13
3
9
14
10
2
11
10
4
.1849
1.000
.3833
.7266
.4240
.1153
.3269
.6875
1.000
.8318
.3750
Gene
RTN4
RTN4
CACNA2D2
Voltage-gated calcium
channel
CACNA2D2
CACNA2D2
CPNE4
UCC1
EMID2
Cell adhesion
Collagen precursor
DIP13B
b
Transmissions to female offspring only
Transmissions to male offspring only
d
Transmissions from female parents only
Transmissions from male parents only
e
Multiple SNPs in this region showed identical evidence for transmission distortion
c
46
Calcium dependent
membrane binding protein
48
Supplementary Figure 14 Length of LD spans. This figure shows a simple model for
the decay of linkage disequilibrium 32 in windows of 1 million bases distributed
throughout the genome. The results of model fitting are summarised by plotting the
fitted r2 value for SNPs separated by 30 kb. (The results for the CHB+JPT analysis
panel are in Fig. 15.)
Supplementary Figure 15 The joint distribution of relative local genetic diversity
and excess of rare alleles. The y-axis on this figure represents sequence diversity
measured as heterozygosity (derived from whole genome shotgun sequencing),
normalised by human-chimpanzee divergence. The x-axis is a relative measure of
allele frequency skew, calculated as the proportion of all SNPs with MAF < 0.20. In the
YRI panel, diversity around the HBB gene is highlighted by the red points. In the CEU
panel, diversity within the LCT gene region is highlighted.
Supplementary Figure 16 The distribution of derived allele frequencies across
the genome by functional class. The colours in this figure represent the genomic
annotation for each set of SNPs: exons (dark blue), conserved non-genic regions (red),
promoters (yellow), rest of genome (light blue). The bins represent allele frequency
bins. The y axis represents the fractions of all SNPs that are in each frequency bin.
Supplementary Figure 17 Total cumulative numbers of SNPs attempted in each
analysis panel.
Supplementary Figure 18 The distribution of minor allele frequencies, by analysis
panel.
Supplementary Figure 19 The distribution of inter-SNP distances, by centre and
chromosome.
49
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
Abecasis, G.R., Cherny, S.S., Cookson, W.O. & Cardon, L.R. GRR: graphical
representation of relationship errors. Bioinformatics 17, 742-743 (2001).
Abecasis, G.R., Cherny, S.S., Cookson, W.O. & Cardon, L.R. Merlinrapid analysis of
dense genetic maps using sparse gene flow trees. Nature Genet. 30, 97-101 (2002).
McVean, G.A. et al. The fine-scale structure of recombination rate variation in the human
genome. Science 304, 581-584 (2004).
Kong, A. et al. A high-resolution recombination map of the human genome. Nature
Genet. 31, 241-247 (2002).
Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination
hotspots using single-nucleotide polymorphism data. Genetics 165, 2213-2233 (2003).
Marchini, J., Cardon, L.R., Phillips, M.S. & Donnelly, P. The effects of human population
structure on large genetic association studies. Nature Genet. 36, 512-517 (2004).
Bersaglieri, T. et al. Genetic signatures of strong recent positive selection at the lactase
gene. Am. J. Hum. Genet. 74, 1111-1120 (2004).
Schaffner, S. et al. Calibrating a coalescent simulation of human genome sequence
variation. Genome Res. 15, 1576-1583 (2005).
Eaves, I.A. et al. Transmission ratio distortion at the INS-IGF2 VNTR. Nature Genet. 22,
324-325 (1999).
Hill, W.G. & Weir, B.S. Maximum-likelihood estimation of gene location by linkage
disequilibrium. Am. J. Hum. Genet. 54, 705-714 (1994).
51