0% found this document useful (0 votes)
97 views31 pages

The Encode Project: Encyclopedia of Dna Elements

ml

Uploaded by

Dylan Jackson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views31 pages

The Encode Project: Encyclopedia of Dna Elements

ml

Uploaded by

Dylan Jackson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

The ENCODE Project: ENCyclopedia Of DNA Elements

Overview
Consortium Membership
Data Release Policy
Accessing ENCODE Data
Common Consortium Resources
Target Selection Process and Target Regions
Comparative Sequence Analysis
Coordination with HapMap
Meeting Reports
Request for Application (RFA)
Press Releases and Publications
Program Staff

Researchers Expand Efforts to Explore Functional Landscape of


the Human Genome
ENCODE Overview
The National Human Genome Research Institute (NHGRI) launched a public research
consortium named ENCODE, the Encyclopedia Of DNA Elements, in September
2003, to carry out a project to identify all functional elements in the human genome
sequence. The project is being conducted in three phases: a pilot project phase, a
technology development phase and a planned production phase.
The pilot phase is testing and comparing existing methods to rigorously analyze a
defined portion of the human genome sequence. It is organized as an open
consortium (See: ENCODE Participants and Projects) and brings together
investigators with diverse backgrounds and expertise to evaluate the relative merits
of each of a diverse set of techniques, technologies and strategies. The concurrent
technology development phase of the project aims to develop new high throughput
methods to identify functional elements. The goal of the first two phases of the
ENCODE project is to identify a suite of approaches that will allow the comprehensive
identification of all the functional elements in the human genome. Through the
ENCODE pilot, NHGRI expects to assess the abilities of different approaches to be
scaled up for an effort to analyze the entire human genome and to find gaps in our
ability to identify functional elements in genomic sequence.
The ENCODE Pilot Project process involves close interactions between computational
and experimental scientists to evaluate a number of methods for annotating the
human genome. A set of regions (See: Target Selection Process and Target Regions)
representing approximately 1 percent (30 Mb) of the human genome has been
selected as the target for this pilot project and is currently being analyzed by all
ENCODE Consortium investigators. All data generated by ENCODE participants on
these regions will be rapidly released into public databases (See: Accessing ENCODE
Data). By initially concentrating on a limited portion of the human genome, the
NHGRI hopes that all of those who have experience and insight into the problem will
be willing to participate, whether or not their approaches are proprietary or have
already generated proprietary data. The ENCODE Consortium is open to all academic,
government and private sector scientists interested in participating in an open

process to facilitate the comprehensive interpretation of the human genome


sequence and who agree to the criteria for participation (See: Criteria for
Participation) for the project. The activities of the ENCODE Consortium will be
influential in helping to guide the planning for a complete public elucidation of
functional elements within the entire human genome.

Read about the ENCODE Project's Background

Top of page

ENCODE Consortium Membership


The ENCODE Consortium is composed primarily of scientists who were funded under
RFAs released by NHGRI to initiate the pilot and technology development phases of
the ENCODE Project. Other participants have been identified and brought into the
Consortium as appropriate. The Consortium is open to any investigator willing to
abide by the criteria for participation established for the ENCODE Project by NHGRI.
The ENCODE External Consultants Panel oversees the activities of the Consortium
and provides advice and feedback on the Consortium's goals, progress and
membership.
Those interested in applying for membership to the ENCODE Consortium should
review the criteria for participation and contact Elise Feingold, Ph.D., or Peter Good,
Ph.D. (See: Program Staff).

Criteria for Participation


External Consultants Panel
ENCODE Participants and Projects

Top of page

ENCODE Data Release Policy


The NHGRI is committed to the principle of rapid data release to the scientific
community. This principle was initially implemented during the Human Genome
Project and has been recognized as leading to one of the most effective ways of
promoting the use of the human genome sequence to advance scientific knowledge.

ENCODE Data Release Policy


ENCODE as a Data User

Top of page

Accessing ENCODE Data


The data produced by ENCODE Consortium members are deposited to public
databases and are available for all to use without restriction. Data linked to the

genomic sequence is stored and visualized on the University of California, Santa Cruz
browser at ENCODE Project at UCSC [genome.ucsc.edu]. Other, non-sequence based
data, like that from microarray studies, are available on public databases such as the
Gene Expression Omnibus (GEO) [ncbi.nlm.nih.gov] and ArrayExpress [ebi.ac.uk].
The NHGRI Division of Intramural Research will be developing a "portal" that will
function as a single point of entry from which users can search and retrieve data
from the ENCODE Project. Data users should abide by the ENCODE Data Release
Policy when accessing data produced by ENCODE Consortium members.
Top of page

ENCODE Common Consortium Resources


Common reagents and resources have been identified by the Consortium to aid in
the comparison of data produced by ENCODE participants using different platforms
and experimental approaches. Common resources for ENCODE include the pilot
project target sequences, BAC Clones for ENCODE targets, cell lines, and antibodies
to DNA-binding proteins.

ENCODE Common Consortium Resources

Top of page

ENCODE Target Selection Process and Regions


For use in the ENCODE Pilot Project, defined regions of the human genome corresponding to 30Mb, roughly 1 percent of the total human genome - have been
selected. These regions serve as the foundation on which to test and evaluate the
effectiveness and efficiency of a diverse set of methods and technologies for finding
various functional elements in human DNA.

ENCODE Project Target Selection Process and Target Regions

Top of page

ENCODE Comparative Sequence Analysis


A component of ENCODE data production involves the generation of sequencing
information from a number of different genomes in order to extract the maximum
amount of information about the human genome through comparative analyses.
Efforts are already underway at the NHGRI, University of British Columbia and the
NIH Intramural Sequencing Center to identify, map and sequence, respectively; BAC
clones for regions syntenic to the human ENCODE targets will be made in additional
mammalian species. In addition to these ENCODE-directed efforts, sequence data
generated through whole genome sequencing projects will be used in comparative
analyses to help scientists better understand the human sequence. ENCODE
Participants intend to abide by the Fort Lauderdale recommendations on "Sharing

Data from Large-scale Biological Research Projects" when using unpublished


sequence data in Project analyses.

View the ENCODE regions in multiple species [ensembl.org]

Link to NISC ENCODE comparative sequencing site [nisc.nih.gov]


ENCODE as a Data User

Top of page

ENCODE-HapMap Coordination
The International HapMap Project has decided to focus on 10 of the ENCODE random
regions for comprehensive genotyping as part of an in-depth study of human genetic
variation. The regions were chosen to represent a range of conservation with the
mouse genome and of gene density according to the strata identified during the
ENCODE target selection process.
The 10 HapMap-ENCODE regions were resequenced in 48 unrelated individuals (16
Yoruba, 16 CEPH, 8 Han Chinese, and 8 Japanese) using a PCR-based method.
30,000 single nucleotide polymorphisms (SNPs) were identified in the HapMapENCODE regions. Some of these were already represented in dbSNP, a database of
SNP data that is managed by the National Center for Biotechnology Information
(NCBI), while others were discovered during the resequencing. The newly-discovered
SNPs were added to dbSNP and the sequence data from the 48 individuals are stored
in NCBI's Trace Archive .
Of the 30,000 SNPs identified in the HapMap-ENCODE regions, 10,000 were not
analyzed because of failed design or failed genotyping. Genotype data were obtained
from the remaining 20,000 SNPs in the HapMap-ENCODE regions of all 270 samples
used for the HapMap Project (90 CEPH, 90 Yoruba, 45 Han Chinese, and 45
Japanese). This genotyping was done at the Broad Institute of Harvard and MIT,
Illumina, Baylor College of Medicine, McGill University & Genome Quebec Innovation
Centre, and the University of California, San Francisco.
The ENCODE-HapMap genotyping data set is considered to be a "gold standard" data
set because of the high density of SNP coverage. The genotype data from these
regions will be used to determine the best way to choose tag SNPs and to assess the
adequacy of the entire HapMap for many analyses, such as coverage, linkage
disequilibrium (LD) measures, and haplotype inference.
For more information on coordination between the HapMap and ENCODE Projects,
please visit http://www.hapmap.org/downloads/encode1.html.en.
Top of page

ENCODE Meeting Reports

A workshop to discuss a proposal to create a highly interactive public research


consortium to carry out a pilot project for testing and comparing existing and new
methods to identify functional sequences in DNA. The workshop participants
resoundingly supported the concept of a pilot project and made a number of
recommendations about the project's goals, organization and implementation.

ENCODE Pilot Project Launch Meeting March 7, 2003

July 23-24, 2002: Workshop on the Comprehensive Extraction of Biological


Information From Genomic Sequence

Top of page

ENCODE Project Request For Application (RFA)


New ENCODE RFAs

RFA-HG-07-010 [grants1.nih.gov]: A Data Analysis Center for the


Encyclopedia of DNA Elements (ENCODE) Project (U01)
Application Receipt Date(s): September 06, 2007

Past ENCODE RFAs


The pilot and technology development phases of the ENCODE project were initiated
simultaneously in 2003 when NHGRI released Requests For Application (RFAs) for
each of these phases. The first RFA for the pilot phase, RFA HG-03-003, entitled
Determination of all functional elements in human DNA, solicited applications from
those interested in participating in a research network to conduct a pilot project to
test and compare existing methods for identifying all of the functional elements in a
limited (~1%) region of the human genome. The second RFA, RFA HG-03-004,
entitled Technologies to find functional elements in DNA, solicited applications to
develop new and improved technologies for the efficient, comprehensive, highthroughput identification and verification of all types of sequence-based functional
elements, particularly those other than coding sequences, for which adequate
methods do not currently exist.
NHGRI re-released the technology development RFA in 2004 and 2006. RFA HG-04001, issued in 2004, solicited additional applications with an added emphasis on
high-risk, high-payoff projects and on technologies that might be applied to model
organism genomes. RFAs HG-07-028 and HG-07-029, issued in 2006, had an added
emphasis on methods to identify functional elements in repetitive sequences and on
methods than can be used to validate the identity of functional elements using
methods independent of the primary mode of detection.
As the initial phase of the ENCODE Project will be completed in September 2007,
NHGRI issued RFAs in November 2006 to solicit application for research projects to
continue the ENCODE-based analysis of the human genome at both pilot and wholegenome scales. RFA HG-07-030, entitled Creating the Encyclopedia of DNA Elements
(ENCODE) in the Human Genome (U01 and U54), solicited applications for research
projects to identify functional elements in the entire human genome sequence (for

whole-genome scale projects) or in the 1% of the genome targeted during the


ENCODE pilot phase (for pilot-scale projects). RFA HG-07-031, entitled A Data
Coordination Center for the Encyclopedia of DNA Elements (ENCODE) Project (U41)
solicited applications to develop, house, and maintain databases to track, store, and
provide access to the different types of data generated as part of the ENCODE
Project.
In 2006, NHGRI released RFAs to begin an ENCODE-like project in the model
organisms Caenorhabditis elegans and Drosophila melanogaster. This effort, called
modENCODE, was initiated through the funding of grants submitted in response to
RFA HG-06-006, entitled Identification of all functional elements in selected model
organism genomes and RFA HG-06-007, entitled A Data Coordination Center for the
model organism ENCODE Project (modENCODE). The modENCODE Project will exploit
the experimental advantages of working with the genomes of these two well-studied
model organisms both to identify sequence-based functional elements and to
promote an understanding of the functional elements on the basis of experiments
that might not be possible to do for those working with the human genome.

RFA HG-03-003 [grants.nih.gov]: Determination of All Functional Elements in


Human DNA (Expired)

RFA HG-03-004 [grants.nih.gov]: Technologies to Find Functional Elements in


Genomic DNA (Expired)
RFA-HG-04-001 [grants1.nih.gov]: Technologies to Find Functional Elements
in Genomic DNA. (Expired)
RFA-HG-07-028 [grants.nih.gov]: Technology Development for the
Comprehensive Determination of Functional Elements in Eukaryotic Genomes
(R21) (Expired)
RFA HG-07-029 [grants.nih.gov]: Technology Development for the
Comprehensive Determination of Functional Elements in Eukaryotic Genomes
(R01) (Expired)
RFA-HG-07-030 [grants.nih.gov]: Creating the Encyclopedia of DNA Elements
(ENCODE) in the Human Genome (U01 and U54) (Expired)

NOT-07-007: Clarification and Additional Information to HG-07-030 and HG07-031


Slides from Applicant Information Meeting - HG-07-030

RFA-HG-07-031 [grants.nih.gov]: A Data Coordination Center for the


Encyclopedia of DNA Elements (ENCODE) Project (U41) (Expired)
NOT-07-007: Clarification and Additional Information to HG-07-030 and HG07-031
Slides from Applicant Information Meeting - HG-07-031

RFA-HG-06-006 [grants.nih.gov]: Identification of All Functional Elements in


Selected Model Organism Genomes (Expired)

RFA HG-06-007 [grants.nih.gov]: A Data Coordination Center for the Model


Organism ENCODE Project (modENCODE) (Expired)

Top of page

ENCODE Press Releases and Publications


ENCODE Press Releases

Researchers Expand Efforts to Explore Functional Landscape of the Human


Genome
October 9, 2007

New Findings Challenge Established Views on Human Genome


June 13, 2007
ENCODE Consortium Publishes Scientific Strategy
October 21, 2004
Beyond Genes: Scientists Venture Deeper Into the Human Genome: ENCODE
Project Seeks to Identify All Functional Elements in Human DNA October 9,
2003
Launch of Pilot Project to Identify All Functional Elements in Human DNA
March 4, 2003

ENCODE Publications

Identification and analysis of functional elements in 1% of the human genome


by the ENCODE pilot project
Nature, June 13, 2007

ENCODE Web Focus


Related articles on ENCODE from Nature, June 2007
Special Issue on ENCODE from Genome Research
June 2007
The ENCODE (ENCylopedia Of DNA Elements) Project. [sciencemag.org] (Full
Text)
Science, Vol. 306, Issue 5696, 636-640, 22 October 2004.

Top of page

ENCODE Program Staff


Program Directors
Elise Feingold, Ph.D.
E-mail: feingole@exchange.nih.gov
Peter Good, Ph.D.
E-mail: goodp@mail.nih.gov

Program Analysts
Laura Liefer
E-mail: lliefer@mail.nih.gov
Kris Wetterstrand, MS
E-mail: wettersk@mail.nih.gov
Address
National Human Genome Research Institute
National Institutes of Health
5635 Fishers Lane
Suite 4076, MSC 9305
Bethesda, MD 20892-9305
Phone:(301) 496-7531
Fax:(301) 480-2770

The HapMap ENCODE resequencing and genotyping project aims to produce a dense set of genotypes ac
genomic regions. Ten 500-kilobase regions of the genome were resequenced in 48 unrelated DNA samp
Yoruba, 8 Japanese, 8 Han Chinese, and 16 CEPH). All SNPs identified, along with SNPs in dbSNP, were
the 269 HapMap DNA samples (90 Yoruba, 44 Japanese, 45 Han Chinese, and 90 CEPH). The new SNPs
were deposited in dbSNP and all genotype data were sent to the HapMap Data Coordination Center. In a
Perlegen will genotype all SNPs in the remaining 34 ENCODE regions in all of the HapMap DNA samples
genotyping for the HapMap Project. This study will provide dense genotype data to allow the developme
assessment of methods of analysis. A second plate of samples was collected from each population in ord
studies of how general the results are from the first plate of samples. Of the 16 Yoruba samples reseque
on the first plate and 8 are on the second plate; of the 8 Han Chinese samples resequenced, 7 are on th
and 1 is on the second plate; the 8 Japanese samples and 16 CEPH samples that were resequenced are
plates. A complete list of the sample ID's can be found here.

ENCODE Regions SNP Information


Chromoso
me
band

Genomic interval
(NCBI B36 )

ENr112

2p16.3

ENr131

Region
name

Genotype SNPs

Genot

CEU

JPT+C
HB

Chr2:51512208..52012208

2,60
1

2,573

2,60
8

McGill-G
Perlege

2q37.1

Chr2:234156563..2346566
27

2,21
4

2,107

2,12
9

McGill-G
Perlege

ENr113

4q26

Chr4:118466103..1189661
03

2,53
8

2,401

2,40
5

Broad, P

ENm010

7p15.2

Chr7:26924045..27424045

1,83
0

1,787

1,74
2

UCSF-W
Perlege

ENm013(500Kb)

7q21.13

Chr7:89621624..90121624

1,77
0

1,678

1,68
0

Broad, P

ENm014(500Kb)

7q31.33

chr7:126368183..1268653
24

3,34
3

3,239

3,23
2

Broad, P

ENr321

8q24.11

Chr8:118882220..1193822

2,12

2,100

2,09

Illumina

YRI

20

ENr232

9q34.11

Chr9:130725122..1312251
22

1,90
9

1,828

1,80
8

Illumina

ENr123

12q12

Chr12:38626477..3912647
6

2,18
9

2,181

2,03
5

BCM, P

ENr213

18q12.1

Chr18:23719231..2421923
1

1,99
0

1,969

1,96
6

Illumina

Total

22,5
12

21,863

21,6
97

Population
YRI
JPT+CHB
CEU

descriptors:
: Yoruba in Ibadan, Nigeria
: Japanese in Tokyo, Japan + Han Chinese in Beijing, China (combined on one p
: CEPH (Utah residents with ancestry from northern and western Europe)

Generated Fri Apr 13 13:44:05 EDT 2007

HapMap ENCODE Resequencing Project

[ Top ]

Groups

David Altshuler and Stacey Gabriel, Broad Institute of Harvard and MIT
Richard Gibbs and George Weinstock, Baylor College of Medicine

ENCODE Regions

Each group resequenced five 500kb regions.


These regions were chosen by the Analysis Group from among the
ENCODE regions; they include a range of chromosomes, recombination
rates, gene density, and values of non-transcribed conservation with
mouse. For more information on the ENCODE Project see
http://www.genome.gov/10005107.

Samples

Resequencing was done for 16 CEPH, 16 Yoruba, 8 Japanese, and 8 Han


Chinese samples. Please click here to view the Coriell Catalog ID for each
DNA sample.
The samples are currently available and may be ordered from the Coriell
Institute.

Strategy

PCR-based sequencing across the regions for each sample.

Data Release

All SNPs found were deposited in dbSNP


(http://www.ncbi.nlm.nih.gov/projects/SNP/).

HapMap ENCODE Genotyping Project

[ Top ]

Groups

David Altshuler and Stacey Gabriel, Broad Institute of Harvard and MIT
Mark Chee, Illumina
Richard Gibbs and John Belmont, Baylor College of Medicine
Thomas Hudson, McGill University & Gnome Qubec Innovation Centre
Pui Kwok, UCSF

ENCODE Regions

The regions are the same ten regions being resequenced.


Each group genotyped all known SNPs (with rs# in dbSNP) and newly
discovered SNPs in the 500kb ENCODE regions in their assigned
chromosomes.

Samples

The samples genotyped are the same 270 (plus the 5 duplicates for each
plate) used for the HapMap Project:
o 90 CEPH samples: including the 16 that were resequenced.
o 90 Yoruba samples: including the 8 that were resequenced.
o 44 Japanese samples: including the 8 that were resequenced.
o 45 Han Chinese samples: including the 7 that were resequenced.
All of the samples listed above may be ordered from the Coriell Institute.
The 8 Yoruba samples and 1 Chinese sample that were not genotyped on
the first plates, are included on the second plates.

Strategy

Initially the SNPs currently in dbSNP build 121 were genotyped.


All SNPs found from the resequencing project and other sources were
also genotyped.

Data Release

The genotype data were sent to the DCC and distributed in the same way
as the other HapMap genotype data.

Perlegen Genotyping Component

[ Top ]

Groups

Perlegen

ENCODE regions

Initially, the ten 500kb regions.


Perlegen will genotype all SNPs in the remaining 34 ENCODE regions.

Samples

90 HapMap CEPH samples (plus 5 duplicates) in the ten 500kb regions.


All samples (90 Yoruba, 45 Japanese, 45 Han Chinese, and 90 CEPH
samples) in the remaining 34 ENCODE regions as part of its genotyping
for the HapMap Project.

Strategy

In the CEPH samples, Perlegen genotyped the ten 500kb regions for all
the SNPs in dbSNP and for the SNPs it had.
Perlegen will genotype all SNPs in the remaining 34 ENCODE regions in
all 270 samples.

Data Release

Perlegen sent its SNPs in the ten 500kb regions to dbSNP (as new SNPs
or as validation of ones in dbSNP).
The genotype data were sent to the DCC and distributed in the same way
as the other HapMap genotype data.
The data for the remaining 34 ENCODE regions will be sent to the DCC
when they become available.
Last updated

Location via proxy:

http://w w w .sanger.ac.uk/PostGenomics/encode/info.shtml

Go

Online Advertising
[Report a bug] [Manage cookies]
No referrer

No cookies

No scripts

Show this form

No ads

[ UP ]

Search

Functional Genomics
Human (HGP)
Pathogens
Blast
ENCODE
Home
Overview
Major Data Contributors
Comparative Sequencing
Data Management
News
Website Search
People Search
Library Services
Site Map
Feedback / Help

ENCODE - Project Information


Following the completion of the Human genome sequence, the next major task is to
understand the information contained therein. Although considerable progress has
been made in identifying the genes that code for functional proteins (For instance
see Collins et al 2003, 2004), identifying elements in DNA sequence which control
gene expression and DNA replication at a genome wide level is far from trivial.
Therefore NHGRI has established a pilot project (ENCODE) to explore computational
and experimental methods to develop an encyclopedia of DNA elements in the
human genome. Initially the pilot project has funded a collection of different groups
who will target 1% of the genome chosen according to the criteria outlined in the
ENCODE RFA.
The Sanger Institute has two groups involved in the Encode project.

Detecting Human Functional Sequences with


Microarrays

Ian Dunham PI
David Vetrie Co-PI
Nigel Carter Co-PI

We were inspired by recent work in our laboratories using microarrays to study DNA
copy number (Fiegler et al 2003), replication timing and chromatin modifications in a
variety of genomic situations from 400bp resolution in a ~200 kb pilot region,
through ~75 kb resolution across the q arm of chromosome 22, to 1Mb resolution
across the human genome. We aim to contribute microarray-based approaches to
the ENCODE consortium to provide experimental evidence of DNA elements involved

in gene regulation and replication, as well as the status of chromatin, across the pilot
1% of the genome. Specifically we are:1. Developing two sets of genomic microarrays covering the 1 % of the genome
targeted in the ENCODE project. The first is a low resolution genomic clone
(predominantly BACs, but also PACs, cosmids, fosmids) based microarray
using the clones from the genomic sequence tile path. The second is an array
of 22 000 1.25kb PCR fragments designed from the DNA sequence covering
~85% of the targeted regions - viewable here.

2. Using these microarrays to assay DNA samples enriched for sequences


involved in specific biological processes and functions by methods including
flow-sorting, pulse-labeling and chromatin immunoprecipitation (ChIP) so as
to develop high resolution maps of the following at genomic clone and 1.25kb
resolution of:
o
o
o
o
o

Replication timing,
Replication origins,
DNA methylation,
Modified histones/active and inactive chromatin,
Transcription factor binding sites.

We will correlate these maps with genomic DNA features including C+G content,
genes/exons, repeat elements, and SNP density. In addition we will correlate the
elements we map with regions of conserved DNA sequence identified by comparative
sequencing across multiple species being undertaken in the laboratory of Eric Green
and maps of transcriptional activity as part of the consortium.

Refs
Reevaluating human gene annotation: a second-generation analysis of chromosome 22.
Collins JE, Goward ME, Cole CG, Smink LJ, Huckle EJ, Knowles S, Bye JM, Beare DM, Dunham
I
Genome Res. 2003;13;27-36. PMID: 12529303
A genome annotation-driven approach to cloning the human ORFeome.
Collins JE, Wright CL, Edwards CA, Davis MP, Grinham JA, Cole CG, Goward ME, Aguado B,
Mallya M, Mokrab Y, Huckle EJ, Beare DM, Dunham I
Genome Biol. 2004;5;R84. PMID: 15461802
DNA microarrays for comparative genomic hybridization based on DOP-PCR amplification
of BAC and PAC clones.
Fiegler H, Carr P, Douglas EJ, Burford DC, Hunt S, Scott CE, Smith J, Vetrie D, Gorman P,
Tomlinson IP, Carter NP
Genes Chromosomes Cancer. 2003;36;361-74. PMID: 12619160
[an error occurred while processing this directive]

Identification of functionally variable regulatory regions


in the human genome

Manolis Dermitzakis
PI
Panos Deloukas CoPI
Stylianos E.
Antonarakis,
University of Geneva
Co-PI
Andrew G. Clark,
Cornell University
Co-PI

One of the main reasons


to annotate the human
genome is to interpret
the phenotypic
consequences of genetic
variation within functional
genomic regions. We are
using a novel approach
for the selective
identification of functionally variable regulatory sequences of the human genome. We
are detecting correlations between variation in gene expression and nucleotide
polymorphisms near those genes to identify regulatory regions and their variants
that contribute to gene expression variation. This approach uses naturally occurring
genomic variation (nucleotide polymorphism) and phenotypic variation (transcript
levels) to detect significant associations (Figure 1). Polymorphisms associated with
phenotypic variation will likely be in linkage disequilibrium with functional regulatory
polymorphisms nearby, thereby identifying segments of the genome containing
sequences that regulate gene expression.

Our experimental design


is to use the illumina
technology to screen for
gene expression variation
as well as to genotype
relevant SNPs for the
association analysis. We
have designed an illumina
bead array that contains
approximately 350 genes
from the ENCODE
regions, all the human
chromosome 21 genes
and 100 genes from a 10
Mb genomic region of
human chromosome 20.
An example of a
hybridized array is shown
in Figure 2. The
technology is highly
sensitive and accurate. In
Figure 3a we show the
regression of two replicates from the same RNA pool and in Figure 3b the regression
of two different individuals. Note the wider spread of Figure 3b as a result of
difference in transcript levels between the two individuals.
We view this project as readily scalable to a whole human genome screen for gene
expression variation and association with nucleotide polymorphism.
It will provide 4 different types of information:
1. Genomic regions that contain variable regulatory polymorphisms;
2. Structure of regulatory variation in the human genome and determination of how it is
associated with disease susceptibility;
3. Large dataset of genes that exhibit variation of expression within populations, in a
manner similar to the way the HapMap project will provide the haplotype structure of the
human genome.

Sanger Home
Sitemap
Site Search
Information
Careers
Press
News
Seminars
Workshops
Publications

Staff Theses
Travel Directions
Research Teams
Research Faculty
Personnel Search
Human Genetics
Model Organism Genetics
Pathogen Genetics
Bioinformatics
Sequencing
Library
Helpdesk
Webmail
VPN Access
Sign In

webmaster@sanger.ac.uk
Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK Tel:+44 (0)1223 834244
Last Modified Wed Apr 4 14:51:33 2007
Registered charity number 210183
Data Release Policy | Conditions of Use | Copyright

Location via proxy:

Go

http://w w w .hapmap.org/

Online Advertising
[Report a bug] [Manage cookies]
No referrer

No cookies

No scripts

No ads

Show this form

International
HapMap Project
Home | About the Project | Data
| Publications | Tutorial

| English | Franais | | Yoruba


The International HapMap Project is a partnership of scientists and funding
agencies from Canada, China, Japan, Nigeria, the United Kingdom and the
United States to develop a public resource that will help researchers find genes
associated with human disease and response to pharmaceuticals. See "About the
International HapMap Project" for more information.
Project
Information
About the Project

News

2007-08-13: Phased haplotypes in NCBI b36


coordinates

Phased haplotypes for release #22 have begun to be

HapMap
Publications
HapMap Tutorial
HapMap Mailing
List
HapMap Project
Participants
HapMap Mirror
Site in Japan
Project Data
HapMap Genome
Browser (B35 full data set)
HapMap Genome
Browser (B36 genotypes &
frequencies only)
HapMart
Bulk Data
Download
Data Freezes for
Publication
ENCODE Project
Guidelines For
Data Use
Useful Links
TSC SNP
Downloads
HapMap Samples
at Coriell Institute
HapMap Project
Press Release
NHGRI HapMap
Page
NCBI Variation
Database
(dbSNP)
Japanese SNP
Database (JSNP)

released for bulk download. Data will be made


available as it is processed and revised.

2007-07-17: HapMap Tutorial: Working with the


HapMap Website
ASHG Annual Meeting, San Diego, California, USA
October 25th, 2007 at 18:30 PST
San Diego Marriott Hotel and Marina, Rancho Las
Palmas, 4th Level, South Tower
The HapMap Data Coordination Center is pleased to
present a one-hour tutorial during the 2007 ASHG
Annual Meeting.
The tutorial will provide an overview of the International
HapMap Project, a comprehensive tour of the HapMap
website, a live demo of new tools and resources, and
Q&A session.
Registration is limited to the first 40 people. You must
register for this tutorial by October 14th, 2005, in order
to participate. There is no additional fee, but
participants must first be registered for the ASHG
Meeting.
To register or inquiries, please e-mail help@hapmap.org

2006-06-14: Wellcome Trust Course: Working with


the HapMap
Wellcome Trust Genome Campus, Hinxton, Cambridge,
UK
November 16-19th, 2007
The Wellcome Trust Course: Working with the HapMap
will be held on November 16-19th, 2007 at the
Wellcome Trust Genome Campus, Hinxton, Cambridge,
UK. The deadline for application is 10th August 2007.
Further information can be found at:
http://www.wellcome.ac.uk/doc_WTX038039.html with
details of how to apply.

2007-06-04: Predicted OMIM associations available


in GBrowse
The OMIM associations track presents data from the
MutaGeneSys database, which links genotype data from
HapMap and whole genome association studies with the
known disease variants reported by the OMIM database.

Example of a region with multiple OMIM associations:


Chr1:194923128..194933127

2007-05-29: Newly phased haplotypes available for


non-par segment of ChrX
Genotyping data for phase I+II (rel #21) was rephased
for the non-pseudoautosomal (non-par) region of
chromosome X. Data is currently available for bulk
download.

Old News

Developing a Haplotype Map of the Human Genome


for Finding Genes Related to Health and Disease
Washington, D.C.
July 18-19, 2001

Introduction
So far about 2.4 million DNA sequence variants (single nucleotide polymorphisms or
SNPs) have been discovered in the human genome, and millions more exist. These
variants will be most useful for discovering genes related to health and disease if
their organization along chromosomes, the haplotype structure, is known.
Technology is just reaching the point that haplotype maps of blocks of SNPs along
chromosomes can be developed.
On July 18-19, 2001, the National Institutes of Health (NIH) held a meeting in
Washington, D.C., to discuss how haplotype maps could be used for finding genes
contributing to disease; the methods for constructing such maps; the data about
haplotype structure in populations; the types of populations and samples that might
be considered for a map; the ethical issues, including those related to studying
genetic variation in identified populations; and how such a project could be
organized. The goal was to resolve some issues and to set up procedures for
resolving others.
There were 165 attendees, including human geneticists, population geneticists,
anthropologists, pharmaceutical and biotech industry scientists, social scientists,
ethicists, representatives from various communities and disease groups,
administrators from many NIH institutes and international funding agencies and
journalists.

Background: Genetic Variation and Its Use for Mapping Genes


Contributing to Disease
Recently technology has become available to study the extent and pattern of human
genetic variation on a large scale, and to use this variation to find the genes that

contribute to disease. The information summarized here was not presented at the
meeting but provides the background for understanding the importance and use of a
haplotype map.
Rationale for finding genes contributing to disease
The goal of much genetic research is to find genes that contribute to disease. Finding
these genes should allow an understanding of the disease process, so that methods
for preventing and treating the disease can be developed. For diseases with a
relatively straightforward genetic basis, the single-gene disorders, current methods
are usually sufficient to find the genes involved. Most people, however, do not have
single-gene disorders, but develop common diseases such as heart disease, stroke,
diabetes, cancers or psychiatric disorders, which are affected by many genes and
environmental factors. The genetic contribution to these diseases is not clear, but
many researchers consider common variants to be important, the CommonDisease/Common-Variant theory.
Definition of a Single Nucleotide Polymorphism, SNP
A SNP is a site in the DNA where different chromosomes differ in the base they have.
For example, 30 percent of the chromosomes may have an A, and 70 percent may
have a G. These two forms, A and G, are called variants or alleles of that SNP. An
individual may have a genotype for that SNP that is AA, AG, or GG.
Number of SNPs
When chromosomes from two random people are compared, they differ at about one
in 1000 DNA sites. Thus when two random haploid genomes are compared, or all the
paired chromosomes of one person are compared, there are about three million
differences. When more people are considered, they will differ at additional sites. The
number of DNA sites that are variable (SNPs) in humans is unknown, but there are
probably between 10 and 30 million SNPs, about one every 100 to 300 bases. Of
these SNPs, perhaps four million are common SNPs, with both alleles of each SNP
having a frequency above 20 percent.
How SNPs are used to find genes contributing to disease
Some SNP alleles are the actual functional variants that contribute to the risk of
getting a disease. Individuals with such a SNP allele have a higher risk for that
disease than do individuals without that SNP allele. Most SNPs are not these
functional variants, but are useful as markers for finding them. To find the regions
with genes that contribute to a disease, the frequencies of many SNP alleles are
compared in individuals with and without the disease. When a particular region has
SNP alleles that are more frequent in individuals with the disease than in individuals
without the disease, those SNPs and their alleles are associated with the disease.
These associations between a SNP and a disease indicate that there may be genes in
that region that contribute to the disease.
The use of haplotypes

A haplotype is the set of SNP alleles along a region


of a chromosome. Theoretically there could be
many haplotypes in a chromosome region, but
recent studies are typically finding only a few
common haplotypes. Consider the example below,
of a region where six SNPs have been studied; the
DNA bases that are the same in all individuals are
not shown. The three common haplotypes are
shown, along with their frequencies in the
population. The first SNP has alleles A and G; the
second SNP has alleles C and T. The four possible haplotypes for these two SNPs are
AC, AT, GC, and GT. However, only AC and GT are common; these SNPs are said to
be highly associated with each other.
The cost of genotyping is currently too high for whole-genome association studies
that would look at millions of SNPs across the entire genome to see which SNPs are
associated with disease. If a region has only a few haplotypes, then only a few SNPs
need to be typed to determine which haplotype a chromosome has and whether the
region is associated with a disease. In the example below, typing two SNPs is all that
is needed to distinguish among the three common haplotypes. The two SNPs
indicated by arrows are one pair of several possible pairs of SNPs that distinguish
among the three common haplotypes.
Most SNP variation is within all groups
For most SNPs, any population has individuals of all possible genotypes for a SNP,
but populations differ in the frequencies of individuals with each of the different
genotypes. About 85 percent of human SNP variation is within all populations, and
about 15 ;percent is between populations, as shown in the figure below. Thus two
random individuals within a village are almost as different in their SNP alleles as any
two random individuals from anywhere in the world. Although a small proportion of
SNPs have alleles that are common in some groups but rare in others, most SNP
alleles that are common in one group will be common in other groups. Under the
Common-Disease/Common-Variant theory, common variants that contribute to a
disease in one group will also contribute to the disease in other groups, although the
amount of the contribution may vary.

The Meeting
At the meeting there was discussion of recent data related to haplotype maps, and
what a haplotype map might look like. Since haplotypes and associations of SNPs
with disease are population phenomena, some of the discussion focused on
complexities in sampling human populations. Much of the discussion concerned the
information that could be gained by identifying the populations contributing samples
for a haplotype map, and the risks and benefits to populations of such identification.
The meeting ended with discussion of various aspects of a haplotype map project,
including what issues would need more discussion in working groups after the
meeting. The main points of the discussions are summarized here.

The Pattern of Genetic Variation and Association Among Genes

Factors that affect the frequencies of alleles and haplotypes in populations:

Biological factors: Haplotype and allele frequencies are affected by cellularlevel processes such as mutation, recombination, and gene conversion, as
well as by population-level processes such as natural selection against alleles
that contribute to disease. When genes are close together and associated,
then selection that changes the frequency of an allele at one gene results in
similar changes in the frequencies of alleles at other genes on the same
haplotype.
Recombination is the major process that breaks down the associations
between SNPs. It is unclear whether haplotype block boundaries are due to
recombination hotspots, or are simply the result of recombination events that
happened to occur there. If the blocks are due to hotspots, then perhaps they
will be common across populations. If the blocks are due to regular
recombination events, then populations may or may not share them,
depending on how long ago the recombination events occurred. When large
chromosomal regions are examined, the regions with high association have
less recombination and less genetic variation.

Demographic and social factors: Haplotype and allele frequencies are also
affected by population history factors such as population size, bottlenecks or
expansions of population size, founder effects, isolation of a population or
admixture between populations, and patterns of mate choice.
Large variances: The many influences on haplotype frequencies result in
large statistical variances for associations among different SNPs; all such
studies find that associations vary a lot around the mean. Neighboring SNPs
may not be associated, while distant SNPs may be associated, despite the
average association declining with longer distances between SNPs. This
variance means that more SNPs are needed to study associations than simply
counting the blocks might indicate.

Extent of association among SNPs differs by chromosome region and by


allele frequency
Some studies show that a measure of association, D', falls to half its possible
maximum value at a distance between SNPs of about 50 to 80 kb, averaged over
gene regions in European-derived populations. Some regions have strong
associations over as much as one megabase. Among different chromosome regions
there is about a fourfold range in the extent of associations. Rare SNP alleles are
generally of more recent origin than common SNP alleles; recombination has had
less time to break down associations around them so that rare alleles generally have
associations over longer distances than do common alleles.
Extent of association among SNPs differs by population
Many studies show that the chromosomal distances that SNP associations extend are
generally shorter for African populations, intermediate for European and Asian
populations, and longer for American Indian populations, although there is variation
among populations in the same geographic region. When groups of people from
populations that differ in some allele frequencies marry and reproduce with each
other, as has often happened with African-Americans and with Hispanics in the

United States, associations are generated over longer chromosomal distances in the
admixed group than in either parental group. Recently formed populations such as
the Mennonites and Acadians also may have associations over longer chromosomal
distances.
Common haplotypes are in all populations
The pattern of variation within and among populations for haplotype structure is just
starting to be studied on a large scale. Recent studies show that the common
haplotypes are found in all populations studied, and that the population-specific
haplotypes are generally rare. African populations generally have more haplotypes
than other populations, which generally have subsets of the African ones, due to the
origins of other populations from ones that spread out of Africa.

Haplotype Block Structure as the Rationale for a Haplotype Map


Block pattern of haplotypes
Some recent studies found that haplotypes occur in a block pattern: the chromosome
region of a block has just a few common haplotypes, followed by another block
region also with just a few common haplotypes, with the longer-distance haplotypes
showing a mixing of the haplotypes in the two blocks. Another description of this
pattern is that the SNPs in a block are strongly associated with each other, but much
less associated with other SNPs. Blocks range in size from about three kb to more
than 150 kb. The majority of SNPs are organized in these blocks. Some recent data
show that the blocks in a Yoruban population from Nigeria are generally the same
ones as, but shorter than, those in two European-derived populations, although the
data are limited and these conclusions are preliminary.
Using haplotype blocks to find chromosome regions associated with disease
Where blocks exist, they can be tested for association with a disease, using just a
few SNPs per block. If the blocks are large, then a few SNPs in a region will indicate
whether that region has genes related to a disease. If the blocks are small, then
many SNPs will be needed to cover a region. Typing more SNPs than needed is a
waste of resources; typing too few SNPs means that a disease association could
easily be missed.
A haplotype map
A haplotype map would show the haplotype blocks and the SNPs that define them. A
haplotype map thus would serve as a resource to increase the efficiency and
comprehensiveness of the many other studies that will be done to relate genes to
diseases.
Haplotype maps of different populations
To the extent that populations differ in their haplotype structure, it may be useful to
study different populations during different stages of the process of finding disease
genes. Studying populations with large haplotype blocks will be useful for initial
association studies over the entire genome to find chromosome regions affecting a

disease. Once these chromosome regions have been found, they can be studied in
populations with small haplotype blocks in those regions, so the particular genes can
be found more easily by being localized in small regions.

Sampling Human Haplotype Variation


Possible schemes to sample human haplotype variation

Population sampling: Samples are chosen from particular identified


populations, defined by ethnicity and geography.
Grid sampling: Samples are chosen from particular geographic regions on a
world grid.
Proportional sampling: Samples are chosen from identified populations so
that the entire sample has a known distribution, but the population identities
of the individual samples are not kept. This scheme was used for the DNA
Polymorphism Discovery Resource.

Studying one or multiple populations


Studying just one population would reveal the common haplotypes that are in all
populations, and so the resulting haplotype map would be useful for all populations.
Including only one, non-disadvantaged, population would also avoid some of the
ethical issues raised by identifying populations. However, this approach would raise
serious issues of justice, since only that population could receive the populationspecific advantages of the haplotype map. There are also scientific reasons to include
more than one population in a haplotype map: to add haplotypes that are not as
common or that are more variable in frequency among populations; and to reveal
regions that are similar or different in haplotype structure among populations. After
the first few populations are included in a haplotype map, additional populations
could still be added. The haplotype map should be developed so that it would be
useful for mapping genes in any population.
Designating which population an individual belongs to, when choosing
which individuals to sample
There are many ways that individuals could define which populations they belong to,
such as their cultural affiliations or the geographic origins of their grandparents. Most
populations have blurred boundaries, some more than others. Some people define
themselves as members of several populations. Individuals or communities may
emphasize some aspects of ancestry more than others, based on factors such as
pride, shame, history of discrimination, or extent of knowledge. For a haplotype map,
the purpose of designating an individual as belonging to a particular population is
simply to make sure that most of that person's bi-parental lineages come from that
particular population. Occasional differences between the population designations of
individuals and their actual lineages would have little effect on a haplotype map,
since the blocks are defined by the common haplotypes in a population contributing
samples. The complexity in designating individuals as belonging to particular
populations underscores the need for involving experts in the social sciences when
developing a haplotype map.
Population consultation

Only a few populations would need to be included for the haplotype map to become a
useful resource for individuals in all populations; there is no reason why any
particular population should have to participate for the project to succeed. For any
population that might be included in a haplotype map, there must be a process of
community consultation to explain the purpose of the map and identify issues of
concern to that population. This process would take time but would be necessary to
educate both the population and the researchers. Particular populations may be
sensitive to being exploited or to being left out. Issues may arise that require
modifications to the consent process, the research protocol, the procedures of the
sample repository, or the database. American Indian and Alaskan Native tribes are
sovereign nations and have procedures for formally granting or withholding consent
to research, which by law must be followed. Other populations are less well
organized, making formal population consent unobtainable, but community
consultation would still be needed, keeping in mind the multiple geographic scales
and other complexities that characterize many populations. Procedures for consulting
communities are outlined in an emerging literature and are under discussion at NIH.

Issues Associated with Identifying Samples by Population


Risks in identifying populations
Identifying the populations that contribute samples for a haplotype map could raise
ethical risks. One risk is that any racial or ethnic identifiers used for the map would
come to be reified as biological constructs, fostering a genetic essentialism in the
way the map is interpreted and the categories understood. This essentialism could
obscure the fluid nature of the "boundaries" between groups and the common
genetic variation within all groups. Although the haplotype map would not have any
individual medical information, another risk to the groups that participate could arise
from later studies that use the haplotype map to find genes contributing to diseases;
the participating groups could become more intensely studied, leading to the
perception that their members are at high risk for diseases.
Benefits in identifying populations
Identifying individual samples as contributed by members of particular populations
would be most useful scientifically. For each population, it would allow multiple
sources of biomedical information to be combined. The contributing populations
would gain the general benefits of the haplotype map as well as any additional
benefits from studies of those particular populations. However, it is an open question
how much less useful a haplotype map would be if population identifiers were
omitted. Additional studies of population differences in haplotypes are needed to
resolve this issue.
Describing the contributing populations
Regardless of whether the individual samples would be identified by their population
of origin, any populations that contribute to a haplotype map must be described in a
way that does not reinforce the mistaken perception that populations are genetically
distinct, well-defined groups. Because people take in information most readily when
it confirms their stereotypes, terms related to race and ethnicity must be used with
precision, sensitivity, and care. Populations should be described as specifically as
possible; for example, if a group of Chinese-Americans in Hawaii were studied, the

population should not be labeled simply "Chinese." This specificity of description is


crucial to minimize the risk of essentialist definitions of race, which assume that all
individuals of a race are genetically similar.

Other Issues Raised by a Haplotype Map Project


Health priorities
Some communities lack even basic health care, so a haplotype map may be a low
priority for them. Groups may feel that even if they participate in a haplotype
project, not much attention would be paid to genetic diseases that primarily affect
some of them but not members of other groups.
Sampling in the developed vs. the developing world
Including samples from developed countries and regions, such as the United States,
Canada, Europe and Japan, might raise fewer human subjects concerns than would
samples from developing countries without good IRB systems for overseeing
research or strong biomedical research infrastructures.
Why not obtain the medical phenotypes of the sampled individuals when
developing a haplotype map?
No phenotypic data, such as medical information, would be collected along with the
samples. The haplotype map would be a resource for researchers trying to relate
genetic variation to a wide range of disorders and traits. Only about 50 samples from
each population would be needed to develop a haplotype map. Such a small sample
would not be adequate to evaluate the many genetic and environmental factors that
affect a disease. However, if the haplotype structure of the genome and the
identifying SNPs were known, then researchers could use those SNPs in studies of
individuals affected and not affected by a disease, matched to control for
environmental factors, to track down the genes that contribute to the disease.

Elements of a Research Plan


The goal of a haplotype map should be medical
A haplotype map could be set up in many ways, to support various types of medical
and biological research. It should be set up to best facilitate its use for relating
genetic variation to disease.
How should a haplotype block be defined?
To compare studies, it will be important to develop a standard definition of a block,
including the minimum frequencies of alleles for SNPs used, how similar haplotypes
must be to be considered the "same" haplotype when figuring out which haplotypes
are common, and how much of a drop in association defines the boundaries of
blocks. Descriptions of the block structure of the genome would include distributions
of the lengths of blocks, measures of the variability of blocks, the amount of
coverage of the genome by blocks, and the proportion of haplotypes in common
blocks; there are tradeoffs among these measures depending on the values of the

defining parameters. Care will be needed when comparing studies using SNPs with
different allele frequencies, with SNPs ascertained in different ways, and with
different sample sizes for estimating associations.
Pilot projects, to help decide whether population identifiers are needed
Population differentiation for allele frequency is about 7 to 10 percent. However,
currently little is known about how populations differ in their haplotypes. It will be
important to find out whether the haplotype blocks are the same in different
populations. Are differences in the extent of associations due to differences in block
lengths, or to differences in the associations among neighboring blocks? How
different are different populations from the same geographic region? How much
information would be lost by removing population identifiers from samples? The first
step would be to get more data, by sampling a small set of populations with different
geographic origins. If these populations were similar, then it may be possible not to
identify populations and still get a haplotype map that can be broadly useful. If the
populations were different enough, then it might be necessary to identify the
populations that contribute the samples. Projects already underway might be used to
answer this question, or some pilot projects might be needed.
Number of populations
It was suggested that about 3 to 6 populations would be included in a haplotype
map. The goal is to produce a tool that is broadly useful.
Samples from real populations
To obtain the most representative samples, it is important not to use samples of
convenience, but to choose samples from real populations. The populations that
contribute samples should be chosen based on the goals of the haplotype map, and
the samples should be collected with appropriate population consultation and
informed consent.
Common samples
Having a common set of samples that could be used by all research groups would
allow comparisons among the results of different studies. Combining information
across studies produces much more informative results than simply the sum of the
results of separate studies.
SNP allele frequencies
Inclusion of SNPs spanning the range of SNP allele frequencies would be important.
The length of associations among alleles may differ depending on the frequencies of
the alleles. Also, a SNP allele provides the most power for an association study when
its frequency matches that of a nearby allele contributing to a disease, and such
alleles can be expected to span the range of frequencies.
A hierarchical approach for SNP density

It would be needlessly expensive to genotype all the individuals in a sample with a


dense set of SNPs. A hierarchical approach makes more sense: start with a density of
SNPs of perhaps one every 50 kb. For regions where such adjacent markers are
strongly associated, these SNPs are sufficient and should be able to define blocks.
For other regions, SNPs with a density of perhaps one every 10 kb could be
examined, and so on until only regions with no block structure are left. Another type
of hierarchical approach would be to start in the regions around genes.
What would define the endpoint of the project?
The goal of a haplotype map is to have sufficient SNPs so that researchers doing
association studies could be sure that regions containing disease alleles have been
found, and that regions not containing disease alleles can be excluded from further
consideration. The map could be considered complete when more SNPs provide no
more information about block structure, or when all common SNPs are included in
the map or are highly associated with ones included.
Methods
Many technical issues still need to be worked out: the method for determining
haplotypes; the types of samples needed, such as single chromosomes, individuals
or families with a certain number of children; the number of samples for each
population; and quality measures. The processes for consulting with populations and
obtaining informed consent need to be developed.
Data analysis
Dealing with thousands to millions of SNPs, haplotypes, and haplotype blocks
requires the development of better statistical methods of analysis to delineate blocks
and to associate them with diseases. Better analytical methods are needed to model
and understand the chromosomal and population processes that lead to the block
structure observed.
Open data-sharing policy
Just as was done for the sequence produced by the Human Genome Project,
providing rapid and complete data release to appropriate public databases would
allow maximal benefit to be gained from haplotype data by allowing all researchers
quick access to the data.
Coordination of data producers
A haplotype project would need coordination among the data producers, both large
and small. The project should be international and open to all interested researchers.

Process for Planning


International project
An international steering committee should be formed. So far there is interest from
the United States, Canada, the United Kingdom, France, Germany and Japan.

Two working groups


Some issues could be considered by one group; others could be considered by both.

Population and ELSI Group: To identify the risks associated with a


haplotype map project, including those associated with identifying
populations; to consider how to minimize those risks; and to consider which
types of populations should be considered for inclusion in a haplotype map.

Methods Group: To consider the types of samples needed and how to create
a haplotype map.

Name of the project


The public would have a hard time understanding and supporting a project named
anything like The Haplotype Linkage Disequilibrium Association Map. A more
understandable name is needed, as well as better ways to explain the project. It will
also be important to communicate clearly what is understood about the complex
relationships among genetics, culture, race and ethnicity.

Location via proxy:

Go

http://genome.w ellcome.ac.uk/doc_WTD020781.html

[ UP ]
Online Advertising
[Report a bug]

[Manage cookies]

No ads

No referrer

No cookies

No scripts

Show this form

About this site | Sitemap |


Contact us

In the genome Genes and the body Tackling disease

Genetics and society In depth Resources What's new

Home > In the genome > A variable genome > Background > Haplotype mapping

The genome sequence


Focus on genes
Focus on proteins
A variable genome
News
Features
Background

HAPLOTYPE MAPPING
20/3/03. BY RICHARD TWYMAN

Haplotypes, groups of closely linked alleles that tend to be


inherited together, can be used to map human disease genes

Latest articl
variable gen
The numbers
Gene copies a
disease
Phase 1 of Ha

very accurately.
All our chromosomes come in pairs, one in each pair inherited from
each parent. While each chromosome of a pair contains the same
genes in the same order, the sequences are not identical. For
example, there are single nucleotide polymorphisms (SNPs)
approximately every 1000 nucleotides. It is therefore possible to
distinguish sequence variants that come from our mother and our
father. These are termed maternal and paternal alleles.
The ability to distinguish between maternal and paternal alleles allows
human disease genes to be mapped by linkage analysis . In germ
cells, which produce eggs or sperm, the maternal and paternal
chromosomes pair up and exchange segments of DNA, a process
called recombination. After recombination, the chromosomes contain
a mixture of alleles from each parent. Recombination will occur
frequently between DNA sequences that are a long way apart but only
rarely between sequences that are close together. Therefore, by
measuring the frequency of recombination between the disease gene
and other DNA sequences whose location is already known, the
position of the disease gene can be established.

Haplotype mapping: A new mutation (X) arises in the proximity of


six single nucleotide polymorphisms, with the ancestral haplotype
signature TATCAT. Over several generations, the haplotype
signature may be eroded by recombination. For example,
contemporary haplotype 1 was produced by recombination between
the first and second SNPs. The new alleles are shown in pink.
However, the smallest conserved haplotype signature in all patients
carrying the disease allele places the disease between SNPs 3 and

complete
Wellcome Tru
Control Conso
HapMap proje

Printable vers
Send to a frie

Glossary

4. This technique provides a candidate region of about 10 000 bp,


which is smaller than most human genes.

Another consequence of recombination is that blocks of sequences on


the same chromosome tend to be inherited together, a phenomenon
known as linkage disequilibrium. Such groups of alleles, which are
rarely separated by recombination, are known as haplotypes. In the
human genome, haplotypes tend to be approximately 60 000 bp in
size and therefore contain up to 60 SNPs that travel as a group.
Haplotypes can be exploited for the fine mapping of disease genes.
The principle of haplotype mapping is shown in the Figure. A new
mutation responsible for a genetic disease always enters the
population within an existing haplotype, which is termed the ancestral
haplotype.
Over several generations, recombination events may occur within the
haplotype but the disease allele and the closest SNPs still tend to be
inherited as a group. If this haplotype can be identified in a group of
patients with the disease, typing the alleles within the haplotype
allows a conserved region to be identified, which pinpoints the
mutation responsible for the disease. Due to the abundance of SNPs,
this technique has the potential to map genes very accurately. There
is therefore much interest in developing a haplotype map of the entire
human genome.

Email your views on this article:

Haplotype mapping' by Richard Twyman

Send

Clear

You might also like