You are on page 1of 38

How was the human genome mapped?

• Official HGP was to span for 15 years (1990-2005)


• The progress was faster, because genetic maps were
made ahead of schedule and techniques like
automated fluorescence based DNA sequencing
added momentum.
• The first human genetics map was based on
microsatellite markers. The first map (1987) was
based on RFLP, but had its own limitations because
the average spacing between markers was
considerable and it is difficult to type.
History of the Human Genome Project
1988 HUGO was established
1990 Official start of HGP with 3 billion $ and a 15 year horizon.
1991 Genome Database GB was established
1992 Genethon published map based on microsatellites.
1995 Lander et al., : detailed map based onSTS
1998 Comprehensive map based on gene markers.
1999 Sanger Centre publishes chromosome 22
2001 First draft of Human Genome published
2003 Gaps and more details of Human Genome

Strachan and Read, HMG


Most of the sequencing came from 5 centers

• Wellcome Trust Sanger Inst


• Whitehead Inst/MIT
• Washington University
• DoE Joint Genomic Institute
• Baylor College of Medicine
The HGP’s primary aims
• The main aims of the Human Genome
Project (HGP) were to:
– Construct maps of the genome (genetic and
physical)
– Identify all the genes (now known to be about
30,000)
– Determine the entire DNA sequence
(3,000,000,000 bp)
Other aims of HGP
• As well as the genome sequence, the
aims were:
• Technology development
• Model organism genome projects (E. coli,
yeast, mouse, fruit fly, C. elegans)
• Ethical, legal and societal implications
(ELSI)
• Personalized medicine
Personalized medicine
• Anticipated medical benefits: for single gene disorders,
comprehensive prenatal, presymptomatic diagnosis of
disorders.
• Similarly the biological basis of diseases can be studied
and suitable therapies can be designed.
• Wide scale screening of mutation has already brought
about a radical change for approach to medical care .
• Similarly, drug response of individuals may vary and the
genetic causes underlying these are being mapped and
thus prior to prescribing a drug, individual’s genome can
be tested before prescribing a medicine.
The linkage map
• The map was built by linkage studies in 60 large families with
grandparents and large numbers of children, collected by the University
of Utah and the Centre d'Étude du Polymorphisme Humain (CEPH),
Paris
• Families were typed with over 5000 polymorphic DNA sequences: 60%
were microsatellite repeats (mostly dinucleotide (CA) repeats, also
some tri- and tetra-nucleotides). Only about 400 of them were actual
genes
• Construction of the genetic map:
– Obtain genotypes of all markers on all family members (PCR and
gel electrophoresis, using robots and automated gel apparatus
– Calculation of recombination fractions between markers
– Observe crossovers between closely linked markers, use this
information to confirm order of markers
• Construction of the linkage map is a very big problem; sophisticated
software was used to work out the "best fit" map of all the markers, with
advanced statistical methods and algorithms
STSs and ESTs
• Sequence tagged sites (STSs) are specific loci in the genome, for which
enough DNA sequence is available to make PCR primers to amplify the
locus (usually as a fragment of a few 100bp). These include
microsatellites (e.g. CA repeats) that can be used for linkage studies.

• The information required to use an STS is just the sequences of the


PCR primers; therefore it is very easy to make databases of STSs that
can be used by anyone. No actual bits of DNA need change hands. This
is crucial in allowing genome projects to proceed as international
collaborations, with many laboratories participating in a co-ordinated
way.

• ESTs act as specific tags for each human gene, since they are derived
by sequencing cDNA clones which came from mRNA and therefore
represent the actual transcribed sequences (as opposed to STSs, which
can be derived from anywhere in the genome and are mostly non-
coding). They allow rapid access to the actual genes, ignoring introns
and “junk” DNA
ESTs can be 3' or 5' depending on which end of the cDNA was sequenced.
1.Because of the methods used to make cDNA libraries, parts of the 5' end
of the gene are often lost during cloning whereas the 3' end is more reliable.
2. This shown on the diagram by the white boxes representing cDNA clones
being different lengths.
3. Another complication is due to alternative splicing.
4. On the left is shown the genomic structure of a gene, with the exons as
boxes - the red one is subject to alternative splicing.
• Genome? Total set of different DNA molecules- human25
different DNA molecules: single mitochondrial DNA and 24
nuclear DNA molecules
• Loosely comprises ofnuclear and mitochondrial genome.
• The approximately 16.5kb mitochondrial genome was
published 1981, primary goal for HGP was to sequence the
3000Mb nuclear genome.
Human gene and DNA segment nomenclature
• The nomenclature used was decided by HUGO
nomenclature committee. Genes are allocated
symbols of usually 2-6 characters.
• For anonymous DNA sequence the convention is to
use D (DNA) followed by 1-22, X or Y to denote
chromosomal location, then S for the unique
segment, Z for a chromosome specific repetitive
DNA family or F for a multilocus DNA family and
finally a serial number. The letter E following the
number for an anonomous DNA sequence indicates
that the sequence is known to be expressed.
Symbol Interpretation
CRYB1 Gene for crystallin beta
pepetide1

B3P42 Breakpoint number 42 on


chromosome3

DYS29 Unique DNA segment


number 29 on the Y
chromosome
D3S2550E Unique DNA segment
number 2550 on
chromosome 3, known to be
expressed
•Different types of markers have been used to construct human
genetic maps, a common principle was followed. All markers
were typed in members of a variety of multigeneration families,
and the data was fed into a computer to check for markers with
co-segregating alleles.
•The first physical map was made based on chromosome
banding, although the resolution is coarse still it provides a very
useful framework for ordering of the human DNA sequence by
in situ hybridization and cytogenetic breakpoints.
•Long-range restriction maps were also made by using rare
cutters (Not1 restriction map).
•The first high resolution physical map of the human genome
was made possible by making “libraries of genomic DNA
clones”.
•Once available libraries can then be used to screen to identify
individual clones which could then be grouped into sets of
clones, with inserts from the same chromosome and
Type of map Example/methodology Resolution

cytogenetic Chromosome banding Average band has several


Mbs of DNA
Chromosome Somatic cell hybrid panels, Distance often may be in
breakpoint maps RH maps Mbs
Restriction maps Rare-cutter restriction map Several hundred Kbs
(NotI)
Clone contig map Overlapping YAC Several hundred Kbs

STS Requires prior info on Less than 1 kb possible


sequence
EST Requires cDNA Average resoloution in
sequencing then mapping tens of kb
back these to physical
maps
DNA sequencing Complete nucleotide 1bp
sequence of chromosomal
DNA
Early map of human gene distribution
• Most human genes are associated with CpG
islands
• Purified fraction of CpG was labelled with texas
red and hybridized to human metaphase
chromosome.
• High gene dense region could be seen by the red
fluorescence from the labelled CpG island fraction.
• For eg. Chromosome 22 was seen to be gene rich
where as chrosome 4, 18, X and Y are gene poor.
Somatic cell hybrid mapping

mitosis
Random loss of
Human chromosomes
Such hybrid cells are unstable
and lose a few and retain some
of the human chromosomes
• Although panels of hybrid cells can be used to map
a human gene or DNA sequence to a specific
human chromosome.
• It is most efficient to use panels of
monochromosomal hybrids (cells containing just
a single type of human chromosome) collectively
expressing all 24 types of human chromosome.
• To make this, donor human cells are exposed to
colcemid, causing the chromosome set to be
partitioned into discrete subnuclear packets
(micronuclei).
• This is followed by centrifugation resulting in
micronuclei, consisting of single micronucleus with a
thin rim of cytoplasm surrounded by intact plasma
membrane.
• The microcells are fused with recipient
rodent cells (microcell fusion) to generate
hybrids, some containing single human
chromosome.
To aid human genome mapping

• Enriching the starting DNA: instead of using whole


genomic DNA, individual chromosomes were
purified by flow cytometry using the same principles
that are used to fractionate cells in FACs sorter.
• By collecting sufficient number of a particular type
of chromosome , chromosome-specific DNA
libraries were generated. Additional chromosome
microdissection procedures enabled DNA libraries
to be made from DNA isolated from specific sub-
chromosomal regions.
Radiation hybrid panel.
What are they?
How are they made?
Radiation hybrid & HGP
• Two radiation hybrid panels have been particularly important
in the human genome mapping. The Gene bridge 4 panel
consists of 93 human-hamster radiation hybrids with an
average human fragment size of 25Mb and 32% retention of
humans sequence in each hybrid.
• Labs can map any unknown STS by scoring the 93
Genebridge hybrids and comparing the pattern with patterns
of previously mapped markers held on a central server.
• A second human-hamster panel, the Stanford G3 panel,
was made using a higher dose of radiation, so that the
average human fragment size is smaller. The 83 hybrids in
G3 averaged 16% retention of the human genome, with an
average size of 2.4Mb.
• Thus G3 could be used for finer mapping (
www.ncbi.nlm.nih.gov/genemap98/ Deloukas et al.,1998)
Lethal dose of Fuse with TK- hamster
Normal human radiation Cells with fragmented cells Hybrid cells TK+
fibroblast
chromosome

RH mapping in peripheral lab


Select 100-200
STS to be mapped Hybrids each
Retaining
PCR amplify STS 25-30% of
In each hybrid human genome
in small fragments

Match
pattern Database on
central server

location
A YAC-based physical map of human genome
• At the official beginning of the HGP in 90, the available
genomic DNA libraries contained inserts upto 40kb in length
(cosmid), because of the large size of the human genome an
average insert of 40kb would need to have several hundreds
of thousand different clones to ensure high probability of
representing 100% of the genome. Screening of these
individual clones and organizing them would be a daunting
task.
• To circumvent this problem novel methods for making
artificial eukaryotic chromosomes were developed. It was
known that only small regions of the yeast chromosomal
sequence was enough to let them function like independent
chromosomes. By purifying these sequences and combining
with large human DNA fragments it was possible to make
hybrid molecules containing megabase sized inserts.
• YAC libraries with an average insert size of 1MB would
range 12,000-15,000 clones to reasonably represent the
human genome, and would have advantage of enabling
large genes to be retained in individual clones. The first
reasonably detailed map using YACs Cohen et al., 1993.
• An updated YAC map covering 75% of human genome
consisting of 225 contigs with an average 10Mb was
subsequently published Chumakov 1995.
• The underlying principle in YAC maps (and all other clone-
based physical maps) is to order the clones in the library on
the basis of the subchromosomal region of origin for the
insert DNA.
• This means that the relevant subchromosomal region is
represented by a linear array of partially overlapping clones
without leaving any gaps. Such a contigous set of cloned
DNA sequences is called a clone contig.
A high resolution STS sequence map of the human
genome

• The accuracy of clone in contig maps is crucial dependent on


the extent to which the clone insert DNA is true representation
of the original genomic sequence.
• However, in the YAC based map considerable portion of the
genome was not represented and the major limitation was it
was not a faithful representation of the genomic DNA. The
large YAC inserts are prone to rearrangements including loss
of internal sequence and there was problem with chimerism
(where a single transformed cell contains two or more pieces
of human DNA from non-contiguous portions of the genome.
• To ensure against this problem due to infidelity of clone
inserts, HGP had to emphasize the need to develop maps
based on sequence tagged sites (STS). By having a
sufficient high density of STS landmarks the insert stability
issue could be side stepped.
• The large number of STS can be restored by typing other
kind of clones (Bacs, etc.). WI Lander’s group “Hudson et
al., 1995).
• The human STS map was an integrated physical map in
which STS had been used to type (a) a panel of human
radiation hybrid map, (b) the CEPH YAC library. STS
markers were of two types, polymorphic and non-
polymorphic (obtained from sequencing genomic DNA
clones at random and then developing PCR primers for non-
repetative regions and STS selected from cDNA (ESTs).
The final stage of the human Genome
project
• Sequencing strategies:YACs
are not faithful representing of
the original starting DNA,
second generation contig
maps were from BAC and PAC
libraries. Were selected
although their insert size is
smaller (100-250kb).
• The sequencing strategy was
hierarchical shotgun
sequencing.
• Sonication, end repairing.
Cloned in vector, sequencing
of individual clones, aligning.
Sequencing methodology
• The basic sequencing methodology, Sanger et al., dideoxy
sequencing.
• Lots of improvement has been made, fluorescence based
sequencing, subsequently capillary sequencing enabled
higher sequencing throughputs.
• Various sequence interpretations and assembly were made
by various dedicated computer programmes. PHRED
(analyzes raw sequences and provides a quality score at
each base position to indicate the degree of confidence that
the assigned base call is correct and PHRAP (assembles
raw sequences into sequence contigs by scanning for
overlapping sequences shared by two or more independent
shot gun clones.
Problem with repetetive DNA

• Assembling individual clone seuqences to identify overlaps is


crucially dependent on an important assumption that the
overlapping sequences are uniquely represented.
• However, a larger fraction almost 50% of the genome
comprise of repetetive DNA, LINE-1, Alu repeats. These are
usually avoided when looking for overlaps between
sequences of clones.
• Despite the above precautions , areas that were very rich in
known repetetive sequences would prove problematic.
Previously unidentified low copy number repeats were also
another concern.
Estimating the total number of genes
• Two essential criteria for annotating a sequence as a gene
a. Transcribed
b. Evidence of evolutionary conserved sequences.
• However this has its own practical issues
a. Sometimes a gene might be expressed at a very low level or
may have unusual cellular location and stages of
development and hence may be missed in a cDNA library.
• Genes encoding untranslated RNA may be difficult to identify
in the absence of a sizeable ORF.
• As a result genes are often missed by experimental
methods.
• Hence in silico based programmes (exon prediction,
homology searches) have helped in identifying genes.
• Homology searches against sequence databases, easy for
identifying genes which are conserved during evolution. A
fine example of comparative genomics.
• Exon prediction programmes-Some programmes use only
information about the input sequence. One of the popular
programmes is GENESCAN. Cross species identification
has been a major contributor in gene identification.
• Intergrated gene-finding software packages. Various
packages which use general homology database, gene
associated motifs, and exons programmes, eg. NIX,
Genotator
• However, sometimes this may lead to over
prediction/underprediction and should be followed by wet lab
experiements.
Major data bases
• GenBank:NCBI
• EMBL:
• Swiss: prot
• TREMBL: translation of coding sequencesfrom
EMBL database
• PIR: Maintained collaboratively by the US National
Biomedical Research Foundation (NBRF) the
Japan Inetrnational Protein information database in
Japan (JIPID)and Munich Information Center for
Protein Sequences (MIPs)
Major sequencing centers andsource of information

1.Baylor College of Medicine Genome Sequencing Center hgsc.bcm.tcm.edu/


2.Celera www.celera.com
3.Washington University Genome Sequencing Center www.genome.wustl.edu
4.Wellcome Trust Sanger Institute www.sanger.ac.uk
5.Whitehead Institute/MIT Center for Genome Research www.-genome.wi.mit.edu

Ensembl genome annotator - www.ensembl.org


European Bionformatics Institute - www.ebi.ac.uk
NCBI - www.ncbi.nlm.nih.gov
Gene Ontology Consortium www. genomeontology.org
Nature Genome Gateway http://www.nature.com/genomics/human/
National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/
HapMap Project Homepage http://www.hapmap.org/
Science Special Issue on “THE HUMAN GENOME” 16th Feb, 2001
Nature collection “HUMAN GENOME” 1 st June, 2006
What does the draft human genome
sequence tell us?
By the Numbers
• The human genome contains 3 billion chemical nucleotide
bases (A, C, T, and G).
• The average gene consists of 3000 bases, but sizes vary
greatly, with the largest known human gene being dystrophic at
2.4 million bases.
• The total number of genes is estimated at around 24,000--
much lower than previous estimates of 80,000 to 140,000.
• Almost all (99.9%) nucleotide bases are exactly the same in
all people.
• The functions are unknown for over 50% of discovered genes.
What does the draft human
genome sequence tell us?
Variations and Mutations
• Scientists have identified more than 3 million locations
where single-base DNA differences, SNPs (Single
nucleotide polymorphisms) occur in humans.
This information promises to revolutionize the processes
of finding chromosomal locations for disease-
associated sequences and tracing human history.
HapMap

An NIH program to chart genetic variation

within the human genome


• Begun in 2002, the project is a 3-year effort to
construct a map of the patterns of SNPs (single
nucleotide polymorphisms) that occur across
Chart genetic variation populations in Africa, Asia, and the United
within the human genome States.
• Consortium of researchers from six countries
• Researchers hope that dramatically decreasing
the number of individual SNPs to be scanned
will provide a shortcut for identifying the DNA
regions associated with common complex
diseases
• Map may also be useful in understanding how
www.hapmap.org genetic variation contributes to responses in
environmental factors and complex diseases
Chart genetic variation
within the human genome
Anticipated Benefits and downsides of HGR
Molecular Medicine
• Improve diagnosis of disease
• detect genetic predispositions to disease
• create drugs based on molecular information
• design “custom drugs” (pharmacogenomics) based on individual genetic profiles
DNA Identification (Forensics, species identification, evolution)
• identify potential suspects whose DNA may match evidence left at crime scenes
• exonerate persons wrongly accused of crimes
• identify crime and catastrophe victims
• establish paternity and other family relationships
• match organ donors with recipients in transplant programs
ELSI: Ethical, Legal, and Social Issues
Privacy and confidentiality of genetic information.
• Fairness in the use of genetic information by insurers, employers, courts,
schools, adoption agencies, and the military, among others.
• Psychological impact, stigmatization, and discrimination due to an
individual’s genetic differences.
• Reproductive issues including adequate and informed consent and use of
genetic information in reproductive decision making.
“We are all at risk for something”

Francis Collin
Director, NHGRI

You might also like