You are on page 1of 39

Lecture 22, Genomics

Genetic Analyses!
(1) FORWARD analysis began with classical transmission methods tracing the
inheritance of mutant alleles or chromosomes.!
(2) REVERSE analysis begins with mutating a reported gene sequence
(knockout), tracing its inheritance and identifying its phenotypic
I find it sometimes very difficult to tell what someone means when they talk about
genes because we dont share the same definition, says developmental geneticist
William Gelbert of Harvard University in Cambridge, Massachusetts. !
You have already struggled with multiple definitions of a gene, based on transmission
and cytological genetics. Now, using your understanding of molecular genetics, propose
a gene definition and explain why it would be consistent with the requirements of a
gene (see lecture 11), practical and comprehensive. !

Increasingly,forward genetics, change of function

insertions, and gene knockouts or eventually
replacement, all depend on knowing something about
sequence, and location of a gene in a genome !

TODAY Genomics and its derivatives:!

Genome: a (one) complete set of an organisms genetic

information, or usually, one complete set of chromosomes
(monoploid), sometimes, nuclear DNA content.
The original definition of Genomics included:
(1) Mapping chromosomes
(2) Sequencing chromosomes and identifying genes
(3) Analyzing the functions of entire genomes
Currently genomics is divided into several fields of study:
Structural genomics- the study of genome structure
Comparative genomics -the study of genome diversity/evolution
Bioinformatics - information from sequence structure
Functional genomics- transcriptome (complete set of RNAs
transcribed from a genome) and the proteome (the complete set
of proteins transcribed from a genome).

But first, Sequencing the Human Genome

Clone by Clone method: Map first,
sequence later (publicly funded
Human Genome Project)

Shotgun Method: Sequence

first, map later (Celera

Genetic maps of chromosomes are based on recombination

frequency between markers:
Low density- limit ~ 1% recombination is a practical limitlimited by the number of bodies you have to measure- for most
visibly-expressed genes in eukaryotic breeding studies.
Higher density genetic maps use restriction sites, and gene
localization probes as landmarks.

Genetic maps of chromosomes start with

ordinal distance based on recombination
frequency between visible markers.
Low density Cytogenic (cytological,
ideogram) maps are based on the location
of markers within or near cytological
features, microscopically visible.

Low Density Genetic maps of chromosomes are based on

recombination frequency between visible markers.
Cytogenic (cytological, ideogram) maps are based on the location of
markers within or near cytological features.

High density maps integrate cytological and physical maps.

Anchor markers correlate the cytogenic maps to the physical. Chromosome fragments
can be identified by migration pattern (RFLP) , tagged using PCR to amplify short,
unique (200-500bp) Sequence Tagged Sites(STS), or short cDNA sequence probes
(Expressed Sequence Tags or EST) and Short (or Simple) Sequence Length
Polymorphisms (SSLP or short repetitive elements). These tagged fragments of known
sequence can be related to the cytogenic map (probe with complementary).
Physical maps are measured in base pairs, kilobase pairs or megabase pairs, they often
show the location of overlapping genomic fragment clones (contigs) and unique
sequences (STS).

Human chromosome 1 fig 4-20


1996-7: Anchoring the

physical map:
2335 microsatellite (SSLP)
sites, 16,000 STS marked
loci & RFLPs used to map
1600 human genes

Ordered Clone by Clone method:

map first sequence later:Screen large
clones (BAC) from a chromosome
library for known sites (restriction
sites, known genes, or other sequence
to anchor the map.
HindIII digest, agarose gel
electrophoresis on fragmented clone:
Stain, characterize fragments by
migration distance.
Share a partial fragment ? = overlap
(different clones share sequence)
use overlap to orient the clone
fragments into a map

Physical maps, are built by reconstructing the order of fragments cut by restriction enzymes. The
first cloning vector is usually a YAC. For example, five YACS were known to hybridize to 1
chromosome band ( 17q2 ). A restriction enzyme cutting an 8 base palindrome sequence having a
low sequence probability (on average 48 or every 66,000 bases) was used to cut the chromosome
fragment. The fragments were denatured and the 5 single stranded radioactive YACS were
hybridized to the blots of the digest to visualize the target chromosome fragments. The
autoradiogram is below. Order of band fragments ?

The exposed
photographic image of
lanes (columns) corresponding to the same
chromosome DNA
tested with 5 different
YAC probes.



band map of
RFLP fragments



Map first sequence

later: clone by clone,
or ordered clone
approach - public.
Minimum tiling pathfewest clones
necessary to get a
complete sequence

Whole genome Shotgun method


Small insert clones

are prepared
directly from DNA
& sequenced
Overlap: - the overlap
is determined directly
by sequence, not
indirectly by fragment
Most of the genome is
sequenced many
(10-15) times to get the
correct overlap

Where do the fragments overlap ?

(2)Paired end reads: each

clone is primed from
two different ends of a
vector, which is known
and PCR the
intervening sequence,
producing an end
tagged linear sequence.
End tagged, multiple
inserts may then be
overlapped to produce
a sequence contig



The overlap of sequences in regions of identity can be

used to make contiguous sequences.





Using clones containing
fragments of different sizes
(different restriction enzymes)
there will be overlap

Problem? Repetitive sequence gaps,


Whole genome Shotgun method
Small insert clones
are prepared
directly from DNA
& sequenced
Overlap: most of
the genome is
sequenced many
(10-15) times to get
the correct overlap!

Problem? Repetitive sequence gaps,

assembles the
final draft

Sequencing DNA fragments

DNA SEQUENCING- Sanger Method: A cloned fragment of DNA
is sequenced by using:
(1) a specific primer piece of DNA (oligonucleotide) to replicate the
DNA from a known, pre-defined starting point.
(2) a spike of radioactive dideoxy-nucleotides (ddATP or ddCTP,
ddGTP, ddTTP) are incorporated with excess normal dATP, dCTP,
dGTP, dTTP , + DNA polymerase,
The ddNTPs are randomly
incorporated and terminate
the strand elongation
(3) electrophoresis, visualize the DNA
(4) read the fragment order (by size)
(5) reconstruct the complementary original sequence


Cloned, (PCR or recombinant) denatured template fragment



Part of the
vector adjacent
to where the
DNA is ligated)

+ excess, normal



Automated Sequencing

uses flourescent tags for
each ddNTP reaction.

The sequencing reaction
can be done in a tube, and
it is read by a light

Pyrosequencing requires single strand DNA (template), a DNA

primer DNA polymerase and dNTP. Read the sequence by the
chemiluminescence, powered by ATP produced by sulfurlase.


The basic techniques for sequencing entire genomes:

(1) libraries (whole genome) (2) cloning vectors (3) PCR (4) DNA
sequencing machines (5) chromosome maps (6) computers


The DNA sequence is the base for computer - assisted analyses

Structural genomics involves the analysis of gene
sequence, gene number, order and
physical nature of chromosomes.
Comparative Genomics- similarity and divergence among genes with a
similar function in different species
Bioinformatics is the use of computer analysis for
structural or functional genomics.
Proteomics the study of all the proteins of an organism.
Transcriptomics - transcript studies
Functional genomics studies the function of genes
gene expression, interactions between gene and proteins, and
between proteins



Genomics is the study of genomes in their entirety.

Most Bacteria are now known by their sequence or
partial sequence, viruses can be sequenced in a day or two
Over 100 eukaryotic genomes have been sequenced including
Several fungi
Malaria parasite
Poplar tree
Many other species have a great deal of cDNA and gene sequences
especially ESTs - partial sequences of cDNA clones.

Structural Genomics - structure of the human genome

The Human Genomics analysis was first published
in Feb 2001 in the journal Science
Human genome has:
2.91 Giga base pairs (Gbp) that is 2,910 million bp There are 26,383 annotated genes in the human genome.
There are 39,000 annotated and hypothetical genes
sequences that resemble genes but are unverified
42% of the annotated genes are of unknown function
48% of base pairs are foreign DNA, including transposons
Average gene size 27 kb
% of base pairs of spanned by exons 1.1 to 1.4 %
% of base pairs of spanned by introns 24.4 to 36.4%
% of base pairs are intergenic regions 75 to 64%


Comparative genomics
Genes in different species have high sequence similarity.
Eg. alpha-tubulin
Amino Acid
Human Mouse
Human - Barley
Mouse - Barley
Barley Maize
Barley Arabidopsis
The gene sequence in one species can help find similar genes in
another species.

93% gene similarity - many mutations are rearrangements


Bioinformatics: (broadly)computational challenges

in biology or (narrowly) the information content of
the genome. A first objective being the
identification of binding sites or the functional
gene annotation.


Bioinformatics (the information content) collates multiple sources of

information, including comparative genomic (BLAST search), cDNA
sequence and ORF to annotate a candidate gene sequence.



A coding transcriptome(ics) represents that small percentage of the

genetic code that is (apparently) transcribed into RNA molecules
estimated to be less than 5% of the genome in humans Adams J. (2008)
Nature Education 1:1 !
!It now appears that the majority of the human genome is transcribed
(introns and intervening sequence), and the vast majority of sequences are nonprotein coding (Frith et al.,2005). The proportion of transcribed sequences
that are non-protein-coding appears to be greater in mammals compared to
nematodes or drosophila. !
Frith et al., 2005. E.J.H.G. 13:894-897

The ENCODE PROJECT: identify and map all the transcribed regions of the human
genome including regulatory regions, replication origins, DNA methylation and histone
methylation sites etc..
The pilot project found among other interesting findings:
On average- per coding region:
- There are 5.4 different transcripts per coding region, over half showed transcription
from both strands-exact reverse direction complements.
- 63% of the mouse genome is transcribed, 1-2% have recognizable exons
- 41% span introns, 22% span intergenic regions.
-Majority of the human genome is transcribed
-Genome- islands of protein coding sequence - interwoven and overlapping
-transcription units spanning the genome
Gene definition ? Micro rNA - regulatory gene ?


Proteomics the study of the proteome, the sequence and

expression of all proteins
proteins ?

Unknown fraction

splicing, RNA
editing, or
initiation and


Functional Genomics - study of

expression patterns!

Genomic sequencing has made possible a new approach to genetics called

functional genomics, which focuses on genome-wide patterns of gene
expression and the mechanisms by which gene expression is coordinated !

DNA microarray (or chip) - a flat surface about the size of a postage stamp
with up to 100,000 distinct spots, each containing a different immobilized
oligotide DNA sequence, of all the known genes in a genome or all the known
cDNA from a genome suitable for hybridization with DNA or RNA isolated from cells
growing under different conditions!



Every yeast
gene was
cloned and
sampled are
spotted onto
glass slides

Relate - which genes are
active in the tissue

Binding (color) intensity indicates RNA concentration- gene activity


DNA chips use synthetic DNA, oligonucleotides, that

can be spotted at a density of 106 /cm2 so fragments
from all human or other eukaryotic organisms can be
annealed to the chip.

regulation of ~
2500 genes
changes during
the first 2.75
hours of C.

abundance relative
to a non-dividing


Time minutes before (-2) to minutes

after (165) the 4 cell stage in
development (gastrulation)


A glossary of types of DNA sequence:

1. Full length cDNA - complement of the mRNA
2. Full length (eukaryotic) gene clone (exons, introns, flanking regions).
3. Restriction fragments, Restriction Fragment Length Polymorphisms
4. SNPs - Single Nuclear Polymorphism(s)- nucleotide polymorphism
5. PCR clone - partial gene sequence
6. Large genomic clones: BACs or YACs may have the sequence of
many adjacent genes
7. Satellite DNA - mid and highly repetitive DNA including VNTRs,
mini and microsatellite DNA
8. STS (sequence tagged sites) short unique sequences used to hybridize
to chromosomes
9. Expressed sequence tags ESTs - partial sequence of cDNAs used as
probes to ID chromosome locations, correlate RFLP and cytological
10. SSLP (short (simple) sequence length polymorphisms)- short
repetitive sequences use to anchor a map