You are on page 1of 89

DNA sequencing: methods

I. Brief history of sequencing


II. Sanger dideoxy method for sequencing
III. Sequencing large pieces of DNA
VI. The “$1,000 dollar genome”

On WebCT
-- “The $1000 genome”
-- review of new sequencing techniques by George Church
Why sequence DNA?

• All genes available for an organism to use -- a very


important tool for biologists
• Not just sequence of genes, but also positioning of
genes and sequences of regulatory regions

• New recombinant DNA constructs must be sequenced


to verify construction or positions of mutations
• Etc.
History of DNA sequencing
History of DNA sequencing

MC chapter 12
Methods of sequencing

A. Sanger dideoxy (primer extension/chain-termination) method:


most popular protocol for sequencing, very adaptable, scalable
to large sequencing projects

B. Maxam-Gilbert chemical cleavage method: DNA is labelled


and then chemically cleaved in a sequence-dependent manner.
This method is not easily scaled and is rather tedious

C. Pyrosequencing: measuring chain extension by pyrophosphate


monitoring
for dideoxy sequencing you need:

1) Single stranded DNA template


2) A primer for DNA synthesis
3) DNA polymerase
4) Deoxynucleoside triphosphates and
dideoxynucleotide triphosphates
Primers for DNA sequencing

• Oligonucleotide primers can be synthesized by


phosphoramidite chemistry--usually designed
manually and then purchased

• Sequence of the oligo must be complimentary to


DNA flanking sequenced region

• Oligos are usually 15-30 nucleotides in length


DNA templates for sequencing:

• Single stranded DNA isolated from recombinant


M13 bacteriophage containing DNA of interest
• Double-stranded DNA that has been denatured
• Non-denatured double stranded DNA (cycle
sequencing)
One way for obtaining single-stranded DNA from a double
stranded source--magnets
Reagents for sequencing: DNA
polymerases

• Should be highly processive, and incorporate


ddNTPs efficiently

• Should lack exonuclease activity

• Thermostability required for “cycle


sequencing”
Sanger dideoxy sequencing--basic method

3’ Single stranded DNA 5’

5’ 3’

a) Anneal the primer


Sanger dideoxy sequencing: basic method
5’

Direction of
b) Extend the primer DNA
with DNA polymerase
polymerase in the travel
presence of all four
3’
dNTPs, with a
limited amount of a
dideoxy NTP
(ddNTP)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.

DNA polymerase incorporates ddNTP in a template-


dependent manner, but it works best if the DNA pol lacks
3’ to 5’ exonuclease (proofreading) activity
Sanger dideoxy sequencing: basic method

3’ T TT T 5’

5’ 3’
ddATP in the reaction:
ddA
anywhere there’s a T in
ddA the template strand,
occasionally a ddA will
ddA
be added to the
ddA growing strand
How to visualize DNA fragments?

• Radioactivity
– Radiolabeled primers (kinase with 32P)
– Radiolabelled dNTPs (gamma 35S or 32P)

• Fluorescence
– ddNTPs chemically synthesized to contain fluors
– Each ddNTP fluoresces at a different wavelength
allowing identification
Analysis of sequencing products:

Polyacrylamide gel electrophoresis--good resolution


of fragments differing by a single dNTP
– Slab gels: as previously described
– Capillary gels: require only a tiny amount of
sample to be loaded, run much faster than slab
gels, best for high throughput sequencing
DNA sequencing gels: old school
Different ddNTP used in
separate reactions
Analyze sequencing
products by gel
electrophoresis,
autoradiography

Radioactively labelled primer or dNTP


in sequencing reaction
cycle sequencing: denaturation occurs
during temperature cycles
94°C:DNA denatures

45°C: primer anneals

60-72°C: thermostable DNA


pol extends primer

Repeat 25-35 times

Advantages: don’t need a lot of


template DNA

Disadvantages: DNA pol may


incorporate ddNTPs poorly
Animation of cycle sequencing: see
http://www.dnai.org/

Click on:
“manipulation”
“techniques”
“sorting and sequencing”
An automated sequencer

The output
Current trends in sequencing:
It is rare for labs to do their own sequencing:
--costly, perishable reagents
--time consuming
--success rate varies

Instead most labs send out for sequencing:

--You prepare the DNA (usually plasmid, M13, or PCR product),


supply the primer, company or university sequencing center does the
rest

--The sequence is recorded by an automated sequencer as an


“electropherogram”
BREAK UP THE GENOME,
PUT IT BACK TOGETHER

Assemble sequences by
~160 kbp matching overlaps

BAC sequence
~1 kbp

BAC overlaps give genome sequence


Sequencing large pieces of DNA:
the “shotgun” method
• Break DNA into small pieces (typically sizes of around 1000
base pairs is preferable)
• Clone pieces of DNA into M13
• Sequence enough M13 clones to ensure complete coverage
(eg. sequencing a 3 million base pair genome would require 5x
to 10x 3 million base pairs to have a reliable representation of
the genome)
• Assemble genome through overlap analysis using computer
algorithms, also “polish” sequences using mapping
information from individual clones, characterized genes, and
genetic markers
• This process is assisted by robotics
Sequencing done by TIGR (Maryland) and The Sanger
Institute (Cambridge, UK)

“Here we report an analysis of the genome sequence of P.


falciparum clone 3D7, including descriptions of chromosome
structure, gene content, functional classification of proteins,
metabolism and transport, and other features of parasite biology.”
Sequencing strategy
A whole chromosome shotgun sequencing strategy
was used to determine the genome sequence of P.
falciparum clone 3D7. This approach was taken because a
whole genome shotgun strategy was not feasible or cost-
effective with the technology that was available at the
beginning of the project. Also, high-quality large insert
libraries of (A - T)-rich P. falciparum DNA have never
been constructed in Escherichia coli, which ruled out a
clone-by-clone sequencing strategy. The chromosomes
were separated on pulsed field gels, and chromosomal
DNA was extracted…
The shotgun sequences were assembled into contiguous
DNA sequences (contigs), in some cases with low coverage
shotgun sequences of yeast artificial chromosome (YAC) clones
to assist in the ordering of contigs for closure. Sequence tagged
sites (STSs)10, microsatellite markers11,12 and HAPPY mapping7
were also used to place and orient contigs during the gap
closure process. The high (A /T) content of the genome made
gap closure extremely difficult7–9.

Chromosomes 1–5, 9 and 12 were closed, whereas


chromosomes 6–8, 10, 11, 13 and 14 contained 3–37 gaps (most
less than 2.5 kb) per chromosome at the beginning of genome
annotation. Efforts to close the remaining gaps are continuing.
Methods: Sequencing, gap closure and annotation
The techniques used at each of the three participating centres
for sequencing, closure and annotation are described in the
accompanying Letters7–9. To ensure that each centres’ annotation
procedures produced roughly equivalent results, the Wellcome Trust
Sanger Institute (‘Sanger’) and the Institute for Genomic Research
(‘TIGR’) annotated the same100-kb segment of chromosome 14. The
number of genes predicted in this sequence by the two centres was 22
and 23; the discrepancy being due to the merging of two single genes
by one centre. Of the 74 exons predicted by the two centres, 50 (68%)
were identical, 9 (2%) overlapped, 6 (8%) overlapped and shared one
boundary, and the remainder were predicted by one centre but not the
other. Thus 88% of the exons predicted by the two centres in the 100-
kb fragment were identical or overlapped.
The $1000 dollar genome
Venter Foundation (2003): The first group to produce a technology
capable of a $1000 human genome will win $500,000 …

X - Prize Foundation: no, $5 - 20 million …

National Institutes of Health (2004): $70 million grant program to


reach the $1000 genome
Previous sequencing techniques: one DNA molecule at a time
Needed: many DNA molecules at a time -- arrays

One of these: “pyrosequencing”


Cut a genome to DNA fragments 300 - 500 bases long

Immobilize single strands on a very small plastic bead (one piece of


DNA per bead)

Amplify the DNA on each bead to cover each bead to boost the signal

Separate each bead on a plate with up to 1.6 million wells


Sequence by DNA polymerase -dependent chain extension, one base
at a time in the presence of a reporter (luciferase)

Luciferase is an enzyme that will emit a photon of light in response to


the pyrophosphate (PPi) released upon nucleotide addition by DNA
polymerase

Flashes of light and their intensity are recorded


Extension with individual dNTPs gives a readout

A B
The readout is recorded by a
detector that measures position
of light flashes and intensity of
light flashes

A B
25 million bases in
about 4 hours

From www.454.com
APS = Adenosine phosphosulfate
Height of peak indicates the number of
dNTPs added

This sequence: TTTGGGGTTGCAGTT


DNA sequencing: methods
I. Brief history of sequencing
II. Sanger dideoxy method for sequencing
III. Sequencing large pieces of DNA
VI. The “$1,000 dollar genome”

On WebCT
-- “The $1000 genome”
-- review of new sequencing techniques by George Church
Introduction to bioinformatics
1) Making biological sense of DNA
sequences
2) Online databases: a brief survey
3) Database in depth: NCBI
4) What is BLAST?
5) Using BLAST for sequence analysis
6) “Biology workbench”, etc.

www.ncbi.nlm.nih.gov
www.tigr.org
http://workbench.sdsc.edu
There’s plenty of DNA to make sense of
http://www.genomesonline.org/

(2006)
Making sense of genome sequences:
1) Genes

a) Protein-coding
• Where are the open reading frames?
• What are the ORFs most similar to? (What is the
function/structure/evolution history?)

b) RNA

2) Non-genes

• Regulation: promoters and factor-binding sites


• Transactions: replication, repair, and segregation,
DNA packaging (nucleosomes)
Sequence output
Raw data

Computer calls
GNNTNNTGTGNCGGATACAATTCCCCTCTAGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGCACCACCAC
CACCACCACCCCATGGGTATGAATAAGCAAAAGGTTTGTCCTGCTTGTGAATCTGCGGAACTTATTTATGATCCAGAAAG
GGGGGAAATAGTCTGTGCCAAGTGCGGTTATGTAATAGAAGAGAACATAATTGATATGGGTCCTAAGTGGCGTGCTTTTG
ATGCTTCTCAAAGGGAACGCAGGTCTAGAACTGGTGCACCAGAAAGTATTCTTCTTCATGACAAGGGGCTTTCAACTGCA
ATTGGAATTGACAGATCGCTTTCCGGATTAATGAGAGAGAAGATGTACCGTTTGAGGAAGTGGCANTCCANATTANGAGT
TAGTGATGCAGCANANAGGAACCTAGCTTTTGCCCTAAGTGAGTTGGATAGAATTNCTGCTCAGTTAAAACTTCCNNGAC
ATGTAGAGGAAGAAGCTGCAANGCTGNACANAGANGCAGNGNGANAGGGACTTATTNGANGCAGATCTATTGAGAGCGTT
ATGGCGGCANGTGTTTACCCTGCTTGTAGGTTATTAAAAGNTCCCGGGACTCTGGATGAGATTGCTGATATTGCTAGAGC
atgttgtatttgtctgaagaaaataaatccgtatccactccttgccctcctg
ataagattatctttgatgcagagaggggggagtacatttgctctgaaact
ggagaagttttagaagataaaattatagatcaagggccagagtggagg
gccttcacgccagaggagaaagaaaagagaagcagagttggagggc
ctttaaacaatactattcacgataggggtttatccactcttatagactggaa
agataaggatgctatgggaagaactttagaccctaagagaagacttga
ggcattgagatggagaaagtggcaaattaga

What does this sequence do?

Could it encode a protein?


Looking for ORFs (Open Reading Frames)
using “DNA Strider”
ORF map 1) Where are the potential starts (ATG) and
stops (TAA, TAG, TGA)?
2) Which reading frame is correct?

= ATG

= stop
codon

Reading frame #1 appears to encode a protein


Cautions in ORF identification
• Not all genes initiate with ATG, particularly in certain
microbes (archaea)
• What is the shortest possible length of a real ORF? 50 amino
acids? 25 amino acids? Cut-off is somewhat arbitrary.
• In eukaryotes, ORFs can be difficult to identify because of
introns

• Are there other sequences surrounding the ORF that indicate


it might be functional?
– promoter sequences for RNA polymerase binding
– Shine-Dalgarno sequences for ribosome binding?
What is the function of
the sequenced gene?
Classical methods:
-- mutate gene, characterize phenotype for clues to function
(genetics)
-- purify protein product, characterize in vitro (biochemistry)

Comparison to previously characterized genes:


-- genes sequences that have high sequence similarity
usually have similar functions
-- if your gene has been previously characterized (using
classical methods) by someone else, you want to know
right away! (avoid duplication of labor)
NCBI
NCBI home page --Go to www.ncbi.nlm.nih.gov for the following
pages

Pubmed: search tool for literature--search by author, subject, title words,


etc.
All databases: “a retrieval system for searching several linked databases”
BLAST: Basic Local Alignment Sequence Tool
OMIM: Online Mendelian Inheritance in Man
Books: many online textbooks available
Tax Browser: A taxonomic organization of organisms and their genomes
Structure: Clearinghouse for solved molecular structures
What does BLAST do?
1) Searches chosen sequence database and
identifies sequences with similarity to test
sequence
2) Ranks similar sequences by degree of
homology (E value)
3) Illustrates alignment between test sequence
and similar sequences
Alignment of sequences:

The principle: two homologous sequences derived from the same


ancestral sequence will have at least some identical (similar)
amino acid residues

Fraction of identical amino acids is called “percent identity”

Similar amino acids: some amino acids have similar


physical/chemical properties, and more likely to substitute for each
other--these give specific similarity scores in alignments

Gaps in similar/homologous sequences are rare, and are given


penalty scores
Homology of proteins

Homology: similarity of biological structure, physiology,


development, and evolution, based on genetic inheritance

Homologous proteins: statistically similar sequence, therefore


similar functions (often, but not always…)

P h o TF B1 1 - - - - - - - - - - - - - - - - - MT KQK1V C-P-V-C-GS - -T-- -- -E-F-I -Y-D-P-E-R-GE


MTI KQK
V C AVRCCPGYVC G
P a b TF B 1 - - - - - - - - - - - - - - - - - MT KQR1V C-P-V-C-GS - -T-- -- -E-F-I -Y-D-P-E-R-GE
MTI KQR
V C AVRCCPGYVC G
P f u TF B1 1 - - - - - - - - - - - - - - - - - MN KQK1V C-P-A-C-E-S-A-- -- -E-L-I -Y-D-P-E-R-GE MNI KQK
V C AVKCCPGYAC E
Tk o TF B1 1 - - - - - - - - - - - - - - - - - MS GKR1V C-P-V-C-GS - -T-- -- -E-F-I -Y-D-P-S-R-GE
MSI GKR
V C KVV CCPGY
VC G
Tk o TF B2 1 - - - - - - - - - - - - MR G- - I S P KR1V C-P-I -C-GS - -T-- -- -E-F-I -YMR
D PG-
R R-GE
I SI PVKR
C AVKCCPGYI CG
P f u TF B2 1 - - - - - - - MS S T E P GGGWL I Y P V1KC-P-Y-C-KS - -R-- MS
- DSLTVEYPDGGGWL
R QHGEIVYFPCVKK KCCPGSYC K
o mB
P hLoATSFTB_2 _ d e d1u c- e- d- N- T- D- i- s- f- r- o- mB
- YLGG-
A S T- _- - S KI 1R C-P-V-C-GS
- -S-- -- -KI
- -I -YYDGG-
P E HGE
- - -YSYKIC ARECCPGHVC G
S s o TF B1 1 - - - - - - - - - - - - ML Y L S E E N KS1V S-T-P-C-P-P-D-- -- -KI - -I -FML
D AYELRSGE
E EYNI KS
C S VESTTGE
PCP
S s o TF B2 1 - - - - - - - - - - - - - - - - - - - - - MKC 1 -P-Y-C-KT - -D-N-- -A-I -T-Y-D-V-E-KGMY
- - - -V-CMKC
T N CPAYSC K
S c e TF I I B 1 MMT R E S I D KR A G R R GP N L N I V L1T CMMT P E CRKV
E SYIPDPKR KI AVGERRRFGPS ENGD
L NVIVVCLATLCCPGL ECK
c on s e ns us 1 m k1v c p v C g s t e l i y d p e r Gem i v Cka vr cc pgvyC g

P h o TF B1 3 2 V I E E N I I D MGP E WR A F D A S QR3- 2- EVKR I ESERNTIGAI DPMGP


E S I ELWRL HD
A FKGL
D A SSQRT D-I -GEI KR D RS R
Alignment of TFB and TFIIB sequences
P a b TF B
P f u TF B1
3 2 V I E E N I V D MGP E WR A F D A S QR3- 2-
3 2 V I E E N I I D MGP E WR A F D A S QR3- 2-
EVKR I ESERNTIGA
EVRIRESERNTIGA
V DPMGP
E S I ELWR
I DPMGP
E S I ELWR
L HD
A FKGL
L HD
D A SSQR
A FKGL
D A SSQR
T D-I -GEI KR D RS R
T E-I -GEI RDRRS R
Tk o TF B1 3 2 V I E E N V V D E GP E WR A F D P GQR3- 2- EVKR I EAERNVVGAV DPEEGP S I ELWR
L HD
A FKGL
D P GQR
S T D-I -GEI KR D RA R
Tk o TF B2 3 5 V I E E N V V D E GP E WR A F E P GQR3- 5- EVKR I EAERNTVGAV DPEMT GPLEMI WRHDA FKGL
E P GQR
S T D-I -DEWR KRDA R
P f u TF B2 4 2 I L A T N L V D S E L - - - - - - - - - - 4- 2- - I- LS ARTKT
N LKT
V DNSDEI LP-R-Y-- -T-KR- -I -G-
- -- -- -- -- -- S- R K
o mB
P hLoATSFTB_2 _ d e3d3u cVeI dKS
N T- D- iFsDfTrRoVmB
- - L- A- S- T- _- - - - 3- 3- - V- IRKST F-S-SFPD- T- R- VP-KF
- -R-S-KGT
- - -S-- -- -- -- -- -- R- T F
High sequence similarity correlates with functional similarity

enzymes

Non-enzymes

40-20% identity: fold can be predicted by similarity but precise


function cannot be predicted (the 40% rule)
Programs available for BLAST searches

Protein sequence (this is the best option)


blastp--compares an amino acid query sequence against a protein sequence
database

tblastn--compares a protein query sequence against a nucleotide sequence


database translated in all reading frames

DNA sequence
blastn--compares a nucleotide query sequence against a nucleotide sequence
database

blastx--compares a nucleotide query sequence translated in all reading frames


against a protein sequence database

tblastx--compares the six-frame translations of a nucleotide query sequence


against the six-frame translations of a nucleotide sequence database.
BLAST considers all possible combinations of
matches
mismatches
gaps
in any given alignment

Gives the “best” (highest scoring) alignment of sequences

Three scores
1) percent identity
2) similarity score
3) E-value--probability that two sequences will have the
similarity they have by chance (lower number, higher probability
of evolutionary homology, higher probability of similar function)
What is the E-value?
The E value represents the chance that the similarity is random
and therefore insignificant. Essentially, the E value describes the
random background noise that exists for matches between
sequences. For example, an E value of 1 assigned to a hit can be
interpreted as meaning that in a database of the current size one
might expect to see 1 match with a similar score simply by
chance.

You can change the Expect value threshold on most main


BLAST search pages. When the Expect value is increased from
the default value of 10, a larger list with more low-scoring hits
can be reported.
E values (continued)

From the BLAST tutorial:

Although hits with E values much higher than 0.1 are


unlikely to reflect true sequence relatives, it is useful to
examine hits with lower significance (E values between
0.1 and 10) for short regions of similarity. In the absence
of longer similarities, these short regions may allow the
tentative assignment of biochemical activities to the ORF
in question. The significance of any such regions must be
assessed on a case by case basis.
Relationship between E-value and function

Single domain proteins

Multi-domain proteins

E value greater than 10-10, similar structure but possibly different


functions
What does this sequence do? Cue up BLAST…..
Raw data

Computer calls
GNNTNNTGTGNCGGATACAATTCCCCTCTAGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGCACCACCAC
CACCACCACCCCATGGGTATGAATAAGCAAAAGGTTTGTCCTGCTTGTGAATCTGCGGAACTTATTTATGATCCAGAAAG
GGGGGAAATAGTCTGTGCCAAGTGCGGTTATGTAATAGAAGAGAACATAATTGATATGGGTCCTAAGTGGCGTGCTTTTG
ATGCTTCTCAAAGGGAACGCAGGTCTAGAACTGGTGCACCAGAAAGTATTCTTCTTCATGACAAGGGGCTTTCAACTGCA
ATTGGAATTGACAGATCGCTTTCCGGATTAATGAGAGAGAAGATGTACCGTTTGAGGAAGTGGCANTCCANATTANGAGT
TAGTGATGCAGCANANAGGAACCTAGCTTTTGCCCTAAGTGAGTTGGATAGAATTNCTGCTCAGTTAAAACTTCCNNGAC
ATGTAGAGGAAGAAGCTGCAANGCTGNACANAGANGCAGNGNGANAGGGACTTATTNGANGCAGATCTATTGAGAGCGTT
ATGGCGGCANGTGTTTACCCTGCTTGTAGGTTATTAAAAGNTCCCGGGACTCTGGATGAGATTGCTGATATTGCTAGAGC
Find the open reading frame(s)

Translate it:

MKCPYCKSRDLVYDRQHGEVFCKKCGSILATNLVDSEL
SRKTKTNDIPRYTKRIGEFTREKIYRLRKWQKKISSERNL
VLAMSELRRLSGMLKLPKYVEEEAAYLYREAAKRGLTR
RIPIETTVAACIYATCRLFKVPRTLNEIASYSKTEKKEIMK
AFRVIVRNLNLTPKMLLARPTDYVDKFADELELSERVRR
RTVDILRRANEEGITSGKNPLSLVAAALYIASLLEGERRS
QKEIARVTGVSEMTVRNRYKELA
BLAST against (go to genomes page):
-- Microbial genomes
-- environmental sequences (genomes)

Results:

1) Distribution of hits: query sequence and positions in sequence that


gave alignments

2) Sequences producing significant alignments


1) Accession number (this takes you to the sequence that yielded
the hit: gene or contig)
2) Name of sequence (sometimes identifies the gene)
3) Similarity score
4) E-value

3) Alignments arranged by E value, with links to gene reports


Two problems with BLAST 1) Homology? the function is
only inferred (NOT known)

2) Large percentages of
coding proteins cannot be
assigned function based on
homology
For a current list of databases and bioinformatics tools
see: Nucleic Acids Research annual bioinformatics issue
(comes out every January).

List of all the databases described, by category:

http://www.oxfordjournals.org/nar/database/cap/

Guide to NCBI: see Webct


Bioinformatics:
making sense of biological sequence

• New DNA sequences are analyzed for ORFs (Open


Reading Frames: protein)

• Any DNA or protein sequence can then be compared to


all other sequences in databases, and similar sequences
identified

• There is much more -- a great diversity of programs and


databases are available
Massively parallel measurements of gene
expression: microarrays

• Defining the “transcriptome”


• The northern blot revisited
• Detecting expression of many genes: arrays
• A typical array experiment
• What to do with all this data?

Brown and Botstein (1999) “Exploring the new world of


the genome with DNA microarrays” Nature Genetics 21,
p. 33-37.
(we have this)

genome (we want these)


DNA
“transcriptome”

RNA
“proteome”

protein
The value of DNA microarrays for
studying gene expression
1) Study all transcripts at same time

2) Transcript abundance usually correlates with level of gene expression--


much gene control is at level of transcription

3) Changes in transcription patterns often occur as a response to changing


environment--this can be detected with a microarray
Detection of mRNA transcripts

• Northern Blot -- immobilize mRNA on membrane,


detect specific sequence by hybridization with one
labeled probe--requires a separate blotting for each
probe

• DNA microarray -- immobilize many probes


(thousands) in an ordered array, hybridize (base pair)
with labelled mRNA or cDNA
Generating an array of probes
• Identify open reading frames (orfs)

1) PCR each orf (several for each orf), attach (spot)


each PCR product to a solid support in a specific
order (pioneered by Pat Brown’s lab, Stanford)

2) Chemically synthesize orf-specific oligonucleotide


probes directly on microchip (Affymetrix)
http://derisilab.ucsf.edu/microarray/
(Derisi Lab at UCSF)
The chip defines the
genes you are
measuring

The RNA comes


from the cells and
conditions you are
interested in

The hybridization
represents the
measurement
A print head for generating arrays of
probes

Print head Print head travels from DNA probe


source (microtiter plate) to solid
support (treated glass slide)

Small amount of DNA probe is put on


a specific spot at a specific location

Each spot (DNA probe sequence) has


a specific “address”

Printing needles
QuickTime™ and a QuickTime™ anddecompressor
a
TIFF (Uncompressed) decompressorTIFF (Uncompressed)
are needed to see this picture. are needed to see this picture.
A yeast array experiment
vegetative sporulating

Isolate mRNA

Prepare fluorescently
labeled cDNA with two
different-colored fluors

hybridize read-out
Example microarray data
Green: mRNA
more abundant in
vegetative cells

Yellow: equivalent
mRNA abundance in
vegetative and
sporulating cells

Red: mRNA more


abundant in
sporulating cells
What to do with all that data?

Overarching patterns may become apparent

1) Organize data by hierarchical clustering, profiling


to find patterns

2) Display data graphically to allow


assimilation/comprehension
(Cell synchronization method)
All yeast cell cycle-
regulated genes

(phase in which
gene is expressed)

High mRNA
levels
low mRNA
levels
MIAME:
The Minimum Information About a Microarray Experiment

(#6 helps correct for variations in the quantity of starting


RNA, and for variable labelling and detection
efficiencies)
(we have this)

genome (we want these)


DNA
“transcriptome”

RNA
“proteome”

protein
Analysis of the proteome: “proteomics”
• Which proteins are present and when?
• What are the proteins doing?
– What interacts with what?
• Protein-DNA interactions (chromatin
immunoprecipitation)
• Protein-protein interactions
– Functions of proteins?

Phizicky et al. (2003) “Protein analysis on a proteomic


scale” Nature 422, p. 208-215
Which proteins are expressed?
Classical method
– Detect presence of a specific protein
• Using antibodies or specific assay
• Measure changes in protein levels with
changing environment, in different tissues

– Very labor intensive, expensive to scale up to


proteome
Massively parallel detection and
identification of proteins
• 2D gel electrophoresis
– Separate proteins in a given organism or tissue type by migration in gel
electrophoresis
– Identify protein (cut out of gel, sequence or mass-spec)
– Pattern of spots like a barcode for hi-throughput studies
• Mass spectrometry
– Separate individual proteins from cell by charge and mass, individual proteins
can be identified (but need genome sequence information for this)
• Microarrays: isolate things that bind proteins
2D gel electrophoresis
1) Separate proteins on the basis of isoelectric point

4 10

This technique is usually done


on a long, narrow gel
2D gel
electrophoresis
Lay gel containing
isoelectrically focused
protein on SDS page gel,
separate on the basis of
size

E.coli protein profile


From swissprot database,
www.expasy.ch
Mass spectrometry for identifying proteins in a mixture

Liquid chromatography and


tandem mass spectrometry

Software for processing data

From J.R. Yates 1998 “Mass spectrometry and the age of the
proteome” J Mass Spec. 33, p 1-19
Defining protein function
• Classical methods:
– Define activity of protein, develop an assay for activity
• Biochemistry: use assay to purify protein from cell,
characterize structure/function of protein in vitro
• Genetics: obtain mutants with change in activity,
characterize phenotype of mutant, obtain suppressors
to identify genes that interact with protein of interest
– Time intensive, expensive
Protein activity at the proteome level
• Protein-DNA interactions: identifying binding sites
for DNA-binding proteins: regulation of gene
expression

• Massively parallel screens for activity--protein


arrays
“chromatin immunoprecipitation” (ChIP)
1) Grow cells, add
formaldehyde to cross-link
everything to everything
(including DNA to protein)

2) Lyse cells, break up DNA by


shearing

3) Retrieve protein of interest


(and the DNA it is bound to)
using specific antibody to that
protein (immunoprecipitation)

4) Determine presence of
DNA by quantitative PCR
V. Orlando (2000) TIBS 25, p. 99
Massively
parallel Ch-
IP

PCR, label with


fluorescent dyes
Protein arrays for function

Proteins immobilized,
usually by virtue of a tag
sequence (6 x his tag, biotin,
etc.)

Probe all proteins at


once for a specific
activity
Example of a protein microarray
Proteins fused to GST with 6
x histidine tags, immobilized
on Ni++ matrix

Anti-GST tells how much


protein is immobilized on
surface

Specific assays identify


proteins with specific
activities--calmodulin
binding, phosphoinositide
binding
(we have this)

genome (we want these)


DNA
“transcriptome”

RNA
“proteome”

protein

You might also like