Human Genome Project: sequencing

Dec 12, 2000 Draft Finished

Outline
"

Exon-intron structure of genes Models of gene grammar
Example: Genscan

"

"

Models of exon-intron sequence Integrating intrinsic, extrinsic information
Example: GenomeScan

"

"

The RNA splicing code

Central Dogma
DNA 1:1

ACCGGACCGATGCGACTGCCCGAGGACTAGATAT TGGCCTGGCTACGCTGACGGGCTCCTGATCTATA
RNA M Protein MRLPED R L P E 1:1 D 3:1

GACCGAUGCGACUGCCCGAGGACUAGA

*

Pre-mRNA Splicing
U2 snRNP U1 snRNP U1 snRNP

U2 AF6 5 U2 AF3 5

intron definition

exon definition SR proteins

...
5 ’ splice signal exonic repressor branch signal polyY exonic enhancers intronic repressor 3 ’ splice signal 5 ’ splice signal intronic enhancers

(assembly of spliceosome, catalysis)

...

Human Splice Signal Motifs
5' splice signal

3' splice signal

C. Burge & S. Karlin, 1997, 1998

Genscan HSMM

Human Splice Signal Motifs
5' splice signal

3' splice signal

http://genes.mit.edu/pictogram.html

Semi-Markov HMM Model

Genome Scale Gene Finding Strategies
Strategy
Ab initio prediction Microarray Gene inference Genomic:genomic alignment DNA:protein alignment cDNA sequencing

Based on
Models of gene structure/co Hybridization mp Homology Homology Homology Sequencing

Examples
Genscan, GRAIL GenLang, hmmgene Exon-scanning array GenomeScan ExoFish GLASS/Rosetta GeneWise RIKEN

C. Burge Nature Genet. 27, 5-7, 2001

ExoFish

Homo sapiens

Tetraodon nigroviridis

Roest Crollius et al., Nature Genet., 2000

GenomeScan Objectives
• Combine probabilistic ‘extrinsic’ information (BLAST

hits) with a probabilistic model of gene structure/composition and reliable enough to run on an • Make method efficient entire vertebrate genome without human supervision
• Focus on ‘typical case’ when homologous but not identical

proteins are available.

http://genes.mit.edu/genomescan

Current Human Gene Annotation Efforts
• Ensembl [http://www.ensembl.org]
Genscan (ab initio) + BLAST (homology) + GeneWise (protein:DNA alignment)

• NCBI [http://ncbi.nlm.nih.org] acembly (cDNA,EST alignments) • Burge lab [http://genes.mit.edu/genomescan]
GenomeScan (ab initio + protein sequence homology)

• Neomorphic/Affymetrix
Genie (ab initio + EST)

• Celera
Otto (???)

IGI (International Gene Index) / IPI (EBI)

Pre-mRNA Splicing
U2 snRNP U1 snRNP U1 snRNP

U2 AF6 5 U2 AF3 5

intron definition

exon definition SR proteins

...
5 ’ splice signal exonic repressor branch signal polyY exonic enhancers intronic repressor 3 ’ splice signal 5 ’ splice signal intronic enhancers

(assembly of spliceosome, catalysis)

...

Human Splice Signal Motifs

5' splice signal

3' splice signal

5’ Splice Signal Scores

Intron Length Distributions

Pre-mRNA Splicing
U2 snRNP U1 snRNP U1 snRNP

U2 AF6 5 U2 AF3 5

intron definition

exon definition SR proteins

...
5 ’ splice signal exonic repressor branch signal polyY exonic enhancers intronic repressor 3 ’ splice signal 5 ’ splice signal intronic enhancers

(assembly of spliceosome, catalysis)

...

Characterizing the sources of information used for splicing
"

5’ splice signal (.AG/GTRAGt) 3’ splice signal (…YYYYYY.YAG/) Branch signal (…CTGAC..) Intron length preference Intron composition

"

"

"

"

Splicing-verified Transcripts
Org Yeast Worm Fly Arab Human MBp 12 100 140 125 3,000+ i-Tx Introns Int/iTx 152 691 1,310 1,121 8,165 152 3,577 3,737 5,265 33,666 ~1 ~7 ~4 ~5 ~9 %Short ~50 46 54 63 10

Data from Sep, 2000 GenBank release

Splice Signal Sequences

IntronScan Accuracy
5’ss and 3’ss only Organism Yeast Elegans Fly Arabidopsis Human Detect 90 95 92 82 76 Exact 43 92 88 68 65 Complete model Detect 98 97 96 96 88 Exact 86 95 94 92 85

Fivefold cross-validated

Top Ten Intronic Pentamers
Arabidopsis TCTCT TTTTT TTTGT TCTTT TGTTT TCTGT TTCTT TGTGT CTTTT TTTCT Drosophila ATATA AAATA TATAT TGATT ACTTA ACATA TTTGT CATTT TTAAA TCATT Human GTGGG CTGGG GAGGG CAGGG TGGGG GCAGG GGTGG GGAGG GCGGG GCTGG

Top Ten Exonic Pentamers
Arabidopsis TGAAG CAAAG AGAAG TGCTG TCTGA TGCAG TGGAG GGAAG CGAAG GAAGG Drosophila GGCGG CGAGG CGCTG AGGAG TGGCC AGCTG TGCTG AGCAG AGAAG TGCAG Human GATGA CAGAA GAAGA CAGCA CACCA CTGAA GTGGA CAGGA GAGGA CTGGA

Summary
"

Genes have a grammatical structure probabilistic models of this structure are interesting and useful

"

Computational methods interact with experimental methods in modern biology Introns also have a grammatical structure sequence analysis may help us to deduce aspects of this structure

"

"

There are many interesting related problems:
Finding RNA genes, identifying regulatory elements, Understanding transcription, regulatory networks, etc.

Sign up to vote on this title
UsefulNot useful