Bioinformatics 1 -- lecture 22

Gene finding in eukaryotes intron/exon boundaries splicing alternative splicing

Finding genes in prokaryotes is easy.
Just translate the DNA sequence in all 6 reading frames. The ORFs (regions starting with ATG and ending in an in-frame stop codon) will be at least 300 bases in length, while random reading frames will be dotted with stop codons at the rate of about 3 stop codons every 64 codons.

XXXXXXXXATG......(3N).....TGAXXXXX

Finding genes in eukaryotes is harder.
•Genes are composed of coding regions (exons) and internal non-coding regions (introns). •Genes are transcribed to pre-mRNA. •Introns are removed from pre-mRNA by the spliceosome (a ribozyme) •Proteins are translated from the mRNA after splicing. •Different tissues may splice pre-mRNA differently! dna
GU AG

pre-mRNA

mRNA

pre-mRNA structure prokaryotic mRNA +polyA tail .

Introns-early? Introns-late? did the common ancester have introns? eubacteria archea eukaryotes have introns don’t have introns .

..XXAG XX.... XXXXXXATG.XXGUX..A generic gene sequence model for pre-mRNA pre-gene region AUG exon 3’ splice site (.......AG) 5’ splice site (GU.) post-gene region stop (UAA|UAG|UGA) intron (0|1|2) exon intron exon intron ..XXXGUX..XXAGXX.XXTAAXXX ...

exon1 GU A AG exon2 5’ splice site “the donor” “branchpoint” 3’ splice site “the acceptor” . containing RNA and small nuclear ribonucleoproteins (snRNPs) that is assembled during the splicing of messenger RNA primary transcript to excise an intron.Splicing mechanism. spliceosome dna pre-mRNA Spliceosome: Def: A ribonucleoprotein complex.

Splicing mechanism (1) pre-mRNA exon1 GU A AG exon2 Spliceosome (not shown) forms. exon1 3’-OH A GU AG exon2 . (2) lariat loop forms A exon1 GU AG exon2 (3) exon 1 cleaved from lariat.

.2 positioned to ligate A GU exon1 3’-OH AG exon2 liberated lariat + ligated exons (5) gets degraded A GU AG exon1 exon2 goes to ribosome Spliceosome (not shown) disassociates.Splicing mechanism the “lariat” (4) exon 1.

html google: wustl neuromuscular splicefunct http://neuromuscular.edu/pathol/diagrams/splicefunct.edu/pathol/diagrams/splicemech.html google: wustl neuromuscular splicemech Much thanks to T. UCSC! .wustl. Wilson.wustl.Splicing mechanism http://neuromuscular.

in this case. GU A AG exon2 it binds near the branchpoint (or one of the splicepoints) exon2 exon1 A exon1 GU AG it blocks. For example: an RNA binding protein is expressed in response to a stimulus.RNA binding proteins may selectively block splicing in some tissues. . the cyclizing step.

Frame of intron Frame 0: intron starts at codon boundary AGU CUU AUC UUU UCA GUU GGG CCG UAG AAC CAC UCG UAA Frame 1: intron starts one after codon boundary AGU CUU AUC UUU UCA UGU GGG CCG UAA GAC CAC UCG UAA .AG Spliceosome cuts before GU and after AG... .. . CCG UAG AGC CAC UCG UAA This must be multiple of 3 if the intron is alternatively spliced. This is a constraint.. Frame 2: intron starts two after codon boundary AGU CUU AUC UUU UCA GGG UGG ...GU..

align it to the translation and find the regions of (near) perfect identity. (4) Find the 5’-GT or 3’-AG signal at the point where the identity matches abruptly end.How to find splice points. using the protein sequence database. (2) Search the database of protein sequences using the translations. (5) If your translation has an insertion with nearly perfect matches on either side. (3) Using the complete protein sequence. These will abruptly end at the intron start site. you have an alternative splicing. . (1) Translate the DNA in all 6 frames.

. While waiting.4. Organism: homo sapiens Submit.. search nucleotides for AKAP9 (you should get the sequence with accession number NM_005751. GI:197245395) Slect “BLAST sequence” Select blastx (not tblastx) Select the nr/nt database..In Class exercise: find the alternative spliced variants Go to NCBI.. do exercise on the next page.

Positives = 86/116 (74%) Frame = +2 Query: 76820 RSHENGFMEDLDKTWVRYQECDSRSNAPATLTFENMAGAFSFIHSRVGSPWXXXXXXXXX 76999 +SHENGFMEDLDKTWVRYQECDSRSNAPATLTFENMA Sbjct: 778 KSHENGFMEDLDKTWVRYQECDSRSNAPATLTFENMA----------------------.814 Query: 77000 XXXXRHTGVFMLVAGGIVAGIFLIFIEIAYKRHKDARRKQMQLAFAAVNVWRKNLQ 77167 GVFMLVAGGIVAGIFLIFIEIAYKRHKDARRKQMQLAFAAVNVWRKNLQ Sbjct: 815 -------GVFMLVAGGIVAGIFLIFIEIAYKRHKDARRKQMQLAFAAVNVWRKNLQ 863 Identical up to the insertion. Identical after the insertion. These are the same gene. Expect = 8e-37 Identities = 85/116 (73%).A sure sign of alternative splicing in blastx output: Score = 160 bits (404). .

G|ee e=a base within the exon. . i =a base within the intron. The un-spliced intron iAG| ends with AG.Which codons can come at the start/end of an alternative exon? Frame : The un-spliced intron |GUi starts with GU. | = intron/exon boundary. 0 e|GU 1 2 ee|G.Uii AG|e iiA.

[SR] [VADEG] e=a base within the exon. 0 1 [CRSG] {FMNHYWCD} 2 {FINHYCD} [FLSYCW] The un-spliced intron [QKE]| ends with AG. | = intron/exon boundary.Which amino acids can come at the start/end of an alternative exon? Frame : The un-spliced intron |V starts with GU. i =a base within the intron. What frame is the intron in the earlier slide? .

.AG is necessary.Exon Is that all there is to it? GU occurs on average every 16 nucleotides.. for splicing.GU. not sufficient. AG. too. What else is needed? . If this were the only information.Exon.AG. there would be too many splice sites. GU.

ISSs) •Base composition in exons/introns. ESSs. •Orthologs conserve intron/exon boundaries. •3’ and 5’ intron sequence motifs •branchpoint sequence motif •Enhancer/silencer sequence motifs (ESEs.AG in DNA) What information is used to predict intron/exon boundaries? •Introns can start in one of three “frames” (0|1|2) relative to the codon frame.. ISEs. •Alternatively spliced introns (may be exons) must have a multiple of 3 nucleotides.•Introns always start with GU and end with AG (GT. .

Not so specific.Sequence composition method for genefinding Most exons code for protein. Selective pressure on exons includes: (1) species-specific codon preferences (2) amino acid preferences (3) selection for “foldability” and function. . Most introns do not. P(G)=w P(C)=x P(T)=y P(A)=z P(G)=a P(C)=b P(T)=c P(A)=d A simple HMM for intron/exon base composition.

ISEs. ISS =Intronic Splicing Silencers: sequence in the introns that inhibit splicing .ESEs. ESSs. ISSs ESE =Exonic Splicing enhancers: sequence in the exons that promote splicing ESS =Exonic Splicing Silencers: sequence in the exons that inhibit splicing ISE =Intronic Splicing Enhancers: sequence in the introns that promote splicing.

non-ESE sequences was constructed.XHF. Zhang. (4) 8-mers with high relative abundance were tested by mutating the putative ESE 8-mers and determining the splicing efficiency by gel electrophoresis. (3) The relative abundance of all “8-mers” was found. and (b) from an internal nonprotein-coding exon. and Chasin LA.How were ESEs found? (1) Training database was constructed of exonic mRNA (post-spliced) that was (a) constitutively spliced (not alternatively spliced).” Genes & Development 18: 1241-1250 (2004) . (2) Database of ‘control’. “Computational definition of sequence motifs governing constitutive exon splicing.

Relative abundance ESE and ESS motifs putative ESSs putative ESEs Some of the motifs found by Zhang & Chasin using relative abundance analysis of 8-mers. after clustering. .

begin end begin end begin end begin end begin end begin end HMMs can be connected by their begin and end states to make a super-HMM.The nice thing about HMMs: they are modular. Individual modules can be “trained” separately. .

A modular HMM for introns short variable length intron model 1 p DSS 1-p Ishort Ifixed 1 1-q ASS q donor site =GU Igeo acceptor site =AG fixed length + variable length intron model Stanke M.32:W309-12. “AUGUSTUS: a web server for gene finding in eukaryotes. 2004 Jul 1. Morgenstern B. Steinkamp R. “ Nucleic Acids Res. Waack S. .

BJ. TIBS 25:106 (2000) . “Exonic splicing enhancers: mechanism of ction.Intron model for mammals branch site poly-pyrimidine region donor motif (contains GU) acceptor motif (contains AG) from: Blencowe. diversity and role in human genetic diseases.

A genefinding HMM: Genescan internal exon model intron models initial exon model terminal exon model single exon model Intergentic Regions Mirrored models for reverse complement strand .

forward strand part .GENESCAN -.

Splicing fact sheet Exons average 145 nucleotides in length Contain regulatory elements : ESEs: Exonic splicing enhancers ESSs: Exonic splicing silencers Introns average more than 10x longer than exons Contain regulatory elements(bind regulatory complexes) ISEs: Intronic splicing enhancers ISSs: Intronic splicing silencers Splice sites 5' splice site Sequence: AGguragu (r = purine) U1 snRNP: Binds to 5' splice site 3' splice site Sequence: yyyyyyy nagG (y= pyrimidine) Branch site Sequence: ynyuray (r = purine) U2 snRNP: Binds to branch site via RNA:RNA interactions between snRNA and pre-mRNA .

mRNA stability.Alternative splicing fact sheet Alternative splicing Definition: Joining of different 5' and 3' splice sites ~80% of alternative splicing results in changes in the encoded protein Up to 59% of human genes express more than one mRNA by alternative splicing Functional effects: Generates several forms of mRNA from single gene Allows functionally diverse protein isoforms to be expressed according to different regulatory programs Structural effects: Insert or remove amino acids Shift reading frame Introduce termination codon Gene expression effects Removes or inserts regulatory elements controlling translation. or localization Regulation Splicing pathways modulated according to: Cell type Developmental stage Gender External stimuli .

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.