You are on page 1of 107

An Introduction to Multiple Sequence Alignments

Cédric Notredame

Copyright Cédric Notredame (2000-2003) All rights reserved

chite wheat trybr mouse

---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . :

chite wheat trybr mouse

Copyright Cédric Notredame (2000-2003) All rights reserved

Manguel M, Samaniego F.J., Abraham Wald’s Work on Aircraft Suvivability, J. American Statistical Association. 79, 259-270, (1984)
Copyright Cédric Notredame (2000-2003) All rights reserved

Our Scope How Can I Use My Alignment? How Does The Computer Align The Sequences? How Can I Assemble a Mult. Aln? What are the Difficulties? Copyright Cédric Notredame (2000-2003) All rights reserved .

Outline -Why Do We Need Multiple Sequence Alignment ? -The progressive Alignment Algorithm -A possible Strategy… -Potential Difficulties Copyright Cédric Notredame (2000-2003) All rights reserved .

Pre-requisite -How Do Sequences Evolve? -How can We COMPARE Sequences ? -How can We ALIGN Sequences ? Copyright Cédric Notredame (2000-2003) All rights reserved .

Why Do We Need Multiple Sequence Alignment ? Copyright Cédric Notredame (2000-2003) All rights reserved .

Sometimes Two Sequences Are Not Enough… The man with TWO watches NEVER knows the time Copyright Cédric Notredame (2000-2003) All rights reserved .

. . ::: . : . Copyright Cédric Notredame (2000-2003) All rights reserved . * . *: * AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .What is A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***.: . Evolution Criteria: Residues are arranged so that those having the same ancestor end up in the same column. . : chite wheat trybr mouse Structural Criteria: Residues are arranged so that those playing a similar role end up in the same column.* .

Phylogenic Relation Functional Relation Copyright Cédric Notredame (2000-2003) All rights reserved .

Copyright Cédric Notredame (2000-2003) All rights reserved .

: . .How Can I Use A Multiple Sequence Alignment? chite wheat trybr unknown ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***.* . ::: . * . : chite wheat trybr unknown Less Than 30 % id BUT Conserved where it MATTERS Extrapolation Beyond The Twilight Zone Homology? SwissProt Unkown Sequence Copyright Cédric Notredame (2000-2003) All rights reserved . : . *: * AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : . ..

Copyright Cédric Notredame (2000-2003) All rights reserved .

*: * AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : . * . : chite wheat trybr mouse Extrapolation Prosite Patterns Copyright Cédric Notredame (2000-2003) All rights reserved . .* ..: . . ::: .How Can I Use A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. : .

::: .* . : .: . . . *: * AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .How Can I Use A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. : chite wheat trybr mouse Extrapolation Prosite Patterns P-K-R-[PA]-x(1)-[ST]… Copyright Cédric Notredame (2000-2003) All rights reserved .. * .

. ::: . * .How Can I Use A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***.* . : chite wheat trybr mouse Extrapolation Prosite Patterns SwissProt Uncharacterised Signature Match? Copyright Cédric Notredame (2000-2003) All rights reserved . *: * AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : . .: .. : .

. * . .. : .: .How Can I Use A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP ***. ::: .* . : chite wheat trybr mouse L? K>R Extrapolation Prosite Patterns Profiles And HMMs A F D E F G H Q I V L W Copyright Cédric Notredame (2000-2003) All rights reserved -More Sensitive -More Specific . *: * AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .

A PROSITE PROFILE A Substitution Cost For Every Amino Acid. At Every Position Copyright Cédric Notredame (2000-2003) All rights reserved .

::: . . : chite wheat trybr mouse Extrapolation Motifs/Patterns Profiles Phylogeny Copyright Cédric Notredame (2000-2003) All rights reserved chite wheat trybr -Evolution -Paralogy/Orthology mouse . .How Can I Use A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. : . * .: .* . *: * AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : ..

. : chite wheat trybr mouse Extrapolation Motifs/Patterns Profiles Phylogeny Copyright Cédric Notredame (2000-2003) All rights reserved Column Constraint  Evolution Constraint  Structure Constraint Struc.: .* .How Can I Use A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. Prediction . . * . : . ::: . . *: * AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .

Struc. Threading: is improving but is not yet as good.: . Prediction . : .How Can I Use A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***.. * . *: * AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : chite wheat trybr mouse Extrapolation Motifs/Patterns Profiles Phylogeny Copyright Cédric Notredame (2000-2003) All rights reserved PsiPred OR PhD For secondary Structure Prediction: 75% Accurate. ::: . . .

: chite wheat trybr mouse Automatic Multiple Sequence Alignment methods are not always perfect… You know better… With your big BRAIN Copyright Cédric Notredame (2000-2003) All rights reserved . .. . *: * AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : . : .How Can I Use A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***.: . * . ::: .* .

Copyright Cédric Notredame (2000-2003) All rights reserved .

: . .: .Why Is It Difficult To Compute A multiple Sequence Alignment? A CROSSROAD PROBLEM BIOLOGY: What is A Good Alignment COMPUTATION What is THE Good Alignment chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. * . *: * Copyright Cédric Notredame (2000-2003) All rights reserved . .. ::: .

. Good Sequences Good Alignment Copyright Cédric Notredame (2000-2003) All rights reserved ...Why Is It Difficult To Compute A multiple Sequence Alignment ? BIOLOGY COMPUTATION CIRCULAR PROBLEM.

We do NOT understand the Relation Between Structures and Sequences.The Biological Problem. Same as PairWise Alignment Problem We do NOT know how Sequences Evolve. We would NOT recognize the Correct Alignment if we had it IN FRONT of our eyes… Copyright Cédric Notredame (2000-2003) All rights reserved .

The Biological Problem. The Charlie Chaplin Paradox Copyright Cédric Notredame (2000-2003) All rights reserved .

The Biological Problem. How to Evaluate an Alignment -A nice set of Sequences -Substitution Matrix (Blosum) -Gap Penalties. -An Evaluation Function A A A C C A A A C Sums of Pairs: Cost=6 C Over-estimation of the Substitutions Easy to compute Copyright Cédric Notredame (2000-2003) All rights reserved .

-An Evaluation Function -An Alignment Algorithm Will It Work ? Copyright Cédric Notredame (2000-2003) All rights reserved GLOBAL Alignment . Producing the Alignment -A nice set of Sequences -Substitution Matrix (Blosum) -Gap Penalties.The COMPUTATIONAL Problem.

HOW CAN I ALIGN MANY SEQUENCES 2 Globins =>1 Min Copyright Cédric Notredame (2000-2003) All rights reserved .

HOW CAN I ALIGN MANY SEQUENCES 3 Globins =>2 hours Copyright Cédric Notredame (2000-2003) All rights reserved .

HOW CAN I ALIGN MANY SEQUENCES 4 Globins => 10 days Copyright Cédric Notredame (2000-2003) All rights reserved .

HOW CAN I ALIGN MANY SEQUENCES 5 Globins => 3 years Copyright Cédric Notredame (2000-2003) All rights reserved .

HOW CAN I ALIGN MANY SEQUENCES ! DHEA Loaded 6 Globins =>300 years Copyright Cédric Notredame (2000-2003) All rights reserved .

000 years Solidified Fossil. Old stuff Copyright Cédric Notredame (2000-2003) All rights reserved .HOW CAN I ALIGN MANY SEQUENCES 7 Globins =>30.

HOW CAN I ALIGN MANY SEQUENCES 8 Globins =>3 Million years Copyright Cédric Notredame (2000-2003) All rights reserved .

The Progressive Multiple Alignment Algorithm (Clustal W) Copyright Cédric Notredame (2000-2003) All rights reserved .

Copyright Cédric Notredame (2000-2003) All rights reserved .

Making An Alignment Any Exact Method would be TOO SLOW We will use a Heuristic Algorithm. -Fast Copyright Cédric Notredame (2000-2003) All rights reserved . Progressive Alignment Algorithm is the most Popular -ClustalW -Greedy Heuristic (No Guarranty).

Taylor 1989 Clustering Copyright Cédric Notredame (2000-2003) All rights reserved . 1988.Progressive Alignment Feng and Dolittle.

Progressive Alignment Dynamic Programming Using A Substitution Matrix Copyright Cédric Notredame (2000-2003) All rights reserved .

Progressive Alignment -Depends on the CHOICE of the sequences. •Tree making Algorithm. Copyright Cédric Notredame (2000-2003) All rights reserved . •Sequence Weight. •Penalties (Gop. -Depends on the ORDER of the sequences (Tree). -Depends on the PARAMETERS: •Substitution Matrix. Gep).

Progressive Alignment When Does It Work Works Well When Phylogeny is Dense No outlayer Sequence. Image: River Crossing Copyright Cédric Notredame (2000-2003) All rights reserved .

Progressive Alignment When Doesn’t It Work CLUSTALW (Score=20. Gep=0. Gop=-1. M=1) SeqA SeqB SeqC SeqD GARFIELD GARFIELD GARFIELD -------THE THE THE THE LAST FAST VERY ---FA-T CA-T FAST FA-T CAT --CAT CAT CORRECT (Score=24) SeqA SeqB SeqC SeqD Copyright Cédric Notredame (2000-2003) All rights reserved GARFIELD GARFIELD GARFIELD -------- THE THE THE THE LAST FAST VERY ---- FA-T ---FAST FA-T CAT CAT CAT CAT .

GARFIELD THE LAST FAT CAT GARFIELD THE LAST FAT CAT GARFIELD THE FAST CAT --- GARFIELD THE FAST CAT GARFIELD GARFIELD GARFIELD -------- THE THE THE THE LAST FAST VERY ---- FA-T CA-T FAST FA-T CAT --CAT CAT GARFIELD THE VERY FAST CAT GARFIELD THE VERY FAST CAT -------.THE ---.FA-T CAT Copyright Cédric Notredame (2000-2003) All rights reserved THE FAT CAT .

Copyright Cédric Notredame (2000-2003) All rights reserved .Building the Right Multiple Sequence Alignment.

Recognizing The Right Sequences When you Meet Them… Copyright Cédric Notredame (2000-2003) All rights reserved .

Gathering Sequences: BLAST Copyright Cédric Notredame (2000-2003) All rights reserved .

**** *:************..:******:** PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT -IDENTICAL SEQUENCES BRING NO INFORMATION FOR THE MULTIPLE SEQUENCE ALIGNMENT -MULTIPLE SEQUENCE ALIGNMENTS THRIVE ON DIVERSITY… Copyright Cédric Notredame (2000-2003) All rights reserved .******.Common Mistake: Sequences Too Closely Related PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE :**::*.*******:***:* :****************.::******:*********** DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES :*** ******.

Copyright Cédric Notredame (2000-2003) All rights reserved .

Sequence Weighting Within ClustalW Copyright Cédric Notredame (2000-2003) All rights reserved .

Selecting Diverse Sequences (Opus II)

Copyright Cédric Notredame (2000-2003) All rights reserved

Respect Information!

PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT TPCC_MOUSE

------------------------------------------SMTDLLN----AEDIKKA ------------------------------------------SMTDLLN----AEDIKKA ------------------------------------------SMTDLLS----AEDIKKA ------------------------------------------SMTDVLS----AEDIKKA ------------------------------------------SMTDLLS----AEDIKKA ------------------------------------------AMTELLN----AEDIKKA MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :*. .*:::: VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFI IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFI IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSI IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSI IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFI IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM

PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT TPCC_MOUSE

This Alignment Is not Informative about the relation Betwwen TPCC MOUSE and the rest of the sequences.

-A better Spread of the Sequences is needed

Copyright Cédric Notredame (2000-2003) All rights reserved

Selecting Diverse Sequences (Opus II)

Copyright Cédric Notredame (2000-2003) All rights reserved

* **** **::** ** -A REASONABLE Model Now Exists.: .* .Selecting Diverse Sequences (Opus II) PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES PRVA_MACFU PRVA_ESOLU PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES PRVA_MACFU PRVA_ESOLU -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIE -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIE MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIE -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIE -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIE --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE : *: . -Going Further:Remote Homologues.:*.* *: ** :: .* . * ** *: * : * :* * **:** EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKAEDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQDEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKAQDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKAEDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA :** .*:. Copyright Cédric Notredame (2000-2003) All rights reserved . .

.Aligning Remote Homologues PRVA_MACFU PRVA_ESOLU PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES TPCS_RABIT TPCS_PIG TPCC_MOUSE ------------------------------------------SMTDLLNA----EDIKKA -------------------------------------------AKDLLKA----DDIKKA ------------------------------------------AFAGVLND----ADIAAA ------------------------------------------AFAGILSD----ADIAAG -----------------------------------------MACAHLCKE----ADIKTA ------------------------------------------AVAKLLAA----ADVTAA ------------------------------------------SITDIVSE----KDIDAA -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :: VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFV LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLF LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELF LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLF IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM : .* :. .** *.. :: : :: .: . :** :: PRVA_MACFU PRVA_ESOLU PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES TPCS_RABIT TPCS_PIG TPCC_MOUSE PRVA_MACFU PRVA_ESOLU PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES TPCS_RABIT TPCS_PIG TPCC_MOUSE Copyright Cédric Notredame (2000-2003) All rights reserved .. LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEALQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKGLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE :: . *: * : * :* : .*:*: :** .

Some Guidelines … Copyright Cédric Notredame (2000-2003) All rights reserved .

Do Not Use Two Many Sequences… Copyright Cédric Notredame (2000-2003) All rights reserved .

Reading Your Alignment Copyright Cédric Notredame (2000-2003) All rights reserved .

Copyright Cédric Notredame (2000-2003) All rights reserved .

* *. : . ..* :.. : * . :: . :: : :: * :.Going Further… PRVA_MACFU PRVB_BOACO PRV1_SALSA TPCS_RABIT TPCS_PIG TPCC_MOUSE TPC_PATYE VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI . : * :* : . LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQFR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQLQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVELS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA : . :** :: PRVA_MACFU PRVB_BOACO PRV1_SALSA TPCS_RABIT TPCS_PIG TPCC_MOUSE TPC_PATYE Copyright Cédric Notredame (2000-2003) All rights reserved .

Copyright Cédric Notredame (2000-2003) All rights reserved .WHAT MAKES A GOOD ALIGNMENT… -THE MORE DIVERGEANT THE SEQUENCES. THE BETTER -NICE UNGAPPED BLOCKS SEPARATED WITH INDELS -DIFFERENT CLASSES OF RESIDUES WITHIN A BLOCK: •Completely Conserved •Conserved For Size and Hydropathy •Conserved For Size or Hydropathy -THE ULTIMATE EVALUATION IS A MATTER OF PERSONNAL JUDGEMENT AND KNOWLEDGE. THE BETTER -THE FEWER INDELS.

Copyright Cédric Notredame (2000-2003) All rights reserved .

Potential Difficulties Copyright Cédric Notredame (2000-2003) All rights reserved .

*: * AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .: .: . . * . ::: . : Copyright Cédric Notredame (2000-2003) All rights reserved chite wheat trybr mouse . .DO NOT OVERTUNE!!! chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. : chite wheat trybr mouse DO NOT PLAY WITH PARAMETERS IF YOU KNOW THE ALIGNMENT YOU WANT: MAKE IT YOURSELF! chite wheat trybr mouse ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. *: * AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : . : .* . . * . . :*: ... : .* .

TUNING or NOT TUNING!!!
-PARAMETERS TO TUNE USUALLY INCLUDE: •GOP/ GEP •MATRIX •SENSITIVITY Vs SPEED

Substitution Matrices (Etzold and al. 1993)
GOP Gonnet Blosum50 Pam250 GEP -MOST METHODS ARE TUNED FOR WORKING WELL ON AVERAGE -PARAMETERS BEHAVIOUR DO NOT NECESSARILY FOLLOW THE THEORY (i.e. Substitution Matrices). -A GOOD ALIGNMENT IS USUALLY ROBUST(i.e. Changes little). -TUNE IF YOU WANT TO CONVINCE YOURSELF. 61.7 % 59.7 % 59.2 %

Copyright Cédric Notredame (2000-2003) All rights reserved

Copyright Cédric Notredame (2000-2003) All rights reserved

KEEP A BIOLOGICAL PERSPECTIVE

chite wheat trybr mouse

---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *

DIFFERENT PARAMETERS

chite wheat trybr mouse

AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL-DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLS -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS * *** .:: ::... : * . . . : * . *: *

WRONG ALIGNMENT !!!
Copyright Cédric Notredame (2000-2003) All rights reserved

REPEATS THERE IS A PROBLEM WHEN TWO SEQUENCES DO NOT CONTAIN THE SAME NUMBER OF REPEATS IT IS THEN BETTER TO MANUALLY EXTRACT THE REPEATS AND TO ALIGN THEM. INDIVIDUAL REPEATS CAN BE RECOGNIZED USING DOTTER Copyright Cédric Notredame (2000-2003) All rights reserved .

Copyright Cédric Notredame (2000-2003) All rights reserved .

Naming Your Sequences The Right Way Copyright Cédric Notredame (2000-2003) All rights reserved .

What Are The Available Methods ??? Copyright Cédric Notredame (2000-2003) All rights reserved .

-Memory and CPU hungry -Do Well When They Can Run. Copyright Cédric Notredame (2000-2003) All rights reserved .Simultaneous Alignments : MSA 1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence.

Copyright Cédric Notredame (2000-2003) All rights reserved .Simultaneous Alignments : DCA -Few Small Closely Related Sequence. but less than MSA -Do Well When Can Run. but less limited than MSA -Memory and CPU hungry.

Dialign Copyright Cédric Notredame (2000-2003) All rights reserved .

Copyright Cédric Notredame (2000-2003) All rights reserved .Dialign II 1) Identify best chain of segments on each pair of sequence. Assign a Pvalue to each Segment Pair. 2) Ré-évaluate each segment pair according to its consistency with the others 3) Assemble the alignment according to the segment pairs.

Dialign II -May Align Too Few Residues -No Gap Penalty -Does well with ESTs Copyright Cédric Notredame (2000-2003) All rights reserved .

html Copyright Cédric Notredame (2000-2003) All rights reserved .uni-bielefeld.techfak.de/dialign/submission.Dialign II bibiserv.

Muscle Copyright Cédric Notredame (2000-2003) All rights reserved .

HMMER.Iterative Methods 7.1 Progressive -HMMs.16. MUSCLE -Slow. SAM. Sometimes Inaccurate -Good Profile Generators Copyright Cédric Notredame (2000-2003) All rights reserved .

16.1 Progressive Copyright Cédric Notredame (2000-2003) All rights reserved .MUSCLE 7.

edu/cgi-bin/muscle/input_muscle.1 Progressive Copyright Cédric Notredame (2000-2003) All rights reserved .MUSCLE phylogenomics.py 7.berkeley.16.

edu/cgi-bin/muscle/input_muscle.py 7.16.1 Progressive Copyright Cédric Notredame (2000-2003) All rights reserved .berkeley.MUSCLE phylogenomics.

T-Coffee Copyright Cédric Notredame (2000-2003) All rights reserved .

Mixing Local and Global Alignments Local Alignment Global Alignment Extension Multiple Sequence Alignment Copyright Cédric Notredame (2000-2003) All rights reserved .

Mixing Heterogenous Data With T-Coffee Local Alignment Global Alignment Multiple Alignment Specialist Structural Multiple Sequence Alignment Copyright Cédric Notredame (2000-2003) All rights reserved .

Mixing Sequences and Structures with T-Coffee Seq Vs Seq Seq Vs Struct Thread Local Global Struct Vs Struct Superpose Evaluation on Homestrad Copyright Cédric Notredame (2000-2003) All rights reserved .

What is the Local Quality of my Alignment I II Copyright Cédric Notredame (2000-2003) All rights reserved .

T-Coffee igs-server.cnrs-mrs.fr/Tcoffee/ Copyright Cédric Notredame (2000-2003) All rights reserved .

DBClustal Copyright Cédric Notredame (2000-2003) All rights reserved .

DBClustal BlastP Copyright Cédric Notredame (2000-2003) All rights reserved .

DBClustal Copyright Cédric Notredame (2000-2003) All rights reserved .

DBClustal Copyright Cédric Notredame (2000-2003) All rights reserved .

Expasy Blast Copyright Cédric Notredame (2000-2003) All rights reserved .

org/tools/blast/ Copyright Cédric Notredame (2000-2003) All rights reserved .Expasy BLAST www.expasy.

Expasy BLAST Copyright Cédric Notredame (2000-2003) All rights reserved .

Choosing the right method Copyright Cédric Notredame (2000-2003) All rights reserved .

Situation  Solution Copyright Cédric Notredame (2000-2003) All rights reserved .

Priority  Solution Method Priority Trees Profile 2D –Pred 3D-Pred Func-Pred Accuracy Speed Copyright Cédric Notredame (2000-2003) All rights reserved .

Purpose  Solution Copyright Cédric Notredame (2000-2003) All rights reserved .

Conclusion Copyright Cédric Notredame (2000-2003) All rights reserved .

Multiple Alignment -The BEST alignment Method: Your Brain The Right Data -The Best Evaluation Procedure: Experimental Data (SwissProt) -Choosing The Sequences Well is Important -Beware of repeated elements Copyright Cédric Notredame (2000-2003) All rights reserved .

Multiple Alignment Know Your Problem: What do you want to do with your MSA Copyright Cédric Notredame (2000-2003) All rights reserved .

biophys.jp/katoh www.com/muscle Copyright Cédric Notredame (2000-2003) All rights reserved .bioinformatics.edu/poa www.ucla.drive5.kyoto-u.Addresses MAFFT POA MUSCLE Progressive/iterative Progressive/Simultaneous Progressive/Iterative www.

NAR. One Outlayer Sequence Two Distantly related Groups Long Internal Indel Long Terminal Indel Copyright Cédric Notredame (2000-2003) All rights reserved . 1999. PROBLEM Description Even Phylogenic Spread. Thompson et al.BaliBase What Is BaliBase Source: BaliBase.

Thompson et al. DCA Strategy T-Coffee PrrP. MSA.Which Method ? What Is BaliBase Source: BaliBase. NAR. T-Coffee Dialign T-Coffee Dialign T-Coffee Copyright Cédric Notredame (2000-2003) All rights reserved . 1999. T-coffee. PROBLEM Strategy ClustalW.

-May Align Too Few Residues -Good For Long Indels 3-Iterative: -HMMs. DCA. Sometimes Inaccurate -Good Profile Generators 4-Progressive: Copyright Cédric Notredame (2000-2003) All rights reserved -ClustalW. SAM. HMMER. -Do Well When They Can Run. -Few Small Closely Related Sequence. 2-Segment Based: -DIALIGN.Methods /Situtations 1-Carillo and Lipman: -MSA. -Slow. Pileup. Multalign… -Fast and Sensitive . MACAW.

4970/.20    79870807. 4579F/7.:89.0/ .

5.89 4579F/7.8.20    79870807.4970/.0/ .

5.8$%  05.8 47.

9448.

89.-.

4579F/7.4970/.0/ .20    79870807.

4970/.20    79870807.8$% 4579F/7.5.0/ .

20    79870807.0/ .44839079 2094/ 4579F/7.4970/.

0/ .4970/.$9:.20    79870807.943 $4:943 4579F/7.

.20    79870807.0/ ..4970/. !70/ .:7.!7479 $4:943 094/ !7479 %7008 !7410  !70/  !70/ :3. $500/ 4579F/7.

20    79870807.0/ .4970/.!:75480 $4:943 4579F/7.

0/ .43.20    79870807.:843 4579F/7.4970/.

. $88!749 4483%0$06:03.70417050.943!74.080825479.0/:70 5072039.3 %0#9.90/0020398 4579F/7.39 0.:.4970/.32039094/ 4:77.:95032039 %0$%. %0089.0/ ..9.9.20    79870807.

3994/4 94:7$ 4579F/7.9/44:.:95032039 344:7!74-02.0/ .20    79870807.4970/.

//708808 % !  &$ !747088.0.

907.0.0 !747088.9.

304:8 !747088.$2:9.0.

907.0  -458 494 : 5.9.

0/:.94  -431472.8 :.9...

42.54.  /7.0 .

20    79870807.0 4579F/7.0/ .2:8.4970/.

 #   !#  08. 3/0 4579F/7.80 $4:7.0/ .0 %489..80 %42584309. $570.80 .. 3/0 43%0723.75943 ..39 70.03!403..4970/.90/74:58 4339073.0.98./ 30 :9.20    79870807.07 $06:03.

4970/. #   !#  $97.20    79870807.90 :89.3 % 41100 4579F/7..80 %42584309.98.90 !77! % 41100 .3 % 41100 ..094/ .80 $4:7..  % .0/ .41100 $  % 41100 $97.0.

094/8 .

3/52.3 $  0$2.3#:3  $02039.3/$0389.0 .4970/. !0:5 :9...9438  .$9:9.20    79870807.0/ :89.480#0.0 8 # $ $4 $42092083.9.3 .3%440#08/:08 44/47433/08  907.74.89.90/$06:03.:7.80/   .9478  !747088.90 44/!74100307.0 4579F/7.0 4003%0.