Diapositiva 2012tm-Coffee-130709120840-Phpapp02

“Homology-enhanced probabilistic consistency”
multiple sequence alignment :

a case study on transmembrane protein
Jia-Ming Chang
2013-July-09
Chang, J-M, P Di Tommaso, J-Fß Taly, C Notredame. 2012. Accurate multiple sequence alignment of
transmembrane proteins with PSI-Coffee. BMC Bioinformatics 13.
Transmembrane protein
Membrane proteins are likely to constitute 20-30% of all ORFs
contained in genomes.
Odorant receptors
Richard Benton, “Eppendorf winner. Evolution and revolution in odor detection,” Science (New York, N.Y.)
326, no. 5951 (October 16, 2009): 382-383.
Transmembrane protein multiple
sequence alignment
• 1994 first address alignment for transmembrane proteins
– Cserzo M, Bernassau JM, Simon I, Maigret B: New alignment strategy for
transmembrane proteins. J Mol Biol 1994, 243(3):388-396.
• Few multiple sequence alignment software till now => 3

– ShafrirY, Guy HR: STAM: simple transmembrane alignment method.
Bioinformatics 2004, 20(5):758-769.
– Forrest LR, Tang CL, Honig B: On the accuracy of homology modeling and
sequence alignment methods applied to membrane proteins. Biophys J 2006,
91(2):508-517.
– Pirovano W, Feenstra KA, Heringa J: PRALINETM: a strategy for improved
multiple alignment of transmembrane proteins. Bioinformatics 2008, 24(4):492-
497.
BAliBASE 2.0 reference 7
Pirovano W, Feenstra KA, Heringa J: PRALINETM: a strategy for improved multiple

alignment of transmembrane proteins. Bioinformatics 2008, 24(4):492-497.
We need an accurate Transmembrane MSA!
Homology-extended
Simossis VA, Kleinjung J, Heringa J: Homology-extended sequence alignment. Nucleic Acids

Res 2005, 33(3):816-824.
Homology-extended

Res 2005, 33(3):816-824.
Pair-hidden Markov Model
Emission probabilities, which correspond to traditional substitution

scores, are based on the BLOSUM62 matrix.
Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-
based multiple sequence alignment. Genome Res 2005, 15(2):330-340.
Probabilistic consistency
transformation
Homology-extended probabilistic
consistency
New emission probabilities are like the following.

20 20
p ' ( xi , y j ) m n p( A. A.m , A. A.n )
m n
where αm is the frequency with which residue m appears at

position i and βn is the frequency with which residue n appears
at position j; p(A.A.m, A.A.n) is the original emission
probabilities in ProbCons.
Homology-extended probabilistic
consistency
P(x i ~ y j Î a* | x, y)¬
1
S
ååa g P ( x
zÎS z k
i k i (
~ zk Î a* | x,z) · b j g k P zk ~ y j Î a* | z, y )
where αi , βj , and rk are the profile frequency.

Homology-extended
Que1: how to
build a profile?
Que2: how to
score profiles?
Res 2005, 33(3):816-824.
Que1: how to build a profile?
• Database Size
• Searching parameters
– E-value : most used, anything else???
1. Matrix file : -M
2. Filter the query sequence for low-complexity subsequence : -F
3. Neighborhood word threshold : -f
4. Truncates the report to number of alignments: -b
Word hit & Neighborhood
Searching parameters
• Fast, Insensitive search
– High percent identity
– blastp –F “m S” –f 999 –M BLOSUM80 –G 9 –E 2 –e 1e-5
• Slow, Sensitive search

– Increase sensitivity, decrease specificity
– blastp –F “m S” –f 9 –M BLOSUM45 –e 100 –b 10000 –v 10000
• Book “BLAST”, page 146, 147

Different database
NCBI non-redundant (NR)
UniProt (release 15.15 – 2010)
UniRef50 UniRef90 UniRef100
UniRef50 UniRef90 UniRef100 UniProt

TM TM TM TM
keyword:"Transmembrane [KW-0812]"
Database Size
NCBI non-redundant (NR) Data Set No.
UniProt (release 15.15 – 2010)
UniRef50-TM 87,989
UniRef90-TM 263,306
UniRef100-TM 613,015
UniProt-TM 818,635
UniRef50 3,077,464
UniRef50 UniRef90 UniRef100
UniRef90 6,544,144
UniRef100 9,865,668
UniRef50 UniProt 11,009,767

UniRef90 UniRef100 UniProt
TM TM TM TM
NCBI NR 10,565,004
keyword:"Transmembrane [KW-0812]"
Performance comparison of different
database sizes for the BAliBASE2-ref7.
UniRef50-TM contains about 100 times fewer sequences than the full UniProt.
The level accuracy is comparable and even superior to that achieved with the default PSI-Coffee
while the CPU time requirements are dramatically decreased by a factor 10.
10% more columns are correctly aligned when compared with
PRALINETM .
The rows, Pairs and Cols, denote the sum of corrected aligned pairs and columns, respectively. The number of pairs and
columns in the reference alignments are 3,294,102 and 1,781, respectively.
BAliBASE 3.0
The performance of other methods are from Rausch et al. The SP and TC scores of full-
length sequences are evaluated by core blocks (by xml).
Que2: how to score profiles?
Edgar RC, Sjolander K: A comparison of scoring functions for protein sequence profile
alignment. Bioinformatics 2004, 20(8):1301-1308.
• Prediction mode : –template_file PSITM
• Output : -output tm_html
This output was obtained on Or94b of D. melanogaster and its orthologs of other Drosophlia species.
Notably, the predicted topology of the Or94b set is consistent with the Benton et al.’s conclusion.
http://tcoffee.crg.cat/tmcoffee
Paolo Di Tommaso

Diapositiva 2012tm-Coffee-130709120840-Phpapp02

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Diapositiva 2012tm-Coffee-130709120840-Phpapp02

Uploaded by

Copyright:

Available Formats

“Homology-enhanced probabilistic consistency”

multiple sequence alignment :

• Few multiple sequence alignment software till now => 3

Pirovano W, Feenstra KA, Heringa J: PRALINETM: a strategy for improved multiple

Simossis VA, Kleinjung J, Heringa J: Homology-extended sequence alignment. Nucleic Acids

Simossis VA, Kleinjung J, Heringa J: Homology-extended sequence alignment. Nucleic Acids

Emission probabilities, which correspond to traditional substitution

New emission probabilities are like the following.

where αm is the frequency with which residue m appears at

where αi , βj , and rk are the profile frequency.

• Slow, Sensitive search

• Book “BLAST”, page 146, 147

UniRef50 UniRef90 UniRef100

UniRef50 UniRef90 UniRef100 UniProt

UniRef50 UniProt 11,009,767

You might also like