You are on page 1of 6

Lecture 7: Multiple Sequence Alignment (MSA)

What is Multiple Sequence Alignment?


Multiple sequence alignment is the alignment of N sequences (amino acids/nucleotides), where N > 2 Goal: to write each sequence along the others to express any similarity between the sequences Each element of sequence is either placed alongside a corresponding element in the other sequences and/or a gap character Example: TGCG, AGCTG, and AGCG can be aligned as follows:

Motivation for and challenge of MSA Sum of Pairs (SP) method Progressive MSA: ClustalW algorithm

T-GC-G -AGCTG -AGC-G


Problems: How do we efficiently find this alignment? Can we find a better alignment?
Some of the notes are derived from slides by Dr. Donald R. Williamson at the University of Delaware Some slides adapted from slides created by Dr. Keith Dunker

Motivation for Multiple Sequence Alignment


The addition of related sequences to a pair-wise alignment facilitates the identification of subsequences of high functional importance. Why? Example: Aligning genomic sequence of related organisms to identify conserved non-coding regulatory sequence elements
Species 1: -GGGACGATATGCAATATGAAATT-----------Gene Species 2: -GACGCGATATGCTCCGATTAAGT-----------Gene CGATATGC Not always so easy to pick out conserved motifs. How about this example? Species 1: --GGGACGATATGCAATATGAAATT----------Gene Species 2: --AGGACCTTATATATTAGCAATGT----------Gene Species 3: --TGGACGTTATCACAGTTTGTCCG----------Gene GGACSWTAT

Motivation for Multiple Sequence Alignment (continued)


Multiple sequence alignment can find biologically important sequence similarities that may be widely dispersed or hidden in the sequences Multiple sequence alignment can provide information about the evolutionary history of the respective sequences Multiple sequence alignment can give insight into the basis for sequence similarities between homologous sequence

Example of a Multiple Sequence Alignment (MSA)

Efficiently Computing a MSA: The Complexity Problem


Adding additional sequences results in an exponential increase in the number of computations required to find the optimal alignment For m-wise comparisons, even the dynamic programming methods quickly break down Example: number of comparisons made to align m protein sequences, each 300 amino acids in length m = 2: m = 3: m = 4: m = 5: 90,000 comparisons 2.7 x 107 comparisons ~ 8 x 109 comparisons ~ 2.4 x 1012 comparisons

Baeyer-Villiger monooxygenases (BVMOs) - taken from Fraaije, et al (2002) FEBS Letters 518:43-47

A Solution: Dynamic Programming (DP) and the Carrillo-Lipman Bound


In the pair-wise Dynamic Programming sequence alignment method, the solution path usually fell within a small area around the diagonal in the sequence vs sequence matrix If we extend this idea to MSA, we have a multi-dimensional figure (hypercube) instead of a plane (N x N) figure The Carrillo-Lipman Bound is a procedure to provide a bound in the form of a polyhedron around the diagonal in the hypercube This Bound limits the search space for finding the optimal MSA of a set of sequences, leading to a large increase in search efficiency Method: use DP for MSA, but limit search space using CarilloLipman Bound

Dynamic Programming with Carillo-Lipman Bound

Carillo-Lipman Bound

m=2 A T
0 0 2 1 1

m=3 T
0 0 2 2 2

G
0 0 1 3 3

Carillo-Lipman Bound

0 1 0 0 0

A T G G

0 0 0 0

A
A T G G

Scoring MSAs: Sum of Pairs Method


To identify the optimal multiple sequence alignment, we need a scoring method The Sum of Pairs (SP) scoring method is as follows: Given: (1) A set of N aligned sequences each of length L in the form of a L x N MSA alignment matrix M (2) A substitution matrix (PAM or BLOSUM) that gives the score s(x,y) for aligning two characters x,y Then the SP core SP(mi) for the ith column of M (denoted mi) is calculated using the formula:

Sum of Pairs Method: DNA Example


The SP score for the complete alignment M is the sum of the scores for each column (mi) in the alignment:

SP(M) = " SP(mi )


Example: we wish to align the following three DNA sequences:
i

S1 = TGCG S2 = AGCTG S3 ! = AGCG


We wish to use the SP method to score the following alignments of these three sequences: Alignment #1 Alignment #2

SP(mi ) = " s(mik ,mil )


k<l

- where mik is the kth entry in the ith column and mil is lth entry in ith column

T-GC-G -AGCTG -AGC-G

TGC-G AGCTG AGC-G

Sum of Pairs: DNA Example


We will use the following simplified DNA substitution matrix: s(x,y) s(x,y) s(x,-) s(-,y) s(-,-) = 1: when x = y [match] = -1: when x ! y [mismatch] = -2: [gap] = -2: [gap] = 0: to prevent double counting of gaps

Sum of Pairs: DNA Example


The SP score for each alignment is calculated by summing the individual scores for each column in the matrix
m1 m2 m3 m4 m5 m6 m1 m2 m3 m4 m5

We will construct the following matrices M for each alignment:

T - G C - G - A G C T G - A G C - G

T G C - G A G C T G A G C - G

T - G C - G - A G C T G - A G C - G

m1 m2 m3 m4 m5 m6

T G C - G A G C T G A G C - G

m1 m2 m3 m4 m5

-4 -3 3 3 -4 3 SP(M) = -2

-1 3 3 -4 3 SP(M) = 4

m1 = s(T,-) + s(T,-) + s(-,-) m1 = -2 + -2 + 0 m1 = - 4

m1 = s(T,A) + s(T,A) + s(A,A) m1 = -1 + -1 + 1 m1 = - 1

Using the simplified substitution matrix, the Sum of Pairs method ranks the second alignment as the higher scoring alignment

Sum of Pairs: Protein Example


Example: we wish to align the following four amino acid sequences:

Sum of Pairs: Protein Example


Using the BLOSUM62 scoring matrix, the score for the 4th column of the alignment M is:

S1 S2 S3 S4

= = = =

AQPILLLV ALRLL AKILLL DPPVLILV

We wish to use the SP method to score the following alignment of these four sequences:

SP(m4) = SP(I,-,I,V) = s(I,-) + s(I,I) + s(I,V) + s(-,I) + s(-,V) + s(I,V) = -2 + 4 + 3 + -2 + -2 + 3 = 4


What is the score for the first column SP(A,A,A,D)? (Note, for BLOSUM62: A,A = 4 and A,D = -2)

AQPILLLV ALR-LL-AK-ILLLDPPVLILV
Use BLOSUM62 scoring matrix for scoring matches/mismatches and a gap score of -2 [ s(x.-) = s(-,y) = -2 ]

What is the score for the first column if we change the first letter of the last sequence from D to A -- SP(A,A,A,A)?

Problems with the Sum of Pairs method


The SP method tends to overweight the effect of single mutations on in a sequence

CLUSTALW Method
CLUSTALW is a progressive method for multiple sequence alignment A progressive MSA method starts by doing pair-wise alignments of all sequences to determine the most related sequences, and aligns these sequences The progressive MSA method then progressively adds less related sequences or groups of sequences to the initial alignment Progressive MSA is similar in concept to hierarchical clustering of microarray data CLUSTAL comes in three versions: CLUSTAL: gives equal weight to all sequences CLUSTALW: includes weights for sequences CLUSTALX: provides a GUI to CLUSTAL

SP(A,A,A,A) = 24; SP(A,A,A,D) = 6


More troubling, the relative difference in score due to a single mutation decreases as the number of sequences in the alignment increases

SP(AN ) " SP(AN "1,D) 6(N "1) 3 = = SP(AN ) (4(N)(N "1) /2) N
Would expect that the relative difference to increase with more evidence (sequences) we have for a conserved alanine residue

The SP method requires extensive search time: an alignment of N sequences of length L has an efficiency of O(LN2NN2) Even after truncating this search space with the Carillo-Lipman Bound, the SP method requires extensive search time for many or large size sequences

The CLUSTALW Algorithm


Step 1: Determine all pair-wise alignments between sequences and determine the degree of similarity (or distance) between each pair Step 2: Construct a similarity tree* (also known as a guide tree) Step 3: Combine the alignments starting from the most closely related sequences/groups to the most distantly related sequences/groups * The PILEUP program is similar to CLUSTALW, but uses a different method for producing the similarity tree

CLUSTALW Step 1.A


Use a pair-wise alignment method (Dynamic programming, etc) to compute pair-wise alignments amongst the sequences The method used in CLUSTAL is implementation dependent Using the pair-wise alignments, compute a distance between all pairs of sequences. A commonly used method is as follows: Count the number of non-gapped positions in pair-wise alignment Count the number of mismatches between the two sequences Distance = (# of mismatches)/(number of non-gapped positions) Example:

seq(i) seq(j)

NKL-EN -MLNEN

Distance = d(i,j) = 1/4 = 0.25

CLUSTALW Step 1.B


After computing the distance between all pairs of sequences, we put them into a matrix Example:
Hbb_ Human Hbb_ Horse Hba_ Human Hba_ Horse Myg_ Phyca Glb5_ Petma Lgb2_ Luplu

CLUSTALW Step 2
Construct a similarity tree (guide tree) The CLUSTALW packages uses the distance matrix and a technique called the Neighbor Joining method to construct the similarity tree If two or more sequences share a branch, this may indicate an evolutionary relationship between the sequences Length of each branch indicates the degree of sequence divergence

Hbb_Human Hbb_Horse Hba_Human Hba_Horse Myg_Phyca Glb5_Petma Lgb2_Luplu

.17 .59 .59 .77 .81 .87 .60 .59 .77 .82 .86 .13 .75 .73 .86 .75 .74 .88 .80 .93 .90 -

Data from Thompson et al., (1994) Nucleic Acids Research 22:4673-4680

Example of Similarity Tree

CLUSTALW Step 3
Combine the alignments starting from the most closely related sequences/ groups to the most distantly-related sequences/groups by following the similarity/guide tree (from tip to root of guide tree) In the example we align the sequences in the following order: (1) align Hbb_Human and Hbb_Horse (group 1) (2) align Hba_Human with Hba_Horse (group 2) (3) align group 1 with group 2 (group 3) (4) align Myg_Phyca with group 3 (group 4) (5) align Glb5_Petma with group 4 (group 5) (6) align Lgb2_Luplu with group 5 -- Have reached the root of the tree

Generated from: http://mobyle.pasteur.fr/cgi-bin/portal.py?form=neighbor Visualized using FigTree (http://tree.bio.ed.ac.uk/software/figtree/)

Aligning groups of sequences in ClustalW


To align groups of sequences: Build a alignment matrix using the weighted average of the pair-wise scores from the sequences in the groups we want to align Then use Dynamic Programming on this matrix to obtain an alignment between the two groups Sequence weights are based on the branch lengths of the guide tree (wont go through full method for calculating weights)

Image from Thompson et al., (1994) Nucleic Acids Research 22:4673-4680

You might also like