You are on page 1of 6

# Computer Science 262 Computational Genomics Lecture 12 May 13th, 2003 Multiple Sequence Alignments Scribed by Maciej F.

Boni

We are given N sequences x(1) , x(2) , . . . , x(N ) , and our goal is to insert gaps in each sequence such that they all have length L. Such an insertion of gaps results in a multiple alignment m, which can be viewed as (i) a path through an N -dimensional hypercube (the way a simple alignment can be viewed as a path through a 2-dimensional Needleman-Wunsch matrix) or (ii) a sequence of states of an N -tuple HMM. The goal of a multiple-alignment algorithm is to nd the maximum score of a given multiple alignment m. The trick is that the scoring function will be a little more complicated than the ones we learned for the Needleman-Wunsch (NW) algorithm. One method for assigning a score to m is to use the sum-of-pairs score, dened as S(m) =
1k<lN

s(m(k) , m(l) )

where m(k) denotes the sequence x(k) with gap insertions, and s is the NW-scoring function. Note that calculating S costs O(N 2 ) operations. Another method of assigning a score to m is to nd a consensus sequence m() that maximizes
N

SC (m) =
i=1

s(m() , m(i) ).

It appears that maximizing SC will take O(N L) operations. The real challenge, however, is nding the multiple alignment m.

## Multiple Alignment Algorithms

We can naturally extend Needleman-Wunsch to many dimensions by lining up our N sequences along N 1-dimensional edges of an N -dimensional hypercube, and writing down a scoring function F (i1 , i2 , . . . , iN ) which represents the maximum score of an alignment up to cell = (i1 , i2 , . . . , iN ). Unfortunately, cell has 2N 1 neighbors and the entire hypercube has on the order of LN cells. Computing the value of F in the entire hypercube requires O(2N LN ) operations. This algorithm is NP-complete in the number of sequences and the sequence lengths. (For xed sequence length L, it still seems it would be NP-complete in the number of sequences. Is it?) 1

Progressive Alignment The heuristic method well be looking at is known as progressive alignment. It follows a sequence of steps that resembles 1. align 2 sequences x(i) and x(j) , call this alignment m 2. x that alignmnet (i.e. declare it unbreakable) 3. align an unaligned sequence x(k) to m 4. repeat 2-3 until all sequences are aligned An individual alignment (in step 3) take O(L2 ) operations, and as there are N successive alignment to do, the algorithm will take O(N L2 ) operations. Note that as the algorithm proceeds, we will need larger and larger scoring functions that will score the alignment between m an already xed alignment and a new sequence x(k) . When the phylogeny of sequences is known, we can take advantage of it during a progressive alignment algorithm by beginning with the two sequences that are most closely related, and continue at each step with the sequence that we believe is the most closely related to the ones we have already aligned. By aligning two sequences that we believe are closely related, we expect there to be a lot of matches, we expect to get a high alignment score, and we expect to have high condence in the correctness of our alignment. A particular version of progressive alignment is CLUSTALW (pronounced kluh-stahldouble-you) which follows the following steps 1. nd all distances dij between sequences x(i) and x(j) (should take O(N 2 L2 ) operations) 2. construct a phylogenetic tree based on the distances in 1 (neighbor joining hierarchical clustering operations?) 3. align nodes in order of decreasing similarity (progressive alignment O(N L 2 )) Fixing too early in Progressive Alignment It is important to notice that in the progressive alignment algorithms, one of the limiting steps is xing alignments between sequences such that they cannot be changed later. For example, the alignments GAAGTT GAC TT 2

and GAAGTT GA CTT may be equivalent from a scoring functions point of view, but one may be more or less correct in light of the other sequences that will be aligned to it during progressive alignment. Once such an alignment is xed during progressive alignment, it cannot be changed. Barton-Stenberg The Barton-Stenberg algorithm is designed to address part of this problem. This algorithm proceeds as follows 1. run a progressive alignment 2. for j = 1 to N , remove x(j) from m, and re-align x(j) to the alignment of {x(1) , . . . , x(j1) , x(j+1) , . . . , x(N ) }. 3. repeat 2 until convergence At each step 2, we are guaranteed not to decrease our score, since re-inserting x(j) into m in the exact way it appeared before will yield the same score we had before. If we are lucky we will increase our score and improve our multiple alignment. This method will x some but not all misalignments that are a result of the ordering we chose at the beginning. For an example of a case not handled well, see slide 15 of lecture 12 (.ppt le). Restricted MDP A nal way to correct for xing alignments too early during a progressive alignment is through restricetd multi-dimensional dynamic programming or Restricted MDP. This requires obtaining an alignment m from progressive alignment, then running a version of bounded Needleman-Wunsch, where we are allowed to deviate only a distance R from the alignment m. This method will x most problems like the one on slide 15 of lecture 12 and has running time O(2N RN 1 L). This may not be so bad when R = 1 and N = 10. It may even work with R = 2 or R = 3 if N is small. MLAGAN MLAGAN (pronounced em-law-gun) combines some of the previous methods and takes advantage of anchoring sequences in areas that are known to be homologous (i.e. areas that are known to align to one another). The algorithm has 3 main steps: 3

1. multi-anchor sequences in regions we know to be homologous (similar to the way LAGAN does it for pairs of sequences; see lecture 11 slides). 2. progressive alignment (using phylogenetic tree, in order of decreasing similarity) 3. iterative renement using LAGAN (remove sequence x(j) , keep anchors, and re-align x(j) to the rest of the alignment using LAGAN and a larger scoring function) Cystic Fibrosis CFTR Cystic Fibrosis is a recessive genetic disease that aects the transport of sodium chloride within certain types of cells. It causes the individuals body to produce a thick mucus that can obstruct the pancreas and prevent certain enzymes from reaching the intestinal tract where they are needed to help digest food. Typical symptoms include increased appetite with no weight gain, problems taking up nutrients from food, coughing, wheezing, and pneumonia. The CFTR allele (gene) which causes cystic brosis when 2 copies are present is 1577 basepairs long and is located on human chromosome 7. MLAGAN, running on a region of DNA (containing CFTR) 1.8 megabases long aligning sequences from humans, 8 other mammals, 1 bird, and 2 sh aligned 98% of exons perfectly and 99.8% of exons to an accuracy of > 90% in mammals; the numbers for alignment in birds and sh were 82% perfectly and 91% to an accuracy of > 90%. NB: exons are regions of DNA that are translated into proteins; introns are regions of DNA that are spliced out at the RNA level and are not translated into proteins. MLAGAN outperformed LAGAN, AVID, and MUMmer in accuracy and ran in 75 minutes using < 700mb of memory.

## Alignments and Rearrangements

We now go back to simple alignments and discuss how to deal with gene duplications, inversions, and translocations. A gene duplication simply implies that a particular sequence of DNA was copied and reinserted into the genome. A translocation implies that a particular sequence of DNA was shifted from one location to another in the genome. Inversions are a little trickier; gure 1 shows the mechanism that causes a DNA inversion When an inversion takes place, a segment of DNA is replaced with its reverse complement, meaning that a segment reverses orientation and each nucleotide is complemented where C is the complement of G, and A is the complement of T. 4

S y

## Figure 1: DNA Inversion

Shows two complementary strands of DNA, wherein one section gets rotated 180 degrees and a section from the bottom strand becomes part of the top strand.

In order to nd some of the rearrangements when aligning two sequences, we run a regular Smith-Waterman local alignment algorithm, and if we notice that we have a particularly high score o the diagonal, we may guess that a duplication or a translocation took place. A high Smith-Waterman score o the diagonal simply means that we have a very good local match between 2 subsequences that happen to be in dierent parts of our original sequences; this may mean that since the time when the sequences diverged evolutionarily, one of these subsequences could have been duplicated or translocated to the location that we see it in today. To detect inversions, we need to run Smith-Waterman a second time on one of the original sequences against the reverse complement of the other sequence. These two Smith-Waterman runs will give us 2 matrices with locations of probable local alignments, each of which could correspond to either (i) an alignment (ii) a translocation (iii) a duplication or (iv) an inversion. The goal is to put together the most costeective alignment using these four components. We will usually have ane penalties rearrangements (ii)-(iv), where it costs a certain amount to use a particular rearrangement and there is a xed per-base penalty for extending its size. The algorithm S-LAGAN (pronounced ess-law-gun) does something like this. It nds all local alignments with two Smith-Waterman runs, builds a rough homology map showing which parts of one sequence align to which parts of the other (and in which direction). Once the sequences can be broken down into fragments that we know align to one another, we run a global alignment on these smaller fragments to align as consistently as possible. This was referred to in class as a glocal alignment. The nal slides of lecture 12 show some good examples of human/mouse and human/rat glocal alignments using S-LAGAN. In particular, slide 38 has a good example of hum/mus and hum/rat alignments where a major piece in the middle of the 5

time
common ancestor to all three

human

## PSfrag replacements S y inversion event Figure 2: Rat-Mouse-Human Phylogeny

Shows an inversion event occurring after the rat/mouse group diverged from humans but before rats and mice diverged from each other. The inversion, thus, should be observed in rats and mice, but not humans.

genome has been inverted in both alignments. This means that the inversion event occurred in a common ancestor of rats and mice (which is why it appears in both today), but it took place after the lineage of humans had diverged from that of rats and mice (which is why the inversion is not observed in humans today). Of course, the inversion could have happened in the human lineage and not in the rat and mouse lineages; we cannot tell whether the ancestral sequence was similar to the human sequence or to the rat-mouse sequence. See gure 2 for an example of a phylogenetic tree where the ancestral sequence is similar to the human sequence, and the inversion event took place in the rat-mouse lineage.

rat

mouse