You are on page 1of 43

CSE 6411

Computational Biology

Sequence alignment and multiple sequence alignment (Dynamic programming)

Atif Hasan Rahman


CSE, BUET

April, 2019
Sequence alignment
• Global sequence alignment
– Needleman Wunsch algorithm
• Local sequence alignment
– Smith-Waterman algorithm
• Global/local alignment
– Affine gap penalties
• Time/space requirements
– O(mn) time
– O(mn) space
Needleman Wunsch (NW) algorithm
• Fill up an m×n table
• The recurrence
If we only need the score
If we only need the score

Store only two rows


at a time
What if we also need the alignment
• Hirschberg’s algorithm
– Global sequence alignment in linear space
• Combines dynamic programming with divide and
conquer
• Requires O(m) or O(n) space
– Whichever is smaller
• Runs in O(mn) time
Hirschberg’s algorithm
• The best path uses some cell k in the middle

n/2
Hirschberg’s algorithm
• Maintain arrays M to store where the path crossed the middle
• Update Mi,j when Vi,j is computed for j≥n/2

Mi,j = Mi-1,j = 3

Mi,j = Mi,j-1 = 6

Mi,j = Mi-1,j-1 = 5

n/2
Hirschberg’s algorithm
• Divide
– Find k as discussed
– Recursively find path by aligning s[1…k] with t[1…n/2]
– Recursively find path by aligning s[k…m] with t[n/2…n]
• Conquer
– Concatenate the two paths with k in the middle
Hirschberg’s algorithm
Finding the Middle Point
0 m/4 m/2 3m/4 m
Finding the Middle Point Again
0 m/4 m/2 3m/4 m
And Again…
0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m
Time = Area: First Pass
• On first pass, the algorithm covers the entire area.

Area = nm
Time = Area: Second Pass
• On second pass, the algorithm covers only 1/2 of the area

Area/2
Time = Area: Third Pass
• On third pass, only 1/4th is covered.

Area/4
Geometric Reduction At Each Iteration
• 1 + ½ + ¼ + ... + (½)k ≤ 2
• Runtime: O(Area) = O(nm)
5th pass: 1/16

3rd pass: 1/4


first pass: 1 4th pass: 1/8

2nd pass: 1/2


Multiple sequence alignment (MSA)
• Aligning more than two sequences
• Generalize DP to MSA
– Impractical
• Heuristic approaches to MSA
– Progressive alignment
– Iterative methods
– Consensus methods
– Hidden Markov models/ statistical approaches
– Divide and conquer
Multiple sequence alignment
Applications
• Phylogeny
• Motifs
• Patterns
• Structure prediction (RNA, protein)
Aligning three sequences
source
• Same strategy as
aligning two
sequences
• Use a 3-D “Manhattan
Cube”, with each axis
representing a
sequence to align
• For global alignments,
go from source to sink
sink
2D vs 3D alignment grid

2D table

3D graph
DP recursion (3 edges vs 7)

Pairwise: 3 possible
paths
(match/mismatch,
insertion, and
deletion)
In 3-D, 7 edges in
each unit cube
Architecture of 3D alignment cell
(i-1,j-1,k-1) (i-1,j,k-1)

(i-1,j-1,k) (i-1,j,k)

(i,j,k-1)
(i,j-1,k-1)

(i,j-1,k) (i,j,k)
Multiple alignment: dynamic programming
cube diagonal: no
si-1,j-1,k-1 + (vi, wj, uk)
indels
si-1,j-1,k +  (vi, wj, _ )
• si,j,k = max
si-1,j,k-1 +  (vi, _, uk) face diagonal:
si,j-1,k-1 +  (_, wj, uk) one indel
si-1,j,k +  (vi, _ , _)
si,j-1,k +  (_, wj, _) edge diagonal:
si,j,k-1 +  (_, _, uk) two indels

• (x, y, z) is an entry in the 3D scoring matrix


MSA: running time
• For 3 sequences of length n, the run time is 7n3; O(n3)

• For k sequences, build a k-dimensional Manhattan, with run time


(2k-1)(nk); O(2knk)

• Conclusion: DP approach for alignment between two sequences is


easily extended to k sequences
• impractical due to exponential running time.
• Computing exact MSA is computationally almost impossible, and
in practice heuristics are used
Progressive alignment

[Huson et al., 2010]


Progressive alignment
• Progressive alignment does not realign the
sequences
• Progressive alignment is not guaranteed to converge
to the optimal solution
• Works reasonably well for similar sequences
Iterative alignment
• Generate an MSA and iteratively refine it
Searching in a database
• Search for matches in large database of sequences
such as Genbank
• BLAST: Basic Local Alignment Search Tool
• Created in 1990 by Altschul, Gish, Miller, Myers, &
Lipman
• Similar to seed and extend
• https://blast.ncbi.nlm.nih.gov/Blast.cgi
Original BLAST: Example
A C G A A G T A A G G T C C A G T
• w=4

G A T C C T G G A T T G C G A
• Exact keyword
match of GGTC
• Extend diagonals
with mismatches
until score is under
50%.
• Output result:
• GTAAGGTCC
• GTTAGGTCC

From lectures by Serafim Batzoglou (Stanford)


Gapped BLAST: Example
A C G A A G T A A G G T C C A G T
• Original BLAST exact

G A T C C T G G A T T G C G A
keyword search,
THEN:
• Extend with gaps
around ends of
exact match until
score < threshold
• Output result
GTAAGGTCCAGT
GTTAGGTC-AGT

From lectures by Serafim Batzoglou (Stanford)


Whole genome alignment
• Genomes may have translocations, inversions,
duplications in addition to insertions, deletions and
SNPs

CCGGTAGGATATTAAACGGGGTGAGGAGCGTTGGCATAGCA

CCGCTAGGCTATTAAAACCCCGGAGGAG....GGCTGAGCA
Translocation Inversion Insertion
B

A
http://mummer.sourceforge.net/manual/AlignmentTypes.pdf
MUMmer
• Maximal Unique Matcher (MUM)
– Match
• exact match of a minimum length
– Maximal
• cannot be extended in either direction without a mismatch
– Unique
• occurs only once in both sequences (MUM)
• occurs only once in a single sequence (MAM)
• occurs one or more times in either sequence (MEM)
MUMmer
• It is used to compare whole genomes
• Find all MUMs
• Cluster consistent MUMs
• Extend alignments
R

Q
Genome rearrangement
• Although cabbages and turnips share a recent
common ancestor, they look and taste differently.
Genome rearrangement
• Turnip vs. Cabbage
– In the 1980s Jeffrey Palmer studied evolution of plant
organelles by comparing mitochondrial genomes of
cabbage and turnip.
– He found 99% similarity between genes.
– These surprisingly similar gene sequences differed in
gene order.
– This study helped pave the way to analyzing genome
rearrangements in molecular evolution.
Genome rearrangement problem
• Represent the order of genes in two genomes by
permutations over 1,2,…,n
• Genome rearrangement problem
– Find out the minimum number of genome rearrangement
operations to transform one permutation into the other
Rearrangement operations
Reversal
1 2 3 4 5 6 1 2543 6
Transpositions
1 2 3 4 5 6 1 25346

Translocation
1 2 3 1 26
45 6 4 53
Fusion
1 2 3 4
5 6 1 2 3 4 5 6
Fission
Applications in phylogenetics

History of chromosome X
Application in cancer biology
• Rearrangements may disrupt genes and alter gene regulation.

• Example: translocation in leukemia yields “Philadelphia”


chromosome:
Chr 9

promoter ABL gene promoter BCR gene


Chr 22

promoter BCR gene promoter c-ab1 oncogene

• There are thousands of individual rearrangements known for


different tumors.
References
• An Introduction to Bioinformatics Algorithms by
Jones and Pevzner (Chapter 7)
• Algorithms on Strings, Trees and Sequences by Dan
Gusfield (Chapter 12)
• Algorithm Design by Kleinberg and Tardos

You might also like