Professional Documents
Culture Documents
Computational Biology
April, 2019
Sequence alignment
• Global sequence alignment
– Needleman Wunsch algorithm
• Local sequence alignment
– Smith-Waterman algorithm
• Global/local alignment
– Affine gap penalties
• Time/space requirements
– O(mn) time
– O(mn) space
Needleman Wunsch (NW) algorithm
• Fill up an m×n table
• The recurrence
If we only need the score
If we only need the score
n/2
Hirschberg’s algorithm
• Maintain arrays M to store where the path crossed the middle
• Update Mi,j when Vi,j is computed for j≥n/2
Mi,j = Mi-1,j = 3
Mi,j = Mi,j-1 = 6
Mi,j = Mi-1,j-1 = 5
n/2
Hirschberg’s algorithm
• Divide
– Find k as discussed
– Recursively find path by aligning s[1…k] with t[1…n/2]
– Recursively find path by aligning s[k…m] with t[n/2…n]
• Conquer
– Concatenate the two paths with k in the middle
Hirschberg’s algorithm
Finding the Middle Point
0 m/4 m/2 3m/4 m
Finding the Middle Point Again
0 m/4 m/2 3m/4 m
And Again…
0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m
Time = Area: First Pass
• On first pass, the algorithm covers the entire area.
Area = nm
Time = Area: Second Pass
• On second pass, the algorithm covers only 1/2 of the area
Area/2
Time = Area: Third Pass
• On third pass, only 1/4th is covered.
Area/4
Geometric Reduction At Each Iteration
• 1 + ½ + ¼ + ... + (½)k ≤ 2
• Runtime: O(Area) = O(nm)
5th pass: 1/16
2D table
3D graph
DP recursion (3 edges vs 7)
Pairwise: 3 possible
paths
(match/mismatch,
insertion, and
deletion)
In 3-D, 7 edges in
each unit cube
Architecture of 3D alignment cell
(i-1,j-1,k-1) (i-1,j,k-1)
(i-1,j-1,k) (i-1,j,k)
(i,j,k-1)
(i,j-1,k-1)
(i,j-1,k) (i,j,k)
Multiple alignment: dynamic programming
cube diagonal: no
si-1,j-1,k-1 + (vi, wj, uk)
indels
si-1,j-1,k + (vi, wj, _ )
• si,j,k = max
si-1,j,k-1 + (vi, _, uk) face diagonal:
si,j-1,k-1 + (_, wj, uk) one indel
si-1,j,k + (vi, _ , _)
si,j-1,k + (_, wj, _) edge diagonal:
si,j,k-1 + (_, _, uk) two indels
G A T C C T G G A T T G C G A
• Exact keyword
match of GGTC
• Extend diagonals
with mismatches
until score is under
50%.
• Output result:
• GTAAGGTCC
• GTTAGGTCC
G A T C C T G G A T T G C G A
keyword search,
THEN:
• Extend with gaps
around ends of
exact match until
score < threshold
• Output result
GTAAGGTCCAGT
GTTAGGTC-AGT
CCGGTAGGATATTAAACGGGGTGAGGAGCGTTGGCATAGCA
CCGCTAGGCTATTAAAACCCCGGAGGAG....GGCTGAGCA
Translocation Inversion Insertion
B
A
http://mummer.sourceforge.net/manual/AlignmentTypes.pdf
MUMmer
• Maximal Unique Matcher (MUM)
– Match
• exact match of a minimum length
– Maximal
• cannot be extended in either direction without a mismatch
– Unique
• occurs only once in both sequences (MUM)
• occurs only once in a single sequence (MAM)
• occurs one or more times in either sequence (MEM)
MUMmer
• It is used to compare whole genomes
• Find all MUMs
• Cluster consistent MUMs
• Extend alignments
R
Q
Genome rearrangement
• Although cabbages and turnips share a recent
common ancestor, they look and taste differently.
Genome rearrangement
• Turnip vs. Cabbage
– In the 1980s Jeffrey Palmer studied evolution of plant
organelles by comparing mitochondrial genomes of
cabbage and turnip.
– He found 99% similarity between genes.
– These surprisingly similar gene sequences differed in
gene order.
– This study helped pave the way to analyzing genome
rearrangements in molecular evolution.
Genome rearrangement problem
• Represent the order of genes in two genomes by
permutations over 1,2,…,n
• Genome rearrangement problem
– Find out the minimum number of genome rearrangement
operations to transform one permutation into the other
Rearrangement operations
Reversal
1 2 3 4 5 6 1 2543 6
Transpositions
1 2 3 4 5 6 1 25346
Translocation
1 2 3 1 26
45 6 4 53
Fusion
1 2 3 4
5 6 1 2 3 4 5 6
Fission
Applications in phylogenetics
History of chromosome X
Application in cancer biology
• Rearrangements may disrupt genes and alter gene regulation.