You are on page 1of 124

6.

096AlgorithmsforComputationalBiology
SequenceAlignment
andDynamicProgramming
Lecture 1- Introduction
Lecture 2- HashingandBLAST
Lecture 3- CombinatorialMotifFinding
Lecture4 - StatisticalMotifFinding
5
ChallengesinComputationalBiology
4 GenomeAssembly
Regulatorymotifdiscovery 1 GeneFinding
DNA
2 Sequencealignment
6 ComparativeGenomics
TCATGCTAT
TCGTGATAA
3 Databaselookup
7 EvolutionaryTheory
TGAGGATAT
TTATCATAT
TTATGATTT
8 Geneexpressionanalysis
RNAtranscript
Proteinnetworkanalysis 11
9 Gibbssampling 10
12
Regulatorynetworkinference
Emergingnetworkproperties 13
Clusterdiscovery
A C G T C A T C A
T A G T G T C A
ComparingtwoDNAsequences
GiventwopossiblyrelatedstringsS1andS2
Whatisthelongestcommonsubsequence?
A C G T C A T C A
T A G T G T C A
S1
S2
S1 S1
A C G T C A T C A
T A G T G T C A
A G T T C A
S2 S2
LCSS
Editdistance:
Numberofchanges
neededforS1S2
Howcanwecomputebestalignment
S1
S2
A C G T C A T C A
T A G T G T C A
Needscoringfunction:
Score(alignment)=TotalcostofeditingS1intoS2
Costofmutation
Costofinsertion/deletion
Rewardofmatch
Needalgorithmforinferringbestalignment
Enumeration?
Howwouldyoudoit?
Howmanyalignmentsarethere?
Whyweneedasmartalgorithm
Waystoaligntwosequencesoflengthm,n
n m +
|


m + n
|
(m + n)!
~=
2
t=
|
.
|
(m!)
2
\
m
m
Fortwosequencesoflengthn
n Enumeration Today'slecture
10 184,756 100
20 1.40E+11 400
100 9.00E+58 10,000
Keyinsight: scoreisadditive!
A C G T C A T C A
T A G T G T C A
S1
S2
i
j
Computebestalignmentrecursively
Foragivensplit(i, j),thebestalignmentis:
BestalignmentofS1[1..i] andS2[1..j]
+BestalignmentofS1[ i..n]andS2[ j..m]
i i
A C G T C A T C A
T A G T G T C A
S1
S2
j j
A C G T C A T C A
T A G T G T C A
S1
S2
A C G T
T A G T G
S1
S2
A C G T C A T C A
T A G T G T C A
S1
S2
S2
A C G T C A T C A
T A G T G T C A
S1
S2
A C G T C A T C A
T A G T G T C A
S1
C G T C A T C A
T G T C A
S1
S2
Keyinsight: re-usecomputation
Identicalsub-problems! Wecanreuseourwork!
Solution#1Memoization
Createabigdictionary,indexedbyalignedseqs
Whenyouencounteranewpairofsequences
Ifitisinthedictionary:
Lookupthesolution
Ifitisnotinthedictionary
Computethesolution
Insertthesolutioninthedictionary
Ensuresthatthereisnoduplicatedwork
Onlyneedtocomputeeachsub-alignmentonce!
Topdownapproach
Solution#2Dynamicprogramming
Createabigtable,indexedby(i,j)
Fillitinfromthebeginningallthewaytilltheend
Youknowthatyoullneedeverysubpart
Guaranteedtoexploreentiresearchspace
Ensuresthatthereisnoduplicatedwork
Onlyneedtocomputeeachsub-alignmentonce!
Verysimplecomputationally!
Bottomupapproach
A C G T C A T C A
T A G T G T C A
S1
S2
A C G T C A T C A
T
A
G
T
G
T
C
A
A
G
T
C/G
T
C
A
Goal:
Findbestpath
throughthematrix
Keyinsight: Matrixrepresentationofalignments
Sequencealignment
DynamicProgramming
Globalalignment
0.Settingupthescoringmatrix
-
A G T
A
A
G
C
- 0
Initialization:

UpdateRule:
A(i,j)=max{
}
Termination:

Topright:0
Bottomright
1.Allowinggapsins
-
A G T
A
A
G
C
- 0
-2
-4
-6
-8
Initialization:

UpdateRule:
A(i,j)=max{
i-1, j
}
Termination:

Topright:0
Bottomright
A( )- 2
0
2.Allowinggapsint
-
A G T
-
A
A
G
-2 -4 -6
-2 -4 -6 -8
-4 -6 -8 -10
-6 -8 -10 -12
-8 -10 -12 -14
Initialization:
Topright:0
UpdateRule:
A(i,j)=max{
A(i-1, j)- 2
A( i ,j-1)- 2
}
Termination:
Bottomright
C
3.Allowingmismatches
-
A G T
-
A
A
G
0 -2 -4 -6
-2 -1 -3 -5
-4 -3 -2 -4
-6 -5 -4 -3
-8 -7 -6 -5
-1
-1
-1
-1
-1
-1 -1
-1 -1
-1
-1
-1
Initialization:
Topright:0
UpdateRule:
A(i,j)=max{
A(i-1, j)- 2
A( i ,j-1)- 2
A(i-1,j-1)-1
}
Termination:
Bottomright
C
4.Choosingoptimalpaths
-
A G T
-
A
A
G
0 -2 -4 -6
-2 -1 -3 -5
-4 -3 -2 -4
-6 -5 -4 -3
-8 -7 -6 -5
-1
-1
-1
-1
-1
-1 -1
-1 -1
-1
-1
-1
Initialization:
Topright:0
UpdateRule:
A(i,j)=max{
A(i-1,
A( i ,
A(i-1,
}
j)- 2
j-1)- 2
j-1)-1
Termination:
Bottomright
C
5.Rewardingmatches
-
A G T
-
A
A
G
0 -2 -4 -6
-2 1 -1 -3
-4 -1 0 -2
-6 -3 0 -1
-8 -5 -2 -1
1
1
1
-1 -1
-1
-1
Initialization:
Topright:0
UpdateRule:
A(i,j)=max{
A(i-1,
A( i ,
A(i-1,
}
j)- 2
j-1)- 2
j-1)1
Termination:
Bottomright
C
Sequencealignment
GlobalAlignment
Semi-Global
DynamicProgramming
Semi-GlobalMotivation
Aligningthefollowingsequences
CAGCACTTGGATTCTCGG
CAGC- - - - - G- T- - - - GG
Wemightpreferthealignment
vvvv- - - - - v- v- - - - vv = 8(1)+0(-1)+10(-2)=-12
CAGCA- CTTGGATTCTCGG
match mismatch
- - - CAGCGTGG- - - - - - - -
- - - vv- vxvvv- - - - - - - - = 6(1)+1(-1)+12(-2)=-19
gap
Newqualitiessought,newscoringscheme
designed
Intuitively,dontpenalizemissingendofthe
sequence
Wedliketomodelthisintuition
Ignoringstartinggaps
-
A G T
Initialization:
-
/ l 1strow co :0
UpdateRule:
A(i,j)=max{
A
A(i-1, j)- 2
A( i ,j-1)- 2
A
A(i-1,j-1)1
}
Termination: G
Bottomright
0 0 0 0
0 1 -1 -1
0 1 0 -2
0 -1 2 0
0 -1 0 1
1
1
1
-1
-1 -1
-1 -1
-1
-1
C
Ignoringtrailinggaps
-
A G T
-
A
A
G
0 0 0 0
0 1 -1 -1
0 1 0 -2
0 -1 2 0
0 -1 0 1
1
1
1
-1
-1 -1
-1 -1
-1
-1
Initialization:
1strow/col:0
UpdateRule:
A(i,j)=max{
A(i-1, j)- 2
A( i ,j-1)- 2
A(i-1,j-1)1
}
Termination:
max(lastrow/col)
C
Usingthenewscoringscheme
Withtheoldscoringscheme(allgapscount-2)
CAGCACTTGGATTCTCGG
CAGC- - - - - G- T- - - - GG
vvvv- - - - - v- v- - - - vv = 8(1)+0(-1)+10(-2)+0(-0)=-12
Newscore(endgapsarefree)
6(1)+1(-1)+1(-2)+11(-0)=3
match mismatch gap
CAGCA- CTTGGATTCTCGG
endgap
- - - CAGCGTGG- - - - - - - -
- - - vv- vxvvv- - - - - - - - =
Semi-globalalignments
Applications:
query
Findingageneinagenome
Aligningareadontoanassembly
subject
FindingthebestalignmentofaPCRprimer
Placingamarkerontoachromosome
Thesesituationshaveincommon
Onesequenceismuchshorterthantheother
Alignmentshouldspantheentirelengthofthesmaller
sequence
Noneedtoaligntheentirelengthofthelongersequence
Inourscoringschemeweshould
Penalizeend-gapsforsubjectsequence
Donotpenalizeend-gapsforquerysequence
Semi-GlobalAlignment
-
A G T
-
A
A
G
C
Query:s
Subject:t
alignallofs
Initialization:

UpdateRule:
A(i,j)=max{
A(i-1, j
A( i ,j
A(i-1,j-1)1
}
Termination:

0 -2 -4 -6
0 1 -1 -1
0 1 0 -2
0 -1 2 1
0 -1 0 0
...or...
0 -2 -4 -6
-2 1 -1 -1
-4 1 0 -2
-6 -1 2 0
-8 -1 0 -1
-
A G T
A
A
G
C
-
Initialization:
1strow
A(i,j)=max{
A(i-1, j
A( i ,j
A(i-1,j-1)1
}
Termination:
max(lastrow)
Query:t
Subject:s
alignalloft
1stcol
max(lastcol)
)- 2
-1)- 2
Update Rule:
)- 2
-1)- 2
Sequencealignment
GlobalAlignment
Semi-Global
LocalAlignment
DynamicProgramming
IntrotoLocalAlignments
Statementoftheproblem
A localalignmentofstringssandt
isanalignmentofasubstringofs
withasubstringoft
Definitions(reminder):
A substringconsistsofconsecutivecharacters
A subsequenceofsneedsnotbecontiguousins
Navealgorithm
Nowthatweknowhowtousedynamicprogramming
TakeallO((nm)
2
),andruneachalignmentinO(nm)time
Dynamicprogramming
Bymodifyingourexistingalgorithms,weachieveO(mn)
s
t
GlobalAlignment
-
A G T
-
A
A
G
0 -2 -4 -6
-2 1 -1 -5
-4 1 0 -2
-6 -1 2 0
-8 -1 0 1
1
1
1
-1
-1 -1
-1 -1
-1
-1
Initialization:
Topleft:0
UpdateRule:
A(i,j)=max{
A(i-1,
A( i ,
A(i-1,
}
j)- 2
j-1)- 2
j-1)1
Termination:
Bottomright
C
LocalAlignment
-
A G T
A
A
G
C
- 0 0 0 0
0 1 0 0
0 1 0 0
0 0 2 0
0 0 0 1
1
1
1
-1
Initialization:

UpdateRule:
A(i,j)=max{
i-1, j
i ,j
i-1,j-1)1
0
}
Termination:
Anywhere
-1
Topleft:0
A(
A(
A(
)- 2
-1)- 2
LocalAlignmentissues
Resolvingambiguities
Whenfollowingarrowsback,onecanstopatanyofthezero
entries. Onlystopwhennoarrowleaves. Longest.
Correctnesssketchbyinduction
Assumewevecorrectlyalignedupto(i,j)
Considerthefourcasesofourmaxcomputation
Byinductivehypothesisrecurseon(i-1,j-1),(i-1,j),(i,j-1)
Basecase: emptystringsaresuffixesalignedoptimally
Timeanalysis
O(mn)time
O(mn)space,canbebroughttoO(m+n)
Sequencealignment
GlobalAlignment
Semi-Global
LocalAlignment
AffineGapPenalty
DynamicProgramming
Scoringthegapsmoreaccurately
Currentmodel:
(n)
Gapoflength n
incurspenalty nd
However,gapsusuallyoccurinbunches
Convexgappenaltyfunction:
(n):
foralln,(n+1)- (n)s (n)- (n1)
(n)
Generalgapdynamicprogramming
Initialization: same
Iteration:
F(i-1,j-1)+s(x
i
,y
j
)
F(i,j) =max
max
max
k=0i-1
F(k,j)(i-k)
k=0j-1
F(i,k)(j-k)
Termination: same
RunningTime: O(N
2
M)
Space:
(assumeN>M)
O(NM)
Compromise:affinegaps
(n)=d+(n1)e
(n)
| |
gap gap
open extend
d
Tocomputeoptimalalignment,
e
Atpositioni,j,needtorememberbestscoreifgapisopen
bestscoreifgapisnotopen
F(i,j): scoreofalignmentx
1
x
i
toy
1
y
j
if if x
i
alignstoy
j
G(i,j): scoreif if x
i
,ory
j
,alignstoagap
Motivationforaffinegappenalty
Modelingevolution
Tointroducethefirstgap,abreakmustoccurinDNA
Multipleconsecutivegapslikelytobeintroducedbythesame
evolutionaryevent. Oncethebreakismade,itsrelativelyeasy
tomakemultipleinsertionsordeletions.
Fixedcostforopeningagap: p+q
Linearcostincrementforincreasingnumberofgaps: q
Affinegapcostfunction
Newgapfunctionforlengthk: w(k)=p+q*k
p+qisthecostofthefirstgapinarun
qistheadditionalcostofeachadditionalgapinsamerun
AdditionalMatrices
Theamountofstateneededincreases
Inscoringasingleentryinourmatrix,weneed
rememberanextrapieceofinformation
Arewecontinuingagapins?(ifnot,startismore
expensive)
Arewecontinuingagapint?(ifnot,startismore
expensive)
Arewecontinuingfromamatchbetweens(i)andt(j)?
Dynamicprogrammingframework
Weencodethisinformationinthreedifferentstates
foreachelement(i,j)ofouralignment. Usethree
matrices
a(i,j):bestalignmentofs[1..i]&t[1..j]thatalignss[i]witht[j]
b(i,j):bestalignmentofs[1..i]&t[1..j]thatalignsgapwitht[j]
c(i,j):bestalignmentofs[1..i]&t[1..j]thatalignss[i]withgap
Updaterules
Whens[j]andt[j]arealigned
|
i a , 1 j1)
|=
Scorecanbe
=
(
|=
( ( [ t j ( i a ,j) i s score ],[ ])+ max

i b , 1 j1)
|=
differentforeach
|
pairofchars
(i c , 1 j1)
\ .=
Whent[j]alignswithagapins
|
i a ,j1)(p q)
|
+
startingagapins
=
(
|
( ( i b ,j) max

i b ,j1) q
|=
extendingagapins
|
i c ,j1)(p q) ( + Stoppingagapint,
\ .
andstartingoneins
Whens[i]alignswithagapint
|
i a 1 )(p q)
|=
, j +
=
(
|=
( ( , j i c ,j) max

i c 1 ) q
|=
|
( , j + i b 1 )(p q)
\ .
Findmaximumoverallthreearraysmax(a[m,n],b[m,n],c[m,n]).
Followarrowsback,skippingfrommatrixtomatrix
Simplifiedrules
Transitionsfrombtocarenotnecessary...
iftheworstmismatchcostslessthanp+q
ACC-GGTA
ACCGGTA
A--TGGTA
A-TGGTA
=
(
Whens[j]andt[j]arealigned
|
i a , 1 j1)
|=
Scorecanbe
|=
( [ ], ( i a ,j) score( t i s [j])+ max

i b , 1 j1)
|=
differentforeach
|=
pairofchars
(
\
i c , 1 j1)
.
Whent[j]alignswithagapins
(
i b( ,j) max
|
i a ,j1)(p+ q)
|=
startingagapins
|=
|
\
i b ,j1) q (
extendingagapins
.
Whens[i]alignswithagapint
(
i c( ,j) max
|
i a , 1 j)(p+ q)
|=
|=
|
\
i c , 1 j) q (
.
GeneralGapPenalty
Gappenaltiesarelimitedbytheamountofstate
Affinegappenalty: w(k)=k*p
State:Currentindextellsifinagapornot
Lineargappenalty: w(k)=p+q*k,whereq<p
State: addbinaryvalueforeachsequence: startingagapornot
Whataboutquadriatic:w(k)=p+q*k+rk
2
.
State: needstoencodethelengthofthegap,whichcanbeO(n)
ToencodeitweneedO(logn)bitsofinformation.Notfeasible
Whatabouta(mod3)gappenaltyforproteinalignments
Gapsoflengthdivisibleby3arepenalizedless:conserveframe
Thisisfeasible,butrequiresmorepossiblestates
Possiblestatesare: starting,mod3=1,mod3=2,mod3=0
Sequencealignment
GlobalAlignment
Semi-Global
LocalAlignment
LinearGapPenalty
VariationsontheTheme
DynamicProgramming
DynamicProgrammingVersatility
Unifiedframework
Dynamicprogrammingalgorithm.Localupdates.
Re-usingpastresultsinfuturecomputations.
Memoryusageoptimizations
Toolsinourdisposition
Globalalignment:entirelengthoftwoorthologousgenes
Semi-globalalignment: pieceofalargersequencealigned
entirely
Localalignment: twogenessharingafunctionaldomain
LinearGapPenalty:penalizefirstgapmorethansubsequent
gaps
Editdistance,min#ofeditoperations.M=0,m=g=-1,every
operationsubtracts1,beitmutationorgap
Longestcommonsubsequence: M=1,m=g=0. Everymatch
addsone,beitcontiguousornotwithprevious.
DPAlgorithmVariations
t
s
t
s
t
s
-
A G T
A
A
G
C
- 0 -2 -4 -6
-2 1 -1 -1
-4 -1 -1 -2
-6 -1 0 0
-8 -3 0 -1
GlobalAlignment
Semi-GlobalAlignment
LocalAlignment
-
A G T
A
A
G
C
- 0 -2 -4 -6
0 1 -1 -1
0 1 0 -2
0 -1 2 1
0 -1 0 0
-
A G T
A
A
G
A
- 0 0 0 0
0 1 0 0
0 1 0 0
0 0 2 0
0 1 0 1
BoundedDynamicProgramming
Initialization:
F(i,0),F(0,j)undefinedfori,j>k
Iteration:
Fori=1M
Forj=max(1,i k)min(N,i+k)
F(i 1,j 1)+s(x
i
,y
j
)
F(i,j)=max F(i,j 1) d,ifj>i k(N)
F(i 1,j) d,ifj<i+k(N)
Termination: same
Easytoextendtotheaffinegapcase
x
1
x
M
y
1

y
N
k(N)
Linear-spacealignment
Now,wecanfindk
*
maximizingF(M/2,k)+F
r
(M/2,N-k)
Also,wecantracethepathexitingcolumnM/2fromk
*
k
*
k
*
Linear-SpaceAlignment
Hirschbergsalgorithm
Longestcommonsubsequence
Givensequencess=s
1
s
2
s ,t=t
1
t
2
t
n
,
m
Findlongestcommonsubsequenceu=u
1
u
k
Algorithm:
F(i-1,j)
F(i,j)=max F(i,j-1)
F(i-1,j-1)+[1,ifs =t
j
;0otherwise]
i
Hirschbergsalgorithmsolvesthisinlinearspace
Introduction:Computeoptimalscore
ItiseasytocomputeF(M,N)inlinearspace
F(i,j)
Allocate(column[1])
Allocate(column[2])
For i=1.M
If i>1,then:
Free(column[i2])
Allocate(column[i])
For j=1N
F(i,j)=
Linear-spacealignment
Tocomputeboththeoptimalscoreandtheoptimalalignment:
Divide&Conquerapproach:
Notation:
r
x ,y
r
:reverseofx,y
E.g.x =accgg;
r
x =ggcca
r r
F
r
(i,j):optimalscoreofaligningx
r
1
x & y
r
1
y
j i
sameasF(M-i+1,N-j+1)
Linear-spacealignment
Lemma:
F(M,N)=max
k=0N
(F(M/2,k)+F
r
(M/2,N-k))
x
y
M/2
k
*
F
r
(M/2,N-k) F(M/2,k)
Linear-spacealignment
Now,using2columnsofspace,wecancompute
fork=1M,F(M/2,k),F
r
(M/2,N-k)
PLUSthebackpointers
Linear-spacealignment
Now,wecanfindk
*
maximizingF(M/2,k)+F
r
(M/2,N-k)
Also,wecantracethepathexitingcolumnM/2fromk
*
k
*
k
*
Linear-spacealignment
Iteratethisproceduretotheleftandright!
k
*
N-k
*
M/2
M/2
Linear-spacealignment
HirschbergsLinear-spacealgorithm:
MEMALIGN(l,l,r,r): (alignsxx
l
withy
r
y
r
)
l
1. Leth=(l-l)/2(=
2. FindinTimeO((ll)=(r-r)),SpaceO(r-r)
theoptimalpath, L
h
,enteringcolumnh-1,exitingcolumnh
Letk
1
=posnatcolumnh2whereL
h
enters
k
2
=posnatcolumnh+1whereL
h
exits
3. MEMALIGN(l,h-2,r,k
1
)
4. OutputL
h
5. MEMALIGN(h+1,l,k
2
,r)
Toplevelcall:MEMALIGN(1,M,1,N)
Linear-spacealignment
Time,SpaceanalysisofHirschbergsalgorithm:
Tocomputeoptimalpathatmiddlecolumn,
ForboxofsizeM=N,
Space: 2N
Time: cMN, forsomeconstantc
Then,left,rightcallscostc(M/2=k
*
+M/2=(N-k
*
))=cMN/2
Allrecursivecallscost
TotalTime: cMN+cMN/2+cMN/4+..=2cMN=O(MN)
TotalSpace:O(N)forcomputation,
O(N+M)tostoretheoptimalalignment
TheFour-RussianAlgorithm
AusefulspeedupofDynamicProgramming
MainObservation
WithinarectangleoftheDP
matrix,
valuesofDdependonly
onthevaluesofA,B,C,
andsubstringsx
l...l
,y
rr
Definition:
At-blockisat tsquareof
theDPmatrix
Idea:
Dividematrixint-blocks,
Precompute t-blocks
Speedup:O(t)
A B
C
D
x
l
x
l
y
r
y
r
t
TheFour-RussianAlgorithm
Mainstructureofthealgorithm:
DivideNN DPmatrixintoKK
log
2
N-blocksthatoverlapby1
column&1row
Fori=1K
Forj=1K
ComputeD
i,j
asafunctionof
A
i,j
,B
i,j
,C
i,j
,x[l
i
l
i
],y[r
j
r
j
]
Time: O(N
2
/log
2
N)
timesthecostofstep4
t t
t
TheFour-RussianAlgorithm
Anotherobservation:
(Assumem=0,s=1,d=1)
Lemma.TwoadjacentcellsofF(.,.)differbyatmost1
Gusfieldsbookcoverscasewherem=0,
calledtheeditdistance(p.216):
minimum#ofsubstitutions+gapstotransformonestringtoanother
TheFour-RussianAlgorithm
ProofofLemma:
1. Samerow:
a. F(i,j)F(i1,j)s=+1
Atworst,onemoregap: x
1
x
i-1
x
i
y
1
y
j

b. F(i,j)F(i1,j)>=-1
F(i,j) F(i1,j1) F(i,j)F(i1,j1)
x x x x
1
x
1 i-1 i i-1
y
1
y
a-1
y
a
y
a+1
y
j
y
1
y
a-1
y
a
y
a+1
y
j
>=-1
x
1
x x x x
i-1 i 1 i-1
y
1
y
a-1
y
a
y
j
y
1
y
a-1
y
a
y
j
+1
2. Samecolumn:similarargument
TheFour-RussianAlgorithm
ProofofLemma:
3. Samediagonal:
a. F(i,j)F(i1,j1)s=+1
Atworst,oneadditionalmismatchinF(i,j)
b. F(i,j)F(i1,j1)>=-1
F(i,j)
x
1
x x
i-1 i
|
y
1
y
i-1
y
j
x
1
x x
i-1 i
y
1
y
a-1
y
a
y
j
F(i1,j1)
x x
1 i-1
y
1
y
j-1
x x
1 i-1
y
1
y
a-1
y
a
y
j
F(i,j)F(i1,j1)
>-1
+1
TheFour-RussianAlgorithm
Definition:
Theoffsetvectorisa
t-longvectorofvalues
from{-1,0,1},
wherethefirstentryis0
IfweknowthevalueatA,
andthetoprow,leftcolumn
offsetvectors,
andx
l
x
l
,y
r
y
r
,
ThenwecanfindD
A B
C
D
x
l
x
l
y
r
y
r
t
TheFour-RussianAlgorithm
Example:
x=AACT
y=CACT
5 6 5 5
6 5 5 4
5 6 5 5
4 5 6 5
A A C T
C
A
C
T
0 1 -1
0
0
-1
1
0 0 1 -1
0
1
1
-1
-1
TheFour-RussianAlgorithm
Example:
x=AACT
y=CACT
1 2 1 1
2 1 1 0
1 2 1 1
0 1 2 1
A A C T
C
A
C
T
0 1 -1
0
0
-1
1
0 0 1 -1
0
1
1
-1
-1
TheFour-RussianAlgorithm
Definition:
Theoffsetfunctionofa
t-block
isafunctionthatforany
givenoffsetvectors
oftoprow,leftcolumn,
andx
l
x
l
,y
r
y
r
,
producesoffsetvectors
ofbottomrow,right
column
A B
C
D
x
l
x
l
y
r
y
r
t
TheFour-RussianAlgorithm
4
3
Wecanpre-computetheoffsetfunction:
2(t-1)
possibleinputoffsetvectors
2t
possiblestringsxx
l
,y
r
y
r l
Therefore3
2(t-1)
=4
2t
valuestopre-compute
Wecankeepallthesevaluesinatable,andlookupinlineartime,
orinO(1)timeifweassume
constant-lookupRAMforlog-sizedinputs
TheFour-RussianAlgorithm
Four-RussiansAlgorithm:(Arlazarov,Dinic,Kronrod,
Faradzev)
1. CovertheDPtablewitht-blocks
2. InitializevaluesF(.,.)infirstrow&column
3. Row-by-row,useoffsetvaluesatleftmostcolumnandtop
rowofeachblock,tofindoffsetvaluesatrightmostcolumn
andbottomrow
4. LetQ=totalofoffsetsatrowN
F(N,N)=Q+F(N,0)
TheFour-RussianAlgorithm
t t
t
EvolutionattheDNAlevel
ACGGTGCAGTCACCA
ACGTTGCAGTCCACCA
C
Sequence Changes Computing best alignment
In absence of gaps
SequenceAlignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
Giventwostrings x=x
1
x
2
...x
M
, y=y
1
y
2
y
N
,
analignmentisanassignmentofgapstopositions
0,,Minx,and0,,Niny,soastolineupeach
letterinonesequencewitheitheraletter,oragap
intheothersequence
ScoringFunction
Sequenceedits:
AGGCCTC
Mutations
AGGACTC
Insertions
AGGGCCTC
Deletions
AGG.CTC
ScoringFunction:
Match: +m
Mismatch: -s
Gap: -d
Score F=(#matches)=m- (#mismatches)=s(#gaps)=
d
Howdowecomputethebestalignment?
A
G
T
G
A
C
C
T
G
G
G
A
A
G
A
C
C
C
T
G
A
C
C
C
T
G
G
G
T
C
A
C
A
A
A
A
C
T
C

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
Too many possible
alignments:
O(2
M+N
)
Alignmentisadditive
Observation:
Thescoreofaligning x
1
x
M
y
1
y
N
isadditive
Saythat x
1
x
i
x
i+1
x
M
alignsto y
1
y
j
y
j+1
y
N
Thetwoscoresaddup:
F(x[1:M],y[1:N])= F(x[1:i],y[1:j])+F(x[i+1:M],y[j+1:N])
DynamicProgramming
Wewillnowdescribeadynamicprogramming
algorithm
Supposewewishtoalign
x
1
x
M
y
1
y
N
Let
F(i,j) = optimalscoreofaligning
x
1
x
i
y
1
y
j
DynamicProgramming(contd)
Noticethreepossiblecases:
1. x
i
alignstoy
j
x
1
x
i-1
x
i
y
1
y
j-1
y
j
m,ifx
i
=y
-s,ifnot
j
F(i,j)=F(i-1,j-1)+
2. x
i
alignstoagap
x
1
x
i-1
x
i
y
1
y
j
-
3. y
j
alignstoagap
F(i,j)=F(i-1,j)- d
x
1
x -
i
y
1
y
j-1
y
j
F(i,j)=F(i,j-1)- d
DynamicProgramming(contd)
Howdoweknowwhichcaseiscorrect?
Inductiveassumption:
F(i,j-1),F(i-1,j),F(i-1,j-1)areoptimal
Then,
F(i-1,j-1)+s(x,y
j
)
F(i,j)=max
i
F(i-1, j)d
F( i,j-1)d
Where s(x,y
j
)=m,ifx =y; -s,ifnot
i i j
Example
x=AGTA
y=ATA
F(i,j) i=0 1 2 3 4
j=0
1
2
3
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
m= 1
s =-1
d =-1
OptimalAlignment:
F(4,3)=2
AGTA
A- TA
TheNeedleman-WunschMatrix
y
1

y
N

x
1
x
M
Everynondecreasing
path
from(0,0)to(M,N)
correspondsto
analignment
ofthetwosequences
Canthinkofitasa
divide-and-conqueralgorithm
TheNeedleman-WunschAlgorithm
1. Initialization.
a. F(0,0) = 0
b. F(0,j) =- j=d
c. F(i,0) =- i=d
2. MainIteration.Filling-inpartialalignments
a. Foreach i=1M
Foreach j=1N
F(i-1,j-1)+s(x,y
j
)
i
F(i,j) = max F(i-1,j)d
F(i,j-1)d
DIAG, if [case1]
Ptr(i,j) = LEFT, if [case2]
UP, if[case3]
3. Termination.F(M,N)istheoptimalscore,and
fromPtr(M,N)cantracebackoptimalalignment
[case1]
[case2]
[case3]
Performance
O(NM)
O(NM)

me:
Laterwewillcovermoreefficientmethods
Ti
Space:
Avariantofthebasicalgorithm:
MaybeitisOKtohaveanunlimited#ofgapsin
thebeginningandend:
----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC
GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------
Then,wedontwanttopenalizegapsintheends
Differenttypesofoverlaps
TheOverlapDetectionvariant
Changes:
1. Initialization
x
1
x
M
y
1

y
N
Foralli,j,
F(i,0)=0
F(0,j)=0
2. Termination
max
i
F(i,
N)
F
OPT
=max
max F(M,
j
j)
Thelocalalignmentproblem
Giventwostrings x=x
1
x
M
,
y=y
1
y
N
(optimalglobalalignmentvalue)
ismaximum
e.g. x=aaaacccccgggg
y=cccgggaaccaacc
Findsubstringsx,ywhosesimilarity
Whylocalalignment
Genesareshuffledbetweengenomes
Portionsofproteins(domains)areoftenconserved
Imageremovedduetocopyrightrestrictions.
Cross-speciesgenomesimilarity
98%ofgenesareconservedbetweenanytwomammals
>70%averagesimilarityinproteinsequence
hum_a:GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @57331/400001
mus_a:GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @78560/400001
rat_a:GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @112658/369938
fug_a:TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG@36008/68174
hum_a:CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG@57381/400001
mus_a:CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG@78610/400001
rat_a:CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG@112708/369938
atohenhancerin
fug_a:TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG@36058/68174
human,mouse,
hum_a:AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT@57431/400001
rat,fugufish
mus_a:AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT@78659/400001
rat_a:AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT@112757/369938
fug_a:AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC@36084/68174
hum_a:AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG@57481/400001
mus_a:AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG@78708/400001
rat_a:AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG@112806/369938
fug_a:CCGAGGACCCTGA------------------------------------- @36097/68174
TheSmith-Watermanalgorithm
Idea:Ignorebadlyaligningregions
ModificationstoNeedleman-Wunsch:
Initialization: F(0,j)=F(i,0)=0
0
Iteration: F(i,j)=max F(i1,j)d
F(i,j1)d
F(i1,j1)+s(x,y
j
)
i
TheSmith-Watermanalgorithm
Termination:
1. Ifwewantthebestlocalalignment
F
OPT
=max
i,j
F(i,j)
2. Ifwewantalllocalalignmentsscoring>t
Foralli,jfindF(i,j)>t,andtraceback
Scoringthegapsmoreaccurately
Currentmodel:
(n)
Gapoflength n
incurspenalty nd
However,gapsusuallyoccurinbunches
Convexgappenaltyfunction:
(n):
foralln,(n+1)- (n)s (n)- (n1)
(n)
Generalgapdynamicprogramming
Initialization: same
Iteration:
F(i-1,j-1)+s(x
i
,y
j
)
F(i,j) =max
max
max
k=0i-1
F(k,j)(i-k)
k=0j-1
F(i,k)(j-k)
Termination: same
RunningTime: O(N
2
M)
Space:
(assumeN>M)
O(NM)
Compromise:affinegaps
(n)=d+(n1)e
| |
(n)
gap gap
open extend
d
Tocomputeoptimalalignment,
e
Atpositioni,j,needtorememberbestscoreifgapisopen
bestscoreifgapisnotopen
F(i,j): scoreofalignmentx
1
x toy
1
y
j i
if if x
i
alignstoy
j
G(i,j): scoreif if x,ory
j
,alignstoagap
i
Needleman-Wunschwithaffinegaps
Initialization: F(i,0)=d+(i1)e
F(0,j)=d+(j1)e
Iteration:
F(i1,j1)+s(x,y
j
)
i
F(i,j)=max
G(i1,j1)+s(x,y
j
)
i
F(i1,j)d
F(i,j1)d
G(i,j)=max
G(i,j1)e
G(i 1,j) e
Termination: same
SequenceAlignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
Giventwostrings x=x
1
x
2
...x
M
, y=y
1
y
2
y
N
,
analignmentisanassignmentofgapstopositions
0,,Minx,and0,,Niny,soastolineupeach
letterinonesequencewitheitheraletter,oragap
intheothersequence
ScoringFunction
Sequenceedits:
AGGCCTC
Mutations
AGGACTC
Insertions
AGGGCCTC
Deletions
AGG.CTC
ScoringFunction:
Match: +m
Mismatch: -s
Gap: -d
Score F=(#matches)=m- (#mismatches)=s(#gaps)=
d
TheNeedleman-WunschAlgorithm
1. Initialization.
a. F(0,0) = 0
b. F(0,j) =- j=d
c. F(i,0) =- i=d
2. MainIteration.Filling-inpartialalignments
a. Foreach i=1M
Foreach j=1N
F(i-1,j-1)+s(x,y
j
)
i
F(i,j) = max F(i-1,j)d
F(i,j-1)d
DIAG, if [case1]
Ptr(i,j) = LEFT, if [case2]
UP, if[case3]
3. Termination.F(M,N)istheoptimalscore,and
fromPtr(M,N)cantracebackoptimalalignment
[case1]
[case2]
[case3]
TheSmith-Watermanalgorithm
Idea:Ignorebadlyaligningregions
ModificationstoNeedleman-Wunsch:
Initialization: F(0,j)=F(i,0)=0
0
Iteration: F(i,j)=max F(i1,j)d
F(i,j1)d
F(i1,j1)+s(x,y
j
)
i
Scoringthegapsmoreaccurately
Simple,lineargapmodel:
Gapoflength n
(n)
incurspenalty nd
However,gapsusuallyoccurinbunches
Convexgappenaltyfunction:
(n):
(n)
foralln,(n+1)- (n)s (n)- (n1)
Algorithm:O(N
3
)time,O(N
2
)space
Compromise:affinegaps
(n)=d+(n1)e
| |
(n)
gap gap
open extend
d
Tocomputeoptimalalignment,
e
Atpositioni,j,needtorememberbestscoreifgapisopen
bestscoreifgapisnotopen
F(i,j): scoreofalignmentx
1
x toy
1
y
j i
if if x
i
alignstoy
j
G(i,j): scoreif if x,ory
j
,alignstoagap
i
Whydoweneedtwomatrices?
x
i
alignstoy
j
x
1
x
i-1
x
i
x
i+1
y
1
y
j-1
y
j
-
2. x
i
alignstoagap
x
1
x
i-1
x
i
x
i+1
y
1
y
j
- -
Add-d
Add-e
Needleman-Wunschwithaffinegaps
Needleman-Wunschwithaffinegaps
Initialization: F(i,0)=d+(i1)e
F(0,j)=d+(j1)e
Iteration:
F(i1,j1)+s(x,y
j
)
i
F(i,j)=max
G(i1,j1)+s(x,y
j
)
i
F(i1,j)d
F(i,j1)d
G(i,j)=max
G(i,j1)e
G(i 1,j) e
Termination: same
Togeneralizealittle
thinkofhowyouwouldcomputeoptimalalignment
withthisgapfunction
(n)
.intimeO(MN)
BoundedDynamicProgramming
Assumeweknowthatxandyareverysimilar
Assumption: #gaps(x,y) <k(N) (sayN>M)
x
i
Then, | implies |ij|<k(N)
y
j
Wecanalignxandymoreefficiently:
Time,Space: O(N=k(N)) <<O(N
2
)
BoundedDynamicProgramming
Initialization:
F(i,0),F(0,j)undefinedfori,j>k
Iteration:
Fori=1M
Forj=max(1,i k)min(N,i+k)
F(i 1,j 1)+s(x
i
,y
j
)
F(i,j)=max F(i,j 1) d,ifj>i k(N)
F(i 1,j) d,ifj<i+k(N)
Termination: same
Easytoextendtotheaffinegapcase
x
1
x
M
y
1

y
N
k(N)
Linear-SpaceAlignment
Hirschbergsalgortihm
Longestcommonsubsequence
Givensequencess=s
1
s
2
s ,t=t
1
t
2
t
n
,
m
Findlongestcommonsubsequenceu=u
1
u
k
Algorithm:
F(i-1,j)
F(i,j)=max F(i,j-1)
F(i-1,j-1)+[1,ifs =t
j
;0otherwise]
i
Hirschbergsalgorithmsolvesthisinlinearspace
Introduction:Computeoptimalscore
ItiseasytocomputeF(M,N)inlinearspace
F(i,j)
Allocate(column[1])
Allocate(column[2])
For i=1.M
If i>1,then:
Free(column[i2])
Allocate(column[i])
For j=1N
F(i,j)=
Linear-spacealignment
Tocomputeboththeoptimalscoreandtheoptimalalignment:
Divide&Conquerapproach:
Notation:
r
x ,y
r
:reverseofx,y
E.g.x =accgg;
r
x =ggcca
r r
F
r
(i,j):optimalscoreofaligningx
r
1
x & y
r
1
y
j i
sameasF(M-i+1,N-j+1)
Linear-spacealignment
Lemma:
F(M,N)=max
k=0N
(F(M/2,k)+F
r
(M/2,N-k))
x
y
M/2
k
*
F
r
(M/2,N-k) F(M/2,k)
Linear-spacealignment
Now,using2columnsofspace,wecancompute
fork=1M,F(M/2,k),F
r
(M/2,N-k)
PLUSthebackpointers
Linear-spacealignment
Now,wecanfindk
*
maximizingF(M/2,k)+F
r
(M/2,N-k)
Also,wecantracethepathexitingcolumnM/2fromk
*
k
*
k
*
Linear-spacealignment
Iteratethisproceduretotheleftandright!
k
*
N-k
*
M/2
M/2
Linear-spacealignment
HirschbergsLinear-spacealgorithm:
MEMALIGN(l,l,r,r): (alignsxx
l
withy
r
y
r
)
l
1. Leth=(l-l)/2(=
2. FindinTimeO((ll)=(r-r)),SpaceO(r-r)
theoptimalpath, L
h
,enteringcolumnh-1,exitingcolumnh
Letk
1
=posnatcolumnh2whereL
h
enters
k
2
=posnatcolumnh+1whereL
h
exits
3. MEMALIGN(l,h-2,r,k
1
)
4. OutputL
h
5. MEMALIGN(h+1,l,k
2
,r)
Toplevelcall:MEMALIGN(1,M,1,N)
Linear-spacealignment
Time,SpaceanalysisofHirschbergsalgorithm:
Tocomputeoptimalpathatmiddlecolumn,
ForboxofsizeM=N,
Space: 2N
Time: cMN, forsomeconstantc
Then,left,rightcallscostc(M/2=k
*
+M/2=(N-k
*
))=cMN/2
Allrecursivecallscost
TotalTime: cMN+cMN/2+cMN/4+..=2cMN=O(MN)
TotalSpace:O(N)forcomputation,
O(N+M)tostoretheoptimalalignment
TheFour-RussianAlgorithm
AusefulspeedupofDynamicProgramming
MainObservation
WithinarectangleoftheDP
matrix,
valuesofDdependonly
onthevaluesofA,B,C,
andsubstringsx
l...l
,y
rr
Definition:
At-blockisat tsquareof
theDPmatrix
Idea:
Dividematrixint-blocks,
Precompute t-blocks
Speedup:O(t)
A B
C
D
x
l
x
l
y
r
y
r
t
TheFour-RussianAlgorithm
Mainstructureofthealgorithm:
DivideNN DPmatrixintoKK
log
2
N-blocksthatoverlapby1
column&1row
Fori=1K
Forj=1K
ComputeD
i,j
asafunctionof
A
i,j
,B
i,j
,C
i,j
,x[l
i
l
i
],y[r
j
r
j
]
Time: O(N
2
/log
2
N)
timesthecostofstep4
t t
t
TheFour-RussianAlgorithm
Anotherobservation:
(Assumem=0,s=1,d=1)
Lemma.TwoadjacentcellsofF(.,.)differbyatmost1
Gusfieldsbookcoverscasewherem=0,
calledtheeditdistance(p.216):
minimum#ofsubstitutions+gapstotransformonestringtoanother
TheFour-RussianAlgorithm
ProofofLemma:
1. Samerow:
a. F(i,j)F(i1,j)s=+1
Atworst,onemoregap: x
1
x
i-1
x
i
y
1
y
j

b. F(i,j)F(i1,j)>=-1
F(i,j) F(i1,j1) F(i,j)F(i1,j1)
x x x x
1
x
1 i-1 i i-1
y
1
y
a-1
y
a
y
a+1
y
j
y
1
y
a-1
y
a
y
a+1
y
j
>=-1
x
1
x x x x
i-1 i 1 i-1
y
1
y
a-1
y
a
y
j
y
1
y
a-1
y
a
y
j
+1
2. Samecolumn:similarargument
TheFour-RussianAlgorithm
ProofofLemma:
3. Samediagonal:
a. F(i,j)F(i1,j1)s=+1
Atworst,oneadditionalmismatchinF(i,j)
b. F(i,j)F(i1,j1)>=-1
F(i,j)
x
1
x x
i-1 i
|
y
1
y
i-1
y
j
x
1
x x
i-1 i
y
1
y
a-1
y
a
y
j
F(i1,j1)
x x
1 i-1
y
1
y
j-1
x x
1 i-1
y
1
y
a-1
y
a
y
j
F(i,j)F(i1,j1)
>-1
+1
TheFour-RussianAlgorithm
Definition:
Theoffsetvectorisa
t-longvectorofvalues
from{-1,0,1},
wherethefirstentryis0
IfweknowthevalueatA,
andthetoprow,leftcolumn
offsetvectors,
andx
l
x
l
,y
r
y
r
,
ThenwecanfindD
A B
C
D
x
l
x
l
y
r
y
r
t
TheFour-RussianAlgorithm
Example:
x=AACT
y=CACT
5 6 5 5
6 5 5 4
5 6 5 5
4 5 6 5
A A C T
C
A
C
T
0 1 -1
0
0
-1
1
0 0 1 -1
0
1
1
-1
-1
TheFour-RussianAlgorithm
Example:
x=AACT
y=CACT
1 2 1 1
2 1 1 0
1 2 1 1
0 1 2 1
A A C T
C
A
C
T
0 1 -1
0
0
-1
1
0 0 1 -1
0
1
1
-1
-1
TheFour-RussianAlgorithm
Definition:
Theoffsetfunctionofa
t-block
isafunctionthatforany
givenoffsetvectors
oftoprow,leftcolumn,
andx
l
x
l
,y
r
y
r
,
producesoffsetvectors
ofbottomrow,right
column
A B
C
D
x
l
x
l
y
r
y
r
t
TheFour-RussianAlgorithm
4
3
Wecanpre-computetheoffsetfunction:
2(t-1)
possibleinputoffsetvectors
2t
possiblestringsxx
l
,y
r
y
r l
Therefore3
2(t-1)
=4
2t
valuestopre-compute
Wecankeepallthesevaluesinatable,andlookupinlineartime,
orinO(1)timeifweassume
constant-lookupRAMforlog-sizedinputs
TheFour-RussianAlgorithm
Four-RussiansAlgorithm:(Arlazarov,Dinic,Kronrod,
Faradzev)
1. CovertheDPtablewitht-blocks
2. InitializevaluesF(.,.)infirstrow&column
3. Row-by-row,useoffsetvaluesatleftmostcolumnandtop
rowofeachblock,tofindoffsetvaluesatrightmostcolumn
andbottomrow
4. LetQ=totalofoffsetsatrowN
F(N,N)=Q+F(N,0)
TheFour-RussianAlgorithm
t t
t

You might also like