Sequence Alignment and Dynamic Programming: 6.096 - Algorithms For Computational Biology

6.
096AlgorithmsforComputationalBiology
SequenceAlignment
andDynamicProgramming
Lecture 1- Introduction
Lecture 2- HashingandBLAST
Lecture 3- CombinatorialMotifFinding
Lecture4 - StatisticalMotifFinding
5
ChallengesinComputationalBiology
4 GenomeAssembly
Regulatorymotifdiscovery 1 GeneFinding
DNA
2 Sequencealignment
6 ComparativeGenomics
TCATGCTAT
TCGTGATAA
3 Databaselookup
7 EvolutionaryTheory
TGAGGATAT
TTATCATAT
TTATGATTT
8 Geneexpressionanalysis
RNAtranscript
Proteinnetworkanalysis 11
9 Gibbssampling 10
12
Regulatorynetworkinference
Emergingnetworkproperties 13
Clusterdiscovery
A C G T C A T C A
T A G T G T C A
ComparingtwoDNAsequences
GiventwopossiblyrelatedstringsS1andS2
Whatisthelongestcommonsubsequence?
A C G T C A T C A
T A G T G T C A
S1
S2
S1 S1
A C G T C A T C A
T A G T G T C A
A G T T C A
S2 S2
LCSS
Editdistance:
Numberofchanges
neededforS1S2
Howcanwecomputebestalignment
S1
S2
A C G T C A T C A
T A G T G T C A
Needscoringfunction:
Score(alignment)=TotalcostofeditingS1intoS2
Costofmutation
Costofinsertion/deletion
Rewardofmatch
Needalgorithmforinferringbestalignment
Enumeration?
Howwouldyoudoit?
Howmanyalignmentsarethere?
Whyweneedasmartalgorithm
Waystoaligntwosequencesoflengthm,n
n m +
|

m + n
|
(m + n)!
~=
2
t=
|
.
|
(m!)
2
\
m
m
Fortwosequencesoflengthn
n Enumeration Today'slecture
10 184,756 100
20 1.40E+11 400
100 9.00E+58 10,000
Keyinsight: scoreisadditive!
A C G T C A T C A
T A G T G T C A
S1
S2
i
j
Computebestalignmentrecursively
Foragivensplit(i, j),thebestalignmentis:
BestalignmentofS1[1..i] andS2[1..j]
+BestalignmentofS1[ i..n]andS2[ j..m]
i i
A C G T C A T C A
T A G T G T C A
S1
S2
j j
A C G T C A T C A
T A G T G T C A
S1
S2
A C G T
T A G T G
S1
S2
A C G T C A T C A
T A G T G T C A
S1
S2
S2
A C G T C A T C A
T A G T G T C A
S1
S2
A C G T C A T C A
T A G T G T C A
S1
C G T C A T C A
T G T C A
S1
S2
Keyinsight: re-usecomputation
Identicalsub-problems! Wecanreuseourwork!
Solution#1Memoization
Createabigdictionary,indexedbyalignedseqs
Whenyouencounteranewpairofsequences
Ifitisinthedictionary:
Lookupthesolution
Ifitisnotinthedictionary
Computethesolution
Insertthesolutioninthedictionary
Ensuresthatthereisnoduplicatedwork
Onlyneedtocomputeeachsub-alignmentonce!
Topdownapproach
Solution#2Dynamicprogramming
Createabigtable,indexedby(i,j)
Fillitinfromthebeginningallthewaytilltheend
Youknowthatyoullneedeverysubpart
Guaranteedtoexploreentiresearchspace
Ensuresthatthereisnoduplicatedwork
Onlyneedtocomputeeachsub-alignmentonce!
Verysimplecomputationally!
Bottomupapproach
A C G T C A T C A
T A G T G T C A
S1
S2
A C G T C A T C A
T
A
G
T
G
T
C
A
A
G
T
C/G
T
C
A
Goal:
Findbestpath
throughthematrix
Keyinsight: Matrixrepresentationofalignments
Sequencealignment
DynamicProgramming
Globalalignment
0.Settingupthescoringmatrix
-
A G T
A
A
G
C
- 0
Initialization:
UpdateRule:
A(i,j)=max{
}
Termination:
Topright:0
Bottomright
1.Allowinggapsins
-
A G T
A
A
G
C
- 0
-2
-4
-6
-8
Initialization:
UpdateRule:
A(i,j)=max{
i-1, j
}
Termination:
Topright:0
Bottomright
A( )- 2
0
2.Allowinggapsint
-
A G T
-
A
A
G
-2 -4 -6
-2 -4 -6 -8
-4 -6 -8 -10
-6 -8 -10 -12
-8 -10 -12 -14
Initialization:
Topright:0
UpdateRule:
A(i,j)=max{
A(i-1, j)- 2
A( i ,j-1)- 2
}
Termination:
Bottomright
C
3.Allowingmismatches
-
A G T
-
A
A
G
0 -2 -4 -6
-2 -1 -3 -5
-4 -3 -2 -4
-6 -5 -4 -3
-8 -7 -6 -5
-1
-1
-1
-1
-1
-1 -1
-1 -1
-1
-1
-1
Initialization:
Topright:0
UpdateRule:
A(i,j)=max{
A(i-1, j)- 2
A( i ,j-1)- 2
A(i-1,j-1)-1
}
Termination:
Bottomright
C
4.Choosingoptimalpaths
-
A G T
-
A
A
G
0 -2 -4 -6
-2 -1 -3 -5
-4 -3 -2 -4
-6 -5 -4 -3
-8 -7 -6 -5
-1
-1
-1
-1
-1
-1 -1
-1 -1
-1
-1
-1
Initialization:
Topright:0
UpdateRule:
A(i,j)=max{
A(i-1,
A( i ,
A(i-1,
}
j)- 2
j-1)- 2
j-1)-1
Termination:
Bottomright
C
5.Rewardingmatches
-
A G T
-
A
A
G
0 -2 -4 -6
-2 1 -1 -3
-4 -1 0 -2
-6 -3 0 -1
-8 -5 -2 -1
1
1
1
-1 -1
-1
-1
Initialization:
Topright:0
UpdateRule:
A(i,j)=max{
A(i-1,
A( i ,
A(i-1,
}
j)- 2
j-1)- 2
j-1)1
Termination:
Bottomright
C
Sequencealignment
GlobalAlignment
Semi-Global
DynamicProgramming
Semi-GlobalMotivation
Aligningthefollowingsequences
CAGCACTTGGATTCTCGG
CAGC- - - - - G- T- - - - GG
Wemightpreferthealignment
vvvv- - - - - v- v- - - - vv = 8(1)+0(-1)+10(-2)=-12
CAGCA- CTTGGATTCTCGG
match mismatch
- - - CAGCGTGG- - - - - - - -
- - - vv- vxvvv- - - - - - - - = 6(1)+1(-1)+12(-2)=-19
gap
Newqualitiessought,newscoringscheme
designed
Intuitively,dontpenalizemissingendofthe
sequence
Wedliketomodelthisintuition
Ignoringstartinggaps
-
A G T
Initialization:
-
/ l 1strow co :0
UpdateRule:
A(i,j)=max{
A
A(i-1, j)- 2
A( i ,j-1)- 2
A
A(i-1,j-1)1
}
Termination: G
Bottomright
0 0 0 0
0 1 -1 -1
0 1 0 -2
0 -1 2 0
0 -1 0 1
1
1
1
-1
-1 -1
-1 -1
-1
-1
C
Ignoringtrailinggaps
-
A G T
-
A
A
G
0 0 0 0
0 1 -1 -1
0 1 0 -2
0 -1 2 0
0 -1 0 1
1
1
1
-1
-1 -1
-1 -1
-1
-1
Initialization:
1strow/col:0
UpdateRule:
A(i,j)=max{
A(i-1, j)- 2
A( i ,j-1)- 2
A(i-1,j-1)1
}
Termination:
max(lastrow/col)
C
Usingthenewscoringscheme
Withtheoldscoringscheme(allgapscount-2)
CAGCACTTGGATTCTCGG
CAGC- - - - - G- T- - - - GG
vvvv- - - - - v- v- - - - vv = 8(1)+0(-1)+10(-2)+0(-0)=-12
Newscore(endgapsarefree)
6(1)+1(-1)+1(-2)+11(-0)=3
match mismatch gap
CAGCA- CTTGGATTCTCGG
endgap
- - - CAGCGTGG- - - - - - - -
- - - vv- vxvvv- - - - - - - - =
Semi-globalalignments
Applications:
query
Findingageneinagenome
Aligningareadontoanassembly
subject
FindingthebestalignmentofaPCRprimer
Placingamarkerontoachromosome
Thesesituationshaveincommon
Onesequenceismuchshorterthantheother
Alignmentshouldspantheentirelengthofthesmaller
sequence
Noneedtoaligntheentirelengthofthelongersequence
Inourscoringschemeweshould
Penalizeend-gapsforsubjectsequence
Donotpenalizeend-gapsforquerysequence
Semi-GlobalAlignment
-
A G T
-
A
A
G
C
Query:s
Subject:t
alignallofs
Initialization:
UpdateRule:
A(i,j)=max{
A(i-1, j
A( i ,j
A(i-1,j-1)1
}
Termination:
0 -2 -4 -6
0 1 -1 -1
0 1 0 -2
0 -1 2 1
0 -1 0 0
...or...
0 -2 -4 -6
-2 1 -1 -1
-4 1 0 -2
-6 -1 2 0
-8 -1 0 -1
-
A G T
A
A
G
C
-
Initialization:
1strow
A(i,j)=max{
A(i-1, j
A( i ,j
A(i-1,j-1)1
}
Termination:
max(lastrow)
Query:t
Subject:s
alignalloft
1stcol
max(lastcol)
)- 2
-1)- 2
Update Rule:
)- 2
-1)- 2
Sequencealignment
GlobalAlignment
Semi-Global
LocalAlignment
DynamicProgramming
IntrotoLocalAlignments
Statementoftheproblem
A localalignmentofstringssandt
isanalignmentofasubstringofs
withasubstringoft
Definitions(reminder):
A substringconsistsofconsecutivecharacters
A subsequenceofsneedsnotbecontiguousins
Navealgorithm
Nowthatweknowhowtousedynamicprogramming
TakeallO((nm)
2
),andruneachalignmentinO(nm)time
Dynamicprogramming
Bymodifyingourexistingalgorithms,weachieveO(mn)
s
t
GlobalAlignment
-
A G T
-
A
A
G
0 -2 -4 -6
-2 1 -1 -5
-4 1 0 -2
-6 -1 2 0
-8 -1 0 1
1
1
1
-1
-1 -1
-1 -1
-1
-1
Initialization:
Topleft:0
UpdateRule:
A(i,j)=max{
A(i-1,
A( i ,
A(i-1,
}
j)- 2
j-1)- 2
j-1)1
Termination:
Bottomright
C
LocalAlignment
-
A G T
A
A
G
C
- 0 0 0 0
0 1 0 0
0 1 0 0
0 0 2 0
0 0 0 1
1
1
1
-1
Initialization:
UpdateRule:
A(i,j)=max{
i-1, j
i ,j
i-1,j-1)1
0
}
Termination:
Anywhere
-1
Topleft:0
A(
A(
A(
)- 2
-1)- 2
LocalAlignmentissues
Resolvingambiguities
Whenfollowingarrowsback,onecanstopatanyofthezero
entries. Onlystopwhennoarrowleaves. Longest.
Correctnesssketchbyinduction
Assumewevecorrectlyalignedupto(i,j)
Considerthefourcasesofourmaxcomputation
Byinductivehypothesisrecurseon(i-1,j-1),(i-1,j),(i,j-1)
Basecase: emptystringsaresuffixesalignedoptimally
Timeanalysis
O(mn)time
O(mn)space,canbebroughttoO(m+n)
Sequencealignment
GlobalAlignment
Semi-Global
LocalAlignment
AffineGapPenalty
DynamicProgramming
Scoringthegapsmoreaccurately
Currentmodel:
(n)
Gapoflength n
incurspenalty nd
However,gapsusuallyoccurinbunches
Convexgappenaltyfunction:
(n):
foralln,(n+1)- (n)s (n)- (n1)
(n)
Generalgapdynamicprogramming
Initialization: same
Iteration:
F(i-1,j-1)+s(x
i
,y
j
)
F(i,j) =max
max
max
k=0i-1
F(k,j)(i-k)
k=0j-1
F(i,k)(j-k)
Termination: same
RunningTime: O(N
2
M)
Space:
(assumeN>M)
O(NM)
Compromise:affinegaps
(n)=d+(n1)e
(n)
| |
gap gap
open extend
d
Tocomputeoptimalalignment,
e
Atpositioni,j,needtorememberbestscoreifgapisopen
bestscoreifgapisnotopen
F(i,j): scoreofalignmentx
1
x
i
toy
1
y
j
if if x
i
alignstoy
j
G(i,j): scoreif if x
i
,ory
j
,alignstoagap
Motivationforaffinegappenalty
Modelingevolution
Tointroducethefirstgap,abreakmustoccurinDNA
Multipleconsecutivegapslikelytobeintroducedbythesame
evolutionaryevent. Oncethebreakismade,itsrelativelyeasy
tomakemultipleinsertionsordeletions.
Fixedcostforopeningagap: p+q
Linearcostincrementforincreasingnumberofgaps: q
Affinegapcostfunction
Newgapfunctionforlengthk: w(k)=p+q*k
p+qisthecostofthefirstgapinarun
qistheadditionalcostofeachadditionalgapinsamerun
AdditionalMatrices
Theamountofstateneededincreases
Inscoringasingleentryinourmatrix,weneed
rememberanextrapieceofinformation
Arewecontinuingagapins?(ifnot,startismore
expensive)
Arewecontinuingagapint?(ifnot,startismore
expensive)
Arewecontinuingfromamatchbetweens(i)andt(j)?
Dynamicprogrammingframework
Weencodethisinformationinthreedifferentstates
foreachelement(i,j)ofouralignment. Usethree
matrices
a(i,j):bestalignmentofs[1..i]&t[1..j]thatalignss[i]witht[j]
b(i,j):bestalignmentofs[1..i]&t[1..j]thatalignsgapwitht[j]
c(i,j):bestalignmentofs[1..i]&t[1..j]thatalignss[i]withgap
Updaterules
Whens[j]andt[j]arealigned
|
i a , 1 j1)
|=
Scorecanbe
=
(
|=
( ( [ t j ( i a ,j) i s score ],[ ])+ max
i b , 1 j1)
|=
differentforeach
|
pairofchars
(i c , 1 j1)
\ .=
Whent[j]alignswithagapins
|
i a ,j1)(p q)
|
+
startingagapins
=
(
|
( ( i b ,j) max
i b ,j1) q
|=
extendingagapins
|
i c ,j1)(p q) ( + Stoppingagapint,
\ .
andstartingoneins
Whens[i]alignswithagapint
|
i a 1 )(p q)
|=
, j +
=
(
|=
( ( , j i c ,j) max
i c 1 ) q
|=
|
( , j + i b 1 )(p q)
\ .
Findmaximumoverallthreearraysmax(a[m,n],b[m,n],c[m,n]).
Followarrowsback,skippingfrommatrixtomatrix
Simplifiedrules
Transitionsfrombtocarenotnecessary...
iftheworstmismatchcostslessthanp+q
ACC-GGTA
ACCGGTA
A--TGGTA
A-TGGTA
=
(
Whens[j]andt[j]arealigned
|
i a , 1 j1)
|=
Scorecanbe
|=
( [ ], ( i a ,j) score( t i s [j])+ max
i b , 1 j1)
|=
differentforeach
|=
pairofchars
(
\
i c , 1 j1)
.
Whent[j]alignswithagapins
(
i b( ,j) max
|
i a ,j1)(p+ q)
|=
startingagapins
|=
|
\
i b ,j1) q (
extendingagapins
.
Whens[i]alignswithagapint
(
i c( ,j) max
|
i a , 1 j)(p+ q)
|=
|=
|
\
i c , 1 j) q (
.
GeneralGapPenalty
Gappenaltiesarelimitedbytheamountofstate
Affinegappenalty: w(k)=k*p
State:Currentindextellsifinagapornot
Lineargappenalty: w(k)=p+q*k,whereq<p
State: addbinaryvalueforeachsequence: startingagapornot
Whataboutquadriatic:w(k)=p+q*k+rk
2
.
State: needstoencodethelengthofthegap,whichcanbeO(n)
ToencodeitweneedO(logn)bitsofinformation.Notfeasible
Whatabouta(mod3)gappenaltyforproteinalignments
Gapsoflengthdivisibleby3arepenalizedless:conserveframe
Thisisfeasible,butrequiresmorepossiblestates
Possiblestatesare: starting,mod3=1,mod3=2,mod3=0
Sequencealignment
GlobalAlignment
Semi-Global
LocalAlignment
LinearGapPenalty
VariationsontheTheme
DynamicProgramming
DynamicProgrammingVersatility
Unifiedframework
Dynamicprogrammingalgorithm.Localupdates.
Re-usingpastresultsinfuturecomputations.
Memoryusageoptimizations
Toolsinourdisposition
Globalalignment:entirelengthoftwoorthologousgenes
Semi-globalalignment: pieceofalargersequencealigned
entirely
Localalignment: twogenessharingafunctionaldomain
LinearGapPenalty:penalizefirstgapmorethansubsequent
gaps
Editdistance,min#ofeditoperations.M=0,m=g=-1,every
operationsubtracts1,beitmutationorgap
Longestcommonsubsequence: M=1,m=g=0. Everymatch
addsone,beitcontiguousornotwithprevious.
DPAlgorithmVariations
t
s
t
s
t
s
-
A G T
A
A
G
C
- 0 -2 -4 -6
-2 1 -1 -1
-4 -1 -1 -2
-6 -1 0 0
-8 -3 0 -1
GlobalAlignment
Semi-GlobalAlignment
LocalAlignment
-
A G T
A
A
G
C
- 0 -2 -4 -6
0 1 -1 -1
0 1 0 -2
0 -1 2 1
0 -1 0 0
-
A G T
A
A
G
A
- 0 0 0 0
0 1 0 0
0 1 0 0
0 0 2 0
0 1 0 1
BoundedDynamicProgramming
Initialization:
F(i,0),F(0,j)undefinedfori,j>k
Iteration:
Fori=1M
Forj=max(1,i k)min(N,i+k)
F(i 1,j 1)+s(x
i
,y
j
)
F(i,j)=max F(i,j 1) d,ifj>i k(N)
F(i 1,j) d,ifj<i+k(N)
Termination: same
Easytoextendtotheaffinegapcase
x
1
x
M
y
1
y
N
k(N)
Linear-spacealignment
Now,wecanfindk
*
maximizingF(M/2,k)+F
r
(M/2,N-k)
Also,wecantracethepathexitingcolumnM/2fromk
*
k
*
k
*
Linear-SpaceAlignment
Hirschbergsalgorithm
Longestcommonsubsequence
Givensequencess=s
1
s
2
s ,t=t
1
t
2
t
n
,
m
Findlongestcommonsubsequenceu=u
1
u
k
Algorithm:
F(i-1,j)
F(i,j)=max F(i,j-1)
F(i-1,j-1)+[1,ifs =t
j
;0otherwise]
i
Hirschbergsalgorithmsolvesthisinlinearspace
Introduction:Computeoptimalscore
ItiseasytocomputeF(M,N)inlinearspace
F(i,j)
Allocate(column[1])
Allocate(column[2])
For i=1.M
If i>1,then:
Free(column[i2])
Allocate(column[i])
For j=1N
F(i,j)=
Tocomputeboththeoptimalscoreandtheoptimalalignment:
Divide&Conquerapproach:
Notation:
r
x ,y
r
:reverseofx,y
E.g.x =accgg;
r
x =ggcca
r r
F
r
(i,j):optimalscoreofaligningx
r
1
x & y
r
1
y
j i
sameasF(M-i+1,N-j+1)
Lemma:
F(M,N)=max
k=0N
(F(M/2,k)+F
r
(M/2,N-k))
x
y
M/2
k
*
F
r
(M/2,N-k) F(M/2,k)
Now,using2columnsofspace,wecancompute
fork=1M,F(M/2,k),F
r
(M/2,N-k)
PLUSthebackpointers
Now,wecanfindk
*
r
(M/2,N-k)
*
k
*
k
*
Iteratethisproceduretotheleftandright!
k
*
N-k
*
M/2
M/2
HirschbergsLinear-spacealgorithm:
MEMALIGN(l,l,r,r): (alignsxx
l
withy
r
y
r
)
l
1. Leth=(l-l)/2(=
2. FindinTimeO((ll)=(r-r)),SpaceO(r-r)
theoptimalpath, L
h
,enteringcolumnh-1,exitingcolumnh
Letk
1
=posnatcolumnh2whereL
h
enters
k
2
=posnatcolumnh+1whereL
h
exits
3. MEMALIGN(l,h-2,r,k
1
)
4. OutputL
h
5. MEMALIGN(h+1,l,k
2
,r)
Toplevelcall:MEMALIGN(1,M,1,N)
Time,SpaceanalysisofHirschbergsalgorithm:
Tocomputeoptimalpathatmiddlecolumn,
ForboxofsizeM=N,
Space: 2N
Time: cMN, forsomeconstantc
Then,left,rightcallscostc(M/2=k
*
+M/2=(N-k
*
))=cMN/2
Allrecursivecallscost
TotalTime: cMN+cMN/2+cMN/4+..=2cMN=O(MN)
TotalSpace:O(N)forcomputation,
O(N+M)tostoretheoptimalalignment
TheFour-RussianAlgorithm
AusefulspeedupofDynamicProgramming
MainObservation
WithinarectangleoftheDP
matrix,
valuesofDdependonly
onthevaluesofA,B,C,
andsubstringsx
l...l
,y
rr
Definition:
At-blockisat tsquareof
theDPmatrix
Idea:
Dividematrixint-blocks,
Precompute t-blocks
Speedup:O(t)
A B
C
D
x
l
x
l
y
r
y
r
t
Mainstructureofthealgorithm:
DivideNN DPmatrixintoKK
log
2
N-blocksthatoverlapby1
column&1row
Fori=1K
Forj=1K
ComputeD
i,j
asafunctionof
A
i,j
,B
i,j
,C
i,j
,x[l
i
l
i
],y[r
j
r
j
]
Time: O(N
2
/log
2
N)
timesthecostofstep4
t t
t
Anotherobservation:
(Assumem=0,s=1,d=1)
Lemma.TwoadjacentcellsofF(.,.)differbyatmost1
Gusfieldsbookcoverscasewherem=0,
calledtheeditdistance(p.216):
minimum#ofsubstitutions+gapstotransformonestringtoanother
ProofofLemma:
1. Samerow:
a. F(i,j)F(i1,j)s=+1
Atworst,onemoregap: x
1
x
i-1
x
i
y
1
y
j
b. F(i,j)F(i1,j)>=-1
F(i,j) F(i1,j1) F(i,j)F(i1,j1)
x x x x
1
x
1 i-1 i i-1
y
1
y
a-1
y
a
y
a+1
y
j
y
1
y
a-1
y
a
y
a+1
y
j
>=-1
x
1
x x x x
i-1 i 1 i-1
y
1
y
a-1
y
a
y
j
y
1
y
a-1
y
a
y
j
+1
2. Samecolumn:similarargument
ProofofLemma:
3. Samediagonal:
a. F(i,j)F(i1,j1)s=+1
Atworst,oneadditionalmismatchinF(i,j)
b. F(i,j)F(i1,j1)>=-1
F(i,j)
x
1
x x
i-1 i
|
y
1
y
i-1
y
j
x
1
x x
i-1 i
y
1
y
a-1
y
a
y
j
F(i1,j1)
x x
1 i-1
y
1
y
j-1
x x
1 i-1
y
1
y
a-1
y
a
y
j
F(i,j)F(i1,j1)
>-1
+1
Definition:
Theoffsetvectorisa
t-longvectorofvalues
from{-1,0,1},
wherethefirstentryis0
IfweknowthevalueatA,
andthetoprow,leftcolumn
offsetvectors,
andx
l
x
l
,y
r
y
r
,
ThenwecanfindD
A B
C
D
x
l
x
l
y
r
y
r
t
Example:
x=AACT
y=CACT
5 6 5 5
6 5 5 4
5 6 5 5
4 5 6 5
A A C T
C
A
C
T
0 1 -1
0
0
-1
1
0 0 1 -1
0
1
1
-1
-1
Example:
x=AACT
y=CACT
1 2 1 1
2 1 1 0
1 2 1 1
0 1 2 1
A A C T
C
A
C
T
0 1 -1
0
0
-1
1
0 0 1 -1
0
1
1
-1
-1
Definition:
Theoffsetfunctionofa
t-block
isafunctionthatforany
givenoffsetvectors
oftoprow,leftcolumn,
andx
l
x
l
,y
r
y
r
,
producesoffsetvectors
ofbottomrow,right
column
A B
C
D
x
l
x
l
y
r
y
r
t
4
3
Wecanpre-computetheoffsetfunction:
2(t-1)
possibleinputoffsetvectors
2t
possiblestringsxx
l
,y
r
y
r l
Therefore3
2(t-1)
=4
2t
valuestopre-compute
Wecankeepallthesevaluesinatable,andlookupinlineartime,
orinO(1)timeifweassume
constant-lookupRAMforlog-sizedinputs
Four-RussiansAlgorithm:(Arlazarov,Dinic,Kronrod,
Faradzev)
1. CovertheDPtablewitht-blocks
2. InitializevaluesF(.,.)infirstrow&column
3. Row-by-row,useoffsetvaluesatleftmostcolumnandtop
rowofeachblock,tofindoffsetvaluesatrightmostcolumn
andbottomrow
4. LetQ=totalofoffsetsatrowN
F(N,N)=Q+F(N,0)
t t
t
EvolutionattheDNAlevel
ACGGTGCAGTCACCA
ACGTTGCAGTCCACCA
C
Sequence Changes Computing best alignment
In absence of gaps
SequenceAlignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
Giventwostrings x=x
1
x
2
...x
M
, y=y
1
y
2
y
N
,
analignmentisanassignmentofgapstopositions
0,,Minx,and0,,Niny,soastolineupeach
letterinonesequencewitheitheraletter,oragap
intheothersequence
ScoringFunction
Sequenceedits:
AGGCCTC
Mutations
AGGACTC
Insertions
AGGGCCTC
Deletions
AGG.CTC
ScoringFunction:
Match: +m
Mismatch: -s
Gap: -d
Score F=(#matches)=m- (#mismatches)=s(#gaps)=
d
Howdowecomputethebestalignment?
A
G
T
G
A
C
C
T
G
G
G
A
A
G
A
C
C
C
T
G
A
C
C
C
T
G
G
G
T
C
A
C
A
A
A
A
C
T
C

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
Too many possible
alignments:
O(2
M+N
)
Alignmentisadditive
Observation:
Thescoreofaligning x
1
x
M
y
1
y
N
isadditive
Saythat x
1
x
i
x
i+1
x
M
alignsto y
1
y
j
y
j+1
y
N
Thetwoscoresaddup:
F(x[1:M],y[1:N])= F(x[1:i],y[1:j])+F(x[i+1:M],y[j+1:N])
DynamicProgramming
Wewillnowdescribeadynamicprogramming
algorithm
Supposewewishtoalign
x
1
x
M
y
1
y
N
Let
F(i,j) = optimalscoreofaligning
x
1
x
i
y
1
y
j
DynamicProgramming(contd)
Noticethreepossiblecases:
1. x
i
alignstoy
j
x
1
x
i-1
x
i
y
1
y
j-1
y
j
m,ifx
i
=y
-s,ifnot
j
F(i,j)=F(i-1,j-1)+
2. x
i
alignstoagap
x
1
x
i-1
x
i
y
1
y
j
-
3. y
j
alignstoagap
F(i,j)=F(i-1,j)- d
x
1
x -
i
y
1
y
j-1
y
j
F(i,j)=F(i,j-1)- d
DynamicProgramming(contd)
Howdoweknowwhichcaseiscorrect?
Inductiveassumption:
F(i,j-1),F(i-1,j),F(i-1,j-1)areoptimal
Then,
F(i-1,j-1)+s(x,y
j
)
F(i,j)=max
i
F(i-1, j)d
F( i,j-1)d
Where s(x,y
j
)=m,ifx =y; -s,ifnot
i i j
Example
x=AGTA
y=ATA
F(i,j) i=0 1 2 3 4
j=0
1
2
3
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
m= 1
s =-1
d =-1
OptimalAlignment:
F(4,3)=2
AGTA
A- TA
TheNeedleman-WunschMatrix
y
1
y
N
x
1
x
M
Everynondecreasing
path
from(0,0)to(M,N)
correspondsto
analignment
ofthetwosequences
Canthinkofitasa
divide-and-conqueralgorithm
TheNeedleman-WunschAlgorithm
1. Initialization.
a. F(0,0) = 0
b. F(0,j) =- j=d
c. F(i,0) =- i=d
2. MainIteration.Filling-inpartialalignments
a. Foreach i=1M
Foreach j=1N
F(i-1,j-1)+s(x,y
j
)
i
F(i,j) = max F(i-1,j)d
F(i,j-1)d
DIAG, if [case1]
Ptr(i,j) = LEFT, if [case2]
UP, if[case3]
3. Termination.F(M,N)istheoptimalscore,and
fromPtr(M,N)cantracebackoptimalalignment
[case1]
[case2]
[case3]
Performance
O(NM)
O(NM)
me:
Laterwewillcovermoreefficientmethods
Ti
Space:
Avariantofthebasicalgorithm:
MaybeitisOKtohaveanunlimited#ofgapsin
thebeginningandend:
----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC
GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------
Then,wedontwanttopenalizegapsintheends
Differenttypesofoverlaps
TheOverlapDetectionvariant
Changes:
1. Initialization
x
1
x
M
y
1
y
N
Foralli,j,
F(i,0)=0
F(0,j)=0
2. Termination
max
i
F(i,
N)
F
OPT
=max
max F(M,
j
j)
Thelocalalignmentproblem
Giventwostrings x=x
1
x
M
,
y=y
1
y
N
(optimalglobalalignmentvalue)
ismaximum
e.g. x=aaaacccccgggg
y=cccgggaaccaacc
Findsubstringsx,ywhosesimilarity
Whylocalalignment
Genesareshuffledbetweengenomes
Portionsofproteins(domains)areoftenconserved
Imageremovedduetocopyrightrestrictions.
Cross-speciesgenomesimilarity
98%ofgenesareconservedbetweenanytwomammals
>70%averagesimilarityinproteinsequence
hum_a:GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @57331/400001
mus_a:GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @78560/400001
rat_a:GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @112658/369938
fug_a:TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG@36008/68174
hum_a:CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG@57381/400001
mus_a:CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG@78610/400001
rat_a:CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG@112708/369938
atohenhancerin
fug_a:TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG@36058/68174
human,mouse,
hum_a:AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT@57431/400001
rat,fugufish
mus_a:AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT@78659/400001
rat_a:AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT@112757/369938
fug_a:AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC@36084/68174
hum_a:AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG@57481/400001
mus_a:AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG@78708/400001
rat_a:AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG@112806/369938
fug_a:CCGAGGACCCTGA------------------------------------- @36097/68174
TheSmith-Watermanalgorithm
Idea:Ignorebadlyaligningregions
ModificationstoNeedleman-Wunsch:
Initialization: F(0,j)=F(i,0)=0
0
Iteration: F(i,j)=max F(i1,j)d
F(i,j1)d
F(i1,j1)+s(x,y
j
)
i
Termination:
1. Ifwewantthebestlocalalignment
F
OPT
=max
i,j
F(i,j)
2. Ifwewantalllocalalignmentsscoring>t
Foralli,jfindF(i,j)>t,andtraceback
Currentmodel:
(n)
Gapoflength n
incurspenalty nd
(n):
(n)
Generalgapdynamicprogramming
Initialization: same
Iteration:
F(i-1,j-1)+s(x
i
,y
j
)
F(i,j) =max
max
max
k=0i-1
F(k,j)(i-k)
k=0j-1
F(i,k)(j-k)
Termination: same
RunningTime: O(N
2
M)
Space:
(assumeN>M)
O(NM)
(n)=d+(n1)e
| |
(n)
gap gap
open extend
d
e
1
x toy
1
y
j i
if if x
i
alignstoy
j
G(i,j): scoreif if x,ory
j
,alignstoagap
i
Needleman-Wunschwithaffinegaps
Initialization: F(i,0)=d+(i1)e
F(0,j)=d+(j1)e
Iteration:
F(i1,j1)+s(x,y
j
)
i
F(i,j)=max
G(i1,j1)+s(x,y
j
)
i
F(i1,j)d
F(i,j1)d
G(i,j)=max
G(i,j1)e
G(i 1,j) e
Termination: same
SequenceAlignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
Giventwostrings x=x
1
x
2
...x
M
, y=y
1
y
2
y
N
,
analignmentisanassignmentofgapstopositions
0,,Minx,and0,,Niny,soastolineupeach
letterinonesequencewitheitheraletter,oragap
intheothersequence
ScoringFunction
Sequenceedits:
AGGCCTC
Mutations
AGGACTC
Insertions
AGGGCCTC
Deletions
AGG.CTC
ScoringFunction:
Match: +m
Mismatch: -s
Gap: -d
Score F=(#matches)=m- (#mismatches)=s(#gaps)=
d
TheNeedleman-WunschAlgorithm
1. Initialization.
a. F(0,0) = 0
b. F(0,j) =- j=d
c. F(i,0) =- i=d
2. MainIteration.Filling-inpartialalignments
a. Foreach i=1M
Foreach j=1N
F(i-1,j-1)+s(x,y
j
)
i
F(i,j) = max F(i-1,j)d
F(i,j-1)d
DIAG, if [case1]
Ptr(i,j) = LEFT, if [case2]
UP, if[case3]
3. Termination.F(M,N)istheoptimalscore,and
fromPtr(M,N)cantracebackoptimalalignment
[case1]
[case2]
[case3]
Idea:Ignorebadlyaligningregions
ModificationstoNeedleman-Wunsch:
Initialization: F(0,j)=F(i,0)=0
0
Iteration: F(i,j)=max F(i1,j)d
F(i,j1)d
F(i1,j1)+s(x,y
j
)
i
Simple,lineargapmodel:
Gapoflength n
(n)
incurspenalty nd
(n):
(n)
Algorithm:O(N
3
)time,O(N
2
)space
(n)=d+(n1)e
| |
(n)
gap gap
open extend
d
e
1
x toy
1
y
j i
if if x
i
alignstoy
j
G(i,j): scoreif if x,ory
j
,alignstoagap
i
Whydoweneedtwomatrices?
x
i
alignstoy
j
x
1
x
i-1
x
i
x
i+1
y
1
y
j-1
y
j
-
2. x
i
alignstoagap
x
1
x
i-1
x
i
x
i+1
y
1
y
j
- -
Add-d
Add-e
Initialization: F(i,0)=d+(i1)e
F(0,j)=d+(j1)e
Iteration:
F(i1,j1)+s(x,y
j
)
i
F(i,j)=max
G(i1,j1)+s(x,y
j
)
i
F(i1,j)d
F(i,j1)d
G(i,j)=max
G(i,j1)e
G(i 1,j) e
Termination: same
Togeneralizealittle
thinkofhowyouwouldcomputeoptimalalignment
withthisgapfunction
(n)
.intimeO(MN)
Assumeweknowthatxandyareverysimilar
Assumption: #gaps(x,y) <k(N) (sayN>M)
x
i
Then, | implies |ij|<k(N)
y
j
Wecanalignxandymoreefficiently:
Time,Space: O(N=k(N)) <<O(N
2
)
Initialization:
F(i,0),F(0,j)undefinedfori,j>k
Iteration:
Fori=1M
Forj=max(1,i k)min(N,i+k)
F(i 1,j 1)+s(x
i
,y
j
)
F(i,j)=max F(i,j 1) d,ifj>i k(N)
F(i 1,j) d,ifj<i+k(N)
Termination: same
Easytoextendtotheaffinegapcase
x
1
x
M
y
1
y
N
k(N)
Linear-SpaceAlignment
Hirschbergsalgortihm
Longestcommonsubsequence
Givensequencess=s
1
s
2
s ,t=t
1
t
2
t
n
,
m
Findlongestcommonsubsequenceu=u
1
u
k
Algorithm:
F(i-1,j)
F(i,j)=max F(i,j-1)
F(i-1,j-1)+[1,ifs =t
j
;0otherwise]
i
Hirschbergsalgorithmsolvesthisinlinearspace
Introduction:Computeoptimalscore
ItiseasytocomputeF(M,N)inlinearspace
F(i,j)
Allocate(column[1])
Allocate(column[2])
For i=1.M
If i>1,then:
Free(column[i2])
Allocate(column[i])
For j=1N
F(i,j)=
Tocomputeboththeoptimalscoreandtheoptimalalignment:
Divide&Conquerapproach:
Notation:
r
x ,y
r
:reverseofx,y
E.g.x =accgg;
r
x =ggcca
r r
F
r
(i,j):optimalscoreofaligningx
r
1
x & y
r
1
y
j i
sameasF(M-i+1,N-j+1)
Lemma:
F(M,N)=max
k=0N
(F(M/2,k)+F
r
(M/2,N-k))
x
y
M/2
k
*
F
r
(M/2,N-k) F(M/2,k)
Now,using2columnsofspace,wecancompute
fork=1M,F(M/2,k),F
r
(M/2,N-k)
PLUSthebackpointers
Now,wecanfindk
*
r
(M/2,N-k)
*
k
*
k
*
Iteratethisproceduretotheleftandright!
k
*
N-k
*
M/2
M/2
HirschbergsLinear-spacealgorithm:
MEMALIGN(l,l,r,r): (alignsxx
l
withy
r
y
r
)
l
1. Leth=(l-l)/2(=
2. FindinTimeO((ll)=(r-r)),SpaceO(r-r)
theoptimalpath, L
h
,enteringcolumnh-1,exitingcolumnh
Letk
1
=posnatcolumnh2whereL
h
enters
k
2
=posnatcolumnh+1whereL
h
exits
3. MEMALIGN(l,h-2,r,k
1
)
4. OutputL
h
5. MEMALIGN(h+1,l,k
2
,r)
Toplevelcall:MEMALIGN(1,M,1,N)
Time,SpaceanalysisofHirschbergsalgorithm:
Tocomputeoptimalpathatmiddlecolumn,
ForboxofsizeM=N,
Space: 2N
Time: cMN, forsomeconstantc
Then,left,rightcallscostc(M/2=k
*
+M/2=(N-k
*
))=cMN/2
Allrecursivecallscost
TotalTime: cMN+cMN/2+cMN/4+..=2cMN=O(MN)
TotalSpace:O(N)forcomputation,
O(N+M)tostoretheoptimalalignment
AusefulspeedupofDynamicProgramming
MainObservation
WithinarectangleoftheDP
matrix,
valuesofDdependonly
onthevaluesofA,B,C,
andsubstringsx
l...l
,y
rr
Definition:
At-blockisat tsquareof
theDPmatrix
Idea:
Dividematrixint-blocks,
Precompute t-blocks
Speedup:O(t)
A B
C
D
x
l
x
l
y
r
y
r
t
Mainstructureofthealgorithm:
DivideNN DPmatrixintoKK
log
2
N-blocksthatoverlapby1
column&1row
Fori=1K
Forj=1K
ComputeD
i,j
asafunctionof
A
i,j
,B
i,j
,C
i,j
,x[l
i
l
i
],y[r
j
r
j
]
Time: O(N
2
/log
2
N)
timesthecostofstep4
t t
t
Anotherobservation:
(Assumem=0,s=1,d=1)
Lemma.TwoadjacentcellsofF(.,.)differbyatmost1
Gusfieldsbookcoverscasewherem=0,
calledtheeditdistance(p.216):
minimum#ofsubstitutions+gapstotransformonestringtoanother
ProofofLemma:
1. Samerow:
a. F(i,j)F(i1,j)s=+1
Atworst,onemoregap: x
1
x
i-1
x
i
y
1
y
j
b. F(i,j)F(i1,j)>=-1
F(i,j) F(i1,j1) F(i,j)F(i1,j1)
x x x x
1
x
1 i-1 i i-1
y
1
y
a-1
y
a
y
a+1
y
j
y
1
y
a-1
y
a
y
a+1
y
j
>=-1
x
1
x x x x
i-1 i 1 i-1
y
1
y
a-1
y
a
y
j
y
1
y
a-1
y
a
y
j
+1
2. Samecolumn:similarargument
ProofofLemma:
3. Samediagonal:
a. F(i,j)F(i1,j1)s=+1
Atworst,oneadditionalmismatchinF(i,j)
b. F(i,j)F(i1,j1)>=-1
F(i,j)
x
1
x x
i-1 i
|
y
1
y
i-1
y
j
x
1
x x
i-1 i
y
1
y
a-1
y
a
y
j
F(i1,j1)
x x
1 i-1
y
1
y
j-1
x x
1 i-1
y
1
y
a-1
y
a
y
j
F(i,j)F(i1,j1)
>-1
+1
Definition:
Theoffsetvectorisa
t-longvectorofvalues
from{-1,0,1},
wherethefirstentryis0
IfweknowthevalueatA,
andthetoprow,leftcolumn
offsetvectors,
andx
l
x
l
,y
r
y
r
,
ThenwecanfindD
A B
C
D
x
l
x
l
y
r
y
r
t
Example:
x=AACT
y=CACT
5 6 5 5
6 5 5 4
5 6 5 5
4 5 6 5
A A C T
C
A
C
T
0 1 -1
0
0
-1
1
0 0 1 -1
0
1
1
-1
-1
Example:
x=AACT
y=CACT
1 2 1 1
2 1 1 0
1 2 1 1
0 1 2 1
A A C T
C
A
C
T
0 1 -1
0
0
-1
1
0 0 1 -1
0
1
1
-1
-1
Definition:
Theoffsetfunctionofa
t-block
isafunctionthatforany
givenoffsetvectors
oftoprow,leftcolumn,
andx
l
x
l
,y
r
y
r
,
producesoffsetvectors
ofbottomrow,right
column
A B
C
D
x
l
x
l
y
r
y
r
t
4
3
Wecanpre-computetheoffsetfunction:
2(t-1)
possibleinputoffsetvectors
2t
possiblestringsxx
l
,y
r
y
r l
Therefore3
2(t-1)
=4
2t
valuestopre-compute
Wecankeepallthesevaluesinatable,andlookupinlineartime,
orinO(1)timeifweassume
constant-lookupRAMforlog-sizedinputs
Four-RussiansAlgorithm:(Arlazarov,Dinic,Kronrod,
Faradzev)
1. CovertheDPtablewitht-blocks
2. InitializevaluesF(.,.)infirstrow&column
3. Row-by-row,useoffsetvaluesatleftmostcolumnandtop
rowofeachblock,tofindoffsetvaluesatrightmostcolumn
andbottomrow
4. LetQ=totalofoffsetsatrowN
F(N,N)=Q+F(N,0)
t t
t

Sequence Alignment and Dynamic Programming: 6.096 - Algorithms For Computational Biology

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sequence Alignment and Dynamic Programming: 6.096 - Algorithms For Computational Biology

Uploaded by

Copyright:

Available Formats

6.

You might also like