You are on page 1of 24

Pooja Anshul Saxena Engr 692: Special Topics Computational Biology

Global Sequence Alignment


The NeedlemanWunsch algorithm performs a global alignment on two sequences It is an example of dynamic programming, and was the first application of dynamic programming to biological sequence comparison Suitable when the two sequences are of similar length, with a significant degree of similarity throughout Aim: The best alignment over the entire length of two sequences

Three steps in NeedlemanWunsch Algorithm


Initialization Scoring Trace back (Alignment) Consider the two DNA sequences to be globally aligned are: ATCG (x=4, length of sequence 1) TCG (y=3, length of sequence 2)

Scoring Scheme
Match Score = +1 Mismatch Score = -1 Gap penalty = -1 Substitution Matrix

A A C G T 1 -1 -1 -1 C -1 1 -1 -1 G -1 -1 1 -1 T -1 -1 -1 1

Initialization Step
Create a matrix with X +1 Rows and Y +1 Columns The 1st row and the 1st column of the score matrix are filled as multiple of gap penalty

T 0 A T C G -1 -2 -3 -4 -1 C -2 G -3

Scoring

The score of any cell C(i, j) is the maximum of: scorediag = C(i-1, j-1) + S(I, j) scoreup = C(i-1, j) + g scoreleft = C(i, j-1) + g where S(I, j) is the substitution score for letters i and j, and g is the gap penalty

Scoring .

Example: The calculation for the cell C(2, 2): scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1 scoreup = C(i-1, j) + g = -1 + -1 = -2 scoreleft = C(i, j-1) + g = -1 + -1 = -2
T 0 A T C G -1 -2 -3 -4 -1 -1 C -2 G -3

Scoring .

Final Scoring Matrix


T 0 A T C G -1 -2 -3 -4 -1 -1 0 -1 -2 C -2 -2 -1 1 0 G -3 -3 -2 0 2

Trace back
The trace back step determines the actual alignment(s) that result in the maximum score There are likely to be multiple maximal alignments Trace back starts from the last cell, i.e. position X, Y in the matrix Gives alignment in reverse order

Trace back .
There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible predecessors

Trace back .
T 0 A T C G

C -2 -2 -1 1 0

G -3 -3 -2 0 2

-1 -1 0 -1 -2

-1 -2 -3 -4

The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of Seq 1: G | Seq 2: G

Trace back .

Final Trace back


T
0 A T C G -1 -2 -3 -4 -1 -1 0 -1 -2

C
-2 -2 -1 1 0

G
-3 -3 -2 0 2

Best Alignment: ATC G | | | | _TCG

Local Sequence Alignment


The Smith-Waterman algorithm performs a local alignment on two sequences It is an example of dynamic programming Useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context Aim: The best alignment over the conserved domain of two sequences

Differences in NeedlemanWunsch and Smith-Waterman Algorithms:

In the initialization stage, the first row and first column are all filled in with 0s While filling the matrix, if a score becomes negative, put in 0 instead In the traceback, start with the cell that has the highest score and work back until a cell with a score of 0 is reached.

Three steps in Smith-Waterman Algorithm


Initialization Scoring Trace back (Alignment) Consider the two DNA sequences to be globally aligned are: ATCG (x=4, length of sequence 1) TCG (y=3, length of sequence 2)

Scoring Scheme
Match Score = +1 Mismatch Score = -1 Gap penalty = -1 Substitution Matrix

A A C G T 1 -1 -1 -1 C -1 1 -1 -1 G -1 -1 1 -1 T -1 -1 -1 1

Initialization Step
Create a matrix with X +1 Rows and Y +1 Columns The 1st row and the 1st column of the score matrix are filled with 0s

T 0 A T C G 0 0 0 0 0 C 0 G 0

Scoring
The score of any cell C(i, j) is the maximum of: scorediag = C(i-1, j-1) + S(I, j) scoreup = C(i-1, j) + g scoreleft = C(i, j-1) + g And 0 (here S(I, j) is the substitution score for letters i and j, and g is the gap penalty)

Scoring .

Example: The calculation for the cell C(2, 2): scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1 scoreup = C(i-1, j) + g = 0 + -1 = -1 scoreleft = C(i, j-1) + g = 0 + -1 = -1
T 0 A T C G 0 0 0 0 0 0 C 0 G 0

Scoring .

Final Scoring Matrix


T 0 A T C G 0 0 0 0 0 0 1 0 0 C 0 0 0 2 1 G 0 0 0 1 3

Note: It is not mandatory that the last cell has the maximum alignment score!

Trace back
The trace back step determines the actual alignment(s) that result in the maximum score There are likely to be multiple maximal alignments Trace back starts from the cell with maximum value in the matrix Gives alignment in reverse order

Trace back .

There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible predecessors. This continues till cell with value 0 is reached.

Trace back .
T 0 A T C G

C 0 0 0 2 1

G 0 0 0 1 3

0 0 1 0 0

0 0 0 0

The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of Seq 1: G | Seq 2: G

Trace back .

Final Trace back


T
0 A T C G 0 0 0 0 0 0 1 0 0

C
0 0 0 2 1

G
0 0 0 1 3

Best Alignment: TCG | | | TCG