You are on page 1of 28

Bioinformatics

Lecture -04

Nadia Afrin Ritu


Lecturer, Computer Science and Engineering
Chapter 2: The Needleman and Wunsch Algorithm

The Needleman and Wunsch Algorithm

Global Alignment

✓Dynamic Programming
Problems Sub-problems Optimal solution
How does dynamic programming work?

A divide-and-conquer strategy:
• Break the problem into smaller subproblems.
• Solve the smaller problems optimally.
• Use the sub-problem solutions to construct an optimal solution for
the original problem.
Dynamic Programming vs exhaustive search
✓ In Exhaustive search, all possible alignments is generally not feasible. As the
lengths of the sequences grow, the number of possible alignments to search
quickly becomes intractable or impossible to compute in a reasonable amount of
time. We can overcome this problem using dynamic programming.
✓Dynamic programming is used for optimal alignment of two sequences.
✓ It finds the alignment in a more quantitative way by giving some scores for
matches and mismatches (Scoring matrices), rather than only applying dots. By
searching the highest scores in the matrix, alignment can be accurately obtained.
✓ The Dynamic Programming solves the original problem by dividing the problem
into smaller independent sub problems.
✓ Needleman-Wunsch and Smith-Waterman algorithms for sequence alignment are
defined by dynamic programming approach.
Scoring matrices

-In optimal alignment procedures, mostly Needleman-Wunsch and Smith-Waterman


algorithms use scoring system.
- For nucleotide sequence alignment, the scoring matrices used are relatively simpler
since the frequency of mutation for all the bases are equal.
- Positive or higher value is assigned for a match and a negative or a lower value is
assigned for mismatch. These assumption based scores can be used for scoring the
matrices.
Scoring matrices
Mainly used predefined matrices are PAM and BLOSUM.

• PAM Matrices: Margaret Dayhoff was the first one to develop the PAM matrix,
PAM stands for Point Accepted Mutations. PAM matrices are calculated by
observing the differences in closely related proteins. One PAM unit (PAM1)
specifies one accepted point mutation per 100 amino acid residues, i.e. 1% change
and 99% remains as such.

• BLOSUM: Blocks Substitution Matrix, developed by Henikoff and Henikoff in


1992, used conserved regions. These matrices are actual percentage identity
values. Simply to say, they depend on similarity. Blosum 62 means there is 62 %
similarity.
The Needleman and Wunsch Algorithm
Scoring Scheme:
+1 for every match
-1 for mismatch
-10 for gap penalty
The Needleman and Wunsch Algorithm

The Needleman-Wunsch algorithm consists of three steps:

1. Create 2D matrix

2. Trace Back

3. Final Alignment

© Bioinformatics| Nadia Afrin Ritu


Continued

© Bioinformatics| Nadia Afrin Ritu


Consider the simple example
• The alignment of two sequences SEND and AND. Find the best
alignment between two sequences using Needleman and Wunsch
Algorithm.
The score and traceback matrices
Initialization
Scoring
Continued
The Needleman-Wunsch progression
The value of C(2, 2)
Filling the score and traceback matrices
The progression is recursive
The final score and traceback matrices
The traceback
• Traceback = the process of deduction of the best alignment from the
traceback matrix.
• The traceback always begins with the last cell to be filled with the
score, i.e. the bottom right cell.
• One moves according to the traceback value written in the cell. There
are three possible moves: diagonally (toward the top-left corner of
the matrix), up, or left.
• The traceback is completed when the first, top-left cell of the matrix
is reached (”done” cell).
The traceback path
The traceback step-by-step (1)
• The first cell from the traceback path is ”diag” implying that the
corresponding letters are aligned:
D
D
The traceback step-by-step (2)
• The second cell from the traceback path is also ”diag” implying that
the corresponding letters are aligned:
ND
ND
The traceback step-by-step (3)
• The third cell from the traceback path is ”left” implying the gap in the
left sequence (i.e. we stay on the letter A from the left sequence):
END
-ND
The traceback step-by-step (4)
• The fourth cell from the traceback path is also ”diag” implying that
the corresponding letters are aligned. We consider the letter A again,
this time it is aligned with S:
SEND
A-ND
Compare with the exhaustive search
• The best alignment via the Needleman-Wunsch algorithm:
SEND
A-ND
-The exhaustive search:
SEND
-AND score: +1
A-ND score: +3 ← the best
AN-D score: -3
AND- score: -8
Summary
• The alignment is over the entire length of two sequences: the
traceback starts from the lower right corner of the traceback matrix,
and completes in the upper left cell of the matrix.
• The Needleman-Wunsch algorithm works in the same way regardless
of the length or complexity of sequences, and guarantees to find the
best alignment.
• The Needleman-Wunsch algorithm is appropriate for finding the best
alignment of two sequences which are (i) of the similar length; (ii)
similar across their entire lengths.
Continued

You might also like