You are on page 1of 36

Sequence

Analysis
Sequence analysis

Pairwise Alignment:
¾Dot Matrix
¾Dynamic Programming
¾Heuristic Methods:
FASTA
BLAST
Pairwise Sequence Alignment

• The process of comparing two sequences by


searching for a series of individual characters
that are in the same order in the sequences
Terms used to indicate sequence similaity

Term used Description of the term


Homologous Characters similar due to common ancestry

Analogous Characters similar due to convergent evolution

Orthologous Charecters homologous with conserved function

Paralogous Characters homologous with divergent function

Xenologous Sequence identity due to horizontal transfer


Dot matrix

• Useful for simple alignments


• Uses graphical methods
• Easy to understand and apply
• Disadvantage: Does not produce an
optimal alignment
Dot matrix

• Constructed to visually compare two


sequences and detect regions of close
similarity between them
• Similarity between two sequences can be
detected as a diagonal on an identity
matrix
• Word size can be fixed to filter the best dot
plot
Dynamic programming

• A little bit challenging to understand when


compared to Dot matrix
• Produces optical alignment using both
Global and Local alignments
• Disadvantage: Needs skill to produce the
most optical alignment
Scoring Matrices

• The alignment that gives the highest


similarity score is the optimal alignment
• A scoring method used in the alignment of
one residue against another
• The number of alignments that must be
checked to identify the optimal alignment
increases exponentially with the lengths of
the sequences
SCORING MATRICES for Proteins

• PAM (Percent Accepted Mutations)


• BLOSUM (BLOcks SUbstitution Matrix)
PAM

• The first substitution matrix developed for comparision of


protein sequences
• Also called as Dayhoff, MDM or PAM matrices
• Developed by Margaret Dayhoff and co-workers
• Developed from global alignments of closely related
sequences
• Number accompanying PAM refers to evolutionary
distance between the sequences
• Larger number – greater evolutionary distance
Table showing the PAM matrices and the corresponding
percent identities

Matrix PAM0 PAM30 PAM80 PAM110 PAM200 PAM250

% Identity 100 75 50 60 25 20
BLOSUM

• Developed by Steve Henikoff and co-workers


• Based on local alignments of distantly related sequences
• Small number - greater evolutionary distance
Global Alignment

• Is done across the entire sequence length


to include as many matches as possible
upto and including the sequence end
• Needleman-Wunsh Algorithm
• Example of programme using global
alignment by implementing Needleman-
Wunsh algorithm: Needle tool of EMBOSS
Local Alignment

• Finds out the pair of regions, one from


each sequence that exhibit high similarity
• Smith-Waterman Algorithm
• Example of programme using global
alignment by implementing Smith-
Waterman algorithm: Water tool of
EMBOSS
Heuristics methods

• Fast and effective but approximate


methods
• BLAST and FASTA
FASTA

• It is a programme written by W.R.Pearson


and D.J.Lipman
• Sensitive and rapid algorithm
• EMBL and DDBJ use FASTA for whole
database search
• Disadvantage: Completely ignores more
more distantly related sequences that
have functional homology but no longer
retain complete identity
Flavours of FASTA

TYPE of Query sequence Library/ Database scanned


fasta
fasta Protein or DNA Protein or DNA
tfasta Protein Translated DNA
lfasta Compares two sequences for local similarity and shows local sequence
alignments
plfasta Compares two sequences for local similarity and plots the local
sequence alignments

fastx Tanslated DNA Protein


(all the six frames)
fasty
FASTA

Considers exact matches for a given set of


sequences

Uses DP algorithm to compute optimal


alignment out of the exact matches
BLAST (Basic Local Alignment Search Tool)

• Karlin and Altschul


• Multiple hits to the same sequence is
allowed in order to find out patches of
regional similarity
BLAST

Program Database Query

BLASTN Nucleotide Nucleotide

BLASTP Protein Protein


Nucleotide translated
BLASTX Protein
into protein
Nucleotide translated into
TBLASTN Protein
protein
Nucleotide translated into Nucleotide translated
TBLASTX
protein into protein
BLAST

Looks for short subsequences that are likely


to have significant matches

Tries to extend these subsequences in order


to obtain maximum sequence similarity
Steps in alignment process

1. Matrix construction.
2. Matrix filling.
3. Back tracing.
4. Alignment plotting.
The two example sequences taken are:

5’ GAATTCAGTTA 3’
5’ CCATCGG 3’
Matrix construction and filling:
The matrix is constructed as in the slides that follow, the
number of rows and columns taken depends on the number of
bases present in both the sequences to be aligned. Once the
matrix is constructed, it is filled based on some scoring rules and
formulas as follows:

Formula for determining the score of a cell (for global) :


Si,j = Max [(Si-1,j-1 + W), (Si-1,j + W), (Si,j-1 + W)] (or)
Si,j = Max [(Si-1,j-i), (Si-1,j), (Si,j-1)] + W
Where,
Si,j = Score of the cell represented by ith column and jth row
i = row in which the cell of consideration lies
j = column in which the cell of consideration lies
W = weight given to the substitution of bases represented by the cell of
consideration
Scoring rules used:
The weights given in this example for substitution or indel are the following :-
Match = 2 Si,j = Max [Si-1,j-1, Si-1,j, Si,j-1] + W
Mismatch = -1
Gap = -2

S2,2 = Max [S1,1, S1,2, S2,1] + W

For example, the calculation of score in the case of the cell S2,2:
S1,1 = 0 S1,2 = -2 S2,1 = -2 and W (GQC) = -1
Max of all the three is 0. So, 0 + -1 = -1
So, this score is given to the cell and arrow is placed from the diagonal
cell from which the value is obtained, it can be to more than one cell also.
row (i)
G A A T T C A G T T A
col 1 2 3 4 5 6 7 8 9 10 11 12
(j) 0 -2 -4 -6 -8 -10 -12 -14 -16 -18 -20 -22
2
C -2 -1
3
C -4
4
A -6
5
T -8
6
C -10
7
G -12
8
G -14
Similarly all the cells are assigned scores and arrows are placed accordingly.
Finally when the entire matrix is filled, the back tracing is done.

The two differences between local and global alignment are:


™ Scoring: In case of global alignment whatever score we get from the
formula, we assign it to the cell. Where as in case of local alignment the cell
score is equaled to zero if the calculated maximum score is less than zero.
So, the formula for calculating score of a cell for local alignment is written as:
Si,j = Max [(Si-1,j-1 + W), (Si-1,j + W), (Si,j-1 + W), 0]
™ Back tracing:
In case of global it starts from the bottom right and ends at the top
left. Where as in case of local alignment, back tracing is done starting from
the highest score in the matrix to the lowest score other than zero, and if
more than one traces are there the longest one is chosen to plot the optimal
local alignment.
While back tracing, it is a must that trace should be always from the
cell of consideration to only those cells from which it’s score is attained.
The final filled matrix for global
alignment (including the optimal
alignment trace in red) obtained for the
example sequences taken is . . . .
row (i)
G A A T T C A G T T A
col 1 2 3 4 5 6 7 8 9 10 11 12
(j) 0 -2 -4 -6 -8 -10 -12 -14 -16 -18 -20 -22
2
C -2 -1 -2 -3 -4 -5 -3 -4 -5 -6 -7 -8
3
C -4 -2 -2 -3 -4 -5 -3 -4 -5 -6 -7 -8
4
A -6 -3 0 2 1 0 -1 1 0 -1 -2 0
5
T -8 -4 -1 1 4 6 5 4 3 5 7 6
6
C -10 -5 -2 0 3 5 8 7 6 5 4 6
7
G -12 -3 -3 -1 2 4 7 7 9 8 7 6
8
G -14 -1 -2 -2 1 3 6 6 11 10 9 8
If a score is followed by a diagonal, then there will be no gap. The
bases represented by the cell can be aligned side by side in such case.
If a score is followed by a horizontal or vertical arrow, a gap is inserted
in the sequence towards which the arrow is pointing.

CAUTION:
Gap must be inserted only when we come across a horizontal or
vertical pointing arrow following a score, but not for a diagonal arrow
following it. A diagonal arrow following a score represents an
acceptable match, which can be either match or mismatch.
Properly check where a gap must be inserted (the sequence towards
which the arrow points should get a gap).
From the previous slide, there can be a total of four OPTIMAL GLOBAL
ALIGNMENTS possible. These are:

GA A T T C A GT T A
C C ATCG G

GA A T T C A GT T A
C C ATC GG

G AA T T C A GT T A
C C ATCG G

G AA T T C A GT T A
C C ATC GG
The final filled matrix for local alignment
(including the optimal alignment trace
in red) obtained for the example
sequences taken is . . . .
row (i)
G A A T T C A G T T A
col 1 2 3 4 5 6 7 8 9 10 11 12
(j) 0 0 0 0 0 0 0 0 0 0 0 0
2
C 0 0 0 0 0 0 2 1 0 0 0 0
3
C 0 0 0 0 0 0 4 3 2 1 0 0
4
A 0 0 2 4 3 2 3 6 5 4 3 5
5
T 0 0 1 3 6 8 7 6 5 7 9 8
6
C 0 0 0 2 5 7 10 9 8 7 8 8
7
G 0 2 1 1 4 6 9 9 11 10 9 8
8
G 0 4 3 2 3 5 8 8 13 12 11 10
From the previous slide, there can be a total of two OPTIMAL LOCAL
ALIGNMENTS possible. These are:

A A T T C A G A A T T C A G
A TC G G A TC GG
Distance

• Distance between two sequences is the


minimal sum of weights for a set of
mutations transforming one into the other
• Similarity is useful for database searching
• Distance measures are useful for
phylogenetic tree construction

You might also like