L3.4 Alignment

Sequence Alignment
What is alignment?
• a sequence alignment is a way of arranging the sequences of
DNA, RNA, or protein to identify regions of similarity
ATGCATGC TGCATGCA
GAATTCAG
ATGCATGC GGATCG
TGCATGCA
ATGCATGC
->TGCATGCA
Dr. Md. Khademul Islam

Sequence Alignment
• Match: same (or similar) letter in both rows

• Mismatch: different letters in both rows
• Insertion: the letter opposite to a space
• Deletion: the space opposite to a letter
• Indel: a column containing a space

Sequence Alignment
Why align sequences?

• Useful for discovering
-- structure
-- function
-- evolutionary relationship
• For example:
-- two proteins with similar sequences will probably be
structurally or functionally similar.
-- to find whether two (or more) genes or protein are
evolutionary related to each other

Sequence Alignment
• Types of alignment:
(i) Pair-wise
(ii) Multiple
• Types of alignment:
(i) Global
(ii) Local

Sequence Alignment
• Global
— attempt to align every residue in every sequence,
— most useful when the sequences in the query set are similar and of
roughly equal size. (This does not mean global alignments cannot end in
gaps.)
— A general global alignment technique is the Needleman–Wunsch
algorithm.
• Local
— Gathers islands of matches
— Stretches of sequences with highest density of matches are aligned
— more useful for dissimilar sequences that are suspected to contain
regions of similarity or similar sequence motifs within their larger
sequence context.
— The Smith–Waterman algorithm is a general local alignment method

Sequence Alignment
• What is an optimal alignment?

Sequence Alignment
• Score matrix & Cost Function?
• The total score assigned to an alignment is the
— sum of terms for matches and mismatches,
— plus the sum of terms for gaps (i.e., indels).
• Formally we can treat the space character just like any other.
Then we obtain the following additive scoring scheme:
• The score of an alignment (x′, y′) is
• Where: δ is a score matrix (scoring scheme)

Sequence Alignment
• So how should we choose the score matrix δ?
• Hamming distance
• In the simplest case, we do not allow any gaps at all and charge
all mismatches at unit cost. This is called the Hamming distance.
• Edit distance
• If we charge all insertions, deletions, and mismatches at unit
cost, we obtain the Levenshtein distance. This measure is also
called the edit distance, because it counts the number of
elementary edit operations needed to transform one sequence
into the other.

Sequence Alignment
• Shell we use same penalty for all gaps?
• Linear Gap penalty
Edit distance: 15
• Affine Gap penalty

• Not same cost for gap ‘Open’ & ‘Extension’
• Higher cost for gap open, but lower cost for extending it.
• e.g. Open: -5 , extension -1, match: +1, mismatch: -1
Score: 26-5-7-5-1-5-3-5 = -5
Score: 26-5-14 = +7
Homology vs. Similarity
• When two sequences are descended from a common

evolutionary origin, they are said to have a homologous
relationship or share homology.
• A related but different term is sequence similarity, which is the

percentage of aligned residues that are similar in physiochemical
properties such as size, charge, and hydrophobicity.

Similarity vs. Identity
• Sequence similarity and sequence identity are synonymous for

nucleotide sequences. For protein sequences, however, the two
concepts are very different.
• In a protein sequence alignment, sequence identity refers to the
percentage of matches of the same amino acid residues
between two aligned sequences.
• Similarity refers to the percentage of aligned residues that have
similar physicochemical characteristics and can be more readily
substituted for each other

Similarity vs. Identity

Pair-wise Sequence Alignment
How to do pair-wise alignment?

• Dot plot
• Dynamic programming
Dot Matrix Pair-wise alignment

• Graphically displays any possible sequence alignments
• Suitable for sequences not very much alike
• Can identify all possible matches of residues
• Can detect presence of: Insertion, deletion, direct & inverted
repeats, RNA self-complementary regions
• Generally does not provide actual alignment

Dot Matrix Pair-wise alignment
Example:
• acgcg acacg

Dynamic Programming Pair-wise alignment
• Dot matrix method may not provide actually aligned sequence

and not quantitative.
• Dynamic programming is a method that determines optimal
alignment by matching two sequences for all possible pairs of
characters between the two sequences.
• It is fundamentally similar to the dot matrix method in that it
also creates a two dimensional alignment grid.
• However, it finds alignment in a more quantitative way by
converting a dot matrix into a scoring matrix to account for
matches and mismatches between sequences.
• By searching for the set of highest scores in this matrix, the
best alignment can be accurately obtained

Dynamic Programming Pair-wise alignment
Dynamic Programming Idea:
• dynamic programming: solve an instance of a problem by

taking advantage of solutions for subparts of the problem
– reduce problem of best alignment of two sequences to best
alignment of all prefixes of the sequences
– avoid recalculating the scores already considered
• example: Fibonacci sequence 1, 1, 2, 3, 5, 8, 13, 21, 34…

Dynamic Programming Idea: Pair-wise alignment
[A] is the maximum score from one of the

three directions plus matching score at the
current position

Dynamic Programming Idea: Pair-wise alignment

Dynamic Programming Idea
Dynamic Programming Types:
• Global alignment: The classical global pair-wise alignment

algorithm using dynamic programming is the Needleman–
Wunsch algorithm.
• Local Alignment: The first application of dynamic

programming in local alignment is the Smith–Waterman
algorithm.

Multiple Sequence Alignment (MSA)
• MSA: can be seen as a generalization of Pairwise Sequence

Alignment - instead of aligning two sequences.
• A re-presentation of a set of sequences, in which equivalent
residues (e.g. functional or structural) are aligned in columns.

MSA Application
Phylogenetic studies
Comparative genomics Hierarchical function annotation:
homologs, domains, motifs
Gene identification, validation Structure comparison, modelling
RNA sequence, structure, function Interaction networks
Human genetics, SNPs Therapeutics, drug design

DBD insertion domain
Therapeutics, drug discovery

LBD
binding sites / mutations

MSA Application

MSA Construction
• Traditional approaches
 Optimal multiple alignment
 Progressive multiple alignment
• Alignment parameters
 Residue similarity matrices
 Gap penalties
• Alternative approaches
 Iterative alignment methods
 Consistency or Combinatorial algorithms

MSA: Which is Optimum Method
• Limitations of Dynamic programming for MSA?
MSA for 3 sequences by dynamic approach:
Each alignment is a path through the dynamic programming matrix

MSA: Heuristic Algorithms
Because the use of dynamic programming is not feasible for

routine multiple sequence alignment, faster and heuristic
algorithms have been developed.
 Progressive: e.g. ClustalW
 Iterative: e.g. Muscle
 Concistency Based: e.g. T-Coffee and Probcons

MSA: Progressive Method
The most practical and widely used method in multiple

sequence alignment is the hierarchical extensions of pairwise
alignment methods.
The principal is that multiple alignments is achieved by

successive application of pairwise methods.
This is based on Global alignment

MSA: Choosing sequences for alignment
 The more sequences to align the better.

 Don’t include similar (>80%) sequences.
 Sub-groups should be pre-aligned separately, and one member
of each subgroup should be included in the final multiple
alignment.

MSA: Progressive Algorithm Principal

Limitations of Progressive Algorithm
1. Not suitable for comparing sequences of different lengths

because it is a global alignment–based method.
2. As a result of the use of affine gap penalties, long gaps are not
allowed, and, in some cases, this may limit the accuracy of the
method.
3. It depends on initial pair-wise alignment.
4. The final alignment result is also influenced by the order of
sequence addition.
5. Once gaps introduced in the early steps of alignment, they are
fixed.
6. Any errors made in these steps cannot be corrected.
7. This problem of “once an error, always an error” can propagate
throughout the entire alignment.
MSA: Consistency Based Algorithm
• New generation of algorithms have been developed, which specifically
target some of the problems of the Clustal program.
• T-Coffee (Tree-based Consistency Objective Function for alignment
Evaluation; www.ch.embnet.org/software/TCoffee.html) performs
progressive sequence alignments as in Clustal.
• The main difference is that, in processing a query, T-Coffee performs both
global and local pair-wise alignment for all possible pairs involved.
• The global pair-wise alignment is performed using the Clustal program. The
local pair-wise alignment is generated by the Lalign program, from which
the top ten scored alignments are selected.
• The collection of local and global sequence alignments are pooled to form
a library.
• The consistency of the alignments is evaluated. For every pair of residues in
a pair of sequences, a consistency score is calculated for both global and
local alignments. Each pair-wise alignment is further aligned with a
possible third sequence.
Editing and displaying alignments
 Sequence editors are used for:

 manual alignment/editing of sequences
 visualization of data
 data management
 import/export of data
 graphical enhancement of data for
presentations

Alignment Editors
 Alignments produced with PILEUP (or CLUSTAL) can be adjusted

with LINEUP.
 Nicely shaded printouts can be produced with PRETTYBOX
 GCG's SeqLab X-Windows interface has a superb multiple
sequence editor - the best editor of any kind.
 JalView

Other Alignment Editors
 The MACAW and SeqVu program for Macintosh and

GeneDoc and DCSE for PCs are free and provide
excellent editor functionality.
 Many “comprehensive” molecular biology programs
include multiple alignment functions:
 MacVector, OMIGA, Vector NTI, and GeneTool/PepTool all
include a built-in version of CLUSTAL

Alignment parameters: Scoring matrices
What should be score of match, mismatch? (how change in

sequence be evaluated?)
• scoring system: set of values for quantifying the likelihood of

one residue being substituted by another in an alignment.
• The scoring systems is called a substitution matrix
• derived from statistical analysis of residue substitution data
from sets of reliable alignments of highly related sequences.

Scoring matrices: Nucleotide Sequences
• Scoring matrices for nucleotide sequences are relatively

simple. A positive value or high score is given for a match and
a negative value or low score for a mismatch.
• Scoring is based on the assumption that the frequencies of
mutation are equal for all bases.
• However, this assumption may not be realistic; observations
show that transitions occur more frequently than
transversions.
• Therefore, a more sophisticated statistical model with
different probability values to reflect the two types of
mutations is needed.

• Models of Nucleotide Substitution

– Jukes Cantor model
– Kimura model

DNA evolutionary models: Jukes-Cantor

• that nucleotides are uniformly distributed and independently
substituted such that the probabilities for nucleotide substitutions are
all the same and do not depend on the particular nucleotides
• Assumes equal rate of mutation for transition and transversion
Prob(mutation in one unit of time) = α , α << 1. α
Prob(no mutation) = 1-3α

α α α α
Purine = A, G
Pyrimidines = T, C
DNA evolutionary models: Kimura 2-prameter
• In DNA replication, errors can be transitions (purine for purine,

pyrimidine for pyrimidine) or transversions (purine for pyrimidine &
vice versa)
• In reality, for example, “transitions” are more common than
“transversions”
β
α β β α
Transition probability = α
Transversion probability = β
Purine = A, G Pyrimidines = T, C
Scoring matrices: Amino Acid
• Amino acid substitution matrices, which are 20 × 20 matrices, have
been devised to reflect the likelihood of residue substitutions.
• Scoring matrices for amino acids are more complicated because scoring
has to reflect the physicochemical properties of amino acid residues, as
well as the likelihood of certain residues being substituted among true
homologous sequences.
• Certain amino acids with similar physicochemical properties: --
-- easily substituted
-- are likely to preserve the essential functional and structural
features.
• Substitutions between residues of different physicochemical
properties:
-- are more likely to cause disruptions to the structure and function.
--less likely to be selected in evolution because it renders nonfunctional
proteins.
• For example, phenylalanine, tyrosine, and tryptophan all
share aromatic ring structures. Because of their chemical
similarities, they are easily substituted for each other without
perturbing the regular function and structure of the protein.
• Similarly, arginine, lysine, and histidine are all large basic

residues and there is a high probability of them being
substituted for each other.
• Aspartic acid, glutamic acid, asparagine, and glutamine belong

to the acid and acid amide groups and can be associated with
relatively high frequencies of substitution.

• For example, phenylalanine, tyrosine, and tryptophan all
share aromatic ring structures. Because of their chemical
similarities, they are easily substituted for each other without
perturbing the regular function and structure of the protein.
• Similarly, arginine, lysine, and histidine are all large basic

residues and there is a high probability of them being
substituted for each other.
• Aspartic acid, glutamic acid, asparagine, and glutamine belong

to the acid and acid amide groups and can be associated with
relatively high frequencies of substitution.

Types of amino acids substitution matrices
• PAM = Point Accepted Mutation

• BLOSUM = Blocks Substitution Matrix
(I) based on interchangeability of the genetic code or amino acid properties:

based on the genetic code or the physicochemical features of amino acids, less
accurate.
(ii) derived from empirical studies of amino acid substitutions; based on

surveys of actual amino acid substitutions among related proteins; most
popular in sequence alignment applications.

• The empirical matrices, which include PAM and BLOSUM matrices, are derived
from actual alignments of highly similar sequences.
• High score: for a more likely substitution.

• Low score: for a rare substitution.
• Positive score: frequency of amino acid substitutions found in a data set of

homologous sequences is greater than would have occurred by random chance.
substitutions of very similar residues or identical residues.
• Zero score: frequency of amino acid substitutions found in the homologous
sequence data set is equal to that expected by chance. In this case, the
relationship between the amino acids is weakly similar at best in terms of
physicochemical properties.
• Negative score: frequency of amino acid substitutions found in the homologous
sequence data set is less than would have occurred by random chance. This
normally occurs with substitutions between dissimilar residues.
Scoring matrices: (1) PAM
• Point Accepted Mutation, also called Dayhoff PAM matrices

• Derived based on the evolutionary divergence between
sequences of the same cluster.
• One PAM unit is defined as 1% of the amino acid positions that
have been changed.
• To construct a PAM1 substitution table, a group of closely
related sequences with mutation frequencies corresponding to
one PAM unit is chosen.
• Based on the collected mutational data from this group of
sequences, a substitution matrix can be Derived.

Scoring matrices: Amino Acid : PAM
• One PAM unit is defined as 1% amino acid change or one
mutation per 100 residues.

Scoring matrices: Amino Acid : PAM
• PAM250 amino acid substitution matrix.
• Residues are grouped according to physicochemical similarities.

Scoring matrices: (2): BLOSUM
• PAM1 is based on a relatively small set of extremely closely

related sequences.
• Sequence alignment statistics for more divergent sequences are
not available. To fill in the gap, a new set of substitution matrices
have been developed. This is the series of blocks amino acid
substitution matrices (BLOSUM), all of which are derived based on
direct observation for every possible amino acid substitution in
multiple sequence alignments.
• These were constructed based on more than 2,000 conserved
amino acid patterns representing 500 groups of protein
sequences.
• The sequence patterns, also called blocks, are ungapped
alignments of less than sixty amino acid residues in length. The
frequencies of amino acid substitutions of the residues in these
blocks are calculated to produce a numerical table, or block
substitution matrix. Dr. Md. Khademul Islam
Scoring matrices: Amino Acid : BLOSUM
• Instead of using the extrapolation function, the BLOSUM

matrices are actual percentage identity values of sequences
selected for construction of the matrices.
• For example, BLOSUM62 indicates that the sequences
selected for constructing the matrix share an average identity
value of 62%.
• Other BLOSUM matrices based on sequence groups of various
identity levels have also been constructed

Scoring matrices: Amino Acid : BLOSUM

Scoring matrices: Amino Acid : PAM & BLOSUM


STATISTICAL SIGNIFICANCE OF SEQUENCE ALIGNMENT
• By calculating alignment scores of a large number of unrelated

sequence pairs, a distribution model of the randomized sequence
scores can be derived.
• From the distribution, a statistical test can be performed based

on the number of standard deviations from the average score.

Procedure:
1. An optimal alignment between two given sequences

2. Unrelated sequences of the same length are then generated through a
randomization process in which one of the two sequences is randomly
shuffled.
3. A new alignment score is then computed for the shuffled sequence
pair.
4. More such scores are similarly obtained through repeated shuffling.
5. The pool of alignment scores from the shuffled sequences is used to
generate parameters for the extreme distribution.
6. The original alignment score is then compared against the distribution
of random alignments to determine whether the score is beyond
random chance.
7. If the score is located in the extreme margin of the distribution, that
means that the alignment between the two sequences is unlikely due
to random chance and is thus considered significant.
8. A P-value is given to indicate the probability that the original
alignment is due to random chance.
• A p-value resulting from the test provides a much more reliable

indicator of possible homologous relationships than using percent
identity values. It has been shown that
• P-value is <10e-100, it indicates an exact match between the two

sequences.
• P-value is in the range of 10e−50 to 10e−100, it is considered to be a
nearly identical match.
• P-value : 10e−5 to 10e−50 ---------> having clear homology.
• P-value: 10e−1 to 10e−5 ------------> distant homologs.
• P-value: >10e−1 ----------------------> may be randomly related.

• Which substitution favors?

Objectives:
• Similarity search in database using BLASTn that verifies the
species to which the generated sequence belongs to.
• Identification of Species & Gene to which the given sequence
shows highest similarity.
Tools:
• BLAST
http://blast.ncbi.nlm.nih.gov/Blast.cgi

BLAST
Basic Local Alignment Search Tool

• A BLAST search enables a researcher to compare a query
sequence with a library or database of sequences, and
identify library sequences that resemble the query sequence
above a certain threshold.
ABCDEF
My sequence
AAAAABCDEFAAAA CCCBBBBDDDDAAAA NNNOOOOPPPPQQQ
Fish Dog Mouse
BLAST: How it works
high-scoring segment pair

(HSP)

BLASTn example
ATCGGACGTGGATCCATCGATC
GATGCGATCGATCGAAATCG sequence that you
want to know about

Matrix Selection

Understanding BLASTn result

Alignment score
A summation of each specified aligned pair of bases or residues,

and their nulls, in the alignment. The higher the alignment score,
the better the alignment.

Max and Total score
high scoring pairs = HSPs
Max Score: The higher the Max Score, the better the alignment
between the hit and the query. This is based on the overall score
of HSPs between sequences, similar to Expect Value
Total Score: By the sum of scores from all HSPs from the same
database sequence.

E (expected) Value
• It describes the chance of randomly achieving the same

alignment in a database of a particular size.
• An E Value is used to describe the significance (instead of a P

value) of each sequence alignment hit to the query.
The lower the E value is, the more significant the alignment is.

What do the Score and the e-value really mean?
• The quality of the alignment is represented by the Score (S).
• The score of an alignment is calculated as the sum of substitution

and gap scores. Substitution scores are given by a look-up table
(PAM, BLOSUM) whereas gap scores are assigned empirically .
• The significance of each alignment is computed as an E value (E).
• Expectation value. The number of different alignments with scores

equivalent to or better than S that are expected to occur in a
database search by chance. The lower the E value, the more
significant the score.

What do the Score and the e-value really mean?
Notes on E-values
• Low E-values suggest that sequences are homologous
• Can’t show non-homology
• Statistical significance depends on both the size of the

alignments and the size of the sequence database
• Important consideration for comparing results across

different searches
• E-value increases as database gets bigger
• E-value decreases as alignments get longer

Homology: Some Guidelines
• Similarity can be indicative of homology
• Generally, if two sequences are significantly similar over

entire length they are likely homologous
• Low complexity regions can be highly similar without

being homologous
• Homologous sequences not always highly similar

Query Coverage and Max Identity
• The amount of the query sequence, expressed as a percent,

that overlaps the subject sequence
• The highest percent identity for a set of aligned segments to

the same subject sequence.

Suggested BLAST Cutoffs
• For nucleotide based searches, one should look for hits with
E-values of 10-6 or less and sequence identity of 70% or more
• For protein based searches, one should look for hits with E-
values of 10-3 or less and sequence identity of 25% or more

Objectives:
• To understand the similarities among group of sequences
• To determine conserved regions
• To understand the evolutional relationship among related
sequences.
Tools:
• Clustalw2
http://www.ebi.ac.uk/Tools/msa/clustalw2/
• T-Coffee:
http://tcoffee.crg.cat/apps/tcoffee/index.html
CLUSTAL: multiple sequence alignment
3. CLUSTAL: multiple sequence alignment
http://www.ebi.ac.uk/Tools/msa/clustalo/
multiple sequence alignment using BioEdit


L3.4 Alignment

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L3.4 Alignment

Uploaded by

Copyright:

Available Formats

Sequence Alignment

Dr. Md. Khademul Islam

• Match: same (or similar) letter in both rows

Dr. Md. Khademul Islam

Why align sequences?

Dr. Md. Khademul Islam

Dr. Md. Khademul Islam

Dr. Md. Khademul Islam

Dr. Md. Khademul Islam

• Where: δ is a score matrix (scoring scheme)

Dr. Md. Khademul Islam

Dr. Md. Khademul Islam

• Affine Gap penalty

• When two sequences are descended from a common

• A related but different term is sequence similarity, which is the

Dr. Md. Khademul Islam

• Sequence similarity and sequence identity are synonymous for

Dr. Md. Khademul Islam

Dr. Md. Khademul Islam

How to do pair-wise alignment?

Dot Matrix Pair-wise alignment

Dr. Md. Khademul Islam

Dr. Md. Khademul Islam

• Dot matrix method may not provide actually aligned sequence

Dr. Md. Khademul Islam

Dynamic Programming Idea:

• dynamic programming: solve an instance of a problem by

Dr. Md. Khademul Islam

[A] is the maximum score from one of the

Dr. Md. Khademul Islam

Dr. Md. Khademul Islam

Dynamic Programming Types:

• Global alignment: The classical global pair-wise alignment

• Local Alignment: The first application of dynamic

Dr. Md. Khademul Islam

• MSA: can be seen as a generalization of Pairwise Sequence

Dr. Md. Khademul Islam

Gene identification, validation Structure comparison, modelling

RNA sequence, structure, function Interaction networks

Human genetics, SNPs Therapeutics, drug design

Therapeutics, drug discovery

binding sites / mutations

Dr. Md. Khademul Islam

Dr. Md. Khademul Islam

• Limitations of Dynamic programming for MSA?

MSA for 3 sequences by dynamic approach:

Each alignment is a path through the dynamic programming matrix

Dr. Md. Khademul Islam

Because the use of dynamic programming is not feasible for

 Progressive: e.g. ClustalW

 Iterative: e.g. Muscle

 Concistency Based: e.g. T-Coffee and Probcons

Dr. Md. Khademul Islam

The most practical and widely used method in multiple

The principal is that multiple alignments is achieved by

This is based on Global alignment

Dr. Md. Khademul Islam

 The more sequences to align the better.

Dr. Md. Khademul Islam

Dr. Md. Khademul Islam

1. Not suitable for comparing sequences of different lengths

 Sequence editors are used for:

Dr. Md. Khademul Islam

 Alignments produced with PILEUP (or CLUSTAL) can be adjusted