You are on page 1of 90

Sequence Alignment

What is alignment?
• a sequence alignment is a way of arranging the sequences of
DNA, RNA, or protein to identify regions of similarity
ATGCATGC TGCATGCA
GAATTCAG
ATGCATGC GGATCG
TGCATGCA

ATGCATGC
->TGCATGCA

Dr. Md. Khademul Islam


Sequence Alignment

• Match: same (or similar) letter in both rows


• Mismatch: different letters in both rows
• Insertion: the letter opposite to a space
• Deletion: the space opposite to a letter
• Indel: a column containing a space

Dr. Md. Khademul Islam


Sequence Alignment

Why align sequences?


• Useful for discovering
-- structure
-- function
-- evolutionary relationship
• For example:
-- two proteins with similar sequences will probably be
structurally or functionally similar.
-- to find whether two (or more) genes or protein are
evolutionary related to each other

Dr. Md. Khademul Islam


Sequence Alignment
• Types of alignment:
(i) Pair-wise
(ii) Multiple

• Types of alignment:
(i) Global
(ii) Local

Dr. Md. Khademul Islam


Sequence Alignment
• Global
— attempt to align every residue in every sequence,
— most useful when the sequences in the query set are similar and of
roughly equal size. (This does not mean global alignments cannot end in
gaps.)
— A general global alignment technique is the Needleman–Wunsch
algorithm.

• Local
— Gathers islands of matches
— Stretches of sequences with highest density of matches are aligned
— more useful for dissimilar sequences that are suspected to contain
regions of similarity or similar sequence motifs within their larger
sequence context.
— The Smith–Waterman algorithm is a general local alignment method

Dr. Md. Khademul Islam


Sequence Alignment
• What is an optimal alignment?

Dr. Md. Khademul Islam


Sequence Alignment
• Score matrix & Cost Function?
• The total score assigned to an alignment is the
— sum of terms for matches and mismatches,
— plus the sum of terms for gaps (i.e., indels).

• Formally we can treat the space character just like any other.
Then we obtain the following additive scoring scheme:
• The score of an alignment (x′, y′) is

• Where: δ is a score matrix (scoring scheme)

Dr. Md. Khademul Islam


Sequence Alignment
• So how should we choose the score matrix δ?
• Hamming distance
• In the simplest case, we do not allow any gaps at all and charge
all mismatches at unit cost. This is called the Hamming distance.

• Edit distance
• If we charge all insertions, deletions, and mismatches at unit
cost, we obtain the Levenshtein distance. This measure is also
called the edit distance, because it counts the number of
elementary edit operations needed to transform one sequence
into the other.

Dr. Md. Khademul Islam


Sequence Alignment
• Shell we use same penalty for all gaps?
• Linear Gap penalty

Edit distance: 15

• Affine Gap penalty


• Not same cost for gap ‘Open’ & ‘Extension’
• Higher cost for gap open, but lower cost for extending it.
• e.g. Open: -5 , extension -1, match: +1, mismatch: -1

Score: 26-5-7-5-1-5-3-5 = -5

Score: 26-5-14 = +7
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
Homology vs. Similarity

• When two sequences are descended from a common


evolutionary origin, they are said to have a homologous
relationship or share homology.

• A related but different term is sequence similarity, which is the


percentage of aligned residues that are similar in physiochemical
properties such as size, charge, and hydrophobicity.

Dr. Md. Khademul Islam


Similarity vs. Identity

• Sequence similarity and sequence identity are synonymous for


nucleotide sequences. For protein sequences, however, the two
concepts are very different.
• In a protein sequence alignment, sequence identity refers to the
percentage of matches of the same amino acid residues
between two aligned sequences.
• Similarity refers to the percentage of aligned residues that have
similar physicochemical characteristics and can be more readily
substituted for each other

Dr. Md. Khademul Islam


Similarity vs. Identity

Dr. Md. Khademul Islam


Pair-wise Sequence Alignment

How to do pair-wise alignment?


• Dot plot
• Dynamic programming

Dot Matrix Pair-wise alignment


• Graphically displays any possible sequence alignments
• Suitable for sequences not very much alike
• Can identify all possible matches of residues
• Can detect presence of: Insertion, deletion, direct & inverted
repeats, RNA self-complementary regions
• Generally does not provide actual alignment

Dr. Md. Khademul Islam


Dot Matrix Pair-wise alignment

Example:
• acgcg acacg

Dr. Md. Khademul Islam


Dynamic Programming Pair-wise alignment

• Dot matrix method may not provide actually aligned sequence


and not quantitative.
• Dynamic programming is a method that determines optimal
alignment by matching two sequences for all possible pairs of
characters between the two sequences.
• It is fundamentally similar to the dot matrix method in that it
also creates a two dimensional alignment grid.
• However, it finds alignment in a more quantitative way by
converting a dot matrix into a scoring matrix to account for
matches and mismatches between sequences.
• By searching for the set of highest scores in this matrix, the
best alignment can be accurately obtained

Dr. Md. Khademul Islam


Dynamic Programming Pair-wise alignment

Dynamic Programming Idea:

• dynamic programming: solve an instance of a problem by


taking advantage of solutions for subparts of the problem
– reduce problem of best alignment of two sequences to best
alignment of all prefixes of the sequences
– avoid recalculating the scores already considered
• example: Fibonacci sequence 1, 1, 2, 3, 5, 8, 13, 21, 34…

Dr. Md. Khademul Islam


Dynamic Programming Idea: Pair-wise alignment

[A] is the maximum score from one of the


three directions plus matching score at the
current position

Dr. Md. Khademul Islam


Dynamic Programming Idea: Pair-wise alignment

Dr. Md. Khademul Islam


Dynamic Programming Idea

Dynamic Programming Types:

• Global alignment: The classical global pair-wise alignment


algorithm using dynamic programming is the Needleman–
Wunsch algorithm.

• Local Alignment: The first application of dynamic


programming in local alignment is the Smith–Waterman
algorithm.

Dr. Md. Khademul Islam


Multiple Sequence Alignment (MSA)

• MSA: can be seen as a generalization of Pairwise Sequence


Alignment - instead of aligning two sequences.
• A re-presentation of a set of sequences, in which equivalent
residues (e.g. functional or structural) are aligned in columns.

Dr. Md. Khademul Islam


MSA Application
Phylogenetic studies
Comparative genomics Hierarchical function annotation:
homologs, domains, motifs

Gene identification, validation Structure comparison, modelling

RNA sequence, structure, function Interaction networks

Human genetics, SNPs Therapeutics, drug design


DBD insertion domain

Therapeutics, drug discovery


LBD

binding sites / mutations


Dr. Md. Khademul Islam
MSA Application

Dr. Md. Khademul Islam


MSA Construction

• Traditional approaches
 Optimal multiple alignment
 Progressive multiple alignment

• Alignment parameters
 Residue similarity matrices
 Gap penalties

• Alternative approaches
 Iterative alignment methods
 Consistency or Combinatorial algorithms

Dr. Md. Khademul Islam


MSA: Which is Optimum Method

• Limitations of Dynamic programming for MSA?

MSA for 3 sequences by dynamic approach:

Each alignment is a path through the dynamic programming matrix

Dr. Md. Khademul Islam


MSA: Heuristic Algorithms

Because the use of dynamic programming is not feasible for


routine multiple sequence alignment, faster and heuristic
algorithms have been developed.

 Progressive: e.g. ClustalW

 Iterative: e.g. Muscle

 Concistency Based: e.g. T-Coffee and Probcons

Dr. Md. Khademul Islam


MSA: Progressive Method

The most practical and widely used method in multiple


sequence alignment is the hierarchical extensions of pairwise
alignment methods.

The principal is that multiple alignments is achieved by


successive application of pairwise methods.

This is based on Global alignment

Dr. Md. Khademul Islam


MSA: Choosing sequences for alignment

 The more sequences to align the better.


 Don’t include similar (>80%) sequences.
 Sub-groups should be pre-aligned separately, and one member
of each subgroup should be included in the final multiple
alignment.

Dr. Md. Khademul Islam


MSA: Progressive Algorithm Principal

Dr. Md. Khademul Islam


Dr. Md. Khademul Islam
Limitations of Progressive Algorithm

1. Not suitable for comparing sequences of different lengths


because it is a global alignment–based method.
2. As a result of the use of affine gap penalties, long gaps are not
allowed, and, in some cases, this may limit the accuracy of the
method.
3. It depends on initial pair-wise alignment.
4. The final alignment result is also influenced by the order of
sequence addition.
5. Once gaps introduced in the early steps of alignment, they are
fixed.
6. Any errors made in these steps cannot be corrected.
7. This problem of “once an error, always an error” can propagate
throughout the entire alignment.
Dr. Md. Khademul Islam
MSA: Consistency Based Algorithm
• New generation of algorithms have been developed, which specifically
target some of the problems of the Clustal program.
• T-Coffee (Tree-based Consistency Objective Function for alignment
Evaluation; www.ch.embnet.org/software/TCoffee.html) performs
progressive sequence alignments as in Clustal.
• The main difference is that, in processing a query, T-Coffee performs both
global and local pair-wise alignment for all possible pairs involved.
• The global pair-wise alignment is performed using the Clustal program. The
local pair-wise alignment is generated by the Lalign program, from which
the top ten scored alignments are selected.
• The collection of local and global sequence alignments are pooled to form
a library.
• The consistency of the alignments is evaluated. For every pair of residues in
a pair of sequences, a consistency score is calculated for both global and
local alignments. Each pair-wise alignment is further aligned with a
possible third sequence.
Dr. Md. Khademul Islam
Editing and displaying alignments

 Sequence editors are used for:


 manual alignment/editing of sequences
 visualization of data
 data management
 import/export of data
 graphical enhancement of data for
presentations

Dr. Md. Khademul Islam


Alignment Editors

 Alignments produced with PILEUP (or CLUSTAL) can be adjusted


with LINEUP.
 Nicely shaded printouts can be produced with PRETTYBOX
 GCG's SeqLab X-Windows interface has a superb multiple
sequence editor - the best editor of any kind.
 JalView

Dr. Md. Khademul Islam


Other Alignment Editors

 The MACAW and SeqVu program for Macintosh and


GeneDoc and DCSE for PCs are free and provide
excellent editor functionality.
 Many “comprehensive” molecular biology programs
include multiple alignment functions:
 MacVector, OMIGA, Vector NTI, and GeneTool/PepTool all
include a built-in version of CLUSTAL

Dr. Md. Khademul Islam


Dr. Md. Khademul Islam
Alignment parameters: Scoring matrices

What should be score of match, mismatch? (how change in


sequence be evaluated?)

• scoring system: set of values for quantifying the likelihood of


one residue being substituted by another in an alignment.
• The scoring systems is called a substitution matrix
• derived from statistical analysis of residue substitution data
from sets of reliable alignments of highly related sequences.

Dr. Md. Khademul Islam


Scoring matrices: Nucleotide Sequences

• Scoring matrices for nucleotide sequences are relatively


simple. A positive value or high score is given for a match and
a negative value or low score for a mismatch.
• Scoring is based on the assumption that the frequencies of
mutation are equal for all bases.
• However, this assumption may not be realistic; observations
show that transitions occur more frequently than
transversions.
• Therefore, a more sophisticated statistical model with
different probability values to reflect the two types of
mutations is needed.

Dr. Md. Khademul Islam


Scoring matrices: Nucleotide Sequences

• Models of Nucleotide Substitution


– Jukes Cantor model
– Kimura model

Dr. Md. Khademul Islam


Scoring matrices: Nucleotide Sequences

DNA evolutionary models: Jukes-Cantor


• that nucleotides are uniformly distributed and independently
substituted such that the probabilities for nucleotide substitutions are
all the same and do not depend on the particular nucleotides
• Assumes equal rate of mutation for transition and transversion

Prob(mutation in one unit of time) = α , α << 1. α

Prob(no mutation) = 1-3α


α α α α

Purine = A, G
Dr. Md. Khademul Islam
Pyrimidines = T, C
Scoring matrices: Nucleotide Sequences

DNA evolutionary models: Kimura 2-prameter

• In DNA replication, errors can be transitions (purine for purine,


pyrimidine for pyrimidine) or transversions (purine for pyrimidine &
vice versa)
• In reality, for example, “transitions” are more common than
“transversions”
β

α β β α

Transition probability = α
Transversion probability = β
Purine = A, G Pyrimidines = T, C
Dr. Md. Khademul Islam
Scoring matrices: Amino Acid
• Amino acid substitution matrices, which are 20 × 20 matrices, have
been devised to reflect the likelihood of residue substitutions.
• Scoring matrices for amino acids are more complicated because scoring
has to reflect the physicochemical properties of amino acid residues, as
well as the likelihood of certain residues being substituted among true
homologous sequences.
• Certain amino acids with similar physicochemical properties: --
-- easily substituted
-- are likely to preserve the essential functional and structural
features.
• Substitutions between residues of different physicochemical
properties:
-- are more likely to cause disruptions to the structure and function.
--less likely to be selected in evolution because it renders nonfunctional
proteins.
Dr. Md. Khademul Islam
Scoring matrices: Amino Acid
• For example, phenylalanine, tyrosine, and tryptophan all
share aromatic ring structures. Because of their chemical
similarities, they are easily substituted for each other without
perturbing the regular function and structure of the protein.

• Similarly, arginine, lysine, and histidine are all large basic


residues and there is a high probability of them being
substituted for each other.

• Aspartic acid, glutamic acid, asparagine, and glutamine belong


to the acid and acid amide groups and can be associated with
relatively high frequencies of substitution.

Dr. Md. Khademul Islam


Scoring matrices: Amino Acid
• For example, phenylalanine, tyrosine, and tryptophan all
share aromatic ring structures. Because of their chemical
similarities, they are easily substituted for each other without
perturbing the regular function and structure of the protein.

• Similarly, arginine, lysine, and histidine are all large basic


residues and there is a high probability of them being
substituted for each other.

• Aspartic acid, glutamic acid, asparagine, and glutamine belong


to the acid and acid amide groups and can be associated with
relatively high frequencies of substitution.

Dr. Md. Khademul Islam


Scoring matrices: Amino Acid

Types of amino acids substitution matrices

• PAM = Point Accepted Mutation


• BLOSUM = Blocks Substitution Matrix

(I) based on interchangeability of the genetic code or amino acid properties:


based on the genetic code or the physicochemical features of amino acids, less
accurate.

(ii) derived from empirical studies of amino acid substitutions; based on


surveys of actual amino acid substitutions among related proteins; most
popular in sequence alignment applications.

Dr. Md. Khademul Islam


Scoring matrices: Amino Acid

• The empirical matrices, which include PAM and BLOSUM matrices, are derived
from actual alignments of highly similar sequences.

• High score: for a more likely substitution.


• Low score: for a rare substitution.

• Positive score: frequency of amino acid substitutions found in a data set of


homologous sequences is greater than would have occurred by random chance.
substitutions of very similar residues or identical residues.
• Zero score: frequency of amino acid substitutions found in the homologous
sequence data set is equal to that expected by chance. In this case, the
relationship between the amino acids is weakly similar at best in terms of
physicochemical properties.
• Negative score: frequency of amino acid substitutions found in the homologous
sequence data set is less than would have occurred by random chance. This
normally occurs with substitutions between dissimilar residues.
Dr. Md. Khademul Islam
Scoring matrices: (1) PAM

• Point Accepted Mutation, also called Dayhoff PAM matrices


• Derived based on the evolutionary divergence between
sequences of the same cluster.
• One PAM unit is defined as 1% of the amino acid positions that
have been changed.
• To construct a PAM1 substitution table, a group of closely
related sequences with mutation frequencies corresponding to
one PAM unit is chosen.
• Based on the collected mutational data from this group of
sequences, a substitution matrix can be Derived.

Dr. Md. Khademul Islam


Scoring matrices: Amino Acid : PAM
• One PAM unit is defined as 1% amino acid change or one
mutation per 100 residues.

Dr. Md. Khademul Islam


Scoring matrices: Amino Acid : PAM
• PAM250 amino acid substitution matrix.
• Residues are grouped according to physicochemical similarities.

Dr. Md. Khademul Islam


Scoring matrices: (2): BLOSUM

• PAM1 is based on a relatively small set of extremely closely


related sequences.
• Sequence alignment statistics for more divergent sequences are
not available. To fill in the gap, a new set of substitution matrices
have been developed. This is the series of blocks amino acid
substitution matrices (BLOSUM), all of which are derived based on
direct observation for every possible amino acid substitution in
multiple sequence alignments.
• These were constructed based on more than 2,000 conserved
amino acid patterns representing 500 groups of protein
sequences.
• The sequence patterns, also called blocks, are ungapped
alignments of less than sixty amino acid residues in length. The
frequencies of amino acid substitutions of the residues in these
blocks are calculated to produce a numerical table, or block
substitution matrix. Dr. Md. Khademul Islam
Scoring matrices: Amino Acid : BLOSUM

• Instead of using the extrapolation function, the BLOSUM


matrices are actual percentage identity values of sequences
selected for construction of the matrices.
• For example, BLOSUM62 indicates that the sequences
selected for constructing the matrix share an average identity
value of 62%.
• Other BLOSUM matrices based on sequence groups of various
identity levels have also been constructed

Dr. Md. Khademul Islam


Scoring matrices: Amino Acid : BLOSUM

Dr. Md. Khademul Islam


Scoring matrices: Amino Acid : PAM & BLOSUM

Dr. Md. Khademul Islam


Scoring matrices: Amino Acid : PAM & BLOSUM

Dr. Md. Khademul Islam


STATISTICAL SIGNIFICANCE OF SEQUENCE ALIGNMENT

• By calculating alignment scores of a large number of unrelated


sequence pairs, a distribution model of the randomized sequence
scores can be derived.

• From the distribution, a statistical test can be performed based


on the number of standard deviations from the average score.

Dr. Md. Khademul Islam


STATISTICAL SIGNIFICANCE OF SEQUENCE ALIGNMENT
Procedure:

1. An optimal alignment between two given sequences


2. Unrelated sequences of the same length are then generated through a
randomization process in which one of the two sequences is randomly
shuffled.
3. A new alignment score is then computed for the shuffled sequence
pair.
4. More such scores are similarly obtained through repeated shuffling.
5. The pool of alignment scores from the shuffled sequences is used to
generate parameters for the extreme distribution.
6. The original alignment score is then compared against the distribution
of random alignments to determine whether the score is beyond
random chance.
7. If the score is located in the extreme margin of the distribution, that
means that the alignment between the two sequences is unlikely due
to random chance and is thus considered significant.
8. A P-value is given to indicate the probability that the original
alignment is due to random chance.
Dr. Md. Khademul Islam
STATISTICAL SIGNIFICANCE OF SEQUENCE ALIGNMENT

• A p-value resulting from the test provides a much more reliable


indicator of possible homologous relationships than using percent
identity values. It has been shown that

• P-value is <10e-100, it indicates an exact match between the two


sequences.
• P-value is in the range of 10e−50 to 10e−100, it is considered to be a
nearly identical match.
• P-value : 10e−5 to 10e−50 ---------> having clear homology.
• P-value: 10e−1 to 10e−5 ------------> distant homologs.
• P-value: >10e−1 ----------------------> may be randomly related.

Dr. Md. Khademul Islam


Scoring matrices: Amino Acid : PAM & BLOSUM

• Which substitution favors?

Dr. Md. Khademul Islam


Objectives:
• Similarity search in database using BLASTn that verifies the
species to which the generated sequence belongs to.
• Identification of Species & Gene to which the given sequence
shows highest similarity.

Tools:
• BLAST
http://blast.ncbi.nlm.nih.gov/Blast.cgi

Dr. Md. Khademul Islam


BLAST

Basic Local Alignment Search Tool


• A BLAST search enables a researcher to compare a query
sequence with a library or database of sequences, and
identify library sequences that resemble the query sequence
above a certain threshold.

ABCDEF
My sequence
AAAAABCDEFAAAA CCCBBBBDDDDAAAA NNNOOOOPPPPQQQ
Fish Dog Mouse

http://blast.ncbi.nlm.nih.gov/Blast.cgi
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
BLAST: How it works

high-scoring segment pair


(HSP)
Dr. Md. Khademul Islam
http://blast.ncbi.nlm.nih.gov/Blast.cgi

Dr. Md. Khademul Islam


BLASTn example

ATCGGACGTGGATCCATCGATC
GATGCGATCGATCGAAATCG sequence that you
want to know about

Dr. Md. Khademul Islam


Matrix Selection

Dr. Md. Khademul Islam


Understanding BLASTn result

Dr. Md. Khademul Islam


Dr. Md. Khademul Islam
Alignment score

A summation of each specified aligned pair of bases or residues,


and their nulls, in the alignment. The higher the alignment score,
the better the alignment.

Dr. Md. Khademul Islam


Max and Total score

high scoring pairs = HSPs

Max Score: The higher the Max Score, the better the alignment
between the hit and the query. This is based on the overall score
of HSPs between sequences, similar to Expect Value

Total Score: By the sum of scores from all HSPs from the same
database sequence.

Dr. Md. Khademul Islam


E (expected) Value

• It describes the chance of randomly achieving the same


alignment in a database of a particular size.

• An E Value is used to describe the significance (instead of a P


value) of each sequence alignment hit to the query.

The lower the E value is, the more significant the alignment is.

Dr. Md. Khademul Islam


What do the Score and the e-value really mean?

• The quality of the alignment is represented by the Score (S).

• The score of an alignment is calculated as the sum of substitution


and gap scores. Substitution scores are given by a look-up table
(PAM, BLOSUM) whereas gap scores are assigned empirically .

• The significance of each alignment is computed as an E value (E).

• Expectation value. The number of different alignments with scores


equivalent to or better than S that are expected to occur in a
database search by chance. The lower the E value, the more
significant the score.

Dr. Md. Khademul Islam


What do the Score and the e-value really mean?
Notes on E-values
• Low E-values suggest that sequences are homologous

• Can’t show non-homology

• Statistical significance depends on both the size of the


alignments and the size of the sequence database

• Important consideration for comparing results across


different searches

• E-value increases as database gets bigger

• E-value decreases as alignments get longer

Dr. Md. Khademul Islam


Homology: Some Guidelines

• Similarity can be indicative of homology

• Generally, if two sequences are significantly similar over


entire length they are likely homologous

• Low complexity regions can be highly similar without


being homologous

• Homologous sequences not always highly similar

Dr. Md. Khademul Islam


Query Coverage and Max Identity

• The amount of the query sequence, expressed as a percent,


that overlaps the subject sequence

• The highest percent identity for a set of aligned segments to


the same subject sequence.

Dr. Md. Khademul Islam


Suggested BLAST Cutoffs

• For nucleotide based searches, one should look for hits with
E-values of 10-6 or less and sequence identity of 70% or more

• For protein based searches, one should look for hits with E-
values of 10-3 or less and sequence identity of 25% or more

Dr. Md. Khademul Islam


Objectives:
• To understand the similarities among group of sequences
• To determine conserved regions
• To understand the evolutional relationship among related
sequences.

Tools:
• Clustalw2
http://www.ebi.ac.uk/Tools/msa/clustalw2/

• T-Coffee:
http://tcoffee.crg.cat/apps/tcoffee/index.html
Dr. Md. Khademul Islam
CLUSTAL: multiple sequence alignment
3. CLUSTAL: multiple sequence alignment
http://www.ebi.ac.uk/Tools/msa/clustalo/
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
multiple sequence alignment using BioEdit

Dr. Md. Khademul Islam


Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam

You might also like