Professional Documents
Culture Documents
What is alignment?
• a sequence alignment is a way of arranging the sequences of
DNA, RNA, or protein to identify regions of similarity
ATGCATGC TGCATGCA
GAATTCAG
ATGCATGC GGATCG
TGCATGCA
ATGCATGC
->TGCATGCA
• Types of alignment:
(i) Global
(ii) Local
• Local
— Gathers islands of matches
— Stretches of sequences with highest density of matches are aligned
— more useful for dissimilar sequences that are suspected to contain
regions of similarity or similar sequence motifs within their larger
sequence context.
— The Smith–Waterman algorithm is a general local alignment method
• Formally we can treat the space character just like any other.
Then we obtain the following additive scoring scheme:
• The score of an alignment (x′, y′) is
• Edit distance
• If we charge all insertions, deletions, and mismatches at unit
cost, we obtain the Levenshtein distance. This measure is also
called the edit distance, because it counts the number of
elementary edit operations needed to transform one sequence
into the other.
Edit distance: 15
Score: 26-5-7-5-1-5-3-5 = -5
Score: 26-5-14 = +7
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
Homology vs. Similarity
Example:
• acgcg acacg
• Traditional approaches
Optimal multiple alignment
Progressive multiple alignment
• Alignment parameters
Residue similarity matrices
Gap penalties
• Alternative approaches
Iterative alignment methods
Consistency or Combinatorial algorithms
Purine = A, G
Dr. Md. Khademul Islam
Pyrimidines = T, C
Scoring matrices: Nucleotide Sequences
α β β α
Transition probability = α
Transversion probability = β
Purine = A, G Pyrimidines = T, C
Dr. Md. Khademul Islam
Scoring matrices: Amino Acid
• Amino acid substitution matrices, which are 20 × 20 matrices, have
been devised to reflect the likelihood of residue substitutions.
• Scoring matrices for amino acids are more complicated because scoring
has to reflect the physicochemical properties of amino acid residues, as
well as the likelihood of certain residues being substituted among true
homologous sequences.
• Certain amino acids with similar physicochemical properties: --
-- easily substituted
-- are likely to preserve the essential functional and structural
features.
• Substitutions between residues of different physicochemical
properties:
-- are more likely to cause disruptions to the structure and function.
--less likely to be selected in evolution because it renders nonfunctional
proteins.
Dr. Md. Khademul Islam
Scoring matrices: Amino Acid
• For example, phenylalanine, tyrosine, and tryptophan all
share aromatic ring structures. Because of their chemical
similarities, they are easily substituted for each other without
perturbing the regular function and structure of the protein.
• The empirical matrices, which include PAM and BLOSUM matrices, are derived
from actual alignments of highly similar sequences.
Tools:
• BLAST
http://blast.ncbi.nlm.nih.gov/Blast.cgi
ABCDEF
My sequence
AAAAABCDEFAAAA CCCBBBBDDDDAAAA NNNOOOOPPPPQQQ
Fish Dog Mouse
http://blast.ncbi.nlm.nih.gov/Blast.cgi
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
BLAST: How it works
ATCGGACGTGGATCCATCGATC
GATGCGATCGATCGAAATCG sequence that you
want to know about
Max Score: The higher the Max Score, the better the alignment
between the hit and the query. This is based on the overall score
of HSPs between sequences, similar to Expect Value
Total Score: By the sum of scores from all HSPs from the same
database sequence.
The lower the E value is, the more significant the alignment is.
• For nucleotide based searches, one should look for hits with
E-values of 10-6 or less and sequence identity of 70% or more
• For protein based searches, one should look for hits with E-
values of 10-3 or less and sequence identity of 25% or more
Tools:
• Clustalw2
http://www.ebi.ac.uk/Tools/msa/clustalw2/
• T-Coffee:
http://tcoffee.crg.cat/apps/tcoffee/index.html
Dr. Md. Khademul Islam
CLUSTAL: multiple sequence alignment
3. CLUSTAL: multiple sequence alignment
http://www.ebi.ac.uk/Tools/msa/clustalo/
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
Dr. Md. Khademul Islam
multiple sequence alignment using BioEdit