You are on page 1of 57

Chapter - 3

Biological Sequence Analysis & Alignment


3.1. Sequence comparison, similarity alignment
3.2. Database Similarity Searching: Database Searching
Tools & Formats (BLAST, FASTA, etc)
3.3. Sequence alignment methods (local & global) and
algorithms
3.4. Sequence alignment techniques: Pair wise & Multiple
Sequence Alignment
3.5. Tools for Sequence alignment: ClustalW, T-Coffee, etc
3.6. Alignment Interpretation and Scoring methods
Biological Sequence Analysis
• Biological Sequence
 is a single, continuous molecule of nucleic acid or protein.
 the methodologies implemented under sequence analysis
include:
- sequence alignment (pairwise & multiple sequence
alignment),
- phylogenetic analysis,
- motif & domain search/prediction,
- identification of novel genes for the drug.
• Sequence alignment
 is a way of arranging the sequences of DNA/RNA/AA to
identify regions of similarity that may be a consequence
of functional, structural or evolutionary relationships.
 the goal of alignment is to find the conserved region (if
present) between two or more sequences;
 these conserved regions are supposed to be an important
& functional region (domain or motif) in the sequences.
The dilemma: DNA or protein?
Search by similarity

Using nucleotide seq. Using amino acid seq.


By translating into amino acid sequence, are we losing
information? Yes!The genetic code is degenerate (Two
or more codons can represent the same amino acid)
 Very different DNA sequences may code for similar
protein sequences →We certainly do not want to miss those
cases!
• Conclusion:
It is almost always better to compare coding sequences in
their amino acid form, especially if they are very divergent.
Very highly similar nucleotide sequences may give better
results.
Biological Sequence Similarity
• it tells us:
1. Homology genes
- are genes that derive from a common ancestor-gene
are called homologs
- is an evolutionary relationship that either exists or
does not.
- high similarity is evidence for homology. Similar
sequences may be orthologs or paralogs.
2. Orthologous genes
- are homologous genes in different organisms with
shared function.
3. Paralogous genes
- are homologous genes in one organism that derive from
gene duplication → often have divergent function.
Orthologs and paralogs are often viewed in a single tree
Homologous and Paralogous
Causes for sequence (dis)similarity
• Mutation: a nucleotide at a certain location is replaced by
another nucleotide (e.g.: ATA → AGA)
Transitions mutations: change from a purine to a purine or
a pyrimidine to a pyrimidine. E.g: A to G; G to A; C
to T; T to C
Transversions mutations: change from a purine to a
pyrimidine or vice versa.
 Synonymous & non-synonymous mutation
Insertion: at a certain location one new nucleotide is
inserted in between two existing nucleotides
(e.g.: AA → AGA)
Deletion: at a certain location one existing nucleotide
is deleted (e.g.: ACTG → AC-G)
 Indel: an insertion or a deletion
Classification of sequence alignment algorithms
 two main classes of sequence alignment methods:
- global alignments and local alignments.

 in contrast to local alignments where only portions of


sequences are aligned, the entire sequences are aligned
in global alignments.
 Global alignments are useful for aligning closely related
sequences whereas local alignments are more suitable
when comparing distantly related sequences
 Pairwise & multiple alignments are the basic tools to
compare sequences.
 An alignment is meant to say global alignment when
closely related sequences of the same length are aligned
together;
 the alignment of the sequence is carried out from the start
to end of the sequence while searching for best possible
alignment.
→ Needleman-Wunsch algorithm
 Local alignment is mainly used for those sequences which
differ in sequence length.
→ this method finds local matches within the sequence
stretch instead of looking at the entire sequence.
→ Smith-Waterman algorithm
→ BLAST (basic local alignment search tool) is the most
commonly used tool for sequence alignment & similarity
search.
 gaps are used to show that an AA or DNA is without a
match in the other sequence & the gaps represent
insertions or deletions in an evolutionary context.
 when alignment is constructed, the identity & similarity
can be quantified.
• the identity is the number of DNAs or AAs matching
among sequences compared at all positions.
• Similarity is a further comparison also considering
different types of DNAs or AAs as well as the gaps.
• Global alignment (top) includes matches ignored by local
alignment (bottom)

Global:
15% identity

Local:
30% identity
Sequence Similarity & Scoring Methods
1. Dot-Matrix Method
 is done by putting one sequence along the y-axis on left
side & another sequence on x-axis horizontally on top.
 this method generates a simple matrix of sequence, while
each item of the matrix is a measure of similarity of those
two residues on the horizontal & vertical sequence.
2. Dynamic Programming
 this method is used in computer science, mathematics,
management science, economics.
Multiple Sequence Alignment
• EBI ClustalW Server
Preparing Multiple Sequence
 “*” refers to the residues or nucleotides in that column are identical in all
sequences in the alignment.
 “:” indicates that conserved substitutions have been observed.
 “.” indicates that semi-conserved substitutions are observed.
Multiple Alignment using Fast Fourier
Transform

MUltiple Sequence Comparison by Log-


Expectation

(Tree-based Consistency Objective Function For alignment Evaluation)


Exercise- 1
1. Pair wise alignment – online + CLC genome
workbench
2. Multiple alignment – online + CLC genome
workbench
3. Local alignment – online
4. Global alignment – online
Database searching tips
 use latest database version.
 use BLAST (Basic Local Alignment Search Tool) first
 search both strands when using FASTA.
 translate sequences where relevant
 E<0.05 is statistically significant, usually biologically
interesting.
 if the query has repeated segments, delete them & repeat
search
 most used algorithm in bioinformatics - Verb: to blast
 BLAST allows rapid sequence comparison of a query
sequence against a database.
 The BLAST algorithm is fast, accurate, & accessible both
via the web & the command line.
 is popular - good balance of sensitivity & speed; reliable
& flexible
BLAST
 BLAST tool is fast & can be used in analysis of >1000s
of sequences & even for comparison of two genomes
 BLAST is freely available for everyone
 BLAST tool is straightforward to handle and produces
very informative data
 BLAST method is a word search heuristic method which
eliminates the irrelevant sequences & saves search time.
 BLAST has some subprograms:
 BLASTn - aligns nucleotide query sequence with
nucleotide database.
 BLASTp - aligns protein sequence with protein
database.
 BLASTx - used to align nucleotide sequence with
protein database by comparing six-frame conceptual
translation of nucleotide sequence.
 tBLASTx - aligns query nucleotide possible six-frame
converted sequence with converted nucleotide six-
frame sequences of the database.
 tBLASTn - aligns protein query sequence with
translated nucleotide database.
(blastn)

(blastp)
BLASTn
BLASTp
BLASTn: Search Set
BLASTn: Program Selection
BLASTn Result
BLASTn: Graphic Summary
BLASTn Description
BLASTn Alignment
BLASTn Tree View
BLASTp: Search Set
PDB BLASTp
BLASTp: Graphic Summary
PDB BLASTp Description
BLASTp Tree View
A practical example of sequence alignment
http://www.ncbi.nlm.nih.gov

BLAST results
0
E = 0.0 means
≤10-1000
 E value: is the expectation value or probability to find by
chance hits similar to your sequence. The lower the E, the
more significant the score.
Exercise - 2
• BLAST the following sequence or accession numbers:
KR093978
• KR093979
• KR093980
• KR093981
• KR093982

You might also like