PAIR-WISE SEQUENCE COMPARISON

UNIT -II
Sequence analysis
PAIR-WISE SEQUENCE COMPARISON

Sequence
What is sequencing? What is it good for?
 Genetic information of an organism encoded in its DNA

 DNA is composed of 4 building blocks, represented ATCG
Ex: DNA: ATCCGGATGG TAGGCCTACC
 Some DNA encodes proteins, RNA, regulatory elements

 Proteins made of 20 building blocks (amino acids)
Ex: Protein: CATGAW
Alignment
The sequences from related proteins or genes were similar, that
could align the sequences so that many corresponding residues
match.
This discovery was very important, since strong similarity between
two genes is a strong argument for their homology.
Bioinformatics is based on it.
Why Do Alignment?
homology: similarity due to descent from a common ancestor often
we can infer homology from similarity, thus we can infer
structure/function from sequence similarity.
Sequence Alignment
 Sequence Alignment lies at heart of the bioinformatics, which describes the way of
arrangement of DNA/RNA or protein sequences, in order to identify the regions of
similarity among them.
 It is used to infer structural, functional and evolutionary relationship between the
sequences.
 Alignment finds similarity level between query sequence and different database
sequences.
It finds the alignment more quantitatively by assigning scores.
Homologous
Homologous sequences can be divided into two groups ;
 Orthologous sequences: sequences that differ

because they are found in different species (e.g. human α
-globin and mouse α-globin)
 Paralogous sequences: sequences that differ because

of a gene duplication event (e.g. human α-globin and
human β-globin, various versions of both )
 Xenologs (genetics) A type of ortholog where the

homologous sequences are found in different species
because of horizontal gene transfer.
Importance of sequence comparison
Discovering functional and evolutional relationships in biological

sequences:
– Similar sequences  evolutionary relationship
–Evolutionary relationship  related function
– Orthologs  same (almost same) function in different organisms.

Three Key Questions
Q1: what do we want to align?
Global alignment: find best match of both sequences in their entirety
 Local alignment: find best subsequence match
Semi-global alignment: find best match without penalizing gaps on the
ends of the alignment
Q2: how do we “score” an alignment?
gap penalty function
substitution matrix
Q3: how do we find the “best” alignment?
simple approach: compute & score all possible alignments
Discovering sequence similarity by dot plots
Dot plot are two dimensional graphs, showing a comarision of two

sequences.
The principle used to generate the dot plot is:
The top X and the left y axes of a rectangular array are used to represent the
two sequences to be compared.
Ex:
Seq 1: TWILIGHTZONE
Seq 2: MIDNIGHTZONE
S: T T A C T C A A T
T: A C T C A T T A C
What is Dot matrix analysis
A dot matrix analysis is a method for comparing two sequences to look for
possible alignment (Gibbs and McIntyre 1970)
The algorithm for a dot matrix:
1. One sequence (A) is listed across the top of the matrix and the other (B) is
listed down the left side
2. Starting from the first character in B, one moves across the page keeping in the
first row and placing a dot in many column where the character in A is the same
3. The process is continued until all possible comparisons between A and B are
made
4. Any region of similarity is revealed by a diagonal row of dots
5. Isolated dots not on diagonal represent random matches
Types of dot plot matrix
Region of similarity appears as diagonal run of dots:
Principal diagonal shows identical sequence.
Global and local alignment are shown DIRECT repeat.
Multiple diagonal indicate repeatation.
Reverse diagonal (perpendicular to diagonal) indicate INVERSION.
Reverse diagonal crossing diagonal (X) indicate PALINDROMES.
Formation of box indicate the low complexity region.

Alignment algorithms
• An alignment program tries to find the best alignment between two sequences given
the scoring system.
• This can be seen as trying to find a path through the dotplot diagram including all (or
the most visible) diagonals.
Alignment types:
• Global Alignment between the complete sequence A and the complete sequence B
• Local Alignment between a sub-sequence of A an a subsequence of B
Computer implementation (Algorithms)
• Dynamic programing;
• Global Needleman-Wunsch
• Local Smith-Waterman
Types of Sequence Alignment
• There are several different types of alignment ;
1. Pair-wise alignment vs. 2. Multiple sequence

alignment
Align 2 sequences Align > 2 sequences
3. Global alignment vs 4. Local alignment

Across entire length of sequences Across part of sequence
Definitions
Optimal alignment - one that exhibits the most
correspondences. It is the alignment with the highest
score. May or may not be biologically meaningful.
Global alignment - Needleman-Wunsch (1970)
maximizes the number of matches between the
sequences along the entire length of the sequences.
Local alignment - Smith-Waterman (1981) gives the
highest scoring local match between two sequences.
Dynamic Programming
Dynamic Programming solves the original problem by dividing the
problem into smaller independent sub problems. Needleman-
Wunsch and Smith-Waterman algorithms for sequence alignment
are defined by dynamic programming approach.
Three steps in Dynamic Programming
1. Initialization
2. Matrix fill or scoring
3. Traceback and alignment
Developing Pairwise Sequence Alignment Algorithms
Pairwise Global Alignment
Global alignment - Needleman-Wunsch (1970)
◦ maximizes the number of matches between the sequences along the
entire length of the sequences.
Reason for making a global alignment:

◦ checking minor difference between two sequences
◦ Analyzing polymorphisms (ex. SNPs) between closely related sequence
Needleman & Wunsch
Place each sequence along one axis
Place score 0 at the up-left corner
Fill in 1st row & column with gap penalty multiples
Fill in the matrix with max value of 3 possible moves:
◦ Vertical move: Score + gap penalty
◦ Horizontal move: Score + gap penalty
◦ Diagonal move: Score + match/mismatch score
The optimal alignment score is in the lower-right corner
To reconstruct the optimal alignment, trace back where the max at each step
came from, stop when hit the origin.
Example
Let gap = -2 , match = 1 , mismatch = -1.
empty A A A C
empty 0 -2 -4 -6 -8
A -2 1 -1 -3 -5
G -4 -1 0 -2 -4
C -6 -3 -2 -1 -1
AAAC AAAC
A-GC -AGC

PAIR-WISE SEQUENCE COMPARISON

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PAIR-WISE SEQUENCE COMPARISON

Uploaded by

Copyright:

Available Formats

UNIT -II

PAIR-WISE SEQUENCE COMPARISON

 Genetic information of an organism encoded in its DNA

 Some DNA encodes proteins, RNA, regulatory elements

 Orthologous sequences: sequences that differ

 Paralogous sequences: sequences that differ because

 Xenologs (genetics) A type of ortholog where the

Discovering functional and evolutional relationships in biological

– Similar sequences  evolutionary relationship

–Evolutionary relationship  related function

– Orthologs  same (almost same) function in different organisms.

Dot plot are two dimensional graphs, showing a comarision of two

Principal diagonal shows identical sequence.

Global and local alignment are shown DIRECT repeat.

Multiple diagonal indicate repeatation.

Reverse diagonal (perpendicular to diagonal) indicate INVERSION.

Reverse diagonal crossing diagonal (X) indicate PALINDROMES.

Formation of box indicate the low complexity region.

1. Pair-wise alignment vs. 2. Multiple sequence

3. Global alignment vs 4. Local alignment

Reason for making a global alignment:

You might also like