You are on page 1of 17

UNIT -II

Sequence analysis

PAIR-WISE SEQUENCE COMPARISON


Sequence
What is sequencing? What is it good for?

 Genetic information of an organism encoded in its DNA


 DNA is composed of 4 building blocks, represented ATCG
Ex: DNA: ATCCGGATGG TAGGCCTACC

 Some DNA encodes proteins, RNA, regulatory elements


 Proteins made of 20 building blocks (amino acids)
Ex: Protein: CATGAW
Alignment
The sequences from related proteins or genes were similar, that
could align the sequences so that many corresponding residues
match.
This discovery was very important, since strong similarity between
two genes is a strong argument for their homology.
Bioinformatics is based on it.
Why Do Alignment?
homology: similarity due to descent from a common ancestor often
we can infer homology from similarity, thus we can infer
structure/function from sequence similarity.
Sequence Alignment
 Sequence Alignment lies at heart of the bioinformatics, which describes the way of
arrangement of DNA/RNA or protein sequences, in order to identify the regions of
similarity among them.
 It is used to infer structural, functional and evolutionary relationship between the
sequences.
 Alignment finds similarity level between query sequence and different database
sequences.
It finds the alignment more quantitatively by assigning scores.
Homologous
Homologous sequences can be divided into two groups ;

 Orthologous sequences: sequences that differ


because they are found in different species (e.g. human α
-globin and mouse α-globin)

 Paralogous sequences: sequences that differ because


of a gene duplication event (e.g. human α-globin and
human β-globin, various versions of both )

 Xenologs (genetics) A type of ortholog where the


homologous sequences are found in different species
because of horizontal gene transfer.
Importance of sequence comparison

Discovering functional and evolutional relationships in biological


sequences:

– Similar sequences  evolutionary relationship

–Evolutionary relationship  related function

– Orthologs  same (almost same) function in different organisms.


Three Key Questions
Q1: what do we want to align?
Global alignment: find best match of both sequences in their entirety
 Local alignment: find best subsequence match
Semi-global alignment: find best match without penalizing gaps on the
ends of the alignment
Q2: how do we “score” an alignment?
gap penalty function
substitution matrix
Q3: how do we find the “best” alignment?
simple approach: compute & score all possible alignments
Discovering sequence similarity by dot plots

Dot plot are two dimensional graphs, showing a comarision of two


sequences.
The principle used to generate the dot plot is:
The top X and the left y axes of a rectangular array are used to represent the
two sequences to be compared.
Ex:
Seq 1: TWILIGHTZONE
Seq 2: MIDNIGHTZONE
S: T T A C T C A A T
T: A C T C A T T A C
What is Dot matrix analysis

A dot matrix analysis is a method for comparing two sequences to look for
possible alignment (Gibbs and McIntyre 1970)
The algorithm for a dot matrix:
1. One sequence (A) is listed across the top of the matrix and the other (B) is
listed down the left side
2. Starting from the first character in B, one moves across the page keeping in the
first row and placing a dot in many column where the character in A is the same
3. The process is continued until all possible comparisons between A and B are
made
4. Any region of similarity is revealed by a diagonal row of dots
5. Isolated dots not on diagonal represent random matches
Types of dot plot matrix
Region of similarity appears as diagonal run of dots:

Principal diagonal shows identical sequence.

Global and local alignment are shown DIRECT repeat.

Multiple diagonal indicate repeatation.

Reverse diagonal (perpendicular to diagonal) indicate INVERSION.

Reverse diagonal crossing diagonal (X) indicate PALINDROMES.

Formation of box indicate the low complexity region.


Alignment algorithms
• An alignment program tries to find the best alignment between two sequences given
the scoring system.
• This can be seen as trying to find a path through the dotplot diagram including all (or
the most visible) diagonals.
Alignment types:
• Global Alignment between the complete sequence A and the complete sequence B
• Local Alignment between a sub-sequence of A an a subsequence of B
Computer implementation (Algorithms)
• Dynamic programing;
• Global Needleman-Wunsch
• Local Smith-Waterman
Types of Sequence Alignment
• There are several different types of alignment ;

1. Pair-wise alignment vs. 2. Multiple sequence


alignment
Align 2 sequences Align > 2 sequences

3. Global alignment vs 4. Local alignment


Across entire length of sequences Across part of sequence
Definitions
Optimal alignment - one that exhibits the most
correspondences. It is the alignment with the highest
score. May or may not be biologically meaningful.
Global alignment - Needleman-Wunsch (1970)
maximizes the number of matches between the
sequences along the entire length of the sequences.
Local alignment - Smith-Waterman (1981) gives the
highest scoring local match between two sequences.
Dynamic Programming
Dynamic Programming solves the original problem by dividing the
problem into smaller independent sub problems. Needleman-
Wunsch and Smith-Waterman algorithms for sequence alignment
are defined by dynamic programming approach.
Three steps in Dynamic Programming
1. Initialization
2. Matrix fill or scoring
3. Traceback and alignment
Developing Pairwise Sequence Alignment Algorithms
Pairwise Global Alignment
Global alignment - Needleman-Wunsch (1970)
◦ maximizes the number of matches between the sequences along the
entire length of the sequences.

Reason for making a global alignment:


◦ checking minor difference between two sequences
◦ Analyzing polymorphisms (ex. SNPs) between closely related sequence
Needleman & Wunsch
Place each sequence along one axis
Place score 0 at the up-left corner
Fill in 1st row & column with gap penalty multiples
Fill in the matrix with max value of 3 possible moves:
◦ Vertical move: Score + gap penalty
◦ Horizontal move: Score + gap penalty
◦ Diagonal move: Score + match/mismatch score
The optimal alignment score is in the lower-right corner
To reconstruct the optimal alignment, trace back where the max at each step
came from, stop when hit the origin.
Example
Let gap = -2 , match = 1 , mismatch = -1.
empty A A A C
empty 0 -2 -4 -6 -8
A -2 1 -1 -3 -5
G -4 -1 0 -2 -4
C -6 -3 -2 -1 -1
AAAC AAAC
A-GC -AGC

You might also like