You are on page 1of 31

CSCE555 Bioinformatics

Lecture 6 Sequence Alignment


(partIII)

Meeting: MW 4:00PM-5:15PM SWGN2A21


Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555

University of South Carolina


Department of Computer Science and Engineering
2008 www.cse.sc.edu.
Roadmap

 Hashing Function based quick search

 Heuristic algorithm: FASTA, BLAST

 Multiple Sequence Alignment algorithm: Clustal W

 Summary

07/30/20 2
Hash Table for Quick Search
Smith 18 Smith 18

Alice 19 Alice 19
O(n)
Bob 18 Bob 18

Lucy 28 Lucy 28
O(log(n))
Alicia 32 Alicia 32

Dan 30 Dan 30

Ron 32
O(1) Ron 32

George 32 George 32
Searching
Consider the problem of searching an array for a
given value
◦ If the array is not sorted, the search requires O(n) time
 If the value isn’t there, we need to search all n elements
 If the value is there, we search n/2 elements on average
◦ If the array is sorted, we can do a binary search
 A binary search requires O(log n) time
 About equally fast whether the element is found or not
◦ It doesn’t seem like we could do much better
 How about an O(1), that is, constant time search?
 We can do it if the array is organized in a particular way

4
Hashing
Suppose we were to come up with a “magic
function” that, given a value to search for,
would tell us exactly where in the array to look
◦ If it’s in that location, it’s in the array
◦ If it’s not in that location, it’s not in the array
This function is called a hash function because it
“makes hash” of its inputs

5
(Magic) Hashing Function
A hash function is a function that:
◦ When applied to an Object, returns a number
◦ When applied to equal Objects, returns the
same number for each
◦ When applied to unequal Objects, is very
unlikely to return the same number for each
Hash functions turn out to be very
important for searching, that is, looking
things up fast

6
Example (ideal) hash function
Suppose our hash 0 kiwi
function gave us the 1

following values: 2 banana


hashCode("apple") = 5 3 watermelon
hashCode("watermelon") = 3 4
hashCode("grapes") = 8
hashCode("cantaloupe") = 7 5 apple
hashCode("kiwi") = 0
hashCode("strawberry") = 9
6 mango
hashCode("mango") = 6 7 cantaloupe
hashCode("banana") = 2 8 grapes
9 strawberry
7
Example of Hash Function
 PRIVATE int hash_number (const char *key, int size)
 { int hash = 0;

◦ if (key) { const char * ptr = key;


◦ for(; *ptr; ptr++)
 hash = (int) ((hash*3 + (*(unsigned char*)ptr)) %
size);
◦ }
◦ return hash;
}
FASTA (Fast Alignment)

9
BLAST (Basic Local Alignment Search
Tool)
 Approach (BLAST) (Altschul et al. 1990, developed by NCBI)
◦ View sequences as sequences of short words (k-tuple)
 DNA: 11 bases, protein: 3 amino acids
◦ Create hash table of neighborhood (closely-matching) words
◦ Use statistics to set threshold for “closeness”
◦ Start from exact matches to neighborhood words
 Motivation
◦ Good alignments should contain many close matches
◦ Statistics can determine which matches are significant
 Much more sensitive than % identity
◦ Hashing can find matches in O(1) time
◦ Extending matches in both directions finds alignment
 Yields high-scoring/maximum segment pairs (HSP/MSP)

10
BLAST (Basic Local Alignment Search Tool)

11
Multiple Sequence Alignment

 Alignment containing multiple DNA / protein sequences


 Look for conserved regions → similar function
 Example:
#Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
#Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
#Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC
#Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC
#Oppossum ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG
#Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT
#Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT

12
Multiple Sequence Alignment:
Why?
 Identify highly conserved residues
◦ Likely to be essential sites for structure/function
◦ More precision from multiple sequences
◦ Better structure/function prediction, pairwise alignments
 Building gene/protein families
◦ Use conserved regions to guide search
 Basis for phylogenetic analysis
◦ Infer evolutionary relationships between genes
 Develop primers & probes
◦ Use conserved region to develop
 Primers for PCR
 Probes for DNA micro-arrays

13
Multiple Alignment Model

Q1: How should we define s? Q2: How should we define A?

X1=x11,…,x1m1 Model: scoring function s: A X1=x11,…,x1m1


Possible alignments of all Xi’s: A ={a 1,…,ak}
X2=x21,…,x2m2 X2=x21,…,x2m2
Find the best alignment(s)
… S(a*)= 21 …
a*  arg max a s (a ( X 1 , X 2 ,..., X N ))
XN=xN1,…,xNmN XN=xN1,…,xNmN

Q4: Is the alignment biologically


Q3: How can we find a* quickly?
Meaningful?

14
Minimum Entropy Scoring
 Intuition:
◦ A perfectly aligned column
has one single symbol (least
uncertainty)
S (mi )   pia log pia
a
◦ A poorly aligned column has
many distinct symbols (high cia
pia 
 cia '
uncertainty) Count of symbol a in
column i

a'

07/30/20 15
Multidimensional Dynamic Programming
Assumptions: (1) columns are independent (2) linear gap cost
S (m)  G   s (mi )
i

G   ( g )  dg
 i1,i 2,...,iN xi11 , xi22 ,..., xiN
N
=Maximum score of an alignment up to the subsequences ending with

 0,0,...,0  0
 i11,i 21,...,iN 1  S ( xi11 , xi22 ,..., xiN
N
)

 i1,i 2 1,...,iN 1  S ( , xi 2 ,..., xiN )
2 N


 i11,i 2,...,iN 1  S ( xi1 , ,..., xiN )
1 N


 i1,i 2,...,iN  max ...
  S ( ,  ,..., x N
)
 i1, i 2,..., iN 1 iN
... Alignment: 0,0,0…,0---|x1| , …, |xN|

 i11,i 2 ,...,iN  S ( xi1 , ,..., )
1

We can vary both the model and the alignment strategies

NP-complete problem. High complexity

16
Approximate Algorithms for Multiple
Alignment
 Two major methods (but it remains a worthy research topic)
◦ Reduce a multiple alignment to a series of pairwise alignments
and then combine the result (e.g., Feng-Doolittle alignment)
◦ Using HMMs (Hidden Markov Models)
 Feng-Doolittle alignment (4 steps)
◦ Compute all possible pairwise alignments
◦ Convert alignment scores to distances
◦ Construct a “guide tree” by clustering
◦ Progressive alignment based on the guide tree (bottom up)

17
Progressive Alignment
How to Align One Sequence to an
Existing Alignment?
Add a sequence to an existing group:
a sequence s: CGAAATC want to align to a existing alignment
s1 AG–AT–
s2 -GAATC
The high scoring pairwise alignment is
s2 -G–AATC
s CGAAATC
Hence , s is merged into the group alignment as:
s1 AG--AT– add gaps if needed
s2 -G–AATC fixed
s CGAAATC
How to Align a Group to Another
Group?
Two groups:
 S1 ATTGCCATT--
 S2 ATC-CAATTTT

 S3 ATGGCCATT
 S4 ATCTTC-TT
The highest score alignment is S1 – S3 , so it is used for aligning the
two groups as
S2 ATC–CAATTTT
S1 ATTGCCATT--
S3 ATGGCCATT--
S4 ATCTTC-TT--
Limitation of Feng-Doolittle Alignment
 Problems of Feng-Doolittle alignment
◦ All alignments are completely determined by pairwise alignment
(restricted search space)
◦ No backtracking (subalignment is “frozen”)
 No way to correct an early mistake
 Non-optimality: Mismatches and gaps at highly conserved region
should be penalized more, but we can’t tell where is a highly
conserved region early in the process
 Iterative Refinement
◦ Re-assigning a sequence to a different cluster/profile
◦ Repeatedly do this for a fixed number of times or until the score
converges
◦ Essentially to enlarge the search space

21
Clustal W: A Multiple Alignment
Tool
 CLUSTAL and its variants are software packages often used to
produce multiple alignments
 Essentially following Feng-Doolittle
◦ Do pairwise alignment (dynamic programming)
◦ Do score conversion/normalization (Kimura’s model)
◦ Construct a guide tree (neighbour-journing clustering)
◦ Progressively align all sequences using profile alignment
 Offer capabilities of using substitution matrices like BLOSUM or
PAM
 Many Heuristics

22
One example of MSA using Clustalw
More Advanced MSA algorithms
 Kalign
 MAFFT (Multiple Alignment using Fast Fourier
Transform)
 MUSCLE stands for MUltiple Sequence Comparison by
Log-Expectation. MUSCLE is claimed to achieve both
better average accuracy and better speed than ClustalW2
or T-Coffee
 T-Coffee allows you to combine results obtained with
several alignment methods
Measuring Alignment Significance
The statistical significance of a an
alignment score is used to try to determine
if an alignment is the result of homology
or just random chance.
The E-value of an alignment score is the
expected number of unrelated sequences
in a database that would have a score at
least as good.

27
E-values and p-values
The E-value of a particular score is
determined by multiplying the number of
sequences in the database, n, times the p-
value of the score.
The p-value of score X is the probability
of a single random alignment having
score X or larger.
E-value(X) = n • p-value(X)

28
Computing p-values
 To compute the p-value of X, we
must know how random scores are
distributed.
 The p-value of X is equal to the area
under the distribution curve to the
right of X.
 For ungapped local alignments, the
distribution can be computed
analytically.
 For gapped alignments, it must be
estimated empirically.

29
Summary
Hashing for quick search
Blast and Fasta
Progressive Multiple Sequence alignment
Testing significance of alignments
Next Lecture
Profiles
and HMM
Reading:
◦ Textbook (CG) chapter 4
◦ Textbook (EB) chapter 6

You might also like