W8 Lecture

CME 4425
Introduction to Bioinformatics Algorithms
Zerrin Işık
zerrin@cs.deu.edu.tr
Sequence Database Searching
What is database searching*
Goal: find similar (homologous) sequences of a query
sequence in a sequence of database
Input: query sequence & database
Output: hits (pairwise alignments)

Basics of database searching
Core: pairwise alignment algorithm
Speed (fast sequence comparison)
Relevance of the search results (statistical tests)
Recovering all information of interest
The results depend on the search parameters like gap
penalty, scoring matrix.
Sometimes searches with more than one matrix should be
preformed
Specificity and sensitivity
The different strategies for
database searching
Alignment
Query Database
methods
BLAST
Sequence FASTA Sequence Faster
SW
MSA Psi-Blast Sequence
Profile
MSA MSA Higher
alignment sensitivity
HMM
Sequence hmmer
models
What program to use for searching?
BLAST is fastest and easily accessed on the Web

A suite; BLASTP, BLASTN, BLASTX
lag behind the development of sequencing techniques
FASTA
more sensitive for DNA-DNA comparisons
FASTX and TFASTX can find similarities in sequences with
frameshifts
Smith-Waterman is slower, but more sensitive
known as a “rigorous” or “exhaustive” search
SSEARCH in GCG and standalone FASTA
FASTA
Derived from logic of the dot plot
compute best diagonals from all frames of alignment
Word method looks for exact matches between words in
query and test sequence
hash tables (fast computer technique)
DNA words are usually 6 bases
protein words are 1 or 2 amino acids
only searches for diagonals in region of word
matches = faster searching
FASTA algorithm
Makes longest diagonal
After all diagonals found, tries to join diagonals by adding
gaps
Computes alignments in regions of best diagonals

FASTA alignments
FASTA results: histogram
!!SEQUENCE_LIST 1.0
(Nucleotide) FASTA of: b2.seq from: 1 to: 693 December 9, 2002 14:02
TO: /u/browns02/Victor/Search-set/*.seq Sequences: 2,050 Symbols:
913,285 Word Size: 6
Searching with both strands of the query.
Scoring matrix: GenRunData:fastadna.cmp
Constant pamfactor used
Gap creation penalty: 16 Gap extension penalty: 4
Histogram Key:
Each histogram symbol represents 4 search set sequences
Each inset symbol represents 1 search set sequences
z-scores computed from opt scores
z-score obs exp
(=) (*)
< 20 0 0:
22 0 0:
24 3 0:=
26 2 0:=
28 5 0:==
30 11 3:*==
32 19 11:==*==
34 38 30:=======*==
36 58 61:===============*
38 79 100:==================== *
40 134 140:==================================*
42 167 171:==========================================*
44 205 189:===============================================*====
46 209 192:===============================================*=====
48 177 184:=============================================*
FASTA results: list
The best scores are: init1 initn opt z-sc E(1018780)..
SW:PPI1_HUMAN Begin: 1 End: 269

! Q00169 homo sapiens (human). phosph... 1854 1854 1854 2249.3 1.8e-117
SW:PPI1_RABIT Begin: 1 End: 269
! P48738 oryctolagus cuniculus (rabbi... 1840 1840 1840 2232.4 1.6e-116
SW:PPI1_RAT Begin: 1 End: 270
! P16446 rattus norvegicus (rat). pho... 1543 1543 1837 2228.7 2.5e-116
SW:PPI1_MOUSE Begin: 1 End: 270
! P53810 mus musculus (mouse). phosph... 1542 1542 1836 2227.5 2.9e-116
SW:PPI2_HUMAN Begin: 1 End: 270
! P48739 homo sapiens (human). phosph... 1533 1533 1533 1861.0 7.7e-96
SPTREMBL_NEW:BAC25830 Begin: 1 End: 270
! Bac25830 mus musculus (mouse). 10, ... 1488 1488 1522 1847.6 4.2e-95
SP_TREMBL:Q8N5W1 Begin: 1 End: 268
! Q8n5w1 homo sapiens (human). simila... 1477 1477 1522 1847.6 4.3e-95
SW:PPI2_RAT Begin: 1 End: 269
! P53812 rattus norvegicus (rat). pho... 1482 1482 1516 1840.4 1.1e-94
FASTA results: alignment
SCORES Init1: 1515 Initn: 1565 Opt: 1687 z-score: 1158.1 E(): 2.3e-58
>>GB_IN3:DMU09374 (2038 nt)
initn: 1565 init1: 1515 opt: 1687 Z-score: 1158.1 expect(): 2.3e-58 66.2% identity in 875
nt overlap
(83-957:151-1022)
60 70 80 90 100 110
u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC
|| ||| | ||||| | ||| |||||
DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC
130 140 150 160 170 180
120 130 140 150 160 170

u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA
||||||||| || ||| | | || ||| | || || ||||| ||
DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC
190 200 210 220 230 240
180 190 200 210 220 230

u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC
||| | ||||| || ||| |||| | || | |||||||| || ||| ||
DMU09374 AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC
250 260 270 280 290 300
240 250 260 270 280 290

u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC
|||||||||| ||||| | |||||| |||| ||| || ||| || |
DMU09374 AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT
310 320 330 340 350 360
FASTA on the web
Many websites offer FASTA searches
Various databases and various other services
Be sure to use FASTA 3
Each server has its limits
Be aware that you are depending on the kindness of
strangers.
BLAST
[BLAST= Basic Local Alignment Search Tool]
The central idea of the BLAST algorithm is that a statistically
significant alignment is likely to contain a high-scoring pair
of aligned words.
Uses word matching like FASTA
Similarity matching of words (3 aa’s, 11 bases)
Does not require identical words.
If no words are similar, then no alignment
won’t find matches for very short sequences
Original BLAST does not handle gaps well; the “gapped”
blast is better
Word match
Extend hits
BLAST: word matching
MEAAVKEEISVEDEAVDKNI
MEA
EAA Break query
AAV
AVK into words:
VKE
KEE
EEI
EIS
ISV Break database
...
sequences
into words:
Compare word lists
by Hashing
(allow near matches)
BLAST: extend hits
Use two word matches as anchors to build an alignment between

the query and a database sequence.
Gapped BLAST algorithm
The NCBIs BLAST website now both use “gapped BLAST”
This algorithm is more complex than the original BLAST
It requires two word matches close to each other on a pair of

sequences (i.e., with a gap) before it creates an alignment
allow gaps (using Smith-Waterman algorithm constrained by a
band width).
The region of the path graph explored
when seeded by the alignment of a
pair of residues
Ref: Gapped BLAST and PSI-BLAST: a

new generation of protein database search
programs. NAR 25(17):3389, 1997
Statistical tests
Evaluate the probability of an event taking place by chance.
Very little is known about the random distribution of
optimal global alignment scores
Statistics for the scores of local alignments, are well
understood (particularly true for local alignments lacking
gaps)
P-value
Randomized data
Distribution under the same setup
BLAST statistics
E-value is equivalent to standard P-value (based on Karlin-Altschul
theorem)
Karlin-Alschul equation is probably one of the most recognized
equation in bioinformatics.
E-value represents the likelihood that the observed alignment is due to
chance alone
E-value: the number of alignments expected by chance (E) during a
sequence database search. E = 1 indicates that an alignment this good
would happen by chance with any random sequence searched against
the same database.
How to compute BLAST E-value
Raw score (S)
Bit score (S’)

m: the effective query length
n: the effective database length
E
P (the probability of getting at least one HSP)

Which BLAST?
Program Query Database Comparison Common use
blastn DNA DNA DNA level Seek identical DNA

sequences and
splicing patterns
blastp Protein Protein Protein level Find homologous

proteins
blastx DNA Protein Protein level Analyze new DNA

to find genes and
seek homologous
proteins
tblastn Protein DNA Protein level Search for genes in

unannotated DNA
tblastx DNA DNA Protein level Discover gene

structure
BLAST+ package has different user interface

Which database?
For blastp (blastx)
Non-redundant protein sequences (nr)
Reference proteins (refseq_protein)
Swissprot protein sequences (swissprot)
Patented protein sequences (pat)
Protein Data Bank proteins (pdb)
Environmental samples (env_nr)
For blastn (tblastx)
Human genomic plus transcript (not for tblastx)
Mouse genomic plus transcript (not available for tblastx)
Nucleotide collection (nr/nt)
Etc.
BLAST is approximate
BLAST makes similarity searches very quickly because it
takes shortcuts.
looks for short, nearly identical “words” (11 bases)
It also makes errors

misses some important similarities
makes many incorrect matches
easily fooled by repeats or skewed composition
Interpretation of BLAST hits
very low E values (e-100) are homologs
moderate E values are related genes
long list of gradually declining of E values indicates a large

gene family
long regions of moderate similarity are more significant than

short regions of high identity
Is the hit with smallest E-value the
closest sequence to the query?
Not necessarily
Some people argue that stricter phylogeny analysis is needed

for further conclusion
Why? Different evolutionary rates, etc.
Biological relevance
It is up to you, the biologist to scrutinize these alignments
and determine if they are significant.
Were you looking for a short region of nearly identical
sequence or a larger region of general similarity?
Are the mismatches conservative ones?
Are the matching regions important structural components
of the genes or just introns and flanking regions?
20 tips to improve your BLAST searches
§ “Don't Use the Default Parameters” (but what parameters?)
§ “Treat BLAST Searches as Scientific Experiments
§ “View BLAST Reports Graphically” (depends!)
§ “Be Skeptical of Hypothetical Proteins”
§ “Look for Stop Codons and Frame-Shifts to find Pseudo-Genes”
§ “Parse BLAST Reports with Bioperl” (your own python, or perl,
program)
§ “How to Lie with BLAST Statistics” (“Lies, dammed lies, and
statistics”)
§ …
Ref: BLAST by Ian Korf; Mark Yandell; Joseph Bedell, 2003.

Borderline similarity
What to do with matches with E values that are not so
impressive? (e.g., E values > 0.01, smaller numbers are
more significant)
this is the “Twilight Zone”
retest these sequences and look for related hits (not just your
original query sequence)
similarity is (is not) transitive:
if A~B and B~C, then A~C
Distant homology detection
Use more data (MSA)
Use advanced approaches (like HMM)
PSI-BLAST
Position Specific Iterated BLAST (PSI-BLAST)
Basic idea
Use results from BLAST query to construct a profile matrix
(or PSSM)
Search database with profile instead of query sequence
Iterate
Position-specific scoring matrices are an extension of
substitution scoring matrices
PSSM--Position Specific Scoring Matrix
3.0 Match A to position 3
Match D to position 3
-1.0
Match D to position 2
distribution of amino acids + pseudo-counts

Representing a profile as a logo
Logos are used to show the residue preferences or
conservation at particular positions
Based on information theory
helix-turn-helix motif from the CAP family of homodimeric DNA binding proteins
http://weblogo.berkeley.edu/examples.html
Profile-profile alignment (alignment
of alignments)
Inputs: two MSAs (profiles)

Alignment of two MSAs (similar to pairwise sequence
alignment)
Many existing programs
Scoring functions vary
Dot product (FFAS)
FFAS scoring function
Dot product of two amino acid “frequency” vectors

Sequence identity scoring zones
>25-30%: homology zone
15-25%: twilight zone
<15%: midnight zone
Weak sequence similarity detection is still not solved!
And sequence similarity != structural similarity (for proteins)

!= functional similarity
Readings
Chapter 4 (Pairwise Sequence Alignment and Database
Searching)

W8 Lecture

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

W8 Lecture

Uploaded by

Copyright:

Available Formats

CME 4425

Introduction to Bioinformatics Algorithms

 Input: query sequence & database

 Output: hits (pairwise alignments)

MSA Psi-Blast Sequence

 BLAST is fastest and easily accessed on the Web

 Computes alignments in regions of best diagonals

SW:PPI1_HUMAN Begin: 1 End: 269

120 130 140 150 160 170

180 190 200 210 220 230

240 250 260 270 280 290

Use two word matches as anchors to build an alignment between

 This algorithm is more complex than the original BLAST

 It requires two word matches close to each other on a pair of

Ref: Gapped BLAST and PSI-BLAST: a

Bit score (S’)

P (the probability of getting at least one HSP)

blastn DNA DNA DNA level Seek identical DNA

blastp Protein Protein Protein level Find homologous

blastx DNA Protein Protein level Analyze new DNA

tblastn Protein DNA Protein level Search for genes in

tblastx DNA DNA Protein level Discover gene

BLAST+ package has different user interface

 It also makes errors

 moderate E values are related genes

 long list of gradually declining of E values indicates a large

 long regions of moderate similarity are more significant than

 Some people argue that stricter phylogeny analysis is needed

Ref: BLAST by Ian Korf; Mark Yandell; Joseph Bedell, 2003.

3.0 Match A to position 3

distribution of amino acids + pseudo-counts

 Inputs: two MSAs (profiles)

Dot product of two amino acid “frequency” vectors

Weak sequence similarity detection is still not solved!

And sequence similarity != structural similarity (for proteins)

You might also like

Input: query sequence & database

Output: hits (pairwise alignments)

BLAST is fastest and easily accessed on the Web

Computes alignments in regions of best diagonals

This algorithm is more complex than the original BLAST

It requires two word matches close to each other on a pair of

It also makes errors

moderate E values are related genes

long list of gradually declining of E values indicates a large

long regions of moderate similarity are more significant than

Some people argue that stricter phylogeny analysis is needed

Inputs: two MSAs (profiles)