You are on page 1of 28

The Blast and FastA algorithms

BLOSUM 62
Global alignments that do not include gaps : a matrix of 200
PAMS for sequences that are thought to be related.
“Unknown sequences” : a 120 PAM matrix was the best
compromise.
Local alignment method PAM40, PAM120 and PAM250. The
lower PAM matrices (40-120) find short alignments of highly
similar sequences, while higher PAM matrices (120-250)
find longer, weaker local alignments.
Standard Blast: Overall the BLOSUM 62 matrix is the most
effective.
All other substitution matrices perform better than BLOSUM
62 for a proportion of the families.
Algorithms

Comparing sequences by dot matrix display or by


any other standard method of sequence comparison
is a very slow process therefore:

Most commonly the Blast and the FastA


approximation algorithm are used
Blast and Fasta create alignments

•In an optimal alignment, non-identical


characters and gaps are so placed to bring as
many identical or similar characters as
possible into columns.
•Two types of sequence alignment are used
global and local
In global alignment,
an attempt is made to align the entire
sequences, as many characters as possible.
The alignment is stretched over the entire
sequence lengths to include as many matching
amino acids as possible up to and including the
sequence ends. Although there is an obvious
region of identity in this example (the sequence
FGKG), a global alignment may not align such
regions in order to favour matching more amino
acids along the entire sequence lengths.
LGPSTKQFGKGSSSRIWDN
| |||| | | global alignment
LNQIERSFGKGAIMRLGDA
Local alignment.
The alignment tends to stop at the ends of
regions of identity or strong similarity. A much
higher priority is given to finding these local
regions than to extending the alignment to
include more neighbouring amino acid pairs.
Dashes indicate sequence not included in the
alignment. This type of alignment favours
finding conserved amino acid motifs in related
protein sequences.
-------FGKG--------
|||| local alignment
-------FGKG--------
Global alignment is appropriate for sequences that are
known to share similarity over their whole length.
Global alignment Algorithm FASTA

Step 1 Preprocessing
•finds regions of similarity by making an index showing all of the
amino acid positions for each sequence i.e. a C at position 1, S at
position 2, etc.
•Step 2 Heuristic searching
•these indexes are used to find if a row of the same characters are
found in the same order in the two sequences being compared.
•If these rows are long enough, the sequences are similar.

•An alignment is shown with the best matched sequences in the


database
PAM250
top 10 sequences
init 1 scores used to rank the
database sequences
FastA

Initn: Sum of init 1 scores


- penalty for gaps (20) NW opt score
Characteristics of FASTA :

•Local alignments: FASTA tries to find patches of regional


similarity, rather than trying to find the best alignment
between your entire query and an entire database sequence.
•Gapped alignments Alignments generated with FASTA can
contain gaps.
•Rapid
Heuristic FASTA is not guaranteed to find the best alignment
between your query and the database; it may miss matches.
This is because it uses a strategy which is expected to find
most matches, but sacrifices complete sensitivity in order to
gain speed.
Scores
Scores

•Initn = init1 = opt indicates 100% homology over the matched stretch.
• Initn > init1 indicates that there is more than one matching region in the database
sequence, with poorly matching separating regions(s).
•Opt > initn shows that the matching regions are greatly improved by the addition
of gaps in one or both of the sequences. Such differences in score are indicative of
non-homologous sequences.
•Opt < initn FASTA only optimizes within a narrow band along the same diagonal
as the INIT1 region (best single region of match). If any of the (n-1) regions lie
outside the band, then they are excluded from the optimized score. i.e.: There is too
large a separation between the good scoring regions for FASTA to join them.
Finding a local alignment: BLAST algorithm

•With the BLAST algorithm a substitution matrix is used


during all phases of protein searches (BLASTP, BLASTX,
TBLASTN)
•FASTA uses a substitution matrix only for the extension
phase. This is in contrast to BLAST, which uses a matrix for
both phases. To reduce the penalty of using a substitution
matrix for only the second phase, set the k-tuple parameter to
a low value (1). However, this will give a significant speed
penalty (for you).
Algorithms BLAST

•makes an index of the query sequence showing the positions of


each possible amino triplet i.e. a CCC occurs at positions 1, YTL at
position 23, etc.

•Triplets are ordered according to how often they will occur by


chance in two related proteins, the most rarely found being the most
significant.

•A matrix (for instance BLOSUM62) is used to determine these


significances
•Each database sequence is searched for these unusual triplets first.
•An alignment is shown with the best matched sequences in the database
•this is a heuristic (tried-and-true) method which usually works well
BLAST (Basic Local Alignment Search Tool).
Characteristics :
•Local alignments BLAST tries to find patches of regional
similarity, rather than trying to find the best alignment
between your entire query and an entire database sequence.

•Ungapped alignments Alignments generated with


BLAST do not contain gaps. BLAST's speed and statistical
model depend on this, but in theory it reduces sensitivity.
However, BLAST will report multiple local alignments
between your query and a database sequence.
Rapid: BLAST is extremely fast.

Heuristic; BLAST is not guaranteed to find the best


alignment between your query and the database; it may miss
matches. This is because it uses a strategy which is expected
to find most matches, but sacrifices complete sensitivity in
order to gain speed.
However, in practice few biologically significant matches are
missed by BLAST which can be found with other sequence
search programs. BLAST searches the database in two
phases. First it looks for short subsequences which are likely
to produce significant matches, and then it tries to extend
these sub-sequences.
•BLASTP search a Protein Sequence against a Protein
Database.
•BLASTN search a Nucleotide Sequence against a Nucleotide
Database.
•TBLASTN search a Protein Sequence against a Nucleotide
Database, by translating each database Nucleotide sequence in
all 6 reading frames. Especially good for EST databases
•BLASTX search a Nucleotide Sequence against a Protein
Database, by first translating the query Nucleotide sequence in
all 6 reading frames.
Finally some rules of the thumb: Homology
•Protein sequence comparisons typically double the evolutionary
look-back time over DNA sequence comparisons.
•The requirement for a common folded structure in homologous
proteins usually causes these proteins to be similar over the
entire length of the gene product (or domain). Therefore, most
sequences that share statistically significant similarity throughout
their entire lengths are homologous.
•Matches that are more than 50% identical in a 20-40 amino acid region occur frequently by
chance.
•Distantly related homologs may lack significant similarity. Two or more
homologous sequences may have very few absolutely conserved residues.
•If homology has been inferred due to significant similarity scores between
two proteins, A and B, that align over their entire lengths and between protein
B and a third protein, C, then proteins A and C must also be homologous, even
if they share no significant similarity.
•Low complexity regions, transmembrane regions and coiled-coil regions
frequently display significant similarity in the absense of homology. Low
complexity regions can be filtered out using the default parameters of BLAST.
Transmembrane and coiled-coil regions should be identified and masked (by
eliminating these regions from the query) by the user.
Significance
Results of searches using different scoring systems may be
compared directly using normalized scores.
If S is the (raw) score for a local alignment, the normalized score S'
(in bits) is calculated by the formula S'=(lambdaS-lnK)/ln2. lambda
and K are parameters associated with a given scoring system..
A normalized score, S' with E value = E, is statistically significant
if it exceeds log N/E where N is the size of the search space. As the
evolutionary distance between two sequences increases, the length
of a local alignment required to achieve a statistically significant
score also increases..
Summary of previous

•Global alignment is appropriate for sequences that are


known to share similarity over their whole length.
•Local alignment is appropriate when the sequences may
show isolated regions of similarity, for example multiple
domains or repeats.
•Local alignment is best applied when scanning a database to
find similarities or when there is no a priori knowledge that
the protein sequences are similar.
Database artifacts and Low complexity filters
Database Artifacts
Vector sequences A number of authors have identified and
catalogued the contamination of sequence databases with
vectors.
Among the studies are:
Claverie Genomics 12:838 1992.
Lamperti et al Nucleic. Acids. Res 20:2741) 1992.
Of particular note in this paper is the finding of short
apparent vector sequences
in the middle of non-vector sequence.
The authors speculate that these may be due to errors in
the editing of sequences or to rearranged plasmids.
Lopez, Kristensen, & Prydz. Nature 355:211. 1992.
Kristensen, Lopez, & Prydz. An estimate of the sequencing
error frequency in the DNA sequence databases. DNA Seq
2:343 1989.
Heterologous sequences
White, O. et al. Nucl. Acids. Res. 21:2829
Describes a statistical method to compare sequence sets (but not individual
sequences). Shows that several sets of cDNAs show bulk properties
different than human cDNAs. Sequence comparisons are used to show that
this is due to contamination of the anomalous libraries with yeast and
bacterial sequences.
Rearranged & deleted sequences
Repetitive element contamination
cDNA cloning methods may sometimes capture retroelements such as Alus.
In some cases, chimaeras between cellular transcripts and Alus may form.
Derived protein sequences which appear to contain Alu-derived sequences
were cataloged by Claverie (Genomics 12:838)

Sequencing errors / Natural polymorphisms

.
Sequence Pre-Filters
Reducing matches due to biased amino acid composition
Many amino acid sequences are highly repetitive in nature,
especially naive translations of genomic DNA. Matches
between such segments are more likely to be due to these
local amino acid composition biases than to common
descent. Filters have been developed to mask out regions
showing highly-biased local composition.
SEG (Wooton & Federhen, Computers & Chemistry 17:149.
1993)
XNU(Claverie & States, Computers & Chemistry, 17:191.
1993)
The end
Thank you for your attention

You might also like