Database Searches FASTA

Database Searches
FASTA
Database searches: Why?

To discover or verify identity of a newly
sequenced gene
To find other members of a multigene
family
To classify groups of genes
Database searching
In practice, we cannot use Smith-Waterman to
search for sequences in a database:
Databases are huge (GenBank ~30 million sequences, SwissProt >> 100,000 sequences)
S-W is slow: Time is proportional to N n2 where n = sequence
length and N = number of sequences in the database
Instead, use faster heuristic approaches

FASTA
BLAST
Tradeoff: Sensitivity vs. false positives

Smith-Waterman is slower, but more sensitive
Dot Plots
GATCAACTGACGTA
G
T
T
C
A
G
C
T
G
C
G
T
A
C
Dot Plots
GATCAACTGACGTA
G
T
T
C
A
G
C
T
G
C
G
T
A
C
4-base window and 75% identity
FASTA
Originally developed ~1985 by Lipman
and Pearson
Goal: Perform fast, approximate local
alignments to find sequences in the
database that are related to the query
sequence
Based on dot plot idea
FASTA: Step 1
Look for exact matches between words
in query and test sequence
Words are short
DNA words are usually 6 bases
Protein words are 1 or 2 amino acids
Ktup denotes word length

Use hash tables to locate words quickly
FASTA: Details
Hashing: Map a strings of characters to
integers. e.g.,
AAA 0
AAC 1
...
TTT 63 (oversimplified)
Preprocess the database and create a table

that stores locations of each possible k-tuple:
20k for amino acids (400 if k = 2),
4k for DNA (4096 if k = 6),
Use hash code computed from query sequence

k-tuples for quick look up
FASTA
FASTA: Step 2
Find 10 best diagonal runs (sequence of
nearby hot spots on same diagonal)
Give each hot spot a positive score, and each
space between consecutive hot spots a
negative score that decreases with distance
similar to affine gap costs in S-W
Each diagonal run is composed of matches

(hot spots themselves) and mismatches
(interspot regions) but no indels
FASTA: Step 3
Evaluate each diagonal run using an
appropriate scoring matrix and find
best scoring run
Discard runs with low scores (filtration)
The highest-scoring diagonal is reported

as init1
FASTA: Step 4
After all diagonals found, try to join diagonals by
adding gaps
Use weighted directed acyclic graph between segments
representing those which could be combined using indel
Find a maximum weight path in this graph; corresponds

to a local alignment, reported as initn
Adding gaps
FASTA: Step 5
If score reaches a threshold value,
compute an alternative local alignment
Form a band around init1 in dynamic
programming table
Width depends on ktup
Use Smith-Waterman to find best

alignment restricted to that band.
Result is called opt
FASTA: Final Steps

Rank database sequences according to
opt scores
use full Smith-Waterman method to align

query sequence against each of the highest
ranking sequences from the database
Perform statistical analysis
!!SEQUENCE_LIST1.0
(Nucleotide)FASTAof:b2.seqfrom:1to:693December9,200214:02
TO:/u/browns02/Victor/Searchset/*.seqSequences:2,050Symbols:
913,285WordSize:6
Searchingwithbothstrandsofthequery.
Scoringmatrix:GenRunData:fastadna.cmp
Constantpamfactorused
Gapcreationpenalty:16Gapextensionpenalty:4
HistogramKey:
Eachhistogramsymbolrepresents4searchsetsequences
Eachinsetsymbolrepresents1searchsetsequences
zscorescomputedfromoptscores
zscoreobsexp
(=)(*)
<2000:
2200:
2430:=
2620:=
2850:==
30113:*==
321911:==*==
343830:=======*==
365861:===============*
3879100:====================*
40134140:==================================*
42167171:==========================================*
44205189:===============================================*====
46209192:===============================================*=====
48177184:=============================================*
List
Thebestscoresare:init1initnoptzscE(1018780)..
SW:PPI1_HUMANBegin:1End:269
!Q00169homosapiens(human).phosph...1854185418542249.31.8e117
SW:PPI1_RABITBegin:1End:269
!P48738oryctolaguscuniculus(rabbi...1840184018402232.41.6e116
SW:PPI1_RATBegin:1End:270
!P16446rattusnorvegicus(rat).pho...1543154318372228.72.5e116
SW:PPI1_MOUSEBegin:1End:270
!P53810musmusculus(mouse).phosph...1542154218362227.52.9e116
SW:PPI2_HUMANBegin:1End:270
!P48739homosapiens(human).phosph...1533153315331861.07.7e96
SPTREMBL_NEW:BAC25830Begin:1End:270
!Bac25830musmusculus(mouse).10,...1488148815221847.64.2e95
SP_TREMBL:Q8N5W1Begin:1End:268
!Q8n5w1homosapiens(human).simila...1477147715221847.64.3e95
SW:PPI2_RATBegin:1End:269
!P53812rattusnorvegicus(rat).pho...1482148215161840.41.1e94
Alignments
SCORESInit1:1515Initn:1565Opt:1687zscore:1158.1E():2.3e58
>>GB_IN3:DMU09374(2038nt)
initn:1565init1:1515opt:1687Zscore:1158.1expect():2.3e58
66.2%identityin875ntoverlap
(83957:1511022)
60708090100110
u39412.gb_prCCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC
||||||||||||||||||||
DMU09374AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC
130140150160170180
120130140150160170
u39412.gb_prGAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA
|||||||||||||||||||||||||||||||||
DMU09374GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC
190200210220230240
180190200210220230
u39412.gb_prTCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC
|||||||||||||||||||||||||||||||||||||
DMU09374AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC
250260270280290300
240250260270280290
u39412.gb_prAAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC
|||||||||||||||||||||||||||||||||||||
DMU09374AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT
310320330340350360

Database Searches FASTA

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Database Searches FASTA

Uploaded by

Copyright:

Available Formats

Database Searches

Database searches: Why?

Instead, use faster heuristic approaches

Tradeoff: Sensitivity vs. false positives

4-base window and 75% identity

Ktup denotes word length

Preprocess the database and create a table

Use hash code computed from query sequence

Each diagonal run is composed of matches

The highest-scoring diagonal is reported

Find a maximum weight path in this graph; corresponds

Use Smith-Waterman to find best

FASTA: Final Steps

use full Smith-Waterman method to align

Perform statistical analysis

You might also like