Professional Documents
Culture Documents
FASTA
Database searching
In practice, we cannot use Smith-Waterman to
search for sequences in a database:
Databases are huge (GenBank ~30 million sequences, SwissProt >> 100,000 sequences)
S-W is slow: Time is proportional to N n2 where n = sequence
length and N = number of sequences in the database
Dot Plots
GATCAACTGACGTA
G
T
T
C
A
G
C
T
G
C
G
T
A
C
Dot Plots
GATCAACTGACGTA
G
T
T
C
A
G
C
T
G
C
G
T
A
C
FASTA
Originally developed ~1985 by Lipman
and Pearson
Goal: Perform fast, approximate local
alignments to find sequences in the
database that are related to the query
sequence
Based on dot plot idea
FASTA: Step 1
Look for exact matches between words
in query and test sequence
Words are short
DNA words are usually 6 bases
Protein words are 1 or 2 amino acids
FASTA: Details
Hashing: Map a strings of characters to
integers. e.g.,
AAA 0
AAC 1
...
TTT 63 (oversimplified)
FASTA
FASTA: Step 2
Find 10 best diagonal runs (sequence of
nearby hot spots on same diagonal)
Give each hot spot a positive score, and each
space between consecutive hot spots a
negative score that decreases with distance
similar to affine gap costs in S-W
FASTA: Step 3
Evaluate each diagonal run using an
appropriate scoring matrix and find
best scoring run
Discard runs with low scores (filtration)
FASTA: Step 4
After all diagonals found, try to join diagonals by
adding gaps
Use weighted directed acyclic graph between segments
representing those which could be combined using indel
Adding gaps
FASTA: Step 5
If score reaches a threshold value,
compute an alternative local alignment
Form a band around init1 in dynamic
programming table
Width depends on ktup
!!SEQUENCE_LIST1.0
(Nucleotide)FASTAof:b2.seqfrom:1to:693December9,200214:02
TO:/u/browns02/Victor/Searchset/*.seqSequences:2,050Symbols:
913,285WordSize:6
Searchingwithbothstrandsofthequery.
Scoringmatrix:GenRunData:fastadna.cmp
Constantpamfactorused
Gapcreationpenalty:16Gapextensionpenalty:4
HistogramKey:
Eachhistogramsymbolrepresents4searchsetsequences
Eachinsetsymbolrepresents1searchsetsequences
zscorescomputedfromoptscores
zscoreobsexp
(=)(*)
<2000:
2200:
2430:=
2620:=
2850:==
30113:*==
321911:==*==
343830:=======*==
365861:===============*
3879100:====================*
40134140:==================================*
42167171:==========================================*
44205189:===============================================*====
46209192:===============================================*=====
48177184:=============================================*
List
Thebestscoresare:init1initnoptzscE(1018780)..
SW:PPI1_HUMANBegin:1End:269
!Q00169homosapiens(human).phosph...1854185418542249.31.8e117
SW:PPI1_RABITBegin:1End:269
!P48738oryctolaguscuniculus(rabbi...1840184018402232.41.6e116
SW:PPI1_RATBegin:1End:270
!P16446rattusnorvegicus(rat).pho...1543154318372228.72.5e116
SW:PPI1_MOUSEBegin:1End:270
!P53810musmusculus(mouse).phosph...1542154218362227.52.9e116
SW:PPI2_HUMANBegin:1End:270
!P48739homosapiens(human).phosph...1533153315331861.07.7e96
SPTREMBL_NEW:BAC25830Begin:1End:270
!Bac25830musmusculus(mouse).10,...1488148815221847.64.2e95
SP_TREMBL:Q8N5W1Begin:1End:268
!Q8n5w1homosapiens(human).simila...1477147715221847.64.3e95
SW:PPI2_RATBegin:1End:269
!P53812rattusnorvegicus(rat).pho...1482148215161840.41.1e94
Alignments
SCORESInit1:1515Initn:1565Opt:1687zscore:1158.1E():2.3e58
>>GB_IN3:DMU09374(2038nt)
initn:1565init1:1515opt:1687Zscore:1158.1expect():2.3e58
66.2%identityin875ntoverlap
(83957:1511022)
60708090100110
u39412.gb_prCCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC
||||||||||||||||||||
DMU09374AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC
130140150160170180
120130140150160170
u39412.gb_prGAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA
|||||||||||||||||||||||||||||||||
DMU09374GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC
190200210220230240
180190200210220230
u39412.gb_prTCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC
|||||||||||||||||||||||||||||||||||||
DMU09374AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC
250260270280290300
240250260270280290
u39412.gb_prAAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC
|||||||||||||||||||||||||||||||||||||
DMU09374AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT
310320330340350360