You are on page 1of 18

Database Searches

FASTA

Database searches: Why?


To discover or verify identity of a newly
sequenced gene
To find other members of a multigene
family
To classify groups of genes

Database searching
In practice, we cannot use Smith-Waterman to
search for sequences in a database:
Databases are huge (GenBank ~30 million sequences, SwissProt >> 100,000 sequences)
S-W is slow: Time is proportional to N n2 where n = sequence
length and N = number of sequences in the database

Instead, use faster heuristic approaches


FASTA
BLAST

Tradeoff: Sensitivity vs. false positives


Smith-Waterman is slower, but more sensitive

Dot Plots
GATCAACTGACGTA
G
T
T
C
A
G
C
T
G
C
G
T
A
C

Dot Plots
GATCAACTGACGTA
G
T
T
C
A
G
C
T
G
C
G
T
A
C

4-base window and 75% identity

FASTA
Originally developed ~1985 by Lipman
and Pearson
Goal: Perform fast, approximate local
alignments to find sequences in the
database that are related to the query
sequence
Based on dot plot idea

FASTA: Step 1
Look for exact matches between words
in query and test sequence
Words are short
DNA words are usually 6 bases
Protein words are 1 or 2 amino acids

Ktup denotes word length


Use hash tables to locate words quickly

FASTA: Details
Hashing: Map a strings of characters to
integers. e.g.,

AAA 0
AAC 1
...
TTT 63 (oversimplified)

Preprocess the database and create a table


that stores locations of each possible k-tuple:
20k for amino acids (400 if k = 2),
4k for DNA (4096 if k = 6),

Use hash code computed from query sequence


k-tuples for quick look up

FASTA

FASTA: Step 2
Find 10 best diagonal runs (sequence of
nearby hot spots on same diagonal)
Give each hot spot a positive score, and each
space between consecutive hot spots a
negative score that decreases with distance
similar to affine gap costs in S-W

Each diagonal run is composed of matches


(hot spots themselves) and mismatches
(interspot regions) but no indels

FASTA: Step 3
Evaluate each diagonal run using an
appropriate scoring matrix and find
best scoring run
Discard runs with low scores (filtration)

The highest-scoring diagonal is reported


as init1

FASTA: Step 4
After all diagonals found, try to join diagonals by
adding gaps
Use weighted directed acyclic graph between segments
representing those which could be combined using indel

Find a maximum weight path in this graph; corresponds


to a local alignment, reported as initn

Adding gaps

FASTA: Step 5
If score reaches a threshold value,
compute an alternative local alignment
Form a band around init1 in dynamic
programming table
Width depends on ktup

Use Smith-Waterman to find best


alignment restricted to that band.
Result is called opt

FASTA: Final Steps


Rank database sequences according to
opt scores

use full Smith-Waterman method to align


query sequence against each of the highest
ranking sequences from the database

Perform statistical analysis

!!SEQUENCE_LIST1.0
(Nucleotide)FASTAof:b2.seqfrom:1to:693December9,200214:02
TO:/u/browns02/Victor/Searchset/*.seqSequences:2,050Symbols:
913,285WordSize:6
Searchingwithbothstrandsofthequery.
Scoringmatrix:GenRunData:fastadna.cmp
Constantpamfactorused
Gapcreationpenalty:16Gapextensionpenalty:4
HistogramKey:
Eachhistogramsymbolrepresents4searchsetsequences
Eachinsetsymbolrepresents1searchsetsequences
zscorescomputedfromoptscores
zscoreobsexp
(=)(*)
<2000:
2200:
2430:=
2620:=
2850:==
30113:*==
321911:==*==
343830:=======*==
365861:===============*
3879100:====================*
40134140:==================================*
42167171:==========================================*
44205189:===============================================*====
46209192:===============================================*=====
48177184:=============================================*

List
Thebestscoresare:init1initnoptzscE(1018780)..
SW:PPI1_HUMANBegin:1End:269
!Q00169homosapiens(human).phosph...1854185418542249.31.8e117
SW:PPI1_RABITBegin:1End:269
!P48738oryctolaguscuniculus(rabbi...1840184018402232.41.6e116
SW:PPI1_RATBegin:1End:270
!P16446rattusnorvegicus(rat).pho...1543154318372228.72.5e116
SW:PPI1_MOUSEBegin:1End:270
!P53810musmusculus(mouse).phosph...1542154218362227.52.9e116
SW:PPI2_HUMANBegin:1End:270
!P48739homosapiens(human).phosph...1533153315331861.07.7e96
SPTREMBL_NEW:BAC25830Begin:1End:270
!Bac25830musmusculus(mouse).10,...1488148815221847.64.2e95
SP_TREMBL:Q8N5W1Begin:1End:268
!Q8n5w1homosapiens(human).simila...1477147715221847.64.3e95
SW:PPI2_RATBegin:1End:269
!P53812rattusnorvegicus(rat).pho...1482148215161840.41.1e94

Alignments
SCORESInit1:1515Initn:1565Opt:1687zscore:1158.1E():2.3e58
>>GB_IN3:DMU09374(2038nt)
initn:1565init1:1515opt:1687Zscore:1158.1expect():2.3e58
66.2%identityin875ntoverlap
(83957:1511022)
60708090100110
u39412.gb_prCCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC
||||||||||||||||||||
DMU09374AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC
130140150160170180
120130140150160170
u39412.gb_prGAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA
|||||||||||||||||||||||||||||||||
DMU09374GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC
190200210220230240
180190200210220230
u39412.gb_prTCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC
|||||||||||||||||||||||||||||||||||||
DMU09374AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC
250260270280290300
240250260270280290
u39412.gb_prAAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC
|||||||||||||||||||||||||||||||||||||
DMU09374AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT
310320330340350360

You might also like