Professional Documents
Culture Documents
0 RELEASE NOTES
* Introduction
* Blast Family of Programs
* Gaps in Blast
* Blast Query Format
* Blast Report
* Blast Statistics and Scores
* Stand-Alone Blast
* Compiling Blast
* Database Format
* PSI-Blast
* PHI-Blast
* References
* Release History
Introduction
The www BLAST server can be accessed through the home page of the NCBI at
www.ncbi.nlm.nih.gov. Stand-alone BLAST binaries can be obtained from the
NCBI FTP site. See the Stand-Alone Blast section for details.
The BLAST 2.0 release has significant differences from the BLAST 1.4
release. These include significant performance enhancements, the addition of
'gapping' routines, position-specific-iterated BLAST (see the PSI-Blast
section) as well as extensive changes to the text report (see below), and
the format of the databases (see the Stand-Alone Blast section). The options
available and their command-line appearance have also changed substantially.
The BLAST 2.0 programs are described in a Nucleic Acids Research article.
Please cite this reference if you publish the results of your BLAST query.
The BLAST family of programs allows all combinations of DNA or protein query
sequences with searches against DNA or protein databases:
Gaps in Blast
The programs, blastn and blastp, offer fully gapped alignments. blastx and
tblastn have 'in-frame' gapped alignments and use sum statistics to link
alignments from different frames. tblastx provides only ungapped alignments.
The sequence sent to the BLAST server should be in FASTA format, described
in http://www.ncbi.nlm.nih.gov/BLAST/fasta.html.
Blast Report
The BLAST report starts with some header information that lists the type of
program (here blastp), the version (here 2.0.1), and a release date. Also
listed are a reference to the BLAST program, the query definition line, and
summary of the database used.
One-line descriptions of the database matches found are presented next. These
include a database sequence identifier, the corresponding definition line, as
well as the score (in bits) and the statistical significance ('E value') for this
match (please see the section on statistics for an explanation of bits and
significance). Consider the output below, from a gapped blastp comparison of
SwissProt accession P01013 against the SwissProt database.
High E
Sequences producing significant alignments: Score Value
The first match, in this case, is the actual query sequence. The identifiers
shown here are all from SwissProt, so they all have 'sp' in the first field,
followed by the accession, and then a Locus name. The syntax of these
identifiers is discussed in more detail in the appendices of
ftp://ncbi.nlm.nih.gov/blast/db/README The definition lines are taken from
the definition line in the database, with the ellipsis (e.g., P29508)
indicating that the definition line was too long to for the space available.
Ungapped alignments and results from blastx and tblastn will have an
additional column ('N'), displaying the number of different segment pairs
used to produce the alignment, according to the Karlin-Altschul statistics.
Query 2 IKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNS 61
I+++L SS D T +VLVNAI FKG+W+ AF EDT+ MPF VT+QESKPVQMM
Sbjct 158 IRNVLQPSSVDSQTAMVLVNAIVFKGLWEKAFKDEDTQAMPFRVTEQESKPVQMMYQIGL 217
The last section lists specifics about the database searched as well as
statistical and search parameters used:
Lambda K H
0.317 0.132 0.377
Gapped
Lambda K H
0.255 0.0350 0.190
Matrix: BLOSUM62
Gap Penalties: Existence: 10, Extension: 1
Number of Hits to DB: 8938654
Number of Sequences: 59576
Number of extensions: 335248
Number of successful extensions: 1188
Number of sequences better than 10: 116
Number of HSP's better than 10.0 without gapping: 106
Number of HSP's successfully gapped in prelim test: 10
Number of HSP's that attempted gapping in prelim test: 868
Number of HSP's gapped (non-prelim): 120
length of query: 232
length of database: 21219450
effective HSP length: 52
effective length of query: 180
effective length of database: 18121498
effective search space: -1033097656
T: 11
A: 40
X1: 16 ( 7.3 bits)
X2: 40 (14.7 bits)
X3: 67 (24.6 bits)
S1: 41 (21.7 bits)
S2: 64 (28.4 bits)
One may judge the results of a blast search by two numbers. One is the 'bit'
score, which is defined as:
Stand-Alone Blast
BLAST binaries are provided for IRIX6.2, Solaris2.6, DEC OSF1 (ver. 4),
LINUX, and Win32 systems. We will attempt to produce binaries for other
platforms upon request.
The source code for BLAST 2.0 is part of the NCBI toolkit. See Compiling
Blast for help in compiling BLAST.
Formatdb
Formatdb, should be used to format the FASTA databases for both protein and
DNA databases for BLAST 2.0. This must be done before blastall or blastpgp
can be run locally. The format of the databases has been changed
substantially from the BLAST 1.4 release. A major improvement in this format
over the old one is that ambiguity information for DNA sequences is now
retrieved from the files produced by formatdb, rather than from the original
FASTA file. The original FASTA file is no longer needed for the BLAST runs.
Formatdb may be obtained with the other BLAST binaries from the executables
directory (see above). The input for formatdb may be either ASN.1 or FASTA.
Use of ASN.1 is advantageous for those sites that might also wish to format
the ASN.1 in different ways, such as a GenBank report. Usage of formatdb may
be obtained by executing formatdb and a dash:
formatdb arguments:
The "-p" option has two different meaning depending on whether input
database is in FASTA or ASN.1 format. In case of FASTA, the "-p" specifies
type of input database. In case of ASN.1, the option specifies the type of
sequence to be indexed for BLAST.
If the "-o" option is TRUE (and the input database is in FASTA format), then
the database identifiers in the FASTA definition line must follow the
convention described in the appendices of
ftp://ncbi.nlm.nih.gov/blast/db/README
1.) If ASN.1 is to be produced from blastall or blastpgp, then "-o" must be TRUE.
2.) master-slave alignments are desired (i.e., the '-m' option with a non-zero
value is used).
3.) The gi's are desired as part of the output (i.e., '-I' is used).
4.) fastacmd is used to fetch sequences from the database by accession or gi.
An input ASN.1 database may be represented in two formats - ascii text and
binary. The "-b" option, if TRUE, specifies that input ASN.1 database is in
binary format. The option is ignored in case of FASTA input database.
Blastall
Blastall may be used to perform all five flavors of blast comparison. One
may obtain the blastall options by executing 'blastall -' (note the dash). A
typical blastall to perform a blastn search (nucl. vs. nucl.) of a file
called QUERY would be:
The output is placed into the output file out.QUERY and the search is performed
against the 'nr' database. If a protein vs. protein search is desired,
then 'blastn' should be replaced with 'blastp' etc.
blastall arguments:
-d Database [String]
default = nr
Version 2.0.4 and higher will accept multiple database names (bracketed by
quotations).
An example would be
-d "nr est"
which will search both the nr and est databases, presenting the results as
if one
'virtual' database consisting of all the entries from both were searched.
The
statistics are based on the 'virtual' database.
The query should be in FASTA format. If multiple FASTA entries are in the
input
file, all queries will be searched.
-F Filter query sequence (DUST with blastn, SEG with others) [T/F]
default = T
Blastpgp
Fastacmd
fastacmd -d nr -s p38398
fastacmd arguments:
-d Database [String]
default = nr
-s Search string: GIs, accessions and locuses may be used delimited
by comma or space) [String] Optional
-i Input file wilth GIs/accessions/locuses for batch retrieval [String]
Optional
-a Retrieve duplicated accessions [T/F] Optional
default = F
-l Line length for sequence [Integer] Optional
default = 80
Software requirements
These patches can be obtained by calling SGI customer service or from the web:
http://support.sgi.com/
System recommendations
BLAST uses memory-mapped files (on UNIX and NT systems), so it runs best if
it can read the entire BLAST database into memory, then keep on using it
there. Resources consumed reading a database into memory can easily
outweight the cost of a BLAST search, so that the memory of a machine is
normally more important than the CPU speed. This means that one should have
sufficient memory for the largest BLAST database one will use, then run all
the searches against this databases in serial, then run queries against
another database in serial. This guarantees that the database will be read
into memory only once. As of Aug. 1997 the EST FASTA file is about 500 Meg,
which translates to about 170-200 Meg of BLAST database. At least another
100-200 Meg should be allowed for memory consumed by the actual BLAST
program. All of the FASTA databases together are about 1.5 Gig, the BLAST
databases produced from this will probably be about another Gig or so. 4 Gig
of disk space, to make room for software and output, is probably a pretty
good bet.
Setup
BLAST needs to know where the NCBI data directory and BLAST databases are.
This is specified by the main configuration file for the NCBI toolkit
(".ncbirc" on UNIX systems, ncbi.ini on Windows, analogous names on other
platforms). If BLAST is the ONLY NCBI application that will be used, it is
sufficient to have the following simple configuration file:
[NCBI]
Data=/am/ncbiapdata/data
[BLAST]
BLASTDB=/usr/ncbi/db/disk.blast/blast2
Low-complexity Filters
BLAST 2.0 uses the dust low-complexity filter for blastn and seg for the
other programs. Both 'dust' and 'seg' are integral parts of the NCBI toolkit
and are accessed automatically.
If one uses "-F T" then normal filtering by seg or dust (for blastn)
occurs (likewise "-F F" means no filtering whatsoever). The seg options
can be changed by using:
which specifies a window of 10, locut of 1.0 and hicut of 1.5. A coiled-coiled
filter,
based on the work of Lupas et al. (Science, vol 252, pp. 1162-4 (1991)) and written
by
John Kuzio (Wilson et al., J Gen Virol, vol. 76, pp. 2923-32 (1995)), may be
invoked
by specifying:
-F "C"
There are three parameters for this: window, cutoff (prob of a coil-coil), and
linker (distance between two coiled-coiled regions that should be linked
together). These are now set to
window: 22
cutoff: 40.0
linker: 32
One may also run both seg and coiled-coiled together by using a ";":
-F "C;S"
BLAST databases
The FASTA files used by the NCBI to produce BLAST databases are available on
the NCBI FTP site in ftp://ncbi.nlm.nih.gov/blast/db/. Please see the README
for details.
Compiling Blast
BLAST is part of the NCBI toolkit and it is necessary to compile the toolkit
to compile BLAST. For DOS or UNIX it is recommended to read either
readme.dos or readme.unx. These documents may be found in the 'make'
directory of the NCBI toolkit. These documents describe how to use scripts
to easily perform the compile. NCBI toolkit archives may be obtained from
ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools.
Database Format
The format of the BLAST databases has changed for the 2.0 release and is not
compatiable with the databases used in the 1.4 release. The change was made
to eliminate an unpleasant feature of the 1.4 databases: ambiguity
information for nucleotide sequences was not stored in the compressed file,
but rather the original FASTA file had to be accessed for this information.
This leads to significant slow-downs in BLAST comparisons for databases,
such as dbest, that contain a large number of ambiguity characters.
PSI-Blast
PHI-Blast
The syntax for the query sequence is FASTA format as for all other
BLAST queries. The syntax for patterns follows the rules of
PROSITE and is documented in detail below.
The specified pattern is not required to be in the PROSITE list.
Most of the other BLAST flags can be used with PHI-BLAST.
One important exception is that PHI-BLAST requires gapped
alignments (i.e. forbids -g F in the flags) because ungapped
alignments do not make sense for almost all patterns in PROSITE.
in which case the use of the "seedp" option requires the user to
specify the location(s) of the interesting pattern occurrence(s)
in the pattern file. The syntax for how to specify pattern
occurrences is below. When there are multiple pattern occurrences in the
query it may be important to decide how many are of interest because
the E-value for matches is effectively multiplied by the number
of interesting pattern occurrences.
then the first round of searching uses PHI-BLAST and all subsequent
rounds use PSI-BLAST.
In the Web page setting, the user must explicitly invoke one round
at a time, and the PHI-BLAST Web page provides the option to
initiate a PSI-BLAST round with the PHI-BLAST results.
To describe a combined usage, use the term "PHI-PSI-BLAST"
(Pattern-Hit Initiated, Position-Specific Iterated BLAST).
C and Lambda are "constants" depending on the score matrix and the
gap costs.
N is (number of occurrences of pattern in database) * (number of
occurrences of pattern in Q)
e is the base of the natural logarithm.
All other PROSITE codes in the first two columns are allowed,
but only the HI code, described below is relevant to PHI-BLAST.
ID ER_TARGET; PATTERN.
PA [KRHQSA]-[DENQ]-E-L>.
HI (19 22)
HI (201 204)
References
Zhang, Zheng, Alejandro A. Sch�ffer, Webb Miller, Thomas L. Madden,
David J. Lipman, Eugene V. Koonin, and Stephen F. Altschul (1998),
"Protein sequence similarity searches using patterns as seeds", Nucleic
Acids Res. 26:3986-3990.
Release History
Bug fixes:
2.) A problem with very redundant databases and psi-blast was fixed.
3.) A problem with the formatting of the number of identities and positives
was fixed. This affected results on the minus strand only and did not
affect the expect value or scores.
4.) A problem that caused tblastn to core-dump very occassionally was corrected.
6.) A limit on the number of HSP's that were saved (100) was removed.
Enhancements:
1.) PHI-BLAST is included in this release. Please see notes on PHI-BLAST for
details.
2.) SEG has become an integral part of the NCBI toolkit and it is no longer
necessary
to install it separately. It is also now supported under non-UNIX platforms.
If one uses "-F T" then normal filtering by seg or dust (for blastn)
occurs (likewise "-F F" means no filtering whatsoever). The seg options
can be changed by using:
which specifies a window of 10, locut of 1.0 and hicut of 1.5. One may
also specify coiled-coiled filtering by specifying:
-F "C"
There are three parameters for this: window, cutoff (prob of a coil-coil), and
linker (distance between two coiled-coiled regions that should be linked
together). These are now set to
window: 22
cutoff: 40.0
linker: 32
One may also run both seg and coiled-coiled together by using a ";":
-F "C;S"
4.) BLAST has been changed to reduce the number of redundant hits that a user
may see. This is acheived by keeping track of the number of hits completely
contained in a certain region and eliminating those lower scoring hits that
are redundant with others. This behavior may be controlled with the -K and -L
options:
Setting -K to zero turns off this feature. This is the default only on blastall.
Bug fixes:
1.) There was a problem with the procedure that called the external utility seg.
The need to fix this was obviated by the integration of seg into the toolkit.
This showed up under LINUX.
2.) There was a memory problem with formatdb that has been fixed. This showed up
mostly under NT and LINUX.
3.) A problem with running in multi-processing mode under IRIX6.5 (as a non-root
user)
was fixed.
Enhancements:
2.) Multi-database searches no longer require that the -o option be used when
preparing the databases (i.e., with formatdb).
Bugs fixed:
1.) A serious bug with multi-database iterative searches was fixed (thanks to
Steve Brenner for providing an example).
2.) 'lcl' is not formatted in the BLAST report when the sequence identifier
is a local identifier or does not contain a bar ("|").
4.) An unnecessary cast that caused formatdb to fail on Solaris 2.5 machines
if the binary was made under 2.6 was fixed.
6.) Some problems with the sum statistics treatment of the blastx and tblastn
programs reported by D. Rozenbaum were fixed. The number of alignments
involved in a sum group was misrepresented. Also the incorrect length for
the database sequence was used, sometimes casuing a slight change in the
value reported.
7.) A problem with blastpgp was fixed that reported incorrect values for
matrices other than BLOSUM62 during iterative searches.
Enhancements:
-d "nr est"
which will search both the nr and est databases, presenting the results as if one
'virtual' database consisting of all the entries from both were searched. The
statistics are based on the 'virtual' database.
3.) The number of identities, positives, and gaps are now printed out before the
alignments for gapped blastx, tblastn, and tblastx. Additionally this feature is
now also enabled for ungapped BLAST.
Bugs fixed:
1.) In blastx, tblastn, and tblastx a codon was incorrectly formatted as a start
codon in
some cases.
2.) The last alignment of the last sequence being presented was incorrectly dropped
in some cases. This change could affect the statistical significance of the last
database
sequence if the dropped alignment had a lower e-value than any other alignments
from the
same database sequence.