You are on page 1of 8

Paper No.

: 06 Computational Biology

Module : 04 Database Searching- Text Search and BLAST

Principal Investigator: Dr. Vibha Dhawan, Distinguished Fellow and Sr. Director
The Energy and Resources Institute (TERI), New Delhi

Co-Principal Investigator: Prof S K Jain, Professor,


Jamia Hamdard University, New Delhi

Paper Coordinator: Dr. Indira Ghosh, Professor


Jawaharlal Nehru University, New Delhi

Content Writer: Dr. Devapriya Choudhury, Associate Professor


Jawaharlal Nehru University, New Delhi

Paper Reviewer: Dr. Debasisa Mohanty


National Institute of Immunology, New Delhi

Computational Biology
Biotechnology
Database Searching- Text Search and BLAST
Description of Module
Subject Name Biotechnology

Paper Name Computational Biology

Module Name/Title Database Searching - Text Search and BLAST

Module Id 4

Pre-requisites

Objectives

Keywords

Computational Biology
Biotechnology
Database Searching- Text Search and BLAST
Database Searching- Text Search and BLAST

The diversity of types among biological data is also reflected in the diversity of organizations among
biological databases. Many databases are specialized in handling one kind of data, be it sequences,
structures or functional annotations, but for optimal use they must also provide other kinds data
often in a context specific manner. This makes the organization of biological databases quite
complex, and the methods of searching through these databases non-trivial. In this tutorial, we will
primarily consider biological sequence databases and the ways to retrieve specific biological
sequences.

There can be many ways by which one can use a sequence database. At the simplest level one may
try to retrieve a particular gene or protein sequence belonging to a particular organism. In such case,
a simple text-based search on the name of the gene or protein together with the name of the
organism is sufficient to locate the particular sequence entry. In another situation, a user may have a
particular sequence in hand, and requests the database to retrieve all sequences that are somehow
similar to the query sequence. This situation is quite non-trivial because the notion of similarity
between sequences may be defined differently in different instances. There are also no universally
existing threshold parameter which decides which sequences can be considered to be similar. Even
otherwise, a pre-computed table of similarities between sequences may be difficult to maintain
given the very large and constantly increasingly numbers of sequences already existing in the
database. Hence, for searches based on sequence similarities, one needs a fast algorithm that can
calculate sequence similarities, in consonance with the user’s definition of similarity on the fly.

Before discussing such algorithms in detail it is helpful to first consider certain questions relevant for
sequence searching, which are as follows:

(1) How does one quantify the quality of a sequence search algorithm?
(2) Are gene sequences or the corresponding translated protein sequences equally preferable
for database searching?

To answer the first question, one needs to define two parameters that together can define the
quality of a database search. The parameters are: (a) Sensitivity, defined as the ability to detect true
evolutionary relationships. The more sensitive a search technique, the more distant evolutionary
relationships can be detected by it. (b) Specificity, defined as the ability to reject false positives. The
more specific a search technique, the more reliable are the results returned by it. It is obvious that a
good search method should be highly sensitive as well as highly specific. Unfortunately, often one
finds a trade-off between these parameters, thus highly sensitive search methods often lose out on
specificity and vice-versa. An optimum search method attempts to as highly sensitive as possible
without taking a significant loss in specificity.

In considering the answer to the second question posed above, one notes that a nucleic acid
sequences is a string of length n over an alphabet size of 4. Its protein translation product is string of
length n/3 over an alphabet size of 20. It is easier to get a random match between query and
database substrings when the alphabet size is small. Again the larger the database, a greater number
of random hits are possible, all these makes nucleic acid sequence databases more prone to random
hits compared to a database of their translation products. Protein sequences also tend to be more
conserved than their corresponding gene sequences, which gives greater significance to a possible

Computational Biology
Biotechnology
Database Searching- Text Search and BLAST
match. It would thus appear that it is preferable to search a sequence database on the protein
alphabet, even in cases where the calculated translation product of a nucleic acid sequence does not
really exists (for example, in case of an rRNA gene) and thus is not biologically meaningful. On the
other hand, translating a nucleic acid sequence entails a certain loss of information which for some
biological questions may be unacceptable. Hence, it is important to consider the biological purpose
of the search before deciding to run it over the nucleic acid or the protein alphabet.

One of the very first algorithms used for sequence based search was known as the FASTA algorithm.1
Essentially this algorithm stored a hash table of small substrings (between 2 to 8 letters) from all the
sequences in the database. The query was matched to sequences in the database through a series
of hash table lookups, followed by a graphical post-processing algorithm that constructs the final
output. Although the FASTA algorithm was highly efficient for its time, it is now nearly obsolete and
has been replaced by another algorithm called BLAST2 and its derivatives. We will discuss the BLAST
algorithm and its implementation in this tutorial.

BLAST stands for Basic Local Alignment Search Tool, as the name of the algorithm itself suggests, it
finds the best local alignments between a query sequence and a set of database sequences. In the
earliest version of this algorithm, only un-gapped alignments were returned, but this limitation has
since been removed. The key achievement of BLAST other than its immense speed is the fact it
allows the use of a user specified substitution matrix right from the earliest steps in the algorithm;
hence it can be suitably tuned for any specific notions of sequence similarity that the user may
possess. Given that the efficiency of sequence search algorithms strongly depend upon the
sequence alphabet used, there are specific implementations of the BLAST algorithm depending upon
the alphabet being used for the query as well as the database as given in table 1.

Table 1: The BLAST family of programs

Program Query Database Comparison


blastn DNA DNA DNA level
blastp Protein Protein Protein level
blastx DNA Protein Protein level
tblastn Protein DNA Protein level
tblastx DNA DNA Protein Level

Programs, where the query or the database is the form of DNA, but the sequence comparison is
done at a protein level, will automatically carry out the appropriate DNA → protein sequence
translation.

Before one attempts to understand the BLAST algorithm certain technical terms must be defined
that are given below:

 Segment Pair A pair of equal length substrings within strings S1 and S2, which are aligned
without gaps.
 Locally Maximal Segment A segment pair, whose alignment score (without gaps) cannot be
increased by extending or shortening it. One should note that most similarity matrices

1
Lipman, DJ and Pearson, WR.(1985) Rapid and sensitive protein similarity searches Science 227,1435-1441
Pearson, WR and Lipman, DJ (1988) Improved tools for biological sequence comparison Proc. Natl. Acad. Sc.
USA 85,2444-2448
2
Altschul, SF et. al. (1990) Basic Local Alignment Search Tool J. Mol. Biol. 215, 403-410

Computational Biology
Biotechnology
Database Searching- Text Search and BLAST
contain positive and negative values, hence extending an arbitrary extension of an alignment
is likely to result in a reduction of the alignment score.
 Maximal segment Pair (MSP) That segment pair within strings S1 and S2 which has the
highest score among all Locally Maximal Segments.
 High Scoring Pair (HSP) When a database is being searched, BLAST attempts to find an MSP
with a score higher than a threshold score S. These MSPs are called HSPs. BLAST chooses the
threshold S in such a way that random hits are minimized.

The BLAST algorithm can now be described in a stepwise fashion as i the following:
 Given a length parameter w and threshold score t, find all substrings of length w in the
database that aligns with substrings in the query sequence with a gapless alignment
score greater than t. Such words are called hits.
 Recommended word size w is 3-5 for amino acids and ~12 for nucleotides.
 If t is increased, the total number of hits is reduced and the program runs faster.
However reduction of t also allows more distant relationships to be discovered.
 Once a hit is found these are extended in both directions until a locally maximal segment
is founds. Those locally maximal segments that have an alignment score greater than S
are called HSPs.
 After BLAST has found all HSPs from the sequences in the database, the alignments
corresponding to the HSPs are stored in a special data structure.
 A formatter then accepts this data structure and formats the output in one of several
ways as per user requirement.

Because BLAST can produce its output in several different ways, it is necessary to
understand the various output formats, so that the user can wisely choose the best format
for the biological purpose at hand. The different formats that are currently available can be
described as under:

 The traditional report is a descriptive and highly detailed form of output that can be
easily read and interpreted by a human user. However, it is technically more difficult
to design algorithms for parsing this form of the output and hence it may not be
suitable if the intention is to parse hundreds of BLAST outputs by a computer.
 The hit table form is a spreadsheet like output that is easily parsed by simple
computer programs and easily readable by humans. On the negative side, the hit
table output can carry somewhat less information than the other output forms.
 Structured Output in the form of markup languages like ASN-1 or XML are highly
information rich but can be easily parsed by a computer. On the other hand, these
forms of the output are more difficult to understand by a human user.

Figure 1, shows the header portion of a typical traditional report. As can be seen from the
figure, the header reports the particular version of the program used in the BLAST search.
The query sequence and the target database are identified, it also provides a unique
Retrieval ID (RID) that can be used to retrieve a copy of the output at some later date. Links
for Help and additional relevant information are also provided.

Computational Biology
Biotechnology
Database Searching- Text Search and BLAST
Figure 1. The header portion of the traditional BLAST output

The header is followed by a pictorial description (Figure 2) where coloured lines denote the
HSPs. The length of each line indicates the length of the alignment between the HSP and the
query sequence and the colour indicating the alignment score. A colour key at the top
relates colour with the range of alignment score. Mousing over the lines as they come in the
web-page, provide links to the exact alignment between the HSP and the query sequence.
Finally further down in the output one finds a series of hyperlinks, clicking on which supplies
the exact alignment. Associated with the hyperlinks are two scores, viz., the bit-score and
the e-value which provide a quantitative description of the quality of the alignment. A
detailed description of the bit-score and the e-value and their interpretation is given below.

Other than the traditional output, the hit-table form is also humanly readable. This form of
the output resembles a spread sheet with structure columns describing the HSPs and the
quality of their alignment with the query sequence. As was mentioned, the advantage of the
hit-table form is that it is easily readable by humans and at the same time can be easily
parsed by an ordinary spread sheet application. The disadvantage is that detailed
information like the exact alignment between the query sequence and the HSPs cannot be
provided.

An important problem in sequence searching relates to the statistical significance of the


alignment between the query sequence and an HSP. Since both the query and the strings

Computational Biology
Biotechnology
Database Searching- Text Search and BLAST
Figure 2. Traditional output in BLAST. Pictorial description of HSPs and their alignment with
the query sequence.

In the database are taken from a finite alphabet. There is always the possibility that small
sections between the query and database sequences match. Therefore, the alignment score
between even a pair of randomly generated sequences is unlikely to be exactly zero
provided the sequences are sufficiently long. Therefore, there is a need for some sort of
statistical treatment that can distinguish significant alignments from those that are likely to
be generated simply by the juxtaposition of two randomly generated strings.
One way to empirically judge the significance of a particular alignment is compare its
alignment score with the score obtained from a set of random alignments with sequences of
the same length (usually obtained by repeated random shufflings of the sequences used for
the alignment).

One can now define what is known as the z-score defined as follows:
𝑆−𝜇
𝑧=
𝜎
Where S is the alignment score of the real sequence pair, while  and  are the mean and
standard deviation of alignment scores obtained from randomly shuffled sequences.
Unfortunately, the z-score is not a generally valid measure of statistical sequences. It is only
meaningful if  and  are parameters of a Gaussian distribution, which is generally not the
case with sequence alignment scores. Hence, there is no threshold value of z, justified by
theory, beyond which an alignment can be considered statistically significant. However, it

Computational Biology
Biotechnology
Database Searching- Text Search and BLAST
was shown that, in a database search, the expected number of HSPs with an alignment score
of at least S is given by:
𝐸 = 𝐾𝑚𝑛𝑒 −𝜆𝑆
Where: m and n are the lengths of the query and the HSP while K and  are parameters of
the distribution of the scores.
Here E is known as the e-value of the score S. The raw score S, however depends on the
scoring system used and needs to be normalized to the bit-score S’ reported by BLAST. The
normalization is given by:
𝜆𝑆 − ln 𝐾
𝑆′ =
ln 2
In terms of the bit-score the e-value becomes:

𝐸 = 𝑚𝑛2−𝑆
The probability of finding at least one HSP with a score greater than S’ is now given by:
𝑝 = 1 − 𝑒𝐸

For example, if the value of E is 0.01, then only one can expect only one hit in 100
independent searches to be random. On the other hand, if the E = 5, then in each search we
can expect five random hits. By setting a value of E and solving for S, one can calculate a
threshold score above which hits are reported. A specific value of E will affect both the
sensitivity and specificity of the search. A lower value of E will increase specificity by
decreasing the error rate, but will decrease the sensitivity as well. Typical values of E are in
the range 0.01-0.001.

The above description of the BLAST algorithm refers to only the basic form of the algorithm
as it was first developed and still useful for routine searches. Later developments include
more sensitive algorithms like PSI-BLAST, or algorithms used for special purpose searches
like RPS-BLAST etc. Some of these algorithms will be discussed in a later module.

Computational Biology
Biotechnology
Database Searching- Text Search and BLAST

You might also like