Professional Documents
Culture Documents
Quasar
Quasar
Report by
Lukas Naef
lnaef@student.ethz.ch
16. June 2008
Q-gram Alignment based on Suffix Array
Index
Index ............................................................................................................... 2
1 Introduction .............................................................................................. 3
1.2 Abstract................................................................................................. 3
3 Example.................................................................................................. 11
4.1 Complexity........................................................................................... 14
5 Appendix................................................................................................. 16
1 Introduction
1.1 QUASAR…
- was introduced in a paper by Stefan Burkhardt, Andreas Crauser, Paolo Ferragina,
Hans-Peter Lenhof, Eric Rivals and Martin Vingron in 1999
- is used to quickly detect sequences with strong local similarity, where many
searches are conducted on one database
1.2 Abstract
In molecular biology it has become a basic operation to search for similarity to a query
sequence in a DNA database. If we have a look at the current situation, today all the fast
algorithms reach their limit if they try all-versus-all comparisons of large databases. In
this report I will discuss the searching algorithm called QUASAR (Q-gram Alignment
based on Suffix ARray) which was developed to quickly detect sequences with strong
similarity1 to the query in a context, where many searches are conducted on one data-
base. Have a look at the following example:
on database: D = …AGCTATTAACGTCA…
In this case QUASAR will return the subsequence ATTAAC as a result because it is a
string not equal to the search query but somehow strong similarity to the search query.
You will get a closer look to the statement “strong similarity” later on in this document.
This algorithm uses q-tuple filtering (see 2.1 q-gram filtering) based on a suffix-array
index (see 5.1 Introduction to suffix array). This also means that QUASAR first filters all
possible positions for similar strings in the database and passes the positions with high
possibility of strong similarity to an algorithm which can solve the similarity problem ac-
curately. In QUASAR we will use the well-known algorithm BLAST to generate the accu-
rate result after the filtering process.
Tests showed that if you use the algorithm for strongly similar comparison, it is an or-
der of magnitude faster than using BLAST for the whole unfiltered task (see section 4.2
Experiment).
The document is organized as follows. After a short introduction to the similarity be-
tween strings and the big picture QUASAR works, we go deeper into the algorithm (see 2
The QUASAR algorithm). To make it visible, we afterwards have a look at a concrete ex-
ample (see 3 Example). At the end of the report I will shortly go over the results given
by the authors of the paper and show you my point of view relating the QUASAR algo-
rithm.
1
strong similarity: see section 1.3 Similarity and edit distance
At this stage we introduce the edit distance2. Edit distance means the minimal number
of inserts, deletes and changes of single letters in string s1 to get s2.
example
s1 ACTAT
Two changes are also the minimal number of changes you can do on s1 to get
s2, thus the edit distance for this example is 2.
example
s1 ATTGC
Three changes in this example is the minimal number of changes you can do
on s1 to get s2, thus the edit distance for this example is 3.
2
This edit distance is also known as Levenshtein distance
example
search query S=
database D= AAAGGGGTTCCCCCTAAACACTGACGAACTGACGAAGTCCAAAAGG
TTTTAACCCCTTTAAAGGGCGACTTGACACCATTGAGAACCCAAAA
GGGGTTTCCCTTTGGGCCCGGAAGGAATTAATTCCBBBAAAAAACC
Step 1: Filter selects sequences with high possibility of high similarity in the
database.
search query S=
database D= AAAGGGGTTCCCCCTAAACACTGACGAACTGACGAAGTCCAAAAGG
TTTTAACCCCTTTAAAGGGCGACTTGACACCATTGAGAACCCAAAA
GGGGTTTCCCTTTGGGCCCGGAAGGAATTAATTCCBBBAAAAAACC
example
we define q = 2
To get an impression of similarity between the two strings, we are especially interested
in the shared q-grams between this two strings. As we can see in the example the two
words w1= BRONSON and w2= BROSNAN share the two q-grams BR and RO.
To get an approximation of the edit distance we are especially interested in the relation
between the number of shared q-grams and the edit distance of the two strings. We can
use the following lemma to get a threshold t which describes the number of at least
shared q-grams q for two strings P and S of a given length w for a given edit distance k.
lemma 1
Let P and S be strings of given length w and given edit distance k. Then P and
S share at least t=w-q+1-kg (t=threshold) common q-grams [JU91].
example
we define q = 3
It is important to see that t shows only the minimal number of shared q-grams between
P and S. Have a look at the following example very similar to the one above:
example
but there are 5 shared q-grams: ACA, CAC, ACT, CTT, TTA
example
Æ t = w-q+1-k·q = 3
The idea is to go through all subsequences of length w in s and for each of this windows
we have to check all subsequences of length w in the database to check the number of
shared q-grams. If we find a subsequence of length w with at least t (threshold of
lemma) shared q-grams, we have found such an approximate match. We start with the
first window of size w=8 in the search-query S1,w3. In the example above we found a win-
dow CACTGAGG in D with shared q-grams {CAC, TGA, GAG} the number of q-grams is
equal to t=3. So this subsequence is a possible match for edit distance k=1, but as we
can see in this situation it is not the case. In the second step of the filtering process we
would pass this subsequence to the matching-algorithm BLAST.
The problem now is that we have to check for each window position in S (S1,w, S2,w+1,…)
all approximate matches in the database D. To get this a little smarter, we do another
approximation as showed in the following section.
3
Sa,b means the substring of S starting on character at position a and ending at position b. For example
S=”abcdefg” Æ S2,4=”bcd”
A main drawback is the additional space we allocate for the counters, so the author de-
cided to combine several substrings of length w to one block and add a counter for each
block.
For this reason we partition the database to non-overlapping-blocks of fixed size b and
add a counter on each block. While this decreases the memory usage, it also decreases
the accuracy.
example
In the second block we count 5 shared q-grams {CAC, ACA, TGA, TGA, GAG}
between the search query and the database.
Because these are non-overlapping blocks we might miss some q-grams that cross the
block borders. So we use a second, shifted by b/2 array of blocks. If we had looked for
the q-gram TAA, we had missed it (on border between first and second block). For the
same reason we have to choose b≥2w. See the next example to get the whole picture
about partitioning and counting:
example
In this case we would pass the two blocks with counter 3 and 5 to BLAST. Because of
the lemma, the blocks with counter < t (in our example t=3) can not include a substring
of length w with edit distance k=1 to S1,w, so we do not pass it to BLAST.
2.3 Hitlist
Now there is still a problem of efficiency, because we basically take each q-gram from
the window S1,w and go for each q-gram through the whole database increasing the
counters on each block which includes this q-gram. To skip this search over the whole
database for each q-gram, we build a hitlist, which is an index to get all the positions in
the database of a given q-gram.
This hitlist is a list with all possible q-grams pointing to the first occurrence in the suffix
array4 of the database.
example
This example shows the hitlist and pointing to the suffix array of a database.
The size of the q-gram in this example is defined as q=4.
To get the position of a q-gram for example AAAA we go to the hitlist entry AAAA and
follow the pointer to the entry in the suffix array, to get the first occurrence of the q-
gram AAAA. Afterwards we go down the suffix array until we reach the field where the
next pointer of the hitlist points to. This technique allows us to find a position of a q-
gram in constant time. Imagine that the whole database is filled with one single charac-
ter, the process of increasing counters for one q-gram can raise to O(n) if n is the size of
the database. The other important thing about this hitlist is its size. The size of the hitlist
itself is equal to the number of symbols in the alphabet to the power of q (size of q-
grams). In the example above this already makes 44=256 entries in the list, and this is
just a small example.
4
See appendix, section “suffix array” to get an introduction to suffix array
In order to redo the whole procedure for S1,w, S2,w+1,… and so on, we can just handle
the difference to the window before. For example if we want to check the search window
S2,w+1 we can take the result from S1,w and just handle the differences. So we discard the
result of S1,q (the first q-gram now no longer available in S2,w+1)and increase the counters
for the new q-gram Sw-q+1,w+1. To do so we decrease the counters of blocks which in-
cluded S1,q and did not reach the threshold and increase the counter of blocks which con-
tain the new q-gram Sw-q+1,w+1.
In this way we will go through the whole S, shifting the window of length w by one. At
the end we have all blocks with a counter greater than t as approximate matches. To get
the alignment we use BLAST on the corresponding blocks.
3 Example
Afterwards we order the list lexicographically. We also generate the hitlist for all possi-
ble q-grams which points to the first occurrence of the desired q-gram:
After we adjusted the counters, we shift the window one position to the right
(S2,w+1=ACATGAGA). Now we just have a look on the first q-gram we do not need any
more, which is CAC. For that reason we decrease all the counters of the blocks which
contain CAC and do not have reached threshold t. If threshold t is reached like b2 and b3
we do not decrease to mark the block as approximate match. We have to increase the
counters for the new q-gram AGA (b2, b3, b4, b7).
Now we can shift the window one position to the right and do the same again.
At the end we can see that we would pass the positions of b2, b3 and b4 to the local
alignment algorithm BLAST. There is a very similar string CACCTGAGAA to S in b3, this
one will be returned as a result.
4.1 Complexity
The construction of the suffix-array and the precomputation of the hitlist can be done in
O(|D|·log|D|) time [MM93]. To get a specific q-gram takes constant time, but the num-
ber of reported q-grams can be linear in |D|. There are O(|S|) q-grams. The whole filter-
ing approach takes O(|D|·|S|), and at the end BLAST takes another O(c·b·|S|) if c is the
number of blocks reaching the threshold t, and b represents the block size.
4.2 Experiment
The authors did a test with 1000 queries on the same database comparing QUASAR and
BLAST on a level of performance and sensitivity. The loading of data with BLAST was in-
cluded to the measurement but has only an impact of less than 1% on duration. On
QUASAR the loading time was not included to the measurement (30 seconds for the first
and 114 seconds for the second row, see the table below).
A few results from the initial implementation QUASAR was running with w=50, q=11,
b=1024bps and t such that windows with at most 6% differences are found (in that case
edit distance at most 3).
They used one processor (SUN Ultra SparcII, 333 Mhz) of a dedicated Sun Enterprise
10000 with 4 GB of main memory and a local disk array.
The following measurements depending different block sizes where also done:
4.3 Conclusion
This algorithm seems to be an efficient method to first filter the data to finally speed up
the BLAST algorithm. But this method has also its drawbacks. For example if the suffix-
array does not fit to the main memory and has to be stored on the disk this would ex-
tremely slow down the algorithm.
Further on we have to mention that they excluded in the example above the prepara-
tion phase of QUASAR. For the first example with the 73.5Mb data the preparation
needed 30 seconds and for the second example 114 seconds! This is the explanation why
the algorithm is made for several queries on the same database.
If we read the paper, somehow the space needed for the hitlist is not mentioned at all.
Let’s take the data from the example with an alphabet of size 4 {A,C,G,T} and a size of
q-grams q=11. This leads to 411= 4194304 entries in the hitlist. Imagine now taking the
lexicographic alphabet with 26 characters this would get more then 1015 entries for the
hitlist.
In my opinion QUASAR is a very good algorithm to speed up the well known BLAST-
algorithm in case of:
5 Appendix
example
The word “endoplasmatic” has length 13 and has the following suffixes:
1. endoplasmatic
2. ndoplasmatic
3. doplasmatic
4. oplasmatic
5. plasmatic
6. lasmatic
7. asmatic
8. smatic
9. matic
10. atic
11. tic
12. ic
13. c
In front of the suffix you still find the position on which the suffix starts on the given
word. The suffix-array is the array of these positions. The suffix-array for the string
S=“endoplasmatic” is: {7,10,13,3,1,12,6,9,2,4,5,8,11}. This is a space-saving method.
To store this suffix-array we need for each character of the given string an entry in the
array, and each entry needs 4 bytes (index as integer) so we need 4·|D| bytes for the
whole suffix-array.
5.2 References
5.2.1 Papers
[AMS+97] S.F. Altshul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and
D. J. Lipman. Gapped Blast and Psi-Blast: a new generation of protein data-
base search programs. Nuleic Acids Res., 25:3398-3402, 1997
[JU91] P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching
in static texts. In Proc. of the 16th Symposium on Mathematical Foundations
of Computer Science, volume 520 of Lecture Notes in Computer Science,
pages 240-248, 1991.
[MM93] U. Manber and E. W. Myers. Suffix Arrays: A new method for on-line string
searches. SIAM Journal on Computing, 22(5): 935-948, 1993
5.2.2 Others