You are on page 1of 17

QUASAR

Q-gram Based Database Searching Using a Suffix Array


written by Stefan Burkhardt, Andreas Crauser, Paolo Ferragina, Hans-Peter Lenhof, Eric
Rivals and Martin Vingron

Report by

Lukas Naef
lnaef@student.ethz.ch
16. June 2008
Q-gram Alignment based on Suffix Array

Index
Index ............................................................................................................... 2

1 Introduction .............................................................................................. 3

1.1 QUASAR… .............................................................................................. 3

1.2 Abstract................................................................................................. 3

1.3 Similarity and edit distance....................................................................... 4

1.4 Idea of the QUASAR-algorithm .................................................................. 5

2 The QUASAR algorithm................................................................................ 6

2.1 q-gram filtering ...................................................................................... 6

2.2 Counting and partitioning ......................................................................... 8

2.3 Hitlist .................................................................................................... 9

2.4 Counting process .................................................................................. 10

3 Example.................................................................................................. 11

3.1 Define the variables............................................................................... 11

3.2 Define the suffix-array and the hitlist ....................................................... 11

3.3 Partitioning and counting........................................................................ 12

4 Analysis and Evaluation ............................................................................. 14

4.1 Complexity........................................................................................... 14

4.2 Experiment .......................................................................................... 14

4.3 Conclusion ........................................................................................... 15

5 Appendix................................................................................................. 16

5.1 Introduction to suffix array ..................................................................... 16

5.2 References ........................................................................................... 17

5.2.1 Papers .......................................................................................... 17

5.2.2 Others .......................................................................................... 17

ETHZ - Algorithms for Data Base Systems 2 - 17


Q-gram Alignment based on Suffix Array

1 Introduction

1.1 QUASAR…
- was introduced in a paper by Stefan Burkhardt, Andreas Crauser, Paolo Ferragina,
Hans-Peter Lenhof, Eric Rivals and Martin Vingron in 1999

- means: Q-gram Alignment based on Suffix ARray

- is used to quickly detect sequences with strong local similarity, where many
searches are conducted on one database

1.2 Abstract
In molecular biology it has become a basic operation to search for similarity to a query
sequence in a DNA database. If we have a look at the current situation, today all the fast
algorithms reach their limit if they try all-versus-all comparisons of large databases. In
this report I will discuss the searching algorithm called QUASAR (Q-gram Alignment
based on Suffix ARray) which was developed to quickly detect sequences with strong
similarity1 to the query in a context, where many searches are conducted on one data-
base. Have a look at the following example:

search query: S = CCATTAGCTAA

on database: D = …AGCTATTAACGTCA…

In this case QUASAR will return the subsequence ATTAAC as a result because it is a
string not equal to the search query but somehow strong similarity to the search query.
You will get a closer look to the statement “strong similarity” later on in this document.

This algorithm uses q-tuple filtering (see 2.1 q-gram filtering) based on a suffix-array
index (see 5.1 Introduction to suffix array). This also means that QUASAR first filters all
possible positions for similar strings in the database and passes the positions with high
possibility of strong similarity to an algorithm which can solve the similarity problem ac-
curately. In QUASAR we will use the well-known algorithm BLAST to generate the accu-
rate result after the filtering process.

Tests showed that if you use the algorithm for strongly similar comparison, it is an or-
der of magnitude faster than using BLAST for the whole unfiltered task (see section 4.2
Experiment).

The document is organized as follows. After a short introduction to the similarity be-
tween strings and the big picture QUASAR works, we go deeper into the algorithm (see 2
The QUASAR algorithm). To make it visible, we afterwards have a look at a concrete ex-
ample (see 3 Example). At the end of the report I will shortly go over the results given
by the authors of the paper and show you my point of view relating the QUASAR algo-
rithm.

1
strong similarity: see section 1.3 Similarity and edit distance

ETHZ - Algorithms for Data Base Systems 3 - 17


Q-gram Alignment based on Suffix Array

1.3 Similarity and edit distance


To go on with this report you have to get a closer look to the term “strong similarity”.
What does it mean if two strings are strongly similar? It means two strings which look
almost the same like „ABCDEF“ and „ABXDEF“. It depends on the small number of differ-
ent characters. To decide whether a string is strong similar or not, we have to measure
the similarity of two strings in more depth.

At this stage we introduce the edit distance2. Edit distance means the minimal number
of inserts, deletes and changes of single letters in string s1 to get s2.

example

s1= ACTAT, s2= ATTAA

We start with s1 and change it like this:

s1 ACTAT

1st change (replace C through T): ATTAT

2nd change (replace T through A): ATTAA = s2

Two changes are also the minimal number of changes you can do on s1 to get
s2, thus the edit distance for this example is 2.

example

s1= ATTGC, s2= TTGGCA

s1 ATTGC

1st change (delete A) TTGC

2nd change (add G) TTGGC

3rd change (add A) TTGGCA

Three changes in this example is the minimal number of changes you can do
on s1 to get s2, thus the edit distance for this example is 3.

Now we have a metric to measure the similarity of two strings.

2
This edit distance is also known as Levenshtein distance

ETHZ - Algorithms for Data Base Systems 4 - 17


Q-gram Alignment based on Suffix Array

1.4 Idea of the QUASAR-algorithm


To achieve the goal of finding strings with a certain similarity, QUASAR uses a filtering
technique. This filter process has two steps. First the filter selects sequences with a high
possibility of high similarity in the database, so called approximate matches. On a second
step, we pass these approximate matches to the matching-algorithm BLAST [AMS+97] to
inspect the sequence in more depth. Have a look at the example below:

example

We search in a database D for strong similar strings like S.

search query S=
database D= AAAGGGGTTCCCCCTAAACACTGACGAACTGACGAAGTCCAAAAGG
TTTTAACCCCTTTAAAGGGCGACTTGACACCATTGAGAACCCAAAA
GGGGTTTCCCTTTGGGCCCGGAAGGAATTAATTCCBBBAAAAAACC

Step 1: Filter selects sequences with high possibility of high similarity in the
database.

search query S=
database D= AAAGGGGTTCCCCCTAAACACTGACGAACTGACGAAGTCCAAAAGG
TTTTAACCCCTTTAAAGGGCGACTTGACACCATTGAGAACCCAAAA
GGGGTTTCCCTTTGGGCCCGGAAGGAATTAATTCCBBBAAAAAACC

Step 2: Pass these (CACTGACGAACTGACGAAGT, GACTTGACACCATTGAGAAC) to the


matching-algorithm BLAST to inspect the sequences in depth.

Further on we concentrate on step 1 where we have to find these subsequences with


high possibility of high similarity and have to keep attention that the filter is not too
strong so that we loose correct results.

ETHZ - Algorithms for Data Base Systems 5 - 17


Q-gram Alignment based on Suffix Array

2 The QUASAR algorithm

2.1 q-gram filtering


To calculate the edit distance between two strings is too expensive. But QUASAR does it
approximately by a so called q-gram filtering. A q-gram in our context refers to a subse-
quence s of a word w with defined length |s|=q.

example

we define q = 2

w1= BRONSON has q-grams : BR RO ON NS SO ON

w2= BROSNAN has q-grams : BR RO OS SN NA AN

To get an impression of similarity between the two strings, we are especially interested
in the shared q-grams between this two strings. As we can see in the example the two
words w1= BRONSON and w2= BROSNAN share the two q-grams BR and RO.

To get an approximation of the edit distance we are especially interested in the relation
between the number of shared q-grams and the edit distance of the two strings. We can
use the following lemma to get a threshold t which describes the number of at least
shared q-grams q for two strings P and S of a given length w for a given edit distance k.

lemma 1

Let P and S be strings of given length w and given edit distance k. Then P and
S share at least t=w-q+1-kg (t=threshold) common q-grams [JU91].

example

P = ACAGCTTA and S = ACACCTTA, w=|P|=|S|=8

As we can see the edit distance k = 1

we define q = 3

Æ t = w-q+1-kq = 8-3+1-1·3=3 (shared q-grams = ACA, CTT, TTA)

It is important to see that t shows only the minimal number of shared q-grams between
P and S. Have a look at the following example very similar to the one above:

ETHZ - Algorithms for Data Base Systems 6 - 17


Q-gram Alignment based on Suffix Array

example

P=ACACTTAG and S=ACACTTAC, w=|P|=|S|=8

Edit distance is also k=1 and still q=3

Æ t = w-q+1-kq = 8-3+1-1·3=3 (same as above)

but there are 5 shared q-grams: ACA, CAC, ACT, CTT, TTA

We reduced the problem to find approximate matches in a database D, to the following


question: Given an edit distance k for the maximum of differences on a fixed window size
w, we get approximate matches by finding subsequences of the database with at least t
(see lemma 1) shared q-grams. To make it clear here an example:

example

defined variables: k=1, w=8, q=3

Æ t = w-q+1-k·q = 3

The idea is to go through all subsequences of length w in s and for each of this windows
we have to check all subsequences of length w in the database to check the number of
shared q-grams. If we find a subsequence of length w with at least t (threshold of
lemma) shared q-grams, we have found such an approximate match. We start with the
first window of size w=8 in the search-query S1,w3. In the example above we found a win-
dow CACTGAGG in D with shared q-grams {CAC, TGA, GAG} the number of q-grams is
equal to t=3. So this subsequence is a possible match for edit distance k=1, but as we
can see in this situation it is not the case. In the second step of the filtering process we
would pass this subsequence to the matching-algorithm BLAST.

The problem now is that we have to check for each window position in S (S1,w, S2,w+1,…)
all approximate matches in the database D. To get this a little smarter, we do another
approximation as showed in the following section.

3
Sa,b means the substring of S starting on character at position a and ending at position b. For example
S=”abcdefg” Æ S2,4=”bcd”

ETHZ - Algorithms for Data Base Systems 7 - 17


Q-gram Alignment based on Suffix Array

2.2 Counting and partitioning


The goal now is to identify all the substrings in D that share at least t q-grams with
S1,w. A simple approach would be, to add a counter on each substring of length w in D. In
this case we would have about |D|-w+1 counters and we had to increment each counter
which includes shared q-grams with S1,w. All substrings with counter value greater than
or equal t are approximate matches.

A main drawback is the additional space we allocate for the counters, so the author de-
cided to combine several substrings of length w to one block and add a counter for each
block.

For this reason we partition the database to non-overlapping-blocks of fixed size b and
add a counter on each block. While this decreases the memory usage, it also decreases
the accuracy.

example

k=1, w=8, q=3 Æ t = w-q+1-k·q = 3, b=16 (b≥2w)

In the second block we count 5 shared q-grams {CAC, ACA, TGA, TGA, GAG}
between the search query and the database.

Because these are non-overlapping blocks we might miss some q-grams that cross the
block borders. So we use a second, shifted by b/2 array of blocks. If we had looked for
the q-gram TAA, we had missed it (on border between first and second block). For the
same reason we have to choose b≥2w. See the next example to get the whole picture
about partitioning and counting:

example

k=1, w=8, q=3 Æ t=3, b=16 (b≥2w)

In this case we would pass the two blocks with counter 3 and 5 to BLAST. Because of
the lemma, the blocks with counter < t (in our example t=3) can not include a substring
of length w with edit distance k=1 to S1,w, so we do not pass it to BLAST.

ETHZ - Algorithms for Data Base Systems 8 - 17


Q-gram Alignment based on Suffix Array

2.3 Hitlist
Now there is still a problem of efficiency, because we basically take each q-gram from
the window S1,w and go for each q-gram through the whole database increasing the
counters on each block which includes this q-gram. To skip this search over the whole
database for each q-gram, we build a hitlist, which is an index to get all the positions in
the database of a given q-gram.

This hitlist is a list with all possible q-grams pointing to the first occurrence in the suffix
array4 of the database.

example

This example shows the hitlist and pointing to the suffix array of a database.
The size of the q-gram in this example is defined as q=4.

To get the position of a q-gram for example AAAA we go to the hitlist entry AAAA and
follow the pointer to the entry in the suffix array, to get the first occurrence of the q-
gram AAAA. Afterwards we go down the suffix array until we reach the field where the
next pointer of the hitlist points to. This technique allows us to find a position of a q-
gram in constant time. Imagine that the whole database is filled with one single charac-
ter, the process of increasing counters for one q-gram can raise to O(n) if n is the size of
the database. The other important thing about this hitlist is its size. The size of the hitlist
itself is equal to the number of symbols in the alphabet to the power of q (size of q-
grams). In the example above this already makes 44=256 entries in the list, and this is
just a small example.

4
See appendix, section “suffix array” to get an introduction to suffix array

ETHZ - Algorithms for Data Base Systems 9 - 17


Q-gram Alignment based on Suffix Array

2.4 Counting process


To find all approximate matches between the first search window S1,w and D we have to
find all substrings in D that share at least t q-grams with S1,w. Afterwards we shift this
search window in S one position to the right so we have to consider S2,w+1 and so on, till
the end of the search query is reached.

In order to redo the whole procedure for S1,w, S2,w+1,… and so on, we can just handle
the difference to the window before. For example if we want to check the search window
S2,w+1 we can take the result from S1,w and just handle the differences. So we discard the
result of S1,q (the first q-gram now no longer available in S2,w+1)and increase the counters
for the new q-gram Sw-q+1,w+1. To do so we decrease the counters of blocks which in-
cluded S1,q and did not reach the threshold and increase the counter of blocks which con-
tain the new q-gram Sw-q+1,w+1.

In this way we will go through the whole S, shifting the window of length w by one. At
the end we have all blocks with a counter greater than t as approximate matches. To get
the alignment we use BLAST on the corresponding blocks.

ETHZ - Algorithms for Data Base Systems 10 - 17


Q-gram Alignment based on Suffix Array

3 Example

3.1 Define the variables


For the example I have chosen small values to keep it simple:

3.2 Define the suffix-array and the hitlist


To build the suffix-array of the database, we first build all suffixes:

Afterwards we order the list lexicographically. We also generate the hitlist for all possi-
ble q-grams which points to the first occurrence of the desired q-gram:

ETHZ - Algorithms for Data Base Systems 11 - 17


Q-gram Alignment based on Suffix Array

3.3 Partitioning and counting


As you can see on the bottom of the following illustration, we increase the counter of
each block on which a q-gram from the first window S1,w=CACATGAG occurred. This
means we check for each q-gram {CAC, ACA, CAT, ATG TGA GAG} in the hitlist the posi-
tion and increase the counter for all the blocks containing this position number. For ex-
ample the second q-gram ACA occurs on position 19, so we have to increase the counter
on b2 which includes all positions between 9 and 24, and we increase b3 with bounds 17
and 32.

ETHZ - Algorithms for Data Base Systems 12 - 17


Q-gram Alignment based on Suffix Array

After we adjusted the counters, we shift the window one position to the right
(S2,w+1=ACATGAGA). Now we just have a look on the first q-gram we do not need any
more, which is CAC. For that reason we decrease all the counters of the blocks which
contain CAC and do not have reached threshold t. If threshold t is reached like b2 and b3
we do not decrease to mark the block as approximate match. We have to increase the
counters for the new q-gram AGA (b2, b3, b4, b7).

Now we can shift the window one position to the right and do the same again.

At the end we can see that we would pass the positions of b2, b3 and b4 to the local
alignment algorithm BLAST. There is a very similar string CACCTGAGAA to S in b3, this
one will be returned as a result.

ETHZ - Algorithms for Data Base Systems 13 - 17


Q-gram Alignment based on Suffix Array

4 Analysis and Evaluation

4.1 Complexity
The construction of the suffix-array and the precomputation of the hitlist can be done in
O(|D|·log|D|) time [MM93]. To get a specific q-gram takes constant time, but the num-
ber of reported q-grams can be linear in |D|. There are O(|S|) q-grams. The whole filter-
ing approach takes O(|D|·|S|), and at the end BLAST takes another O(c·b·|S|) if c is the
number of blocks reaching the threshold t, and b represents the block size.

4.2 Experiment
The authors did a test with 1000 queries on the same database comparing QUASAR and
BLAST on a level of performance and sensitivity. The loading of data with BLAST was in-
cluded to the measurement but has only an impact of less than 1% on duration. On
QUASAR the loading time was not included to the measurement (30 seconds for the first
and 114 seconds for the second row, see the table below).

A few results from the initial implementation QUASAR was running with w=50, q=11,
b=1024bps and t such that windows with at most 6% differences are found (in that case
edit distance at most 3).

They used one processor (SUN Ultra SparcII, 333 Mhz) of a dedicated Sun Enterprise
10000 with 4 GB of main memory and a local disk array.

The following measurements depending different block sizes where also done:

ETHZ - Algorithms for Data Base Systems 14 - 17


Q-gram Alignment based on Suffix Array

4.3 Conclusion
This algorithm seems to be an efficient method to first filter the data to finally speed up
the BLAST algorithm. But this method has also its drawbacks. For example if the suffix-
array does not fit to the main memory and has to be stored on the disk this would ex-
tremely slow down the algorithm.

Further on we have to mention that they excluded in the example above the prepara-
tion phase of QUASAR. For the first example with the 73.5Mb data the preparation
needed 30 seconds and for the second example 114 seconds! This is the explanation why
the algorithm is made for several queries on the same database.

If we read the paper, somehow the space needed for the hitlist is not mentioned at all.
Let’s take the data from the example with an alphabet of size 4 {A,C,G,T} and a size of
q-grams q=11. This leads to 411= 4194304 entries in the hitlist. Imagine now taking the
lexicographic alphabet with 26 characters this would get more then 1015 entries for the
hitlist.

In my opinion QUASAR is a very good algorithm to speed up the well known BLAST-
algorithm in case of:

- Main memory is big enough to store suffix array and hiltlist

- We do several searches on the same database

- The database has a small alphabet

ETHZ - Algorithms for Data Base Systems 15 - 17


Q-gram Alignment based on Suffix Array

5 Appendix

5.1 Introduction to suffix array


This section gives a short introduction to suffix array. Suffix-array was once introduced
to reduce memory consumption compared to a suffix tree. This is one of the reasons we
use this data structure for QUASAR. It is easier to understand the functionality of a suf-
fix-array by looking at an example.

example

The word “endoplasmatic” has length 13 and has the following suffixes:
1. endoplasmatic
2. ndoplasmatic
3. doplasmatic
4. oplasmatic
5. plasmatic
6. lasmatic
7. asmatic
8. smatic
9. matic
10. atic
11. tic
12. ic
13. c

If you order the suffixes lexicographically you get this:


7. asmatic
10. atic
13. c
3. doplasmatic
1. endoplasmatic
12. ic
6. lasmatic
9. matic
2. ndoplasmatic
4. oplasmatic
5. plasmatic
8. smatic
11. tic

In front of the suffix you still find the position on which the suffix starts on the given
word. The suffix-array is the array of these positions. The suffix-array for the string
S=“endoplasmatic” is: {7,10,13,3,1,12,6,9,2,4,5,8,11}. This is a space-saving method.
To store this suffix-array we need for each character of the given string an entry in the
array, and each entry needs 4 bytes (index as integer) so we need 4·|D| bytes for the
whole suffix-array.

ETHZ - Algorithms for Data Base Systems 16 - 17


Q-gram Alignment based on Suffix Array

5.2 References

5.2.1 Papers

[AMS+97] S.F. Altshul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and
D. J. Lipman. Gapped Blast and Psi-Blast: a new generation of protein data-
base search programs. Nuleic Acids Res., 25:3398-3402, 1997

[BCF+99] S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, and M. Vin-


gron. q-gram Based database Searching Using a Suffix Array (QUASAR). In
RECOMB, pages 77–83, 1999.

[JU91] P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching
in static texts. In Proc. of the 16th Symposium on Mathematical Foundations
of Computer Science, volume 520 of Lecture Notes in Computer Science,
pages 240-248, 1991.

[MM93] U. Manber and E. W. Myers. Suffix Arrays: A new method for on-line string
searches. SIAM Journal on Computing, 22(5): 935-948, 1993

[Ukk92] E. Ukkonen. Approximate string-matching with q-grams and maximal


matches. Theoretical Computer Science, 92 (1):191-211, 1992

5.2.2 Others

• Institut für Informatik, Fachbereich Mathematik und Informatik, Freie Universität


Berlin course „Algorithmic Bioinformatics“ (April 2008)
http://www.inf.fu-berlin.de/inst/ag-
bio/FILES/ROOT/Teaching/Lectures/SS08//ssa/script-05-FastFilteringQuasar.pdf

• Johannes Gutenberg Universität Mainz, Seminar Bioinformatik (April 2008),


http://www.informatik.uni-mainz.de/lehre/BioS/langenberger_loh.pdf

ETHZ - Algorithms for Data Base Systems 17 - 17

You might also like