You are on page 1of 6

2015 Second International Conference on Advances in Computing and Communication Engineering

Efficient Read Alignment Using Burrows Wheeler Transform and Wavelet Tree

Sanjeev Kumar Suneeta Agarwal Rajesh Prasad


Department of CSE Department of CSE Department of Computer Science
MNNIT, Allahabad MNNIT, Allahabad Yobe State University, Damaturu
U.P., India U.P., India Yobe State, Nigeria
skgcp86@gmail.com suneeta@mnnit.ac.in rajesh_ucer@yahoo.com

Abstract—In genome sequence alignment problem, a several bioinformatics tools for read mapping, e.g. SOAP [9],
reference string and number of query strings referred as short BWA [5], BWA-SW [13] and Bowtie [11].
reads, are given, goal is to seek out occurrences of these query
strings in the reference string. Huge amount of reads generated Four categories of alignment programs are currently used to
by new sequencing technologies (Illumina/Solexa) need the map the short reads sequences. First category is based on
development of an efficient algorithm requiring both less hashing of read sequence such as RMAP (Smith et. al., 2008),
memory and computational time. There are number of indexing MAQ (Li et al 2008(1)) and ZOOM (Li et. al. 2008(2)). These
and string matching techniques to align short reads on reference programs have flexible memory space but do not support
string(genome). Size of index of the reference string in each of gapped alignment and multithreading. In second category of
existing techniques is large. In this paper, a new self compressed alignment, programs are based on hashing of reference genome
index technique (BWT-WT) is proposed. BWT-WT scheme is such as SOAP (Li et. al. 2008(3)), and BFAST [14]. Programs
based on Burrow Wheeler Transform (BWT) and Wavelet tree of this category supports multithreading for alignment of reads
(WT). BWT-WT also supports exact alignment of DNA sequence but size of reference genome index is very large. Third
reads. Performances of BWT-WT with other BWT based tools of category programs are based on merge sorting of reference
short read alignments are compared. Experiments show that genome as well as merge sorting of read sequence such as
BWT-WT based program achieves more compression and also Malhis (Malhis et. al. 2009) but these programs are not very
faster searching in comparison to other existing tools.
much popular as they do not support pair end mapping. Fourth
Keywords—Burrows-Wheeler Transform, FM Index, Full Text
category of program is based on Burrows Wheeler Transform
Index, Wavelet Tree and Sequence Analysis. (BWT, 1994) which is efficient in both memory footprint as
well as speed. Some of the software programs of this category
I. INTRODUCTION are: BWA [5], Bowtie [11] and SOAP [9].
Next generation sequencing machine Illumina/Solexa Programs of fourth category mentioned above are having
generates millions of short reads DNA sequences in a single relatively small memory footprint, efficient in searching and
run of the machine. These reads must be mapped to one or support exact matching as well as inexact matching with some
more reference genomes. The orientation of a read relative to bounded allowed differences. Exact matching by these
genome in not known. To match these reads, the main problem programs take few seconds to align the reads but to align the
is how to align the reads to reference genome accounting for inexact reads it takes too much time to find all the similar
exact matching with a reasonable amount of time and memory substrings. In case of DNA profiling multiple reference
space? There are number of applications where short read genomes are used for analysis and identification of gene
alignments are used. Example includes: assembling reads into a behaviour, the size of index again become an issue, so
genome, aligning reads to one reference genome for analysis of reduction of index size is required. As a result, development of
genomic variation, aligning a micro-biome to a set of reference efficient program requiring lesser memory and computation
genomes for species or functional analysis etc. time is need of today.
Searching biological sequences in genome and protein is BWT [2] based algorithm uses number of external tools
important to understand genetic blue print of living organism. such as move to front encoding (which is used to rearrange the
This resulted in a fast development of new technologies characters in similar order), run length coding and variable
generating vast amounts of sequence data to be analyzed [7]. length coding to compress the reference sequence.
For this reason, today the focus changed from data acquisition The Wavelet Tree [4] was invented in 2003 by Grossi,
to efficient data storage and processing methods. To regain the Gupta and Vitter, as a data structure to represent a sequence
original ordering of the reads, often they are aligned to a and answer some queries on it. It is a milestone in compressed
reference genome, where the massive number of sequences that full text indexing which adapts the compressibility of the data
need to be processed requires smooth search scheme and data in many ways excellently. Two key approaches to achieve this
structures. are using specific coding (Entropy Coding) on bitmaps and
A lot of effort has been made to develop methods that are modifying the tree shape.
both memory efficient and fast. One approach to derive In this paper, a new self compressed indexing supporting
suitable data structures is the Burrows-Wheeler Transform exact alignment of DNA sequence reads is proposed, which is
(BWT), which can be understood as a rearrangement of based on BWT & Wavelet Tree. The advantage of BWT-WT
characters in a sequence. Therefore, it has been integrated in is, it provide index of optimal size and supports number of

978-1-4799-1734-1/15 $31.00 © 2015 IEEE 133


DOI 10.1109/ICACCE.2015.80

Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on April 05,2022 at 07:43:16 UTC from IEEE Xplore. Restrictions apply.
queries such as access, rank & select in constant time. 3. Construct the transformed text Tbwt by taking the last
Performance of BWT-WT for DNA sequence alignment is column of BWM.
compared with other BWT based tools. Experiments show that
BWT-WT based program achieves more compression and The transformed text Tbwt in the last column is also denoted
efficient searching in comparison to other techniques. as L (last). In particular, the first Column of BWM F (first), is
obtained by lexicographically sorting the characters of T. Fig. 1
This paper is organized as follows. Sec. II describes the shows the construction of BWT.
related concepts. Section III presents proposed compression
and indexing techniques based on Burrows-Wheeler Transform Index Cyclic Shifting Index BWT Matrix
and Wavelet tree. Sec. IV presents the experimental setup and 0 AGCAGT$ 0 $AGCAGT
analysis of the results. Finally, Sec V concludes the paper. 1 GCAGT$A After 1 AGCAGT$
2 CAGT$AG
Sorting 2 AGT$AGC
II. RELATED CONCEPTS
3 AGT$AGC 3 CAGT$AG
A. Suffix Tree and Suffix Array 4 GCAGT$A
4 GT$AGCA
Suffix tree [1] has been used as an important data structure in 5 GT$AGCA
5 T$AGCAG
string processing. This data structure plays a prominent role in
6 $AGCAGT 6 T$AGCAG
algorithms but is not as prevalent in actual implementations of
software tools. There are two major reasons for this. The first Fig. 1. Construction of Burrows Wheeler Transform Matrix for Text
reason is the space consumption, as the suffix tree requires T=AGCAGT$. TBWT=T$CGAAG
quite large space, though its performance is asymptotically
linear. The second reason is that the suffix tree demonstrates a C. FM-Index
poor locality of memory reference. It causes a significant loss In 2000, six years after the BWT was appeared, Paolo
of efficiency in architectures of cached processor. Ferragina and Giovanni Manzini[3] published a paper
Suffix array [6] is introduced by Manber & Myers [6] as a describing how the BWT, together with some small auxiliary
simple, space efficient indexing method alternative to suffix data structures, can be used as a space-efficient index of
trees. It is key data structure for solving a number of problems reference string T?. They named it as FM Index. Just as the
on data compression and information retrieval for biological Last to First Mapping [3, 8] was the key to understanding how
sequence analysis and pattern discovery. It is defined as the the BWT is reversible, it is also the key to how it can be used
permutation of index numbers giving the starting positions of as an index?
suffixes of a given string in alphabetical order. Table I shows D. Wavelet Tree
the suffix array for the string “AGCAGT$”.
A wavelet tree [4] is a binary tree of bit strings to represent a
TABLE I. SUFFIX ARRAY FOR TEXT T=AGCAGT$ given text T. For an alphabet Σ and a text of length n, the tree
needs O(log2n) bits of storage and supports the determination
Suffixes Ordered Suffixes
I S[i] I S[i] Ssuf
of character at a specified position in O(log|Σ|) time. In
0 AGCAGT$ 0 6 $ addition, it allows to obtain the number of occurrences of a
1 GCAGT$ 1 0 AGCAGT$ given character up to a specified position in O(log |Σ|) time.
2 CAGT$ 2 3 AGT$ Fig. 2 shows the wavelet tree for AGCAGT$.
3 AGT$ 3 2 CAGT$
4 GT$ 4 1 GCAGT$ E. Existing Technique
5 T$ 5 4 GT$
There are number of techniques for short read alignment to
6 $ 6 5 T$
reference genome such as MAQ, BWA, Bowtie and SOAP .In
Burrows Wheeler Aligner (BWA) [5], short read alignments
B. Burrrows Wheeler Transform are performed. BWA is based on Burrows Wheeler Transform
DNA sequencing algorithms based on Burrows Wheeler BWT and FM Indexes [3]. In BWA alignment, an index based
Transform (BWT) [2] are widely used in genome sequencing on BWT and Suffix Array is created. To search efficiently
analysis. The main concept of BWT is to sort all rotations of a BWA use FM Index [3, 7] which is based on backward search
given string in lexical order in form of BWM (Burrows method. FM Index uses number of other auxiliary data
Wheeler Matrix) and then return the last column as a result. structures such as count & occurrence table for performing the
This last column, i.e., the BWT string, can be easily search operation. Count table is use to store the number of
compressed, because it has many repeated characters together. characters involved in the string and number of character
BWT also allows fast string matching on compressed text. It is smaller than any character c, occurrence table is use to store
implemented by the following steps: the rank of character. The size of suffix array and occurrence
1. Derive a conceptual matrix M whose rows are n cyclic table is too large, so here only sample values are used to store
shifts of the text T, n being the length of text. and other values are calculating on demand. In order to
perform exact matching, count and locate function are used.
2. Lexicographically sort the text of resultant matrix Count function return no of occurrence of pattern P into Text
called BWM. T, whereas Locate function return the location of pattern P
into text T.

134

Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on April 05,2022 at 07:43:16 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Wavelet Tree for text T= AGCAGT$

III. PROPOSED ALIGNMENT TECHNIQUE USING BWT-WT


BWA uses the Burrows wheeler based indexing, which takes
space as the size of reference string (in case of human genome
it is 3 billion long). The main disadvantage of above
techniques is that BWT itself does not offer compression but
only arranges the text in compressible form. For compression,
it uses some other external techniques such as Move to Front Fig. 3. Block Diagram of Wavelet Tree based Indexing and Searching
Technique
encoding (MTF), Run Length Encoding &Variable Length
prefix coding .It requires lots of computation overhead and A. Index based on BWT-WT:
use of CPU peak memory. In order to overcome above A wavelet tree encodes a given text T as a binary tree. The
problems, here we introduce a new indexing technique BWT- tree is constructed by defining subtext for each node which is
WT based on BWT and Wavelet Tree (WT). Wavelet tree then encoded by bit strings, generated by comparing elements
compresses the string itself and can also be used as a of the subtext to a pivot element p. Each character c smaller
component of other compression tool. It also uses binary than p is represented by a ‘0’, while characters greater or equal
succinct data structures RRR [14] (New library that represent than p encoded by a ‘1’.
each character in optimal space and gives very fast
Rank/Select operation) to compress the WT nodes, and answer Now a bit string defines the strings of the child nodes, where
rank/select queries in constant time. Another advantage of WT all characters represented by ‘0 ’ forms the new substring of
is that it can be extended by changing its shape (Huffman the left child and all characters encoded by ‘1’ define the
shape and its variants) and using some compression booster substring of the right child node. A wavelet tree for the BWT
algorithm to meet a high level compression. The block of “AGCAGCAGACT$” and its index is shown Fig. 4 and
diagram of proposed technique is shown in Fig. 3: Table II respectively.
B. String Searching in the BWT-WT index
Before searching the pattern P into reference string, the
following operations are required. Given the wavelet tree for
the text T, algorithms given in Fig. 5 are used to perform
searching of a string P in to the text T.

135

Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on April 05,2022 at 07:43:16 UTC from IEEE Xplore. Restrictions apply.
(3) if c is in the left sub-tree of v then
(4) r ← rank0(Bv(r))
(5) v←leftchild(v)
(6) else
(7) r←rank1(Bv(r))
(8) v←rightchild(v)
(9) return i

WTBWT-select(c, i)
(1) v ← leaf representing c: r ← i
(2) while v is not root do
(3) p←parent (v)
(4) if v is in the left child of p then
(5) r←select0(Bp(r)) //selectc(T, i) - the position of the
ith occurrence of c in text T.
(6) else
(7) r ← select1(Bp(r))
Fig. 4. Wavelet Tree Index Based on BWT (8) v ← p
(9) return r
TABLE II. INDEX FOR STRING T = AGCAGCAGACT$

i Suffix# BWT(T) Sorted Suffixes WTsearchT(c, (st, ed))


1 12 T $ 1. Let c[c] be the total number of characters in T that is
alphabetically less than c
2 9 G ACT$
2. st=c[P[i]]+rank(s-1,P[i])+1
3 7 C AGACT$ 3. et=c[P[i]]+rank(e , P[i])
4 4 C AGCAGACT$ 4. return (st, et)
5 1 $ AGCAAGCAGACT$ Fig. 5. String searching functions used in the BWT-WT index
6 6 G CAGACT$
7 3 G CAGCAGACT$
By the use of above functions (Fig.5), suffix interval of pattern
P is derived .For any given pattern P specified by its suffix
8 10 A CT$ range (st,ed) in Suffix Array (SA), operation WTsearchT(c,
9 8 A GACT$ (st, ed)) returns the suffix range in SA of the string P = cP,
10 5 A GCAGACT$
where c is any character st is the start range of Text (initially
s=1) and ed is the end range of Text T (size of text). Find the
11 2 A GCAGCAGACT$A suffix interval of a pattern P into reference string T
12 11 C T$ recursively. Example 1 explains the algorithm.
Example 1:
For the text T = AGCAGCAGACT$ and TBWT =
Following functions are used for backward searching:
TGCC$GGAAAAC. Let pattern P to be searched in text T is
“GCA” Here i=3 (size of pattern), s=1(initially) e=12(size of
BWT-WT access T(i) the text T)
(1) v ← root; r ← i Step 1: for i=3
(2) while v is not a leaf do c=P[i]=P[3]=A
(3) if access Bv(r) = 0 then // access(i)- access the character st=c[A]+rank(0, A)+1=2
at i in text T et=c[A]+rank(12, A)=5
(4) r ← rank0(Bv(r)) // rankc(T, i) - the number of i.e character A occurs in Suffix interval 2 and 5 in Table II
character c at or before position i in text T Step 2: for i=2
(5) v ← leftchild(v) c=P[i]=P[2]=C
(6) else st=c[C]+rank(1,C)+1=5+0+1=6
(7) r ← rank1(Bv(r)) et=c[C]+rank(5,C)=5+2=7
(8) v ← rightchild(v) i.e character CA occurs in Suffix interval 6 and 7 in Table II
(9) return label of v Step 3: for i=1
c=P[i]=P[1]=G
WTBWT-rank(c, i) st=c[G]+rank(5,G)+1=8+1+1=10
(1) v ← root; r ← i et=c[G]+rank(7,G)=8+3=11 [st, et]=[10, 11]
(2) while v is not a leaf do i.e pattern GCA occur in Suffix interval 10 and 11 in Table II

136

Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on April 05,2022 at 07:43:16 UTC from IEEE Xplore. Restrictions apply.
So indices corresponding to pattern P of suffix interval are 3. English texts from the Wikipedia dump .
[2,5] from table II, Hence pattern GCA occurs in string 4. Simulated Data of DNA sequence (Arabidopsis thaliana)
T=AGCAGCAGACT$ at two times and their starting position and their short read archives (http://plants.ensembl.org/) is
are 2nd and 5th in text T. used to compare CPU time depicted in TABLE V.
.
IV. EXPERIMENTAL SETUP & RESULTS ANALYSIS G++4.7.3 is used to build all the source code for experiments
The experiments were conducted on a HP Pavilion g series through the Succinct Data Structure Library (SDSL).
with a 2.8 GHz four-core Intel@CoreTM i3-860 chip with 4
MB L3 Cache, but no parallelism was used. The machine runs TABLE III shows the space required for index prepared to be
64-bit Ubuntu 12.04 operating system and has 4 GB internal used in BWT-WT. Comparison of index size of proposed
memory and one 500 GB Serial ATA Hard Drive (7,200 approach BWT-WT with other tools BWA [5], Soap [9] and
RPM). Following real-world biological and non-biological Bowtie [11] in TABLE IV and Figure 5.Comparision of CPU
data to test the efficiency and usability of proposed method: time of proposed scheme with others is shown in TABLE 5.
1. The human genome sequences from NCBI.
2. Protein data from the Pizza & Chili Corpus .

TABLE III. PROPOSED INDEX (BWT-WT) SPACE ANALYSIS


Sequence Input Size Index size Count Number of Size of Construction Time
(N bytes) (bytes) Array size Suffixes Auxiliary (Sec.)
(bytes) (bytes) data (bytes)
(bytes)
English 32619430 36326025~1.11N 1028 2038716 1876136 39.94
DNA 52428801 33881529~0.65N 1028 3276804 3063596 52.567
Protein 52428801 43720341~0.83 N 1028 3276804 3063444 54.801

TABLE IV. INDEX SIZE COMPARISON OF PROPOSED BWT-WT INDEX WITH OTHERS

File File Size Bowtie BWA SOAP BWT-WT

Genome 50 MB 61.5 MB 62.1 MB 114.5 MB 32.3 MB

E-coli 15.3 MB 16 MB 16 MB 30.7 MB 9.9 MB

TABLE V. CPU TIME COMPARISON OF DIFFERENT ALIGNMENT TECHNIQUES


FOR SIMULATED DATA

Program Read Length single pair end read (bp) CPU Time (s)

Bowtie 36 375
Soap 36 249
BWA 36 289
BWT-WT 36 284

137

Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on April 05,2022 at 07:43:16 UTC from IEEE Xplore. Restrictions apply.
Index
Size
(MB)

Figure 5 Comparison of index size of proposed scheme with others

[5] H. Li and R. Durbin. Fast and accurate short read alignment with
V. CONCLUSION burrows–wheeler transform. Bioinformatics, 25(14): pp. 1754–1760,
2009.
In this paper it is shown that how to extend the BWT based [6] U. Manber and G. Myers. Suffix arrays: a new method for on-line string
approach to WT based data structure for compressed indexes. searches. In Proceedings of the first annual ACM-SIAM symposium on
BWT-WT is a simple and faster scheme for short read Discrete algorithms, SODA ’90, pp. 319–327, Philadelphia, PA, USA,
alignment. Experiments show that BWT-WT based program 1990. Society for Industrial and Applied Mathematics.
achieves more compression and also efficient searching speed [7] D.Zhang, Q.Liu Compression and Indexing based on BWT: A
Survey.Web Information System and Application Confrence, 2013.
in comparisons to BWT based approach. As a future work,
[8] Schindler, M. (1997, March). A fast block-sorting algorithm for lossless
one can consider approximate matches (insert, delete, gaps). data compression. In Proceedings of the Conference on Data
Compression (Vol. 469). IEEE Computer Society.
REFERENCES [9] Li, R., Li, Y., Kristiansen, K., & Wang, J. (2008). SOAP: short
oligonucleotide alignment program. Bioinformatics, 24(5), pp. 713-714.
[1] D. Adjeroh, T. Bell, and A. Mukherjee. The Burrows-Wheeler [10] B.Langmead, C.Trapnell, M.Pop, S.Salzberg. 2009. Ultrafast and
Transform: Data Compression, Suffix Arrays, and Pattern Matching. memory-efficient alignment of short DNA sequences to the human
Springer, 1 edition, 2008. genome. Genome Biology 2009,Vol.10,Issue 3,Article R25.
[2] M. Burrows and D. J. Wheeler. A block-sorting lossless data [11] Succinct Data Structure Library: https://github.com/simongog/sdsl-lite
compression algorithm. Systems Research, Research R(124): pp.1–24, [12] H. Li and R. Durbin. Fast and accurate long read alignment with
1994. burrows–wheeler transform. Bioinformatics, 26(5): pp.589-95, 2010.
[3] P. Ferragina, G. Manzini, V. M¨akinen, and G. Navarro. Compressed [13] R. Raman, V. Raman, and S. Srinivasa Rao. Succinct indexable
representations of sequences and full-text indexes. ACM Trans. dictionaries with applications to encoding k-ary trees and multisets. In
Algorithms, 3, 2007. SODA, pp. 233–242, 2002.
[4] R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed
text indexes. In Proceedings of the fourteenth annual ACM-SIAM
symposium on Discrete algorithms, SODA ’03, pp. 841–850,
Philadelphia, PA, USA, 2003. Society for Industrial and Applied
Mathematics.

138

Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on April 05,2022 at 07:43:16 UTC from IEEE Xplore. Restrictions apply.

You might also like