You are on page 1of 9

Title: SAGE: String-overlap Assembly of GEnomes

Authors:
Lucian Ilie, Bahlul Haider, Michael Molnar, Roberto Solis-Oba
Department of Computer Science, University of Western Ontario
London, ON, N6A 5B7, Canada

Corresponding author:
Lucian Ilie
Professor
Department of Computer Science
University of Western Ontario
London, Ontario, N6A 5B7
Canada
e-mail: ilie@csd.uwo.ca
phone: 1 519 661 2111 x 86848
fax: 1 519 661 3515

Running title: SAGE - genome assembly

Keywords: De novo genome assembly, DNA sequencing, String-overlap graph


SAGE: String-overlap Assembly of GEnomes
Lucian Ilie1 , Bahlul Haider, Michael Molnar, and Roberto Solis-Oba
Department of Computer Science, University of Western Ontario
London, ON, N6A 5B7, Canada

October 22, 2013

Abstract
We present a new program, SAGE, for de novo genome assembly. As opposed to most assemblers, which are de
Bruijn graph based, SAGE uses the string-overlap graph. SAGE benefits from improvements in almost every aspect of
the assembly process: error correction of input reads, string-overlap graph construction, read copy counts estimation,
overlap graph analysis and reduction, contig extraction, and scaffolding. The assemblies produced by SAGE compare
favourably with those of existing leading assemblers.

Next-generation sequencing (NGS) technologies are ter three papers for references to other genome assembly
causing an unprecedented revolution in biological sciences. programs or related work. In spite of considerable ad-
The ability to obtain the genome sequence of a species vances, much improvement is still needed in the current
quickly and at a relatively low cost has tremendous biolog- state-of-the-art technology for de novo genome assembly.
ical applications to cancer research, genetic disorders, dis- NGS technologies produce short pieces of DNA, called
ease control, neurological research, personalized medicine, reads. Reads that overlap significantly offer a good indi-
etc. NGS technologies such as Illumina, 454, APG, Heli- cation that they may come from the same region of the
cos, Pacific Biosciences, and Ion Torrent (Metzker, 2010) genome. Two main approaches are used in building as-
produce huge outputs at ever decreasing costs, enabling semblies. Both use overlaps between the reads but in dif-
ambitious projects such as the Genome 10K Project, Haus- ferent ways. In the string-overlap graph approach (My-
sler et al. (2009) (www.genome10k.org) whose goal is to ers, 1995, 2005) sufficiently long overlaps between reads
obtain the genomes of 10,000 vertebrate species, the 1000 are used as edges to connect vertices that represent reads.
Genomes Project, Siva (2008) (www.1000genomes.org) In the de Bruijn graph approach (Idury and Waterman,
that proposes to obtain the genomes of 1000 genetically 1995; Pevzner et al., 2001) reads are broken into k-mers
varying humans, or the Human Microbiome project, Turn- that are used as vertices connected by edges represent-
baugh et al. (2007) (commonfund.nih.gov/Hmp) whose ing overlaps of length k − 1. The NP-hard Hamiltonian
aim is to characterize the microbial communities found Path Problem related to the former approach is replaced in
at several different sites on the human body. the latter by the polynomial-time solvable Eulerian Path
The ability to de novo assemble genomic data is crucial Problem. However, regardless of the model, the problems
for the success of such projects as well as many other ap- underlying both approaches can be shown to be NP-hard
plications. There is increased demand in both computing (Medvedev et al., 2007). The de Bruijn graph approach
power and algorithmic ideas in order to cope with the in- seems counterintuitive, as it breaks the reads into shorter
creasingly popular NGS data. Great work has been done k-mers, which appears contrary to the technological efforts
on creating improved assembly programs, e.g., Butler et al. to produce longer reads. Nevertheless, the most successful
(2008); Dohm et al. (2007); Li (2012); Li et al. (2010); assemblers to date use the de Bruijn graph approach.
Luo et al. (2012); Simpson and Durbin (2012); Simpson We propose a new assembler, SAGE (String-overlap As-
et al. (2009); Zerbino and Birney (2008), as well as sur- sembly of GEnomes), that is string-overlap graph based.
veying various techniques, Pop (2009), or critically and SAGE includes improvements in almost every aspect of an
thoroughly evaluating the existing assemblers (Earl et al., assembler: error correction of input reads, string-overlap
2011; Salzberg et al., 2012); we refer the reader to the lat- graph construction, read copy counts estimation, overlap
graph analysis and reduction, contig extraction, and scaf-
1 Corresponding author: ilie@csd.uwo.ca folding. Extensive testing against several of the very best

1
assemblers, ABySS (Simpson et al., 2009), SGA (Simpson Evaluation method
and Durbin, 2012), SOAPdenovo2 (Luo et al., 2012), and
Velvet (Zerbino and Birney, 2008) shows that SAGE often The most recent version of each assembler has been used
has superior performance. and each program was run for all possible k-mers or min-
imum overlap sizes on each dataset. The k-mer or mini-
mum overlap size producing the highest scaffold N50 value
is reported. N50 is representative for the quality of the as-
Results sembly, however producing high values of N50 is not suf-
ficient by itself, as it says nothing of the quality of the
contigs. In particular, misassembled contigs can artifi-
Leading assemblers cially increase N50. Therefore, we have used the QUAST
comprehensive evaluation tool (Gurevich et al., 2013) to
We have chosen for comparison several leading de novo compare the quality of the assemblies.
assembly programs, according to their ranking by the
The most relevant parameter is NGA50, which is com-
Assemblathon 1 competition (Earl et al., 2011) and the
puted by aligning the contigs to the reference genome,
GAGE survey (Salzberg et al., 2012). ALLPATHS (But-
splitting them at misassembly breakpoints, eliminating
ler et al., 2008) and SOAPdenovo (Li et al., 2010), both
unaligned parts, and then computing the N50 of the ob-
de Bruijn graph based, are ranked at the top; we have
tained contigs with respect to the length of the reference
tested against the improved SOAPdenovo2 (Luo et al.,
genome.
2012). We could not use ALLPATHS since it is unable
We present also the NGA75, correspondingly defined,
to assemble single-library real datasets. We have also in-
the length of the largest alignment, the fraction of genome
cluded ABySS (Simpson et al., 2009) and Velvet (Zerbino
covered, the number of unaligned contigs, and the average
and Birney, 2008), both highly ranked and apparently the
number of indels and mismatches for each 100kbp, as the
most widely used programs, also de Bruijn graph based.
most important parameters computed by QUAST. All the
The most notable exception from the domination of the details given by QUAST are included in the Supplemen-
de Bruijn graph-based assemblers is the recent SGA pro- tary material. Details on how each program was run and
gram (Simpson and Durbin, 2012), which is string-overlap its output was evaluated by QUAST, including the pre-
graph based. SGA uses compressed data structures to cise commands used, are also given in the Supplementary
keep the memory requirements low, with the main aim of material.
being able to run on low-end computing clusters. Never-
theless, SGA is a successful assembler, ranked third in the
Assemblathon 1 competition (Earl et al., 2011). Our ef- Comparison
forts are complementary to those of Simpson and Durbin The best assemblies produced by all the assemblers con-
(2012) in the attempt of producing better assemblers that sidered are compared with respect to the following param-
are string-overlap graph based. eters as computed by QUAST: NGA50, NGA75, largest
alignment, genome coverage, unaligned contigs, and av-
erage indels and mismatches, presented in Tables 2-7,
Datasets resp. Whenever meaningful, we present also the aver-
age of the results in the last row. SAGE has the best
We compare the assemblers on real datasets only. We NGA50, NGA75, and length of the longest aligned con-
have downloaded a number of datasets from the NCBI tig for almost all datasets. For NGA50, the average of
web site (www.ncbi.nlm.nih.gov), with varied read length SAGE is 50% better than the second one, from Velvet.
and genome size, together with a C.elegans dataset, from The genome coverage of SAGE is the best, tied with that
www.wormbase.org, that has been previously used by of ABySS. The number of unaligned contigs is similar for
Simpson and Durbin (2012). As noticed by Simpson and all assemblers, with ABySS performing better on the bac-
Durbin (2012), the genome of C.elegans is an example terial datasets but significantly worse on C.elegans; SAGE
of an excellent test case for assembly algorithms due to performed the best on C.elegans. Concerning the average
its long (100MB) and accurate reference sequence, free of number of indels and mismatches per 100kbp, significant
SNPs and structural variants. The accession numbers for differences are seen between dataasets, with 2, 3, 4 hav-
all datasets and their reference genomes are given in Ta- ing much more errors than the other ones, with similar
ble 1, together with all their parameters. values for all assemblers. For the remaining ones, SGA

2
Table 1: The datasets used for evaluation, sorted increasingly by total number of base pairs. All datasets and reference
genome sequences are obtained from the NCBI, except C.elegans that is from www.wormbase.org.
Dataset Organism Accession Reference Genome Read Number of Number of Coverage
number genome length length reads base pairs
1 Bacillus subtilis DRR000852 NC 000964.3 4,215,606 75 3,519,504 263,962,800 62.62
2 Chlamydia trachomatis ERR021957 NC 000117.1 1,042,519 37 7,825,944 289,559,928 277.75
3 Streptococcus pseudopneumoniae SRR387784 NC 015875.1 2,190,731 100 4,407,248 440,724,800 201.18
4 Francisella tularensis SRR063416 NC 006570.2 1,892,775 101 6,907,220 697,629,220 368.57
5 Leptospira interrogans SRR397962 NC 005823.1 4,277,185 100 7,127,250 712,725,000 166.63
6 Porphyromonas gingivalis SRR413299 NC 002950.2 2,343,476 100 9,497,946 949,794,600 405.29
7 Escherichia coli SRR072099 NC 000913.2 4,639,675 36 30,355,432 1,092,795,552 235.53
8 Clostridium thermocellum SRR400550 NC 009012.1 3,843,301 36 31,994,160 1,151,789,760 299.69
9 Caenorhabditis elegans SRR065390 WS222 100,286,070 100 67,617,092 6,761,709,200 67.42

Table 4: Max alignment; best results in bold.


performed the best. However, for NGA50, NGA75, and Max ABySS SGA SOAP2 Velvet SAGE
the longest aligned contig, SGA performance came last. 1 800,991 241,307 1,014,436 1,016,217 1,016,322
2 359,339 210,791 339,457 328,063 669,089
3 125,616 125,616 125,563 125,616 125,616
4 87,729 87,426 87,417 87,416 87,862
Table 2: NGA50 comparison; best results in bold. 5 413,583 319,895 320,270 409,578 550,746
NGA50 ABySS SGA SOAP2 Velvet SAGE 6 172,567 167,699 167,686 167,731 172,565
1 423,890 68,419 551,507 860,106 924,197 7 326,073 54,214 325,634 242,502 326,332
2 301,840 97,593 225,668 152,094 669,089 8 186,547 106,016 186,433 186,372 186,424
3 23,245 21,876 26,356 23,245 30,232 9 213,835 239,959 382,096 238,636 383,476
4 25,749 23,314 23,294 23,292 23,961 Avg. 298,476 172,547 327,666 311,348 390,937
5 117,711 83,128 132,993 108,841 182,864
6 35,564 37,013 42,835 41,199 54,125
7 101,741 10,038 98,665 94,618 96,980 Table 5: Genome coverage (%); best results in bold.
8 52,944 23,747 54,744 48,286 54,883
9 18,210 20,436 31,973 25,676 32,442 Coverage ABySS SGA SOAP2 Velvet SAGE
Avg. 122,322 42,840 132,004 153,040 229,864 1 99.04 98.67 98.63 98.80 98.95
2 98.57 98.04 94.65 98.57 99.49
3 83.30 82.82 81.54 82.78 83.19
4 95.61 93.07 92.58 93.03 93.56
5 99.49 98.77 98.75 99.08 99.67
6 97.97 95.08 95.62 95.87 97.77
Table 3: NGA75 comparison; best results in bold. 7 95.62 94.10 94.80 94.99 95.42
NGA75 ABySS SGA SOAP2 Velvet SAGE 8 95.78 92.63 92.81 93.23 95.85
9 95.49 95.19 95.28 94.45 96.94
1 162,208 40,124 306,202 190,441 306,386
2 160,704 51,570 125,082 90,503 307,765 Avg. 95.65 94.26 93.85 94.53 95.65
3 9,847 7,570 6,785 8,128 10,040
4 14,491 13,117 13,117 13,117 13,377
5 58,556 40,333 64,594 60,408 87,232
6 20,005 18,062 19,982 19,855 25,176
7 56,943 5,270 54,790 42,438 54,784 Discussion
8 28,805 8,618 25,243 23,657 29,529
9 7,126 7,596 13,232 10,281 14,095
Avg. 57,632 21,362 69,892 50,981 94,265
Myers (2005) advocates that string-overlap graph based
assemblers should perform better than those based on the
de Bruijn graph and our work aims at supporting his
The time and space comparison is presented in Table 8. prediction. SAGE builds upon great existing work and
In order to facilitate comparison we present the time (sec- brings an important number of new ideas, such as the effi-
onds) and space (megabytes) per input mega base pairs. cient computation of the transitive reduction of the graph,
This way we can also compute averages. ABySS uses the the use of (generalized) edge multiplicity statistics for im-
least amount of space and SOAPdenovo2 is the fastest, proved estimation of copy counts, and the improved use of
with SAGE coming closely in second place for both time mate pairs and flow for supporting edge merging. SAGE
and space. Actual time and space values are presented in shows that the potential of string-overlap graph-based as-
the Supplementary material. semblers is higher than previously thought. We hope that

3
Table 6: Unaligned contigs; best results in bold.
Bad contigs ABySS SGA SOAP2 Velvet SAGE Error correction
1 1 0 8 0 0
2 10 16 16 18 19 All NGS datasets contain errors that make any usage of
3 3 26 29 27 14 such data, and genome assembly in particular, very dif-
4 0 1 2 1 1 ficult. We have used a new program, RACER (Ilie and
5 0 1 0 0 0
6 0 0 0 0 0 Molnar, 2013), that consistently exceeds the error correct-
7 1 4 1 0 2 ing performance of existing programs. All datasets have
8 1 0 11 1 3 been corrected with RACER before being assembled with
9 978 304 272 274 267 SAGE.

Table 7: Average number of indels and mismatches per Bidirected graph


100kbp; best results in bold.
Assume a dataset of n input reads of length ` each, se-
Indel/mm ABySS SGA SOAP2 Velvet SAGE
quenced from a genome of length L. The string-overlap
1 8.65 2.18 7.14 19.95 5.61
2 825.09 833.32 817.64 831.27 893.48graph (Myers, 2005) has the reads as vertices. There is an
3 2407.49 2417.37 2387.42 2406.82 2402.27edge between two reads if there is an overlap between their
4 518.38 522.02 527.28 503.81 530.76sequences (or reverse complements) of length higher than
5 19.01 18.89 45.86 15.81 15.58
a given threshold, M , the minimum overlap size. In order
6 10.8 35.81 13.83 17.41 26.75
7 36.2 8.82 29.77 29.95 47.86to avoid the complication due to double strandedness of
8 14.45 3.15 73.19 17.96 46.04DNA, Kececioglu (1991) introduced the bidirected over-
9 50.08 34.33 34.84 60.89 47.63lap graph, where a read and its reverse complement are
represented by the same vertex and an edge has an orien-
tation at each end point, depending on whether the read
Table 8: Time(s)/Space(MB) comparison; time in seconds or its reverse complement is used in producing the over-
and space in megabytes, both per input mega base pairs. lap defining the edge. Three possible types of edges are
The last row gives the average values. thus obtained. Each edge has a string associated with it,
Data ABySS SGA SOAP2 Velvet SAGE obtained from the strings of the reads according to their
1 2.17/3.96 11.68/ 8.78 0.72/17.98 10.31/ 8.46 0.55/ 3.15 overlaps. For instance, two strings xy and yz (y is the
2 2.68/8.13 14.67/12.58 0.57/23.11 19.99/11.83 0.60/ 3.18
3 1.08/1.07 13.50/ 9.67 0.37/15.94 7.75/ 2.82 0.39/ 1.87 overlap) produce the string xyz for the edge. Assuming
4 1.48/1.13 13.99/10.24 0.56/14.35 7.57/ 1.94 0.36/ 1.52 no errors in the reads, a consistent path through the graph
5 1.19/1.03 14.35/10.35 0.36/11.47 7.79/ 2.09 0.45/ 2.00 (for each vertex, the orientation of the ingoing edge must
6 2.15/1.96 13.30/13.96 0.54/ 9.99 23.59/ 7.04 0.54/ 3.41 match the orientation of the outgoing edge) spells a sub-
7 1.87/1.71 11.56/12.14 0.47/ 8.68 20.50/ 6.12 0.47/ 2.96
8 2.69/1.46 4.42/ 4.67 0.38/ 6.94 19.35/ 7.76 3.19/ 6.27 string of the genome. That is also the way of associating
9 4.19/2.33 20.84/10.38 1.54/ 5.18 7.97/ 4.71 0.85/ 2.67 a string with a path in the graph.
Avg. 2.17/2.53 13.15/10.31 0.61/12.63 13.87/ 5.86 0.82/ 3.00

String-overlap graph construction


In order to efficiently find all overlaps of length at least M
some of these ideas will be used also by others in order to between reads, we make the following observation. When-
boost the development of this type of assemblers and fur- ever two reads share an overlap of length M or more bases,
ther improve the current state of the art. As read length is there exists a prefix or suffix of one of the reads that oc-
going to grow, we expect that string-overlap graph-based curs as a substring of the other read. Therefore, we build
assemblers will have a better chance to improve. a hash table with all prefixes and suffixes of all reads (and
reverse complements) of length min{64, M }. A fast com-
putation of these is enabled by a 2-bit representation of the
DNA bases and computation of the prefixes and suffixes
Methods as 64-bit integers by fast bit operations.
After the hash table is built, a search is performed for all
The algorithms used in SAGE are described here. Some substrings of length min{64, M } of all reads. This is done
of the existing ideas are included as well in order to give in one pass through all reads with expected constant time
a complete description. search per substring and fast computation of the next sub-

4
string, again using efficient bit operations. Each successful breaks in coverage. Overlaps between a correct read and
search is followed by a fast extension, to check for a valid one containing errors most likely result in short “dead-
overlap. Whenever an overlap is found, the corresponding end” paths in the graph. As composite path compres-
edge is inserted in the graph. sion is performed, dead-end paths consist of single edges.
Whenever the number of reads in such an edge is lower
Space-efficient transitive reduction than an experimentally determined threshold, the edge is
removed.
The string-overlap graph can be very large and a transitive Sometimes such erroneous paths can connect back into
reduction is performed to significantly decrease its size. the graph, resulting in “bubbles.” A bubble is the event
An edge e = (r1 , r2 ) is transitive if there is another read of two disjoint single paths between two vertices such that
r3 and edges e1 = (r1 , r3 ), e2 = (r2 , r3 ) such that the string their strings are highly similar but their number of reads
of the edge e is the same as the one of the path (e1 , e2 ). is very different. In such a case, if one of the two paths
Noticing that the overlaps producing the edges e1 and e2 has much lower coverage, then it is removed.
are longer, and thus more reliable than the one producing
e, the transitive edge e can be eliminated.
Egde multiplicity
Myers (2005) gave a linear expected time algorithm for
transitive reduction. While Myers’ algorithm is very effi- Assuming complete coverage and no errors, the genome
cient, the graph has to be built before being reduced, thus would be represented as a path in the graph. The number
creating a space bottleneck. We have modified Myers’ al- of times an edge is traversed by this path equals the num-
gorithm to reduce the graph as it is being built. Myers’ ber of times the string associated with the edge occurs in
essential observation is that the edges adjacent to a vertex the genome. Therefore, for an edge e containing k reads
have to be considered in increasing order of their lengths. and such that its associated string has length d, assum-
We maintain this order and in addition attempt to reduce ing the reads are sampled uniformly from the genome, the
the graph as locally as possible. That is, we build only probability that e has multiplicity m, that is, its string
the part of the graph necessary to determine the transi- occurs m times in the genome, is:
tively reducible edges for a given vertex v. These edges are   k  n−k
marked for elimination but not yet removed. Once all ver- n md md
Pr(e, m) = 1−
tices whose transitive reduction can be influenced by the k L L
 k .
edges incident with v have been investigated, the edges 1 mdn mdn
− L
≈ e
of v marked for elimination can be removed, thus reduc- k! L
ing the space during construction. The running time for
building the transitively reduced graph remains the same In order to estimate the actual number of copies e has in
as if the complete string-overlap graph is first constructed the genome, we define the logarithm of the ratio between
and then transitive edges are removed, however the space the probability of e having m or m + 1 copies:
decreases very much. This is essential for the entire SAGE 
Pr(e, m)

algorithm since graph construction is the most space con- R(e, m) = ln
Pr(e, m + 1)  .
suming step. dn m+1
≈ − k ln
L m
Graph simplification
For m = 1, this log-odds ratio was already used by My-
The next step is further simplifying the graph. First, any ers (2005) (called also A-statistics) to identify unique, or
path consisting exclusively of vertices of indegree and out- single-copy, edges. Our general definition is essential in
degree one is compressed to a single edge, subsequently obtaining accurate estimates of the read copy counts.
called composite edge; edges that are not composite are
called simple. The string spelled by the composite path
Genome and insert size estimation
is stored with the new edge. We also store with the new
edge the information concerning the reads corresponding When defining the edge multiplicity probability and log-
to the collapsed vertices. odds ratios, the genome length L is necessary but un-
Our error correction procedure is very effective, how- known. The values R(e, 1) are used to estimate L using
ever, errors remain in the corrected dataset. The cor- the bootstrap algorithm of Myers (2005). The arrival rate
recting step does not remove any reads in order to avoid for a unique edge of length d and k reads is expected to

5
be close to that of the entire genome, which gives the esti- using the CS2 algorithm (Goldberg, 1997) downloaded
mate: L ≈ nd k . Edges of length 1000 or more are initially from www.igsystems.com/cs2. We refer to Medvedev and
assumed unique. Subsequent estimates of L are used for Brudno (2009) for details.
computing the R(e, 1) ratios and only those edges with
R(e, 1) ≥ 20 are kept as unique in future iterations. This
bootstrap procedure is repeated until the set of unique
edges does not change. In practice this happened in five
Read copy counts estimation
iterations or less and a very good estimate for L was ob-
As explained in the previous section, the solution to the
tained.
bidirected flow problem gives an estimation of the copy
Often the dataset does not include information concern-
counts ci ’s of the reads. However, a crucial element are
ing the insert size, that is, the distance between the two
the bounds we set on the capacities of the edges. For
reads of the same mate pair. The mean, µ, and standard
vertices, we must set a lower bound of 1 since the reads in
deviation, σ, of the insert size distribution are estimated
vertices must be included in the assembly.
by considering only those mate pairs that belong to the
same edge in the current graph where the distance between For edges, we use the R(e, m) statistics that we intro-
the reads in the edge is known. duced previously. We consider only long edges (1000 bp
or more) for which the statistics should work well. For one
such edge e, if R(e, 1) ≥ T , for some threshold T (T = 3
Maximum likelihood assembly works well in practice), then we set the lower and upper
Assembling a genome as the shortest string that contains bounds on the capacity of the edge as l(e) = u(e) = 1,
the given reads suffers from an important drawback, that thus forcing the flow in these edges to be 1. If R(e, 1) < 3,
of overcompressing the genome or overcollapsing the re- then we find the smallest m such that R(e, m − 1) ≤ −T
peats. This was already noticed by Myers (1995), where and R(e, m) ≥ T and set the lower bound l(e) = m and
the maximum likelihood reconstruction of the genome was the upper bound to some large value, u(e) = ∞. In case
proposed, that is, instead of the shortest genome, the one the above procedure fails to assign lower bounds, we set
that is most likely to have produced the genome is searched l(e) = 1 and u(e) = ∞. For the composite edges shorter
for. Medvedev and Brudno (2009) considered a very in- than 1000bp but containing at least 30 reads, we assign
teresting approach to maximum likelihood assembly that l(e) = 1 and u(e) = ∞. The remaining edges receive the
is suitable for our purpose. trivial l(e) = 0, u(e) = ∞.
Our goal is to produce good estimates for the read copy The estimation of the copy counts obtained using the
counts, that is, for each read, the number of times it ap- above procedure is very good. Table 9 gives the compari-
pears in the genome. For a read ri , assume its copy count son between our procedure and the one of Medvedev and
is ci . The ci values are unknown. What is known are the Brudno (2009). The latter one works well for synthetic
observed values, say xi , and the likelihood, L , that needs datasets, as reported in Medvedev and Brudno (2009),
to be maximized; maximizing L is the same as minimizing but not for real data.
its negative logarithm:
Y  n  ci xi  ci
n−xi
L = 1− , Table 9: Predicted read copy count comparison between
i  i
x L L the algorithm of Medvedev and Brudno (2009), denoted

X MB09, and the procedure used by SAGE. The values given
− log(L ) = −xi log ci − (n − xi ) log(L − ci ) + C , are the percentages of correctly predicted copy counts.
i
The MB09 algorithm could not process the last 3 datasets.
where C is a constant independent of the ci ’s.
For each i, the path in the graph spelling the genome Data MB09 SAGE Data MB09 SAGE
sequence traverses the edge containing the read ri exactly 1 3.19 82.71 8 9.37 58.61
2 4.85 71.87 9 5.31 65.90
ci times. Since − log(L ) is separable convex, maximizing 3 7.24 59.91 10 11.11 53.87
the likelihood is reduced to a convex min-cost bidirected 4 13.76 51.86 11 9.40 56.26
flow problem in a network build on the string-overlap 5 10.15 54.86 12 - 65.40
graph. First, each convex cost function is given a three- 6 8.03 56.85 13 - 66.67
7 10.02 55.78 14 - 73.22
piece linear approximation. Then the bidirected flow prob-
lem is reduced to a directed flow problem that is solved

6
Further graph simplification we strengthen it by considering only edges with non-zero
flow or having sufficiently long associated paths (at least
Based on the flow computed above, several further sim- 100bp).
plifications are performed to the graph. For a vertex, r, Assuming sufficient support is accumulated to merge
with only one incoming edge, (s, r) and outgoing edges the edges e and e , the distance between their ends is
1 2
(r, ri ), 1 ≤ i ≤ k, we remove the vertex r and its adjacent estimated based on µ. If they overlap, there will be no
edges and add (s, ri ), 1 ≤ i ≤ k. The flow on the edge gap, otherwise the gap is filled with N’s.
(s, ri ) is the same as it was on the edge (r, ri ). A similar
Finally, the set of output contigs consists of the strings
modification is done for vertices with only one outgoing
of minimum required length (default is 100) that are as-
edge.
sociated with edges of non-zero flow.
For a vertex r that has a self loop (r, r) and only two
adjacent edges, (s, r) and (r, t), we remove the vertex r
and its edges and replace them with (s, t). Note that the SAGE overview
flow on (s, t) is the same as it was on either (s, r) or (r, t). We present here a brief overview of the main stages of
SAGE using all the procedures presented above.
Mate pair support
SAGE Algorithm
After the above modification, a vertex having more than 1. Use RACER to correct the input dataset
one incoming edge and more than one outgoing edge is 2. Build transitively reduced string-overlap graph
ambiguous. Mate pairs are used in connection with the 3. Compress composite paths
flow to solve some of these ambiguities. For a vertex r, an 4. Remove dead-ends and bubbles
incoming edge (s, r) and an outgoing edge (r, t), we say 5. Compute edge statistics
that a mate pair (r1 , r2 ) supports the path (s, r, t) through 6. Estimate genome size
r if all paths of length within the range µ ± 3σ from r1 to 7. Estimate mean and std. dev. of insert size
r2 includes the path (s, r, t). Note that this is significantly 8. Compute min-cost bidirected flow
more general than simply having r1 on the edge (s, r) and 9. Reduce single-edge and loop vertices
r2 on the edge (r, t). There may be many paths of length 10. Compute mate pair support
µ ± 3σ between r1 and r2 . The support of (r1 , r2 ) for the 11. Resolve ambiguous vertices
path (s, r, t) means that all these paths must go through 12. Merge contigs without connecting paths
(s, r, t). 13. Output assembly
The support required in SAGE for merging a pair of
adjacent edges is at least 5. When edges (s, r) and (r, t) are
merged, an edge (s, t) is added with flow equal to minimum
Data access
flow on (s, r) and (r, t). The edge with lower flow out of All datasets are available from the NCBI, except
(s, r) and (r, t) is then deleted and the other has its flow C.elegans which is from www.wormbase.org. The
decreased by the minimum of the two. If both have the genome assemblers are available as follows: ABySS
same flow, then both are deleted. Often, r has only four at www.bcgsc.ca/platform/bioinfo/software/abyss,
adjacent edges and so it is completely resolved by this SGA at github.com/jts/sga, SOAPdenovo2 at
procedure. http://soapdenovo2.sourceforge.net/, and Velvet at
In order to save space, we compute the paths of length www.ebi.ac.uk/˜zerbino/velvet. CS2 is available at
up to µ + 3σ from each node and then consider the reads www.igsystems.com/cs2.
on all edges incident to that node. SAGE is available at www.csd.uwo.ca/˜ilie/SAGE/.
Due to lack of coverage in some regions or errors in
the reads, some mate pairs may have no path to connect
them in the graph. We can still use their support in such Acknowledgements
cases as follows. Consider a mate pair (r1 , r2 ) such that
r1 belongs to the edge e1 and r2 belongs to the edge e2 . If We would like to thank Jared Simpson for helping in-
the sum of the distances from r1 to the end of the edge e1 stalling SGA and Paul Medvedev for sharing his code for
and from the corresponding end of e2 to r2 is less than µ+ copy count estimates. Performance evaluation has been
3σ, then (r1 , r2 ) supports the edges (e1 , e2 ) to be merged. performed using the facilities of the Shared Hierarchical
This type of support from mate pairs is less reliable and Academic Research Computing Network (SHARCNET:

7
www.sharcnet.ca) and Compute/Calcul Canada. L.I., Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S.,
B.H., R.S.-O., and M.M. have been partially supported Shan, G., Kristiansen, K., Li, S., Yang, H., Wang, J., and Wang, J.
(2010). De novo assembly of human genomes with massively parallel
by grants from the Natural Sciences and Engineering Re- short read sequencing. Genome Res., 20, 265–272.
search Council of Canada (NSERC).
Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen,
Y., Pan, Q., Liu, Y., et al. (2012). SOAPdenovo2: an empirically im-
proved memory-efficient short-read de novo assembler. GigaScience,
Author contributions 1(1), 18.

Medvedev, P. and Brudno, M. (2009). Maximum likelihood genome as-


L.I. proposed the string-overlap graph approach, the gen- sembly. Journal of Computational Biology, 16(8), 1101–1116.
eralized edge multiplicity statistics, and the efficient tran-
Medvedev, P., Georgiou, K., Myers, G., and Brudno, M. (2007). Com-
sitive reduction, B.H. implemented the SAGE algorithm, putability of models for sequence assembly. Algorithms in Bioinfor-
and R.S.-O. proposed various algorithmic improvements. matics, pages 289–301.
L.I., B.H., and R.S.-O. met regularly and discussed vari-
Metzker, M. L. (2010). Sequencing technologies - the next generation.
ous ideas that were implemented and tested by B.H. M.M. Nat Rev Genet, 11(1), 31–46.
installed the competing programs and the evaluation soft-
Myers, E. W. (1995). Toward simplifying and accurately formulating
ware, performed all final tests and comparisons, and wrote fragment assembly. Journal of Computational Biology, 2(2), 275–
the SAGE manual. L.I. wrote the manuscript that was 290.
read and approved by all authors.
Myers, E. W. (2005). The fragment assembly string graph. Bioinfor-
matics, 21(suppl 2), ii79–ii85.

References Pevzner, P. A., Tang, H., and Waterman, M. S. (2001). An Eulerian


path approach to dna fragment assembly. Proceedings of the National
Academy of Sciences, 98(17), 9748–9753.
Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I. A., Belmonte,
M. K., Lander, E. S., Nusbaum, C., and Jaffe, D. B. (2008). ALL- Pop, M. (2009). Genome assembly reborn: recent computational chal-
PATHS: De novo assembly of whole-genome shotgun microreads. lenges. Briefings in bioinformatics, 10(4), 354–366.
Genome Res., 18(5), 810–820.
Salzberg, S. L., Phillippy, A. M., Zimin, A., Puiu, D., Magoc, T., Ko-
Dohm, J. C., Lottaz, C., Borodina, T., and Himmelbauer, H. (2007). ren, S., Treangen, T. J., Schatz, M. C., Delcher, A. L., Roberts, M.,
SHARCGS, a fast and highly accurate short-read assembly algorithm et al. (2012). GAGE: A critical evaluation of genome assemblies and
for de novo genomic sequencing. Genome Res., 17(11), 1697–1706. assembly algorithms. Genome Res., 22(3), 557–567.

Simpson, J. T. and Durbin, R. (2012). Efficient de novo assembly of large


Earl, D., Bradnam, K., John, J. S., Darling, A., Lin, D., Fass, J., Yu, H.
genomes using compressed data structures. Genome Res., 22(3), 549–
O. K., Buffalo, V., Zerbino, D. R., Diekhans, M., et al. (2011). Assem-
556.
blathon 1: A competitive assessment of de novo short read assembly
methods. Genome Res., 21(12), 2224–2241.
Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J., and
Birol, İ. (2009). ABySS: a parallel assembler for short read sequence
Goldberg, A. V. (1997). An efficient implementation of a scaling data. Genome Res., 19(6), 1117–1123.
minimum-cost flow algorithm. Journal of algorithms, 22(1), 1–29.
Siva, N. (2008). 1000 Genomes project. Nat Biotech, 26(3), 256.
Gurevich, A., Saveliev, V., Vyahhi, N., and Tesler, G. (2013). QUAST:
quality assessment tool for genome assemblies. Bioinformatics, 29(8), Turnbaugh, P. J., Ley, R. E., Hamady, M., Fraser-Liggett, C. M., Knight,
1072–1075. R., and Gordon, J. I. (2007). The human microbiome project. Nature,
449(7164), 804–810.
Haussler, D., O’Brien, S., and Ryder, O. et al. (2009). Genome 10K:
A proposal to obtain whole-genome sequence for 10,000 vertebrate Zerbino, D. R. and Birney, E. (2008). Velvet: algorithms for de novo
species. Journal of Heredity, 100(6), 659–674. short read assembly using de bruijn graphs. Genome Res., 18(5),
821–829.
Idury, R. M. and Waterman, M. S. (1995). A new algorithm for DNA
sequence assembly. Journal of Computational Biology, 2(2), 291–
306.

Ilie, L. and Molnar, M. (2013). RACER: Rapid and accurate correction


of errors in reads. Bioinformatics, 29(19), 2490–2493.

Kececioglu, J. D. (1991). Exact and approximation algorithms for DNA


sequence reconstruction. Ph.D. thesis, The University of Arizona.

Li, H. (2012). Exploring single-sample SNP and INDEL calling with


whole-genome de novo assembly. Bioinformatics, 28(14), 1838–1844.

You might also like