You are on page 1of 10

Hindawi Publishing Corporation

International Journal of Genomics


Volume 2015, Article ID 563482, 10 pages
http://dx.doi.org/10.1155/2015/563482

Research Article
RECORD: Reference-Assisted Genome Assembly for
Closely Related Genomes

Krisztian Buza, Bartek Wilczynski, and Norbert Dojer


Faculty of Mathematics, Informatics and Mechanics (MIM), University of Warsaw, Banacha 2, 02-097 Warsaw, Poland

Correspondence should be addressed to Krisztian Buza; buza@biointelligence.hu

Received 18 March 2015; Revised 27 May 2015; Accepted 31 May 2015

Academic Editor: Chun-Yuan Lin

Copyright © 2015 Krisztian Buza et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background. Next-generation sequencing technologies are now producing multiple times the genome size in total reads from a single
experiment. This is enough information to reconstruct at least some of the differences between the individual genome studied in
the experiment and the reference genome of the species. However, in most typical protocols, this information is disregarded and
the reference genome is used. Results. We provide a new approach that allows researchers to reconstruct genomes very closely
related to the reference genome (e.g., mutants of the same species) directly from the reads used in the experiment. Our approach
applies de novo assembly software to experimental reads and so-called pseudoreads and uses the resulting contigs to generate a
modified reference sequence. In this way, it can very quickly, and at no additional sequencing cost, generate new, modified reference
sequence that is closer to the actual sequenced genome and has a full coverage. In this paper, we describe our approach and test its
implementation called RECORD. We evaluate RECORD on both simulated and real data. We made our software publicly available
on sourceforge. Conclusion. Our tests show that on closely related sequences RECORD outperforms more general assisted-assembly
software.

1. Background sequencing techniques may be applied to metagenomic sam-


ples returning short reads originating from multiple genomes
The emergence of population genomic projects leads to an including some potentially unknown species.
ever growing need for software and methods that facili- Importantly, many of these techniques require the prior
tate studying closely related organism with next-generation knowledge of the reference genome of the species for which
sequencing technologies. This includes determination of the the experiment was performed. This genome sequence is
genomic sequences of individuals in the presence of the more used to map the reads and obtain the final readout of the
generic reference genome of the species. This task is known experiment as the read counts per base pair. Such procedures
as reference-assisted genome assembly and many ongoing are guaranteed to work very well only under the assumption
research projects depend on the accurate solution for this that we know the exact sequence of the genome under study.
problem. There are, however, many biologically relevant cases when
In recent years, next-generation sequencing technologies this assumption cannot be satisfied. For example, in quickly
have brought us the possibility to simultanously sequence growing cell populations such as cancer cell-lines or micro-
millions of short DNA fragments in a DNA library prepared bial colonies, even rare mutations can get fixed in the pop-
from almost any biochemical experiment [1]. Great improve- ulation very quickly. This leads to situations where sampled
ment in the quality and amount of short reads obtained sequences can significantly differ from the original reference
from a single experiment allowed for development of many genome. Similarly, many lab experiments involve genetically
more biochemical assays [2] such as MNase-seq [3], DNAse- modified cells or organisms. While these modifications are
seq [4], or Chia-Pet [5] in addition to the more standard usually controlled as much as possible, the researchers fre-
ChIP-Seq [6] or RNA-seq [7]. Similarly, the next-generation quently do not know the exact landing site of the introduced
2 International Journal of Genomics

sequence or the number of copies in which it was integrated individually, but, together, they should cover as much as pos-
into the host genome. This has naturally serious implications sible of the genome in order to allow detection of the abun-
for the accuracy of the results because any difference between dance of reads in any region of the genome. Third, the result
the reference genome and the sampled one will lead to of the assembly should be accurate; that is, the assembled
differences in the expected number of reads mappable to the genome should be as close as possible to the actual genome of
reference genome at the differing position. This in turn can the studied organism. Last, but not least, we aim to provide a
interfere with the measurement of the real abundance of this simple assembly approach. With simplicity, we mean compu-
DNA region in the sample. tational time (in order to keep the entire process computa-
This problem can be, at least theoretically, alleviated by tionally tractable) and the method clarity needed for ease of
introducing an additional step into the process: instead of reproducibility and reuse of presented ideas in the context of
directly mapping the reads to the reference genome, we can different specific protocols. We believe the adaptation of the
create a “modified” assembly of the genome based on the ideas presented in this paper may be straightforward in some
reads from the sample and the reference genome. Then, we applications, including RNA-Seq and ChIP-Seq. In other
can use this assembly to map the reads and measure their cases, it would be possible after substantial effort. For exam-
abundance. This approach can be broken down into two ple, metagenomic sequencing might potentially benefit from
major steps: some ideas presented in this paper. However, the currently
presented approach would need to accommodate multiple
(I) Assembling a genome of the sampled population genomes and reads originating from different, related species
based on the obtained reads and the reference that may be present in the sample at the same time.
genome. The growing interest in genome assembly is also reflected
(II) Assessing the abundance of reads in genomic regions by recent publications. For example, Peng and Smith [8]
using such an improved reference sequence. studied genome assembly from the theoretical point of view
and showed that various combinatorial problems related to
In the early years of next-generation sequencing, the first genome assembly are NP-hard. On the other hand, various
step of such an approach was impractical, as the number of methods have been proposed for reference-assisted genome
reads used for an experiment like ChIP-seq was far too low assembly, such as Amos [9], RACA [10], ARACHNE [11, 12],
and their quality was not high enough to attempt assembly IMR/DENOM [13], RAGOUT [14], AlignGraph [15], and the
of a better reference genome than the one deposited in pipeline developed by Gnerre et al. [16] which was developed
the databases by the relevant genome consortium. However, inside the framework provided by ARACHNE. Similarly to
now it is commonplace that the total of sequencing reads our approach, Gnerre et al. used a de novo assembler as a
generated for a single experiment such as ChIA-PET might be component. They mapped reads to several reference genomes
covering the genome multiple times and, at least in case of the and used the resulting mapping information to improve the
model organisms such as D. melanogaster or C. elegans, the output of the de novo assembly in subsequent steps. In
read lengths might be large enough to attempt an assembly. contrast to Gnerre et al., we only use one reference genome,
This approach also has another limitation. If the reference and, more importantly, we use the reference genome to
sequence is very different from the one used in the experi- provide enriched input for the de novo assembler. Further-
ment, it contributes more to a problem than to a solution. Any more, we assume that the reference is closely related to the
attempt to use a completely unrelated sequence as a reference target genome, and therefore the reference is directly used to
in such an approach is bound to introduce errors. Therefore, determine order and orientation of the assembly contigs. In
in order to ensure that the output of the assembly is useful, contrast, RACA focused on reliable order and orientation of
when we provide a method of generating reference-assisted the contigs. Amos, one of the most popular assisted assembly
assemblies, it is crucial to validate that the reference is actually softwares, aligns reads to the reference genome and uses
close enough to the target genome. alignment and layout information to generate a new con-
In this paper, we focus on developing an approach for sensus sequence [9]. We note that the techniques presented
reference-assisted genome assembly. We assume that the in this paper are orthogonal to the ones used in the afore-
actual genome of the organism and the reference genome mentioned works; that is, as future work, RECORD may be
are close to each other; for example, the reference genome of combined with other assisted assembly tools. In this paper, we
the species under consideration is given, but not the genome focus on experimentally evaluating the power of the relatively
of the particular mutant. We point out that currently used simple techniques of our pipeline. We will show that, despite
straightforward solutions produce suboptimal or, in some their simplicity, they may achieve surprisingly good results.
cases, even misleading results. For example, when simply In the next section we describe our approach, a simple but
assembling the genome from the given reads, due to the low surprisingly effective reference-assisted assembly technique,
coverage of those reads, we may obtain too short contigs lead- and the software that implements it. By design, this approach
ing to an assembly useless in practical applications. Conse- is most useful in cases when the reference and target genomes
quently, we need an assembly technique which fulfills the fol- are closely related, and the coverage of the target genome by
lowing criteria. First of all, it should output sequences that are the experimental reads is relatively low such as multiplexing
long enough even in cases when the coverage of the genome scenarios where multiple experimental DNA libraries are
sequence by the experimental reads is relatively low. Second, barcoded and pooled in a single sequencing lane. Subse-
not only should the output sequences be large enough quently we present the results of the experimental evaluation
International Journal of Genomics 3

Experimental reads Pseudoreads (step 1) Reference genome


ACGCATG ACTCACG TCACGCG ACGCGAT ACTCACGCGATACGAGCTACTACG
CATGCGT GCATGCG AGCTACT GCGATAC GATACGA CGCGAAATATCTTACCTAGGCACG
TAGGCAC CGTAGGC TACGAGC CGAGCTA AATTAAATTTTGACGACGATATAAC
TACGTAG AAATTCG
TTAAATT

GCATGCGTAT
TACGATCTTACG
CGGAGAACGTA
Assembly contigs (step 2)

ACGCATGCGTATCGAGCTACTACG
CGCGTACGATCTTACGTAGGCACG
AATTAAATTCGGAGAACGTAATAAC
Edited reference (step 3)

Figure 1: RECORD: Reference-Assisted Genome Assembly for Closely Related Genomes. The inputs of the pipeline, that is, the experimental
reads and the reference genome, are illustrated in the top left and top right of the figure, respectively. Intermediate results produced in various
steps of the analysis process are depicted. The dependency between these intermediate results is shown by arrows. In the illustration of the
3rd step, we underlined those segments of the edited reference which were replaced by one of the assembly contigs.

of RECORD and compare it to Amos [9], one of the most Chr1: ACTCACGCGATACGAGCTACTACGGAGGATC... Reference
genome
popular assisted assembly tools. We show that, under realistic
conditions of approximately 1 percent divergence between ACTCACG AGCTACT
reference genome and the studied sequence, our approach CACGCGA TACTACG Pseudoreads
outperforms naive approaches and Amos (which excels in
GCGATAC TACGGAG
situations where the divergence is much higher). To ensure
reproducibility and extensibility of our work, we evaluate our
n m n m
approach on several collections of publicly available next-
generation sequencing data sets originating from various d
model organisms such as yeast (S. pombe), fruit fly (D.
Figure 2: Generation of pseudoreads from the reference genome.
melanogaster), and plant (A. thaliana).

2. Implementation
edited reference, the segments of which are replaced
We propose RECORD, Reference-Assisted Genome Assem- according to the mapping. This step ensures that the
bly for Closely Related Genomes. Our approach consists of edited reference is close to the true genome of the
the following steps (see Figure 1): organism, while it covers as much regions of the
genome as possible.
(1) We generate pseudoreads from the reference genome.
We generate pseudoreads in order to ensure that the
coverage of the genome is large enough. Below we give a detailed description of the above steps.

(2) We obtain the contigs of the actual genome of the


2.1. Generation of Pseudoreads from the Reference. While
organism using a genome assembler, such as Velvet
generating pseudoreads from the reference, we make sure that
[17]. As input of the assembler, we propose to use the
these pseudoreads have uniform coverage and large enough
pseudoreads generated in the previous step together
overlaps so that they can “assist” the genome assembler, while
with the experimental reads.
it joins reads to contigs. In particular, we generate reads
(3) We create an edited reference genome. The contigs of length 𝑚 from each chromosome beginning at positions
obtained in the previous step may not cover the actual 0, 𝑛, 2⋅𝑛, . . . , 𝑘⋅𝑛, . . ., where 𝑚 and 𝑛 are parameters that can be
genome of the organism entirely, and, more impor- set by the user. We generate paired-end reads; the first mate of
tantly, the genome obtained in the previous steps the paired-end reads is generated directly from the reference,
may be fragmented into a relatively large amount while the second mate is generated from its reverse comple-
of contigs. Therefore, the contigs obtained in the ment, so that the resulting data has similar character as the
previous step will be mapped to the reference genome paired-end reads in NGS experiments. The distance between
with MUMmer [18]. Using the reference genome and the ends of the mates of the paired-end reads is 𝑑. This is
the mapped contigs, we produce a new genome, called illustrated in Figure 2.
4 International Journal of Genomics

Additionally, we associate each position of these pseu- CATGCGCTACGAGC


doreads with a relatively low quality score 𝑞 in order to ACTCACGCGATACGAGCTACTACGGAGGATC... Reference
ensure that real reads have higher priority during the genome GAGCTACCAAGGA
assembly process.
Contigs mapped to the reference
By default, whenever the opposite is not stated explicitly,
we set 𝑚 = 100, 𝑛 = 30, 𝑑 = 1000, and 𝑞 = 10. The quality
score is on the Phred scale from 0 to 93. We store the pseu-
doreads together with the quality scores as FastQ files [19] so
that they can be used as input for the genome assembler ACTCACGCGATACGAGCTACTACGGAGGATC... Edited reference
Velvet.
Figure 3: Resolution of ambiguity. First, for each contig, its best
2.2. Assisted Assembly. The second step of our approach mapping is determined, and then the remaining ambiguity is
leads to generation of assisted assembly contigs. To this resolved in greedy fashion by giving priority to the beginning of the
aim, we combine pseudoreads generated in the previous step contigs as shown in the figure.
and experimental reads in one data set. Next, this data set
is used as an input for a genome assembler. In principle,
any assembler can be applied, but we use Velvet with its
overlapping contigs as illustrated in Figure 3. According to
default parameters. However, the user may set values of the
our observations, selecting the best mapping for each contig
parameters according to his or her needs.
greatly reduces the number of those genomic positions that
are covered by multiple contigs. In particular, in both cases of
2.3. Editing the Reference. While editing the reference based
A. thaliana and D. melanogaster, the selection of the best map-
on the alignment of the contigs produced by MUMmer,
ping of each contig reduced the number of multiply covered
we have to take into account that contigs may be mapped
genomic positions by ≈90%. Furthermore, the overlapping
ambiguously to the reference; that is, the same contig may
segments of two contigs typically contain exactly the same
be mapped to several segments of the reference. Moreover,
or very similar genomic sequences. Therefore, the selection
the regions covered by different contigs may overlap and
of the best mapping for each contig is able to eliminate vast
therefore some segments of the genome may be covered by
majority of the ambiguity. In the light of these observations,
several contigs. We resolve this ambiguity in two steps.
Figure 3 shows an exceptional situation, in which two contigs
First, for each contig, we search for its best mapping to
overlap and the overlapping segments correspond to notably
the reference. Conceptually, we can measure the quality of
different genomic sequences. Despite the fact that such situ-
a mapping by the number of identical bases between the
ations are exceptionally rare, in order to produce the edited
contig and the corresponding segment of the reference. This
reference, such ambiguity must be resolved. One possibility
is estimated as
to resolve such ambiguity is to use the aforementioned quality
𝑄map = 𝐿 × idy(ref) , (1) scores and to prioritize the contig with higher 𝑄map score. In
our prototypical implementation of the pipeline, we used an
where 𝐿 denotes the length of the mapped segment of the even simpler method: we resolved the ambiguity remaining
after the selection of the best mapping in a greedy fashion
contig and idy(ref) is the percentile identity between the contig
by preferring the beginning of the contigs to the ends of the
and the corresponding reference segment as outputted by
contigs as illustrated in Figure 3.
MUMmer. For each contig, out of its several mappings, we
After resolving the ambiguity, the edited reference is
select the one that has the highest 𝑄map score.
produced by replacing the segments of the reference by the
Even though there is no theoretical guarantee that a
mapped contigs (or their segments).
particular contig corresponds to that segment of the genome
to which it was mapped with highest 𝑄map score, we argue
that, on one hand, the higher the identity is, the higher the 2.4. Software. We implemented RECORD using Perl and Java
likelihood that the mapping is correct is (i.e., the contig really programming languages. The main program is implemented
originates from that segment of the genome to which it is in Perl programming language. The main program calls Vel-
mapped); on the other hand, the longer the mapped subse- vet and the modules, for generation of pseudoreads and ref-
quence of the contig is, the higher the likelihood that the map- erence editing.
ping is correct is. Therefore, the higher the above quality score
is, the higher the likelihood of correct mapping is. Thus, we 3. Results and Discussion
select for each contig the segment of the genome that has the
highest 𝑄map score. Our approach does not aim to reproduce the reference
As one can see in Figures 5 and 7, the ratio of ambiguously genome (which is used as input anyway), but we aim to
mapped contigs varies between 5% and 12% in most of our recover the true genome of the organism which is unknown
experiments. An exception is the case of A. thaliana, for in case of real experiments. Consequently, the evaluation of
which the proportion of ambiguously mapped contigs is any assembly software is inherently difficult. Therefore, in the
between 30% and 40%. After selecting the best mapping for following sections, we present evaluation on both simulated
each contig, the remaining ambiguity may only arise from and real data. In case of simulated data, a gold standard is
International Journal of Genomics 5

available, while the experiments on real data will show that Table 1: Evaluation on simulated data.
our approach may be useful in real applications. TL Error Id. Bases
Next, we present the results of the experimental evalua- Assembly N50
(Mb) (in %) (Mb)
tion of our approach.
Contigs
3.1. Baselines. In the experiments presented in the subse- Velvet 18.20 213 b 0.85 18.05
quent sections, we used two genome assemblers, Velvet [17] Amos 28.82 1834 b 2.09 28.22
and Amos [9], as baselines. Velvet is a de novo genome RECORD 25.81 2055 b 0.41 25.70
assembler; that is, it assembles the genome directly from the Edited reference
experimental reads, whereas Amos is one of the most popular Velvet 30.00 10 Mb 1.39 29.58
assisted genome assembly software tools; that is, Amos uses
Amos 30.00 10 Mb 1.03 29.69
both the experimental reads and the reference genome of
a genetically related organism in order to reconstruct the RECORD 30.00 10 Mb 0.59 29.82
genome of the studied organism. Throughout the description
of the experiments, with Velvet we refer to the case of
using Velvet as standalone application, even though our these contigs the length of the shortest one is denoted
approach, referred to as RECORD, uses by default Velvet as as N50.
a component of the proposed pipeline. (3) Error = 100% − IDY, where IDY is the percentile
We also tried to use further assisted genome assemblers, identity between the target genome and the genome
such as ARACHNE [11, 12] and IMR/DENOM [13]. While reconstructed by the assembler. (Please note that IDY
these softwares may excel in various general settings (such
is different from idy(ref) . While idy(ref) denotes the
as using the reference genome of a species to reconstruct the
identity between an assembly contig and the corre-
genome of an other species), as far as we can judge, they
sponding segment of the reference genome, we use
do not seem to fit to our special setting of relatively low
IDY to denote the identity between the output of the
coverage (i.e., few experimental reads) and very closely related
assembly and the target genome.) In order to calculate
genomes. For example, in some cases, the outputted genome
IDY, we map the genome reconstructed by the assem-
was the reference genome, which, on one hand, may be con-
bler to the target genome using the MUMmer soft-
sidered as reasonable if the actual genome and the reference
ware tool [18], and we calculated the weighted average
genome are highly similar (i.e., they are almost the same); on
of the percentile identities between the mapped seg-
the other hand, this is a trivial solution for the assisted
ments and the target genome as outputted by MUM-
assembly problem as the reference is one of the inputs of ref-
erence-assisted assembly methods. mer. In the weighted average, we used the length of the
mapped segments as weights.
3.2. Evaluation on Simulated Data. We simulate the scenario (4) Number of identical bases, which we calculated as
that the reference genome is given and we aim to reconstruct IDY × TL.
the actual genome of the studied organism, which we call
target genome. In particular, we used the Evolver software Both in case of our approach and in case of the baselines,
tool [20] to generate the target genome. We used the genome we evaluated both the contigs and the edited reference
from the example that comes with Evolver. This is an resulting from using the contigs. In case of evaluating edited
artificial mammalian genome of size of 30 megabases (Mb). reference for the baselines, we simply used the contigs out-
The genome has three chromosomes. In order to allow for putted by the baselines in the third step of our approach and
an unbiased evaluation, we produced the evolved genome produced the edited reference.
following the example attached with Evolver. We used the Table 1 summarizes our results. The columns of the table
original genome, that is, ancestral genome, as the reference show the total length (TL) of the assembly, N50, error, and
genome, and we considered the evolved genome as the target the number of identical bases. As one can see, our approach,
genome. We generated one million paired-end short reads of RECORD, is competitive with the other assemblers: consid-
length of 70 with wgsim [21] from the target genome. Subse- ering the contigs produced by our pipeline, they have the
quently, we tried to reconstruct the target genome from the highest N50 and the lowest error rates, while the edited
generated paired-end reads and the reference genome both reference produced by RECORD has the overall highest
with our approach and two other state-of-the-art genome number of identical bases with the target genome.
assemblers. Throughout the experiments on simulated data, In a subsequent experiment, we varied the number of
we used Velvet with 𝑘-Mer size of 𝑘 = 21. Finally, we reads used for the assembly and evaluated the resulting
compared the outputs of the assemblers with the target contigs. These results are shown in Figure 4. The diagram
genome and quantitatively measured the quality of each of the (a) shows the number of bases in the target genome that are
assemblers according to the following criteria: covered by the assembly contigs as function of the number of
reads that were used. It is important to note that while Amos
(1) TL, the total length of the assembly in Mb. can provide overall better coverage of the sequence, it requires
(2) N50; that is, we consider the set of largest contigs that more reads (>500 k) for that. In the lower range of the number
together cover at least 50% of the assembly, and out of of reads available, it is outperformed by RECORD. It may be
6 International Journal of Genomics

35 100 90000
99.5 80000
Number of covered bases

30 70000
99
25

Accuracy (%)
60000
98.5
(millions)

Cov50∗
20 50000
98
15 40000
97.5 30000
10 97 20000
5 96.5 10000
0 96 0

100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000

100
200
300
400
500
600
700
800
900
1000
Number of reads (thousands) Number of reads (thousands) Number of reads (thousands)
Velvet Velvet Velvet
Amos Amos Amos
RECORD RECORD RECORD
(a) (b) (c)

Figure 4: Comparison of the proposed approach (RECORD) with two state-of-the-art genome assemblers on data simulated with wgsim. In
this experiment, we consider the evolved genome produced by Evolver as the target genome; the reference genome is the ancestral genome.
The diagrams show the performance of the examined approaches according to various criteria as the function of the number of simulated
reads that were used for the assembly. The diagram (a) shows the number of covered bases of the target genome; the diagram (b) shows the
accuracy, that is, overall percentile identity between the assembly contigs and the corresponding segments of the target genome, while the
diagram (c) shows the number of those largest contigs that together cover at least 50% of the target genome.

0.09 In order to analyze our approach in more detail, we show


in Figure 5 the proportion of ambiguously mapped contigs
(before the selection of the best mapping for each contig). As
0.08 one can see, the proportion of ambiguously mapped contigs
varies between 7% and 8.5%.
Proportion

0.07
We note that, from the point of view of applications, there
is a substantial difference between the execution times of
RECORD and Amos. For example, when using 300 thousand
0.06 reads, producing the edited reference took approximately 1
hour for our approach, whereas it took 16 hours for Amos.
We emphasize that this observation refers to the practical
0.05 application of the software but not to the overall (theoretical)
100 200 300 400 500 600 700 800 900
computational costs: much of the observed difference may be
Number of reads (thousands) attributed to the fact that Velvet, which is used by default as
Figure 5: Proportion of ambiguously mapped contigs (before the assembler in the proposed pipeline, is able to run in parallel
selection of the best mapping for each contig) in case of various on multiple cores, whereas Amos can be used on one core at a
numbers of simulated reads. time. Due to the fact that RECORD uses a de novo assembler
as a component of the proposed pipeline, our approach is
limited to middle-sized genomes that are closely related to
the reference genome; therefore it is currently not applicable
relevant for practical applications as the cost of the exper- to the human and comparable genomes.
iment usually depends on the number of reads produced.
While in this simulated case the number of reads is relatively 3.3. Evaluation on Real Data. The primary goal of the evalua-
low for today NGS technology standards, it might be still rel- tion on real data was to show that our approach can be applied
evant in multiplexing scenarios where multiple experimental in real experiments.
DNA libraries are barcoded and pooled in a single sequencing
lane. 3.3.1. Assessment of the Accuracy in Comparison to the Base-
The second diagram (b) shows the overall percentile line. As mentioned previously, in real-world settings, there is
identity between the target genome and the contigs of the usually no gold standard available. Therefore, the assessment
assembly. The third diagram (c) shows the number of those of the accuracy of the genome produced by any assembler is
largest contigs that together cover at least 50% of the target inherently difficult. For this reason, in the subsequent exper-
genome. As one can see, if only relatively few reads are avail- iment, we evaluate the accuracy of the proposed method on
able, our approach, RECORD, systematically outperforms the real data indirectly. In particular, we assess the quality of the
baselines by producing larger contigs, the most accurate and contigs and, more importantly, we compare our approach to
most complete assembly. the baselines in the following setting: we examine how well
International Journal of Genomics 7

14 100 40000
Number of covered bases

35000
12 99.5
30000

Accuracy (%)
10 99 25000
(millions)

Cov50∗
8
98.5 20000
6 15000
98
4 10000
2 97.5 5000
0 97 0
1 2 3 4 1 2 3 4 1 2 3 4
Number of reads (millions) Number of reads (millions) Number of reads (millions)

Velvet Velvet Velvet


Amos Amos Amos
RECORD RECORD RECORD
(a) (b) (c)

Figure 6: Comparison of the proposed approach (RECORD) with two state-of-the-art genome assemblers on real data. In this experiment,
we compared assemblies resulting from various number of experimental reads to the assembly which is produced by Amos using all the
experimental reads; that is, the target genome is the assembly produced by Amos using all the reads. In this case, the reference genome
exhibits 99.7 percent identity with the result of Amos which is used as the gold standard. The diagrams follow the same structure as the one
in Figure 4.

we can reconstruct the genome using relatively small subsets 0.5


of all the available reads. These subsets are uniform random
samples taken from the set of all the reads: each read has 0.4
the same probability of being included in the sample. Paired-
end reads are sampled together with their mates; that is,
Proportion

0.3
either both sequences corresponding to a particular paired-
end read are selected or none of the sequences of that paired- 0.2
end is selected.
In the aforementioned context, as gold standard, that is, 0.1
target genome, we consider the genome produced by Amos
when using all the reads for the assembly. We note that
0
this leads to an evaluation in which Amos has an inherent A.1 A.2 A.3 D.1 D.2 D.3 P.1 P.2 P.3 P.4 P.5 P.6 P.7
advantage against our approach, as unfortunately we cannot
have an unbiased reference. Figure 7: Proportion of ambiguously mapped contigs (before the
We used real-world experimental reads graciously pro- selection of the best mapping for each contig) in case of experiments
on publicly available data sets.
vided by dr Andrzej Dziembowski’s group, coming from an
unpublished ChIP-seq experiment in a yeast species. The data
contained approximately 4.5 million paired-end short reads
of length of 100. 3.3.2. Characteristics of the Assembly of Publicly Available Data
Figure 6 shows the results. The diagrams follow the same Sets. In order to assist reproducibility of our results, we used
structures as the ones presented at the end of Section 3.2; that publicly available real short read data from the NCBI Short
is, the diagram (a) shows the number of bases in the target Read Archive. We used data originating from three different
genome that are covered by the assembly contigs as function species: plant (A. thaliana), fly (D. melanogaster), and yeast
of the number of reads that were used. The second diagram (S. pombe). The identifiers of the short read collections are
(b) shows the accuracy, that is, the overall percentile identity shown in the third column of Tables 2 and 3.
between the target genome and the contigs of the assembly. We set the 𝑘-mer size for the assembly, that is, the second
The third diagram (c) shows the number of those largest con- step of our approach, in accordance with length of the short
tigs that together cover at least 50% of the target genome. In reads in the archive and the (approximate) size of the target
all the three diagrams, the horizontal axis shows the size of the genome. In particular, similarly to the previous experiments,
sample (i.e., the number of paired-end reads) used to assem- we set 𝑘 = 21 for yeast (short read length = 44), while we used
ble the genome. As one can see, our approach, RECORD, slightly larger settings for the other two species: we set 𝑘 = 45
systematically outperforms the baselines in terms of accuracy in case of flower (short read length = 80) and 𝑘 = 25 for fly
and coverage of the genome. Note that, in case of using very (short read length = 36).
few reads, Velvet achieves as good accuracy as our approach; Tables 2 and 3 show the most important characteristics
however, the contigs it produces have very low coverage. of the resulting assembly contigs and the edited reference.
8 International Journal of Genomics

Table 2: Assembly of real experimental reads (contigs).

Species Number Experimental reads Length (Mb) # ctgs N50 Cov


A.1 [SRR402840, SRR402839] 113.2 250148 19464 39.6x
A. thaliana A.2 [SRR402842, SRR402841] 112.4 296438 18916 24.4x
A.3 [SRR402844, SRR402843] 113.1 243375 20494 30.5x
D.1 [SRR066834, SRR066831] 122.6 460842 58648 2.6x
D. melanogaster D.2 [SRR066835, SRR066832] 122.6 461044 58535 2.1x
D.3 [SRR066836, SRR066833] 122.6 460200 58415 2.4x
P.1 [SRR948260, SRR948250] 15.2 10384 8934 32.3x
P.2 [SRR948261, SRR948251] 12.2 8129 11132 18.3x
P.3 [SRR948262, SRR948252] 12.1 8571 6696 26.5x
S. pombe P.4 [SRR948266, SRR948272] 12.1 7589 10144 21.5x
P.5 [SRR948267, SRR948273] 12.0 9214 4927 31.6x
P.6 [SRR948268, SRR948274] 12.0 8964 5696 28.1x
P.7 [SRR948269, SRR948275] 12.1 8269 8918 27.4x

Table 3: Assembly of real experimental reads (edited reference). contig) for each experiment shown in this section. As one
can see, the proportion of ambiguously mapped contigs varies
Species Number ed.len. % ref % asm # ctgs % IDY between 4% and 8% in case of S. pombe; it is around 10% in
(Mb)
case of D. melanogaster; and it is remarkably higher, around
A.1 109.9 91.8 97.3 51769 99.967 35%, for A. thaliana.
A. thaliana A.2 106.4 89.0 95.5 58052 99.918 The results in Table 3 show that the total length of the
A.3 109.8 91.8 97.3 54185 99.972 assembly is close to the genome size, indicating the complete-
D.1 117.4 82.1 95.8 50598 99.985 ness of the assembly. However, the relatively large number of
D. melanogaster D.2 117.0 81.8 95.4 50606 99.986 contigs in the raw assembly output can be seen as an indica-
tion that the assembler had difficulties with particular regions
D.3 117.2 82.0 95.6 50630 99.986
of the genome, and therefore a large number of short frag-
P.1 12.0 95.1 78.9 3548 99.995 ments may have been produced. This is especially visible in
P.2 12.0 95.2 98.4 2931 99.994 the case of D. melanogaster, where two factors influencing the
P.3 12.0 95.0 99.2 4249 99.996 quality of assembly are combined: low read coverage and low
S. pombe P.4 12.0 95.3 99.2 3054 99.994 read length.
P.5 94.6 99.2 5287 99.996
According to the proposed procedure of editing the
11.9
reference, a contig may be left out if it can not be mapped to
P.6 12.0 94.9 100.0 4762 99.996
the reference or if MUMmer considers it too short to produce
P.7 12.0 95.3 99.2 3465 99.994 a useful alignment. As we can see, the number of contigs
contributing to the edited reference is substantially less than
the total number of contigs. However, in terms of length,
In particular, the fourth column of Table 2 shows the total almost the entire assembly is used; for example, in each of the
length of the assembly contigs; the fifth column shows the D. melanogaster data sets the edited assembly utilizes ∼11% of
number of all the contigs, while in the sixth column the N50 all contigs, covering over 95% of the assisted assembly. This
of the contigs is shown. The last column shows the coverage shows that reference editing relies on a moderate amount of
of experimental reads calculated as follows: long contigs rather than on a bulk of short ones.
read length × number of reads The edited part of the genome in A. thaliana is slightly
Cov = . (2) smaller than in D. melanogaster, but the number of con-
genome size
tributing contigs is slightly larger. Therefore, contigs obtained
The third column of Table 3 shows the total length of the for the former organism are generally shorter than those
replaced segments, while the fourth and fifth columns show obtained for the latter (this is also in accordance with the
in percent the ratio of the length of the replaced segments rel- observation that N50 of A. thaliana is ∼3× lower than N50
ative to the length of the reference and the total length of the of D. melanogaster). Shorter contigs are more likely to be
assembly contigs, denoted as % ref and % asm, respectively. nonuniquely mapped, as we can observe on Figure 7; the
The sixth column of Table 3 shows the number of contigs proportion of ambiguously mapped contigs is similar to the
that were used while editing the reference. The last column proportion obtained in simulated data for S. pombe and
of Table 3 shows the overall percentile identity between the D. melanogaster, while it is remarkably higher for A. thaliana.
edited reference and the original reference. Percentages of replaced segments in genome editing (%
Figure 7 shows the proportion of ambiguously mapped ref and % asm) are also similar to those observed in simulated
contigs (before the selection of the best mapping for each data for two of our species (A. thaliana and S. pombe), while
International Journal of Genomics 9

they are slightly lower for D. melanogaster. This behavior Operating system(s): Linux
is explained by the difference in the coverage, which is in
Programming language: Perl, Java
D. melanogaster an order of magnitude lower than in the two
other species. The results indicate that the outputted genomes Other requirements: Velvet, MUMmer
are closely related to the reference. This is expected, since the
genomes of individuals are close to a reference genome of the License: Open Source
respective species. Any restrictions to use by nonacademics: no.
Overall, the results on real-world data are similar to those
on simulated data (in some respects, e.g., N50, even better). Conflict of Interests
Visibly more variability is observed between results on real
data sets with different characteristics: read length, coverage, The authors declare that there is no conflict of interests
and so forth. regarding the publication of this paper.

4. Conclusions Authors’ Contribution


In this paper, we proposed a new approach for reference- The idea of using assisted assembly in the context presented
assisted assembly of closely related genomes. Our approach in the paper originates from Norbert Dojer. Krisztian Buza
takes into account that the actual genome of the studied implemented RECORD, performed the experiments, and
organism may be slightly different from the reference genome wrote the first draft of the paper. Bartek Wilczyński super-
of that species leading to potentially fewer errors in down- vised the implementation of RECORD and the experiments.
stream analyses of the sequenced read abundances. All the three authors contributed to the discussions around
We have assessed the performance of our method on an the paper, proofread the paper, and contributed to the final
artificially simulated mutated eukaryotic genome, showing text of the paper.
that RECORD produces contigs with very low error rate (less
than 0.5 percent) and after merging them with the original
assembly leading to error rates smaller than in simpler de Acknowledgments
novo assembly technique (Velvet) as well as more general This work was partially supported by the research (Grant no.
assisted assembly approach (Amos). ERA-NET-NEURON/10/2013) from the Polish National Cen-
Further examination of the results in comparison to tre for Research and Development and by the Polish Ministry
Amos and simple Velvet indicated that our approach is most of Science and Education (Grant no. [N N519 6562 740]).
useful in the case where we have relatively few reads at our Krisztian Buza acknowledges the Warsaw Center of Math-
disposal; both of the competing tools struggled with the data ematics and Computer Science (WCMCS) for funding his
sets where the number of reads was low. position. The authors are thankful to Andrzej Dziembowski
The same seems to be true in case of a real data set that we and Aleksandra Siwaszek from Institute of Biochemistry and
analyzed in Section 3.3.1. Even though the numbers of reads Biophysics, Polish Academy of Sciences, for providing them
are much higher, we can still see the difference between our with some of their unpublished ChIP-Seq input data from S.
method and the more traditional approaches. Even though pombe which they used in their evaluation.
the genome size is small, we can see that RECORD shows
clearly superior accuracy with up to 3 million reads and all
measures are clearly better at approximately 1 million reads. References
Finally, we apply RECORD to more than 10 publicly [1] M. L. Metzker, “Sequencing technologies the next generation,”
available data sets from Short NCBI Read Archive to show Nature Reviews Genetics, vol. 11, no. 1, pp. 31–46, 2010.
its applicability in practical situations. We can see in all cases [2] O. Morozova and M. A. Marra, “Applications of next-generation
that not only is RECORD able to produce results for much sequencing technologies in functional genomics,” Genomics,
larger genomes (up to 140 Mb) but the estimated divergence vol. 92, no. 5, pp. 255–264, 2008.
between the examined genome and the reference is close to [3] K. Cui and K. Zhao, “Genome-wide approaches to determining
one percent where we can expect RECORD to perform better nucleosome occupancy in metazoans using MNase-Seq,” in
than its examined alternatives. Chromatin Remodeling, vol. 833 of Methods in Molecular Biol-
We provide a prototype implementation of this approach ogy, pp. 413–419, Springer, 2012.
as a set of scripts. It is available for download at our sup- [4] L. Song and G. E. Crawford, “Dnase-seq: a high-resolution tech-
plementary website together with most of the data published nique for mapping active gene regulatory elements across the
in the study allowing the readers to replicate our results and genome from mammalian cells,” Cold Spring Harbor Protocols,
adapt the method for specific applications. vol. 2010, no. 2, 2010.
[5] M. J. Fullwood, M. H. Liu, Y. F. Pan et al., “An oestrogen-recep-
Availability and Requirements tor-𝛼-bound human chromatin interactome,” Nature, vol. 462,
no. 7269, pp. 58–64, 2009.
Project name: RECORD Genome Assembler
[6] P. J. Park, “ChIP-seq: advantages and challenges of a maturing
Project home page: http://sourceforge.net/projects/ technology,” Nature Reviews Genetics, vol. 10, no. 10, pp. 669–
record-genome-assembler/ 680, 2009.
10 International Journal of Genomics

[7] Z. Wang, M. Gerstein, and M. Snyder, “RNA-Seq: a revolution-


ary tool for transcriptomics,” Nature Reviews Genetics, vol. 10,
no. 1, pp. 57–63, 2009.
[8] Q. Peng and A. D. Smith, “Multiple sequence assembly from
reads alignable to a common reference genome,” IEEE/ACM
Transactions on Computational Biology and Bioinformatics, vol.
8, no. 5, pp. 1283–1295, 2011.
[9] M. Pop, A. Phillippy, A. L. Delcher, and S. L. Salzberg, “Compar-
ative genome assembly,” Briefings in Bioinformatics, vol. 5, no. 3,
pp. 237–248, 2004.
[10] J. Kim, D. M. Larkin, Q. Cai et al., “Reference-assisted chromo-
some assembly,” Proceedings of the National Academy of Sciences
of the United States of America, vol. 110, no. 5, pp. 1785–1790,
2013.
[11] S. Batzoglou, D. B. Jaffe, K. Stanley et al., “Arachne: a whole-
genome shotgun assembler,” Genome Research, vol. 12, no. 1, pp.
177–189, 2002.
[12] D. B. Jaffe, J. Butler, S. Gnerre et al., “Whole-genome sequence
assembly for mammalian genomes: Arachne 2,” Genome
research, vol. 13, no. 1, pp. 91–96, 2003.
[13] X. Gan, O. Stegle, J. Behr et al., “Multiple reference genomes and
transcriptomes for Arabidopsis thaliana,” Nature, vol. 477, no.
7365, pp. 419–423, 2011.
[14] M. Kolmogorov, B. Raney, B. Paten, and S. Pham, “Ragout—a
reference-assisted assembly tool for bacterial genomes,” Bioin-
formatics, vol. 30, no. 12, pp. i302–i309, 2014.
[15] E. Bao, T. Jiang, and T. Girke, “Aligngraph: algorithm for
secondary de novo genome assembly guided by closely related
references,” Bioinformatics, vol. 30, no. 12, pp. i319–i328, 2014.
[16] S. Gnerre, E. S. Lander, K. Lindblad-Toh, and D. B. Jaffe,
“Assisted assembly: how to improve a de novo genome assembly
by using related species,” Genome Biology, vol. 10, no. 8, article
R88, 2009.
[17] D. R. Zerbino and E. Birney, “Velvet: algorithms for de novo
short read assembly using de Bruijn graphs,” Genome Research,
vol. 18, no. 5, pp. 821–829, 2008.
[18] S. Kurtz, A. Phillippy, A. L. Delcher et al., “Versatile and open
software for comparing large genomes,” Genome Biology, vol. 5,
no. 2, article R12, 2004.
[19] P. J. A. Cock, C. J. Fields, N. Goto, M. L. Heuer, and P. M.
Rice, “The Sanger FASTQ file format for sequences with quality
scores, and the Solexa/Illumina FASTQ variants,” Nucleic Acids
Research, vol. 38, no. 6, pp. 1767–1771, 2010.
[20] R. C. Edgar, G. Asimenos, S. Batzoglou, and A. Sidow, “Evolver:
A whole-genome sequence evolution simulator,” http://www
.drive5.com/evolver.
[21] H. Li, “WgSim: a small tool for simulating sequence reads from
a reference genome,” https://github.com/lh3/wgsim.

You might also like