Professional Documents
Culture Documents
Research Article
RECORD: Reference-Assisted Genome Assembly for
Closely Related Genomes
Copyright © 2015 Krisztian Buza et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Background. Next-generation sequencing technologies are now producing multiple times the genome size in total reads from a single
experiment. This is enough information to reconstruct at least some of the differences between the individual genome studied in
the experiment and the reference genome of the species. However, in most typical protocols, this information is disregarded and
the reference genome is used. Results. We provide a new approach that allows researchers to reconstruct genomes very closely
related to the reference genome (e.g., mutants of the same species) directly from the reads used in the experiment. Our approach
applies de novo assembly software to experimental reads and so-called pseudoreads and uses the resulting contigs to generate a
modified reference sequence. In this way, it can very quickly, and at no additional sequencing cost, generate new, modified reference
sequence that is closer to the actual sequenced genome and has a full coverage. In this paper, we describe our approach and test its
implementation called RECORD. We evaluate RECORD on both simulated and real data. We made our software publicly available
on sourceforge. Conclusion. Our tests show that on closely related sequences RECORD outperforms more general assisted-assembly
software.
sequence or the number of copies in which it was integrated individually, but, together, they should cover as much as pos-
into the host genome. This has naturally serious implications sible of the genome in order to allow detection of the abun-
for the accuracy of the results because any difference between dance of reads in any region of the genome. Third, the result
the reference genome and the sampled one will lead to of the assembly should be accurate; that is, the assembled
differences in the expected number of reads mappable to the genome should be as close as possible to the actual genome of
reference genome at the differing position. This in turn can the studied organism. Last, but not least, we aim to provide a
interfere with the measurement of the real abundance of this simple assembly approach. With simplicity, we mean compu-
DNA region in the sample. tational time (in order to keep the entire process computa-
This problem can be, at least theoretically, alleviated by tionally tractable) and the method clarity needed for ease of
introducing an additional step into the process: instead of reproducibility and reuse of presented ideas in the context of
directly mapping the reads to the reference genome, we can different specific protocols. We believe the adaptation of the
create a “modified” assembly of the genome based on the ideas presented in this paper may be straightforward in some
reads from the sample and the reference genome. Then, we applications, including RNA-Seq and ChIP-Seq. In other
can use this assembly to map the reads and measure their cases, it would be possible after substantial effort. For exam-
abundance. This approach can be broken down into two ple, metagenomic sequencing might potentially benefit from
major steps: some ideas presented in this paper. However, the currently
presented approach would need to accommodate multiple
(I) Assembling a genome of the sampled population genomes and reads originating from different, related species
based on the obtained reads and the reference that may be present in the sample at the same time.
genome. The growing interest in genome assembly is also reflected
(II) Assessing the abundance of reads in genomic regions by recent publications. For example, Peng and Smith [8]
using such an improved reference sequence. studied genome assembly from the theoretical point of view
and showed that various combinatorial problems related to
In the early years of next-generation sequencing, the first genome assembly are NP-hard. On the other hand, various
step of such an approach was impractical, as the number of methods have been proposed for reference-assisted genome
reads used for an experiment like ChIP-seq was far too low assembly, such as Amos [9], RACA [10], ARACHNE [11, 12],
and their quality was not high enough to attempt assembly IMR/DENOM [13], RAGOUT [14], AlignGraph [15], and the
of a better reference genome than the one deposited in pipeline developed by Gnerre et al. [16] which was developed
the databases by the relevant genome consortium. However, inside the framework provided by ARACHNE. Similarly to
now it is commonplace that the total of sequencing reads our approach, Gnerre et al. used a de novo assembler as a
generated for a single experiment such as ChIA-PET might be component. They mapped reads to several reference genomes
covering the genome multiple times and, at least in case of the and used the resulting mapping information to improve the
model organisms such as D. melanogaster or C. elegans, the output of the de novo assembly in subsequent steps. In
read lengths might be large enough to attempt an assembly. contrast to Gnerre et al., we only use one reference genome,
This approach also has another limitation. If the reference and, more importantly, we use the reference genome to
sequence is very different from the one used in the experi- provide enriched input for the de novo assembler. Further-
ment, it contributes more to a problem than to a solution. Any more, we assume that the reference is closely related to the
attempt to use a completely unrelated sequence as a reference target genome, and therefore the reference is directly used to
in such an approach is bound to introduce errors. Therefore, determine order and orientation of the assembly contigs. In
in order to ensure that the output of the assembly is useful, contrast, RACA focused on reliable order and orientation of
when we provide a method of generating reference-assisted the contigs. Amos, one of the most popular assisted assembly
assemblies, it is crucial to validate that the reference is actually softwares, aligns reads to the reference genome and uses
close enough to the target genome. alignment and layout information to generate a new con-
In this paper, we focus on developing an approach for sensus sequence [9]. We note that the techniques presented
reference-assisted genome assembly. We assume that the in this paper are orthogonal to the ones used in the afore-
actual genome of the organism and the reference genome mentioned works; that is, as future work, RECORD may be
are close to each other; for example, the reference genome of combined with other assisted assembly tools. In this paper, we
the species under consideration is given, but not the genome focus on experimentally evaluating the power of the relatively
of the particular mutant. We point out that currently used simple techniques of our pipeline. We will show that, despite
straightforward solutions produce suboptimal or, in some their simplicity, they may achieve surprisingly good results.
cases, even misleading results. For example, when simply In the next section we describe our approach, a simple but
assembling the genome from the given reads, due to the low surprisingly effective reference-assisted assembly technique,
coverage of those reads, we may obtain too short contigs lead- and the software that implements it. By design, this approach
ing to an assembly useless in practical applications. Conse- is most useful in cases when the reference and target genomes
quently, we need an assembly technique which fulfills the fol- are closely related, and the coverage of the target genome by
lowing criteria. First of all, it should output sequences that are the experimental reads is relatively low such as multiplexing
long enough even in cases when the coverage of the genome scenarios where multiple experimental DNA libraries are
sequence by the experimental reads is relatively low. Second, barcoded and pooled in a single sequencing lane. Subse-
not only should the output sequences be large enough quently we present the results of the experimental evaluation
International Journal of Genomics 3
GCATGCGTAT
TACGATCTTACG
CGGAGAACGTA
Assembly contigs (step 2)
ACGCATGCGTATCGAGCTACTACG
CGCGTACGATCTTACGTAGGCACG
AATTAAATTCGGAGAACGTAATAAC
Edited reference (step 3)
Figure 1: RECORD: Reference-Assisted Genome Assembly for Closely Related Genomes. The inputs of the pipeline, that is, the experimental
reads and the reference genome, are illustrated in the top left and top right of the figure, respectively. Intermediate results produced in various
steps of the analysis process are depicted. The dependency between these intermediate results is shown by arrows. In the illustration of the
3rd step, we underlined those segments of the edited reference which were replaced by one of the assembly contigs.
of RECORD and compare it to Amos [9], one of the most Chr1: ACTCACGCGATACGAGCTACTACGGAGGATC... Reference
genome
popular assisted assembly tools. We show that, under realistic
conditions of approximately 1 percent divergence between ACTCACG AGCTACT
reference genome and the studied sequence, our approach CACGCGA TACTACG Pseudoreads
outperforms naive approaches and Amos (which excels in
GCGATAC TACGGAG
situations where the divergence is much higher). To ensure
reproducibility and extensibility of our work, we evaluate our
n m n m
approach on several collections of publicly available next-
generation sequencing data sets originating from various d
model organisms such as yeast (S. pombe), fruit fly (D.
Figure 2: Generation of pseudoreads from the reference genome.
melanogaster), and plant (A. thaliana).
2. Implementation
edited reference, the segments of which are replaced
We propose RECORD, Reference-Assisted Genome Assem- according to the mapping. This step ensures that the
bly for Closely Related Genomes. Our approach consists of edited reference is close to the true genome of the
the following steps (see Figure 1): organism, while it covers as much regions of the
genome as possible.
(1) We generate pseudoreads from the reference genome.
We generate pseudoreads in order to ensure that the
coverage of the genome is large enough. Below we give a detailed description of the above steps.
available, while the experiments on real data will show that Table 1: Evaluation on simulated data.
our approach may be useful in real applications. TL Error Id. Bases
Next, we present the results of the experimental evalua- Assembly N50
(Mb) (in %) (Mb)
tion of our approach.
Contigs
3.1. Baselines. In the experiments presented in the subse- Velvet 18.20 213 b 0.85 18.05
quent sections, we used two genome assemblers, Velvet [17] Amos 28.82 1834 b 2.09 28.22
and Amos [9], as baselines. Velvet is a de novo genome RECORD 25.81 2055 b 0.41 25.70
assembler; that is, it assembles the genome directly from the Edited reference
experimental reads, whereas Amos is one of the most popular Velvet 30.00 10 Mb 1.39 29.58
assisted genome assembly software tools; that is, Amos uses
Amos 30.00 10 Mb 1.03 29.69
both the experimental reads and the reference genome of
a genetically related organism in order to reconstruct the RECORD 30.00 10 Mb 0.59 29.82
genome of the studied organism. Throughout the description
of the experiments, with Velvet we refer to the case of
using Velvet as standalone application, even though our these contigs the length of the shortest one is denoted
approach, referred to as RECORD, uses by default Velvet as as N50.
a component of the proposed pipeline. (3) Error = 100% − IDY, where IDY is the percentile
We also tried to use further assisted genome assemblers, identity between the target genome and the genome
such as ARACHNE [11, 12] and IMR/DENOM [13]. While reconstructed by the assembler. (Please note that IDY
these softwares may excel in various general settings (such
is different from idy(ref) . While idy(ref) denotes the
as using the reference genome of a species to reconstruct the
identity between an assembly contig and the corre-
genome of an other species), as far as we can judge, they
sponding segment of the reference genome, we use
do not seem to fit to our special setting of relatively low
IDY to denote the identity between the output of the
coverage (i.e., few experimental reads) and very closely related
assembly and the target genome.) In order to calculate
genomes. For example, in some cases, the outputted genome
IDY, we map the genome reconstructed by the assem-
was the reference genome, which, on one hand, may be con-
bler to the target genome using the MUMmer soft-
sidered as reasonable if the actual genome and the reference
ware tool [18], and we calculated the weighted average
genome are highly similar (i.e., they are almost the same); on
of the percentile identities between the mapped seg-
the other hand, this is a trivial solution for the assisted
ments and the target genome as outputted by MUM-
assembly problem as the reference is one of the inputs of ref-
erence-assisted assembly methods. mer. In the weighted average, we used the length of the
mapped segments as weights.
3.2. Evaluation on Simulated Data. We simulate the scenario (4) Number of identical bases, which we calculated as
that the reference genome is given and we aim to reconstruct IDY × TL.
the actual genome of the studied organism, which we call
target genome. In particular, we used the Evolver software Both in case of our approach and in case of the baselines,
tool [20] to generate the target genome. We used the genome we evaluated both the contigs and the edited reference
from the example that comes with Evolver. This is an resulting from using the contigs. In case of evaluating edited
artificial mammalian genome of size of 30 megabases (Mb). reference for the baselines, we simply used the contigs out-
The genome has three chromosomes. In order to allow for putted by the baselines in the third step of our approach and
an unbiased evaluation, we produced the evolved genome produced the edited reference.
following the example attached with Evolver. We used the Table 1 summarizes our results. The columns of the table
original genome, that is, ancestral genome, as the reference show the total length (TL) of the assembly, N50, error, and
genome, and we considered the evolved genome as the target the number of identical bases. As one can see, our approach,
genome. We generated one million paired-end short reads of RECORD, is competitive with the other assemblers: consid-
length of 70 with wgsim [21] from the target genome. Subse- ering the contigs produced by our pipeline, they have the
quently, we tried to reconstruct the target genome from the highest N50 and the lowest error rates, while the edited
generated paired-end reads and the reference genome both reference produced by RECORD has the overall highest
with our approach and two other state-of-the-art genome number of identical bases with the target genome.
assemblers. Throughout the experiments on simulated data, In a subsequent experiment, we varied the number of
we used Velvet with 𝑘-Mer size of 𝑘 = 21. Finally, we reads used for the assembly and evaluated the resulting
compared the outputs of the assemblers with the target contigs. These results are shown in Figure 4. The diagram
genome and quantitatively measured the quality of each of the (a) shows the number of bases in the target genome that are
assemblers according to the following criteria: covered by the assembly contigs as function of the number of
reads that were used. It is important to note that while Amos
(1) TL, the total length of the assembly in Mb. can provide overall better coverage of the sequence, it requires
(2) N50; that is, we consider the set of largest contigs that more reads (>500 k) for that. In the lower range of the number
together cover at least 50% of the assembly, and out of of reads available, it is outperformed by RECORD. It may be
6 International Journal of Genomics
35 100 90000
99.5 80000
Number of covered bases
30 70000
99
25
Accuracy (%)
60000
98.5
(millions)
Cov50∗
20 50000
98
15 40000
97.5 30000
10 97 20000
5 96.5 10000
0 96 0
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
Number of reads (thousands) Number of reads (thousands) Number of reads (thousands)
Velvet Velvet Velvet
Amos Amos Amos
RECORD RECORD RECORD
(a) (b) (c)
Figure 4: Comparison of the proposed approach (RECORD) with two state-of-the-art genome assemblers on data simulated with wgsim. In
this experiment, we consider the evolved genome produced by Evolver as the target genome; the reference genome is the ancestral genome.
The diagrams show the performance of the examined approaches according to various criteria as the function of the number of simulated
reads that were used for the assembly. The diagram (a) shows the number of covered bases of the target genome; the diagram (b) shows the
accuracy, that is, overall percentile identity between the assembly contigs and the corresponding segments of the target genome, while the
diagram (c) shows the number of those largest contigs that together cover at least 50% of the target genome.
0.07
We note that, from the point of view of applications, there
is a substantial difference between the execution times of
RECORD and Amos. For example, when using 300 thousand
0.06 reads, producing the edited reference took approximately 1
hour for our approach, whereas it took 16 hours for Amos.
We emphasize that this observation refers to the practical
0.05 application of the software but not to the overall (theoretical)
100 200 300 400 500 600 700 800 900
computational costs: much of the observed difference may be
Number of reads (thousands) attributed to the fact that Velvet, which is used by default as
Figure 5: Proportion of ambiguously mapped contigs (before the assembler in the proposed pipeline, is able to run in parallel
selection of the best mapping for each contig) in case of various on multiple cores, whereas Amos can be used on one core at a
numbers of simulated reads. time. Due to the fact that RECORD uses a de novo assembler
as a component of the proposed pipeline, our approach is
limited to middle-sized genomes that are closely related to
the reference genome; therefore it is currently not applicable
relevant for practical applications as the cost of the exper- to the human and comparable genomes.
iment usually depends on the number of reads produced.
While in this simulated case the number of reads is relatively 3.3. Evaluation on Real Data. The primary goal of the evalua-
low for today NGS technology standards, it might be still rel- tion on real data was to show that our approach can be applied
evant in multiplexing scenarios where multiple experimental in real experiments.
DNA libraries are barcoded and pooled in a single sequencing
lane. 3.3.1. Assessment of the Accuracy in Comparison to the Base-
The second diagram (b) shows the overall percentile line. As mentioned previously, in real-world settings, there is
identity between the target genome and the contigs of the usually no gold standard available. Therefore, the assessment
assembly. The third diagram (c) shows the number of those of the accuracy of the genome produced by any assembler is
largest contigs that together cover at least 50% of the target inherently difficult. For this reason, in the subsequent exper-
genome. As one can see, if only relatively few reads are avail- iment, we evaluate the accuracy of the proposed method on
able, our approach, RECORD, systematically outperforms the real data indirectly. In particular, we assess the quality of the
baselines by producing larger contigs, the most accurate and contigs and, more importantly, we compare our approach to
most complete assembly. the baselines in the following setting: we examine how well
International Journal of Genomics 7
14 100 40000
Number of covered bases
35000
12 99.5
30000
Accuracy (%)
10 99 25000
(millions)
Cov50∗
8
98.5 20000
6 15000
98
4 10000
2 97.5 5000
0 97 0
1 2 3 4 1 2 3 4 1 2 3 4
Number of reads (millions) Number of reads (millions) Number of reads (millions)
Figure 6: Comparison of the proposed approach (RECORD) with two state-of-the-art genome assemblers on real data. In this experiment,
we compared assemblies resulting from various number of experimental reads to the assembly which is produced by Amos using all the
experimental reads; that is, the target genome is the assembly produced by Amos using all the reads. In this case, the reference genome
exhibits 99.7 percent identity with the result of Amos which is used as the gold standard. The diagrams follow the same structure as the one
in Figure 4.
0.3
either both sequences corresponding to a particular paired-
end read are selected or none of the sequences of that paired- 0.2
end is selected.
In the aforementioned context, as gold standard, that is, 0.1
target genome, we consider the genome produced by Amos
when using all the reads for the assembly. We note that
0
this leads to an evaluation in which Amos has an inherent A.1 A.2 A.3 D.1 D.2 D.3 P.1 P.2 P.3 P.4 P.5 P.6 P.7
advantage against our approach, as unfortunately we cannot
have an unbiased reference. Figure 7: Proportion of ambiguously mapped contigs (before the
We used real-world experimental reads graciously pro- selection of the best mapping for each contig) in case of experiments
on publicly available data sets.
vided by dr Andrzej Dziembowski’s group, coming from an
unpublished ChIP-seq experiment in a yeast species. The data
contained approximately 4.5 million paired-end short reads
of length of 100. 3.3.2. Characteristics of the Assembly of Publicly Available Data
Figure 6 shows the results. The diagrams follow the same Sets. In order to assist reproducibility of our results, we used
structures as the ones presented at the end of Section 3.2; that publicly available real short read data from the NCBI Short
is, the diagram (a) shows the number of bases in the target Read Archive. We used data originating from three different
genome that are covered by the assembly contigs as function species: plant (A. thaliana), fly (D. melanogaster), and yeast
of the number of reads that were used. The second diagram (S. pombe). The identifiers of the short read collections are
(b) shows the accuracy, that is, the overall percentile identity shown in the third column of Tables 2 and 3.
between the target genome and the contigs of the assembly. We set the 𝑘-mer size for the assembly, that is, the second
The third diagram (c) shows the number of those largest con- step of our approach, in accordance with length of the short
tigs that together cover at least 50% of the target genome. In reads in the archive and the (approximate) size of the target
all the three diagrams, the horizontal axis shows the size of the genome. In particular, similarly to the previous experiments,
sample (i.e., the number of paired-end reads) used to assem- we set 𝑘 = 21 for yeast (short read length = 44), while we used
ble the genome. As one can see, our approach, RECORD, slightly larger settings for the other two species: we set 𝑘 = 45
systematically outperforms the baselines in terms of accuracy in case of flower (short read length = 80) and 𝑘 = 25 for fly
and coverage of the genome. Note that, in case of using very (short read length = 36).
few reads, Velvet achieves as good accuracy as our approach; Tables 2 and 3 show the most important characteristics
however, the contigs it produces have very low coverage. of the resulting assembly contigs and the edited reference.
8 International Journal of Genomics
Table 3: Assembly of real experimental reads (edited reference). contig) for each experiment shown in this section. As one
can see, the proportion of ambiguously mapped contigs varies
Species Number ed.len. % ref % asm # ctgs % IDY between 4% and 8% in case of S. pombe; it is around 10% in
(Mb)
case of D. melanogaster; and it is remarkably higher, around
A.1 109.9 91.8 97.3 51769 99.967 35%, for A. thaliana.
A. thaliana A.2 106.4 89.0 95.5 58052 99.918 The results in Table 3 show that the total length of the
A.3 109.8 91.8 97.3 54185 99.972 assembly is close to the genome size, indicating the complete-
D.1 117.4 82.1 95.8 50598 99.985 ness of the assembly. However, the relatively large number of
D. melanogaster D.2 117.0 81.8 95.4 50606 99.986 contigs in the raw assembly output can be seen as an indica-
tion that the assembler had difficulties with particular regions
D.3 117.2 82.0 95.6 50630 99.986
of the genome, and therefore a large number of short frag-
P.1 12.0 95.1 78.9 3548 99.995 ments may have been produced. This is especially visible in
P.2 12.0 95.2 98.4 2931 99.994 the case of D. melanogaster, where two factors influencing the
P.3 12.0 95.0 99.2 4249 99.996 quality of assembly are combined: low read coverage and low
S. pombe P.4 12.0 95.3 99.2 3054 99.994 read length.
P.5 94.6 99.2 5287 99.996
According to the proposed procedure of editing the
11.9
reference, a contig may be left out if it can not be mapped to
P.6 12.0 94.9 100.0 4762 99.996
the reference or if MUMmer considers it too short to produce
P.7 12.0 95.3 99.2 3465 99.994 a useful alignment. As we can see, the number of contigs
contributing to the edited reference is substantially less than
the total number of contigs. However, in terms of length,
In particular, the fourth column of Table 2 shows the total almost the entire assembly is used; for example, in each of the
length of the assembly contigs; the fifth column shows the D. melanogaster data sets the edited assembly utilizes ∼11% of
number of all the contigs, while in the sixth column the N50 all contigs, covering over 95% of the assisted assembly. This
of the contigs is shown. The last column shows the coverage shows that reference editing relies on a moderate amount of
of experimental reads calculated as follows: long contigs rather than on a bulk of short ones.
read length × number of reads The edited part of the genome in A. thaliana is slightly
Cov = . (2) smaller than in D. melanogaster, but the number of con-
genome size
tributing contigs is slightly larger. Therefore, contigs obtained
The third column of Table 3 shows the total length of the for the former organism are generally shorter than those
replaced segments, while the fourth and fifth columns show obtained for the latter (this is also in accordance with the
in percent the ratio of the length of the replaced segments rel- observation that N50 of A. thaliana is ∼3× lower than N50
ative to the length of the reference and the total length of the of D. melanogaster). Shorter contigs are more likely to be
assembly contigs, denoted as % ref and % asm, respectively. nonuniquely mapped, as we can observe on Figure 7; the
The sixth column of Table 3 shows the number of contigs proportion of ambiguously mapped contigs is similar to the
that were used while editing the reference. The last column proportion obtained in simulated data for S. pombe and
of Table 3 shows the overall percentile identity between the D. melanogaster, while it is remarkably higher for A. thaliana.
edited reference and the original reference. Percentages of replaced segments in genome editing (%
Figure 7 shows the proportion of ambiguously mapped ref and % asm) are also similar to those observed in simulated
contigs (before the selection of the best mapping for each data for two of our species (A. thaliana and S. pombe), while
International Journal of Genomics 9
they are slightly lower for D. melanogaster. This behavior Operating system(s): Linux
is explained by the difference in the coverage, which is in
Programming language: Perl, Java
D. melanogaster an order of magnitude lower than in the two
other species. The results indicate that the outputted genomes Other requirements: Velvet, MUMmer
are closely related to the reference. This is expected, since the
genomes of individuals are close to a reference genome of the License: Open Source
respective species. Any restrictions to use by nonacademics: no.
Overall, the results on real-world data are similar to those
on simulated data (in some respects, e.g., N50, even better). Conflict of Interests
Visibly more variability is observed between results on real
data sets with different characteristics: read length, coverage, The authors declare that there is no conflict of interests
and so forth. regarding the publication of this paper.