Professional Documents
Culture Documents
doi: 10.1093/bioinformatics/btx524
Advance Access Publication Date: 21 August 2017
Original Paper
Sequence analysis
Abstract
Motivation: Contigs assembled from the second generation sequencing short reads may contain
misassemblies, and thus complicate downstream analysis or even lead to incorrect analysis re-
sults. Fortunately, with more and more sequenced species available, it becomes possible to use
the reference genome of a closely related species to detect misassemblies. In addition, long reads
of the third generation sequencing technology have been more and more widely used, and can
also help detect misassemblies.
Results: Here, we introduce ReMILO, a reference assisted misassembly detection algorithm that
uses both short reads and PacBio SMRT long reads. ReMILO aligns the initial short reads to both
the contigs and reference genome, and then constructs a novel data structure called red-black mul-
tipositional de Bruijn graph to detect misassemblies. In addition, ReMILO also aligns the contigs to
long reads and find their differences from the long reads to detect more misassemblies. In our
performance test on short read assemblies of human chromosome 14 data, ReMILO can detect
41.8–77.9% extensive misassemblies and 33.6–54.5% local misassemblies. On hybrid short and
long read assemblies of S.pastorianus data, ReMILO can also detect 60.6–70.9% extensive misas-
semblies and 28.6–54.0% local misassemblies.
Availability and implementation: The ReMILO software can be downloaded for free under Artistic
License 2.0 from this site: https://github.com/songc001/remilo.
Contact: baoe@bjtu.edu.cn
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction errors, i.e. mismatches and small indels, and misassemblies, i.e. re-
arrangements and/or significantly large indels. The first errors could
The second generation sequencing technology can produce short directly affect the single nucleotide polymorphism (SNP) analysis in
reads to sequence a species. The short reads are several hundred bp the genome in downstream, since the errors are difficult to be distin-
long and below $0.1 per Mbp, and can be assembled into contigs of guished from SNPs, while the second errors affect the structural
the target genome. However, the assembled contigs usually contain variation (SV) analysis in the genome (Feuk et al., 2006). Compared
errors. For example, the contigs and published genomes of S.aureus to small errors, it is more challenging to detect misassemblies, be-
and P.falciparum assembled from short reads were found containing cause they are usually much larger than the initial short reads and
2.0% and 5.1% errors, respectively (Hunt et al., 2013). The intro- are also more difficult to be distinguished from SVs.
duced errors in contigs are mainly due to genomic repeats and multi- Several algorithms have been published to detect misassemblies
ploidy, making it difficult to distinguish short reads from similar (Hunt et al., 2013; Muggli et al., 2015; Walker et al., 2014; Zhu
genome regions. The errors can be divided into two categories: small et al., 2015). Depending on the input, they can be divided into the
C The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
V 24
ReMILO 25
following two categories. (1) Some algorithms are inputted with the difficult to find the good enzyme choice. Therefore, we use reference
initial short reads only. Both REAPR (Hunt et al., 2013) and Pilon genome and long reads as alternative data sources to detect misas-
(Walker et al., 2014) align the short reads to contigs and calculate semblies. Although the misassembly detection performance is af-
the following statistics to detect misassemblies: (i) read coverage, (ii) fected by similarity of the reference genome and coverage of the
number of incorrectly oriented reads (i.e. mate pairs align in an in- long reads, it is much more stable and a wide range of reference gen-
verted orientation) and (iii) number of partially aligned reads. omes and long read sets can be used to achieve good performance.
REAPR calculates differences between expected and observed read In addition, it needs very small cost to obtain the reference genome
coverage for detection, while Pilon calculates changes of read cover- and long reads: the former could be downloaded from a public re-
2.1 Short read alignments to contigs and reference 2.3 Coloring of red-black multipositional de Bruijn
genome graph
We align the short reads to both the contigs and reference genome. For each vertex, we calculate the total number of reads d and the
To align the short reads to the contigs, Bowtie2 is used, because it is number of incorrectly oriented reads (i.e. mate pairs align in an
specifically designed for short read alignment to long sequences of inverted orientation) c generating its joined vertices as indicators
the same species (Langmead and Salzberg, 2012). To align the short of its reliability. We color a vertex red, if d 62 ½D; K or c > C,
reads to the reference genome, we apply a contig guided approach, where ½D; K is the acceptable range of read coverage, and C is the
since there has rarely been an aligner specifically designed for short maximum number of incorrectly oriented reads; otherwise, we
alternative black path in (2.3), but it simplifies the algorithm and 2.6 Implementation of the ReMILO software
can help detect more misassemblies. The ReMILO software is implemented in C þþ for Linux platform.
Compared to the misassembly detection approach of the ReMILO’s input includes the contigs, reference genome, short reads
misSEQuel algorithm, our approach not only checks the reliability and long reads, and its output includes a file recording locations of
of vertices, but also checks the inconsistent vertices and the alterna- the detected misassemblies and another file containing split contigs
tive reliable paths, so could achieve higher sensitivity and accuracy. at the misassembly locations. ReMILO can work with or without
Figure 1 shows an illustration on advantage of the red-black multi- long reads, depending on whether they are inputted.
positional de Bruijn graph over the red-black positional de Bruijn
(A) Red-Black Positional de Bruijn Graph (B) Red-Black Multipositional de Bruijn Graph
(ACGA,0,112,4), (CGAG,1,113,4), (GAGC,2,114,4), (AGCA,3,115,4),
(ACGA,0,4), (CGAG,1,4), (GAGC,2,4), (AGCA,3,4), (GCAA,-1,116,4), (CAAC,-1,117,4), (AACG,-1,118,4), (ACGA,-1,119,4),
(GCAT, 4,4), , (ACGT,593,2), (CGTG,594,2), (GTGC,595,2), (CGAG,-1,120,4), (GAGC,-1,121,4), (AGCA,-1,122,4), (GCAT, 4,116,4), ,
(TGCA,596,2), (ACGT,0,2), (CGTG,1,2), (GTGC,2,2), (TGCA,3,2) (ACGT,593,712,2), (CGTG,594,713,2), (GTGC,595,714,2), (TGCA,596,715,2),
(ACGT,0,112,2), (CGTG,1,113,2), (GTGC,2,114,2), (TGCA,3,115,2)
CAT,5,124
Fig. 1. Advantage of the red-black multipositional de Bruijn graph over the positional de Bruijn graph. Compared to the target genome (unknown in dashed rect-
angle), the contig has a misassembly not containing genome region A0 . Contig/genome regions A and B are similar to each other. (A) Short reads are listed with
their alignment positions to the contig and multiplicity, constructing a red-black positional de Bruijn graph. The short reads from genome region A0 need to be
aligned to contig region A, resulting in excessive coverage in region A and a corresponding red path in the graph, to detect the misassembly. However, due to
some alignment issue, the short reads are not aligned, resulting in a black path from vertex ðACG; 0Þ to ðCGA; 1Þ to ðGCA; 4Þ, so the misassembly is not detected.
In addition, many short reads from genome region B are aligned to contig region A, resulting in insufficient coverage in region B and a corresponding red path in
the graph from vertex ðACG; 593Þ to ðGCA; 597Þ, so a false detection occurs in region B. (B) Short reads are listed with their alignment positions to the contig, to
the reference genome and multiplicity, constructing a red-black multipositional de Bruijn graph. The short reads from genome region A0 are aligned to reference
genome region A0 , resulting in inconsistent vertices ðGCA; 4; 116Þ and ðCAT ; 5; 124Þ (j5 4 ¼ 1j but j124 116j ¼ 8) and an alternative black path connecting them
from vertex ðCAA; 5; 117Þ to ðGCA; 1; 123Þ (shaded), so the misassembly is detected. In addition, no false detection occurs, because there does not exist any
other inconsistent vertices
28 E.Bao et al.
A B
C D
length amount: 1851 591.2k bp). The long reads were error cor- 3.1.3 Test on short and long read hybrid assemblies of
rected by HALC, a long read error corrector designed by ourselves S.pastorianus data
(Bao and Lan, 2017). (ii) We also compared ReMILO to the existing (i) Because long reads of relatively higher coverage can not only be
algorithms on synthetic contigs. The synthetic contigs were generated inputted to ReMILO to detect misassemblies but also be assembled
from the initial contigs by combining them and adding relatively large together with short reads, we also compared ReMILO to REAPR,
indels. Compared to misassemblies in the initial contigs, the intro- Pilon and misFinder on hybrid S.pastorianus contigs assembled from
duced misassemblies are a little more explicit, but have known loca- both short and long reads (genome type: triploid; size: 18.7 Mbp;
tions, so can be used to compare the algorithms in a more accurate downloaded from NCBI accession AZCJ00000000.1). The contigs
manner. In the following discussion, the synthetic contigs are referred were assembled from short reads of 72 coverage (read length
to as syntigs to distinguish from the initial ones. (iii) In addition, on amount: 300 2.7 M bp; from NCBI accession DRX036591) and
the contigs, we varied the reference chromosomes inputted to long reads of 37 coverage (read length amount: 2942 244k bp)
ReMILO, including the chromosomes of gorilla, orangutan, by assembler SPAdes, which is a typical assembler supporting the
gibbon and macaque [from the Ensembl FTP site (release 85)], to see hybrid assembly inputted with either the uncorrected long reads or
impact of the reference genomes upon misassembly detection results. corrected ones. The long reads were simulated by PacBio reads simu-
Similarity of the chromosomes to the human chromosome 14 drops lator PBSIM (Ono et al., 2013), and then inputted to SPAdes dir-
from chimpanzee to macaque, and is quantified as percentages of the ectly, and alternatively, corrected by HALC and then inputted to
human chromosome 14 short reads alignable to the chromosomes. SPAdes. All the short and corrected long reads were used for misas-
sembly detection. The reference genome was the S.cerevisiae genome
(from NCBI accessions NC_001133.9-NC_001148.4). (ii) In add-
ition, again, we compared ReMILO to the existing algorithms on
3.1.2 Test on short read assemblies of japonica rice data syntigs generated from the contigs by combining them and adding
(i) In order to compare ReMILO with misSEQuel with or without relatively large indels.
an additional data source inputted (long reads to ReMILO and op-
tical mapping data to misSEQuel), we used a japonica rice data (gen- An additional test on short and long read hybrid assemblies of
ome type: diploid; size: 374.5 Mbp; downloaded from NCBI A.thaliana data is described in Supplementary Section S2, and the
accession GCA_001623365.1). The contigs were assembled from results are shown in Supplementary Section S3. All the software
short reads of 55 coverage (read length amount: 76 268.9 M above was in default settings. Assembly statistics are listed in
bp; from NCBI accession SRX032913) by short read assemblers Supplementary Table S1. The statistics may have some differences
IDBA (Peng et al., 2012), SPAdes (Bankevich et al., 2012) and from those reported previously in Salzberg et al. (2012) and Muggli
Velvet (Zerbino and Birney, 2008). The selection of assemblers was et al. (2015), probably because of the updated assembler versions.
consistent with Muggli et al. (2015), excluding ABySS and There are two things to note in these tests. (i) Although we used
SOAPdenovo2 whose contigs contain few misassemblies. All the quite a few assemblers, the purpose of this paper is not to compare
short reads were used for misassembly detection. The reference gen- these assemblers, but to show ReMILO’s performance working on
ome was the indica rice genome [from the Ensembl Plants FTP site contigs by different assemblers. (ii) The reference genomes or
(release 33)]. The long reads were of 20 coverage (read length chromosome inputted to ReMILO can also be used to extend the
amount: 2950 2520.6k bp; from NCBI accession SRX1897300) assembled contigs (Bao et al., 2014) or build scaffolds (Kim et al.,
and error corrected by HALC. The optical mapping data was 2013; Kolmogorov et al., 2014), but the extended contigs are usu-
queried and obtained from Kawahara et al. (2013). (ii) In addition, ally high quality ones with limited misassemblies and the scaffolds
we varied the long read coverage from 10 to 40 to see impact of do not change the initial contigs, so we did not run ReMILO on the
the long read coverage upon misassembly detection results. extended contigs or scaffolds.
ReMILO 29
Note: The contigs are assembled by various short read assemblers ALLPATHS-LG, MaSuRCA and SOAPdenovo2, and the syntigs are synthesized from them.
The performance of ReMILO is compared to REAPR, Pilon and MisFinder. TPR (extensive) is the true positive rate of extensive misassemblies, TPR (local) is the
true positive rate of local misassemblies, and FPR is the false positive rate of misassemblies. The best value for each column is shown in boldface.
3.2 Performance measurements incorrectly detected misassemblies over the maximum number of
We used QUAST to locate both true extensive misassemblies [MA possible misassemblies. The maximum number of possible misas-
(extensive)] and local misassemblies [MA (local)] in the contigs semblies is estimated as the total number of contig bases over the
(Gurevich et al., 2013). QUAST aligns the contigs to the correspond- average distance between two misassemblies. (v) True positive rate
ing target genome or chromosome (i.e. human chromosome 14, of extensively misassembled contigs [TPRC (extensive)] is the num-
japonica rice genome, S.pastorianus genome or A.thaliana genome), ber of correctly detected extensively misassembled contigs over the
and checks flanking subcontigs aligned with distances [see Gurevich total number of such contigs. (vi) True positive rate of locally misas-
et al. (2013) for detailed definitions of the misassemblies]. sembled contigs [TPRC (local)] is the number of correctly detected
Accordingly, a contig is an extensively misassembled contig [MC locally misassembled contigs over the total number of such contigs.
(extensive)], if it contains at least one extensive misassembly; a con- (vii) False positive rate of contigs (FPRC) is the number of incor-
tig is a locally misassembled contig [MC (local)], if it contains at rectly detected misassembled contigs over the total number of cor-
least one local misassembly. Note that one contig could be both an rect contigs without misassemblies.
extensively misassembled contig and a locally misassembled contig. For the syntigs, we had known locations of the introduced true
Compared to the located errors and contigs, we made the follow- MA (extensive) and MA (local). Therefore, we made the same meas-
ing measurements. Here, a correctly detected misassembly (or misas- urements as above, despite that a correctly detected misassembly is a
sembled contig) is a misassembly (or contig) located by QUAST, misassembly overlapping an introduced true error, while an incor-
while an incorrectly detected misassembly (or misassembled contig) rectly detected misassembly is a misassembly not overlapping any
is a misassembly (or contig) not located by QUAST or any other al- introduced true error or detected in the corresponding contigs.
gorithm. (i) True positive rate of extensive misassemblies [TPR (ex-
tensive)] is the number of correctly detected extensive misassemblies
over the total number of extensive misassemblies. (ii) True positive 3.3 Results
rate of local misassemblies [TPR (local)] is the number of correctly 3.3.1 Results on short read assemblies of human chromosome
detected local misassemblies over the total number of local misas- 14 data
semblies. (iii) True positive rate of misassemblies (TPR) is the num- The results on various contig/syntig sets assembled by different short
ber of correctly detected misassemblies over the total number of read assemblers or further synthesized are listed in Table 1. On the
misassemblies (combining both extensive and local misassemblies). contigs, ReMILO can detect 41.8–77.9% extensive misassemblies
(iv) False positive rate of misassemblies (FPR) is the number of and 33.6–50.8% local misassemblies with 11.1–21.8% false
30 E.Bao et al.
A TPR and FPR with various reference chromosomes detections. Compared to the existing algorithms, ReMILO detects
100% more misassemblies with fewer false detections. On the syntigs,
ALLPATH−LG TPR ALLPATH−LG FPR
MaSuRCA TPR MaSuRCA FPR ReMILO can detect 58.0–75.1% extensive misassemblies and 49.3–
SOAPdenovo2 TPR SOAPdenovo2 FPR
80%
Chimpanzee (94.5%) Gorilla (91.6%) Orangutan (88.9%) Gibbon (49.9%) Macaque (29.9%) of the plot). Both rates are relatively stable despite the decrease in
reference chromosome similarity from chimpanzee to macaque.
B TPRC and FPRC with various long read coverage
100%
Only with the least similar macaque chromosome, the true positive
IDBA TPRC IDBA FPRC
SPAdes TPRC SPAdes FPRC rates show dramatic drops. These results indicate ReMILO is not so
Velvet TPRC Velvet FPRC
much dependent on the reference genome, so given the contigs of
80%
0 10x 20x 30x 40x tions. Compared to misSEQuel with optical mapping data inputted,
ReMILO detects more misassembled contigs but it also makes more
Fig. 3. ReMILO’s true positive rate and false positive rate of misassemblies false detections; compared to misSEQuel without optical mapping
(TPR and FPR, respectively) on contigs with various reference chromosomes
data inputted, ReMILO detects fewer misassembled contigs but it
of the human chromosome 14 data (A) and true positive rate and false posi-
tive rate of misassembled contigs (TPRC and FPRC, respectively) with various
also makes fewer false detections. These results indicate ReMILO
long read coverage of the japonica rice data (B). Similarity of the chromo- keeps a balance between sensitivity and accuracy in misassembly de-
somes to the human chromosome 14 (quantified as percentages of the tection. In addition, compared to itself without long reads inputted,
human chromosome 14 short reads alignable to the chromosomes) is listed ReMILO obtains 4.2–5.8% higher true positive rates of extensively
together with the species names misassembled contigs and 6.4–10.2% higher true positive rates of
Note: The contigs are assembled by various short read assemblers IDBA, SPAdes and Velvet. The performance of ReMILO with or without long reads inputted
(ReMILO-) is compared to misSEQuel with or without optical mapping data inputted (misSEQuel-). TPRC (extensive) is the true positive rate of extensively mis-
assembled contigs, TPRC (local) is the true positive rate of locally misassembled contigs, and FPRC is the false positive rate of contigs. The best value for each col-
umn excluding that of misSEQuel- is shown in boldface.
ReMILO 31
Note: The contigs are assembled by short and long read hybrid assembler SPAdes, and the syntigs are synthesized from them. In (a), the contigs are assembled
by SPAdes inputted with the uncorrected long reads, while in (b), by SPAdes inputted with the corrected ones [SPAdes (cor)]. The performance of ReMILO is com-
pared to REAPR and Pilon. Again, TPR (extensive) is the true positive rate of extensive misassemblies, TPR (local) is the true positive rate of local misassemblies,
and FPR is the false positive rate of misassemblies. The best value for each column is shown in boldface.
locally misassembled contigs, despite 1.1–2.4% higher false positive genome by Bowtie2 and BWA-MEM, about 20% is for constructing
rates of contigs. These results indicate ReMILO is more sensitive red-black multipositional de Bruijn graph to detect misassemblies,
using the long reads. Note that here we compared misassembled about 30% is for contig alignment to long reads by BWA-MEM,
contigs rather than misassemblies, because misSEQuel has limited and about 10% is for using long reads to detect more misassemblies.
support on reporting misassemblies. For memory usage, the peak memory usage usually appears during
With the various long read coverage, ReMILO’s true positive contig alignment to long reads. These results indicate ReMILO is
and false positive rates of misassembled contigs are plotted in Figure sufficiently fast and memory efficient working on various data, and
3B (see Supplementary Table S3 for raw numbers of the plot). Both can thus be practically used.
true positive rates and false positive rates increase with the long read
coverage, but the former increase to a much larger extent than the
latter. These results indicate ReMILO is more sensitive without los- 4 Conclusions
ing much accuracy with more long reads inputted. Hence, long reads This paper introduces ReMILO, a reference assisted misassembly
of relatively high coverage are preferred for misassembly detection. detection algorithm that uses both short and long reads. ReMILO
constructs a red-black multipositional de Bruijn graph of simplicity
3.3.3 Results on short and long read hybrid assemblies of and completeness from short read alignments to the contigs and ref-
S.pastorianus data erence genome. ReMILO checks the graph for inconsistent vertices
The results on various contig/syntig sets assembled by SPAdes or fur- and alternative reliable paths connecting them to detect misassem-
ther synthesized are listed in Table 3. On the contigs, ReMILO can blies. In addition, ReMILO also uses long reads to detect more mis-
detect 60.6–64.8% extensive misassemblies and 28.6–41.7% local assemblies. Experimental results demonstrate that ReMILO is
misassemblies with 4.7–8.8% false detections; on the syntigs, sensitive and accurate. In the future, we will expand ReMILO in the
ReMILO can detect 70.0–70.9% extensive misassemblies and 47.5– following aspects. (i) We will provide support for additional
54.0% local misassemblies with 5.4–9.4% false detections. Overall, variant-aware aligners for both short reads and contigs. (ii) The con-
compared to the existing algorithms, ReMILO detects more misas- tig splitting at misassembly locations will be extended to further cor-
semblies with a compatible amount of false detections. These results rectly join the split contigs. (iii) Additional data sources such as 10
indicate ReMILO is also sensitive and accurate in detecting misas- Genomics linked reads will be supported (Zheng et al., 2016).
semblies in hybrid contigs/syntigs generated from both short and
long reads.
Acknowledgements
We thank Martin Muggli from the Colorado State University and Alexey
3.3.4 Running time and memory usage
Gurevich from the St. Petersburg Academic University for the discussions
On contigs of the human chromosome 14, japonica rice and
about misassemblies. We thank Takeshi Itoh from the National Institute of
S.pastorianus data, ReMILO’s running time is 11.1–11.5, 17.5–21.1 Agrobiological Sciences for providing us the optical mapping data of japonica
and 7.3–7.7 h, respectively, and ReMILO’s memory usage is 4.7– rice, and also thank Shiguo Zhou from the Genome Surveillance, Inc. for an-
4.9, 12.9–14.0 and 2.5–2.7 GB, respectively. About 40% of the total swering our questions about the data. We thank Thomas Girke and Tao Jiang
running time is for short read alignments to contigs and reference from the University of California, Riverside for the suggestions during
32 E.Bao et al.
improvement of this work. We acknowledge the support of core facilities at Kolmogorov,M. et al. (2014) Ragout a reference-assisted assembly tool for
the Institute for Integrative Genome Biology (IIGB), the University of bacterial genomes. Bioinformatics, 30, i302–i309.
California, Riverside. Koren,S. et al. (2012) Hybrid error correction and de novo assembly of
single-molecule sequencing reads. Nat. Biotechnol., 30, 693–700.
Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with bow-
Funding tie 2. Nat. Methods, 9, 357–359.
Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with bur-
This work has been supported by grants from the National Science
rows–wheeler transform. Bioinformatics, 25, 1754–1760.
Foundation of China [61502027 to E.B.], and the Fundamental Research
Luo,R. et al. (2012) Soapdenovo2: an empirically improved memory-efficient