Remilo: Reference Assisted Misassembly Detection Algorithm Using Short and Long Reads

Bioinformatics, 34(1), 2018, 24–32
doi: 10.1093/bioinformatics/btx524
Advance Access Publication Date: 21 August 2017
Original Paper
Sequence analysis
Downloaded from https://academic.oup.com/bioinformatics/article/34/1/24/4085773 by UNIVERSIDAD DE SEVILLA user on 03 December 2020

ReMILO: reference assisted misassembly
detection algorithm using short and long reads
Ergude Bao1,2,*,†, Changjin Song1,† and Lingxiao Lan1
1
Software Engineering Research Center, School of Software Engineering, Beijing Jiaotong University, Beijing
100044, China and 2Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
*To whom correspondence should be addressed.
†
The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.
Associate Editor: Inanc Birol
Received on April 5, 2017; revised on August 1, 2017; editorial decision on August 14, 2017; accepted on August 15, 2017
Abstract
Motivation: Contigs assembled from the second generation sequencing short reads may contain
misassemblies, and thus complicate downstream analysis or even lead to incorrect analysis re-
sults. Fortunately, with more and more sequenced species available, it becomes possible to use
the reference genome of a closely related species to detect misassemblies. In addition, long reads
of the third generation sequencing technology have been more and more widely used, and can
also help detect misassemblies.
Results: Here, we introduce ReMILO, a reference assisted misassembly detection algorithm that
uses both short reads and PacBio SMRT long reads. ReMILO aligns the initial short reads to both
the contigs and reference genome, and then constructs a novel data structure called red-black mul-
tipositional de Bruijn graph to detect misassemblies. In addition, ReMILO also aligns the contigs to
long reads and find their differences from the long reads to detect more misassemblies. In our
performance test on short read assemblies of human chromosome 14 data, ReMILO can detect
41.8–77.9% extensive misassemblies and 33.6–54.5% local misassemblies. On hybrid short and
long read assemblies of S.pastorianus data, ReMILO can also detect 60.6–70.9% extensive misas-
semblies and 28.6–54.0% local misassemblies.
Availability and implementation: The ReMILO software can be downloaded for free under Artistic
License 2.0 from this site: https://github.com/songc001/remilo.
Contact: baoe@bjtu.edu.cn
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction errors, i.e. mismatches and small indels, and misassemblies, i.e. re-
arrangements and/or significantly large indels. The first errors could
The second generation sequencing technology can produce short directly affect the single nucleotide polymorphism (SNP) analysis in
reads to sequence a species. The short reads are several hundred bp the genome in downstream, since the errors are difficult to be distin-
long and below $0.1 per Mbp, and can be assembled into contigs of guished from SNPs, while the second errors affect the structural
the target genome. However, the assembled contigs usually contain variation (SV) analysis in the genome (Feuk et al., 2006). Compared
errors. For example, the contigs and published genomes of S.aureus to small errors, it is more challenging to detect misassemblies, be-
and P.falciparum assembled from short reads were found containing cause they are usually much larger than the initial short reads and
2.0% and 5.1% errors, respectively (Hunt et al., 2013). The intro- are also more difficult to be distinguished from SVs.
duced errors in contigs are mainly due to genomic repeats and multi- Several algorithms have been published to detect misassemblies
ploidy, making it difficult to distinguish short reads from similar (Hunt et al., 2013; Muggli et al., 2015; Walker et al., 2014; Zhu
genome regions. The errors can be divided into two categories: small et al., 2015). Depending on the input, they can be divided into the
C The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
V 24
ReMILO 25
following two categories. (1) Some algorithms are inputted with the difficult to find the good enzyme choice. Therefore, we use reference
initial short reads only. Both REAPR (Hunt et al., 2013) and Pilon genome and long reads as alternative data sources to detect misas-
(Walker et al., 2014) align the short reads to contigs and calculate semblies. Although the misassembly detection performance is af-
the following statistics to detect misassemblies: (i) read coverage, (ii) fected by similarity of the reference genome and coverage of the
number of incorrectly oriented reads (i.e. mate pairs align in an in- long reads, it is much more stable and a wide range of reference gen-
verted orientation) and (iii) number of partially aligned reads. omes and long read sets can be used to achieve good performance.
REAPR calculates differences between expected and observed read In addition, it needs very small cost to obtain the reference genome
coverage for detection, while Pilon calculates changes of read cover- and long reads: the former could be downloaded from a public re-

age along contigs. (2) Some other algorithms are inputted with the pository, and the latter of low coverage are usually sufficient.
initial short reads and an additional data source. misSEQuel Especially, the latter are readily available without additional cost for
(Muggli et al., 2015) aligns the short reads to contigs and reassem- those sequencing projects assembling both short and long reads. The
bles the contigs to detect misassemblies. It also aligns the contigs to novelty of ReMILO is as below.
the corresponding optical mapping data to reduce false detections.
• ReMILO combines the different data sources for the best per-
misFinder (Zhu et al., 2015) aligns the contigs to the reference gen-
ome of a closely related species to detect possible misassemblies. It formance. (i) The reference genome is long but contains SVs
then aligns the short reads to the contigs and uses similar statistics compared to the target genome of the sequencing species, so can
as above to confirm the misassemblies. be used to detect misassemblies with high sensitivity but rela-
Currently, new opportunities such as more and more sequenced tively low accuracy. (ii) The short reads contain limited differ-
species available and the third generation sequencing technology, ences from the target genome but are short, so can be used to
make it possible to further improve quality of assembled contigs, al- detect misassemblies with high accuracy but relatively low sensi-
though most of the researches seek to generate longer and more tivity. (iii) The long reads are long and rarely contain SVs, but
complete contigs, rather than detect misassemblies. because of the higher cost than short reads, they are usually pro-
vided with low to moderate coverage in sequencing projects, so
• Because of the low cost of the second generation sequencing tech- can be used to detect misassemblies with moderate sensitivity
nology, more and more species have been sequenced. Hence, and accuracy. Therefore, ReMILO uses together the reference
when the short reads of a species are being assembled, the refer- genome and short reads to guarantee sufficient sensitivity and ac-
ence genome of a closely related species could sometimes be curacy of misassembly detection, and then uses the long reads to
found and used to improve assembly quality. Schneeberger et al. further improve the sensitivity.
(2011) assemble short reads by aligning them to a reference gen- • ReMILO constructs a novel data structure called red-black mul-
ome, AlignGraph designed by ourselves (Bao et al., 2014) ex- tipositional de Bruijn graph from short read alignments to con-
tends and joins preassembled contigs with a reference genome, tigs and reference genome to detect misassemblies. This graph is
and RACA (Kim et al., 2013) and Ragout (Kolmogorov et al., a variant of the conventional de Bruijn graph (Pevzner et al.,
2014) build scaffolds for preassembled contigs with a single ref- 2001; Zerbino and Birney, 2008). The special feature of this
erence genome and multiple reference genomes, respectively. graph is, the alignment positions to both the contigs and refer-
misFinder (Zhu et al., 2015) is the only algorithm to detect mis- ence genome are incorporated in its vertices, and can thus help
assemblies with a reference genome. avoid many branched paths in the graph [see Ronen et al. (2012)
• In order to overcome the read length limitation of the second and Bao et al. (2014) for detailed discussions/illustrations of the
generation sequencing technology, the PacBio SMRT sequencing incorporated alignment positions for reducing branched paths in
technology, as a representative of the third generation sequencing de Bruijn graph]. The contigs and this graph have the following
technology, was commercially released in 2010, producing long correspondences. (i) A contig corresponds to a path in the graph.
reads of about 5–15 kbp long with about $0.4–0.8 per Mbp (Eid (ii) A misassembly in the contig corresponds to inconsistent verti-
et al., 2009). Considering the relatively higher cost than short ces in the path, which are adjacent vertices with close alignment
reads, long reads of low to moderate coverage are usually used positions to the contig but distant alignment positions to the ref-
together with short reads for quality assemblies. When the cover- erence genome. (iii) The true connection of split subcontigs at the
age of long reads is low, PBJelly2 (English et al., 2012) fills scaf- misassembly location corresponds to an alternative reliable path
fold gaps using long reads; when the coverage is moderate, connecting the inconsistent vertices. Because of the limited
Celera (Koren et al., 2012), SPAdes (Bankevich et al., 2012), branched paths in the graph, the alternative reliable path con-
Cerulean (Deshpande et al., 2013) and dbg2olc (Ye et al., 2014) necting the inconsistent vertices is specific enough as an indicator
can assemble long reads together with short reads to generate of the misassembly. Therefore, ReMILO uses not only the incon-
longer and more complete contigs. However, there has not been sistent vertices, but also the alternative reliable paths connecting
any algorithm specifically designed to detect misassemblies using them as indicators of misassemblies.
long reads.
Therefore in this paper, we propose a novel algorithm ReMILO,
which detects misassemblies in contigs using the corresponding short 2 Materials and methods
reads and the reference genome of a closely related species, as well We align the initial short reads to both the contigs and reference
as the corresponding long reads. Muggli et al. (2015) ‘highlight the genome. Then we construct a red-black multipositional de Bruijn
need to use another source of information . . . to identify mis- graph to detect misassemblies. We also align the contigs to long
assemblies’. They used optical mapping data as the data source and reads and find differences to detect more misassemblies. Finally, we
proved good performance. A limitation of this data source is, the combine the detected misassemblies from both data sources. Below
misassembly detection performance is largely dependent on enzymes are the details of the algorithm (see Supplementary Section S1 for
used to generate the optical mapping data and it is complex or background knowledge of the red-black positional de Bruijn graph).
26 E.Bao et al.
2.1 Short read alignments to contigs and reference 2.3 Coloring of red-black multipositional de Bruijn
genome graph
We align the short reads to both the contigs and reference genome. For each vertex, we calculate the total number of reads d and the
To align the short reads to the contigs, Bowtie2 is used, because it is number of incorrectly oriented reads (i.e. mate pairs align in an
specifically designed for short read alignment to long sequences of inverted orientation) c generating its joined vertices as indicators
the same species (Langmead and Salzberg, 2012). To align the short of its reliability. We color a vertex red, if d 62 ½D; K or c > C,
reads to the reference genome, we apply a contig guided approach, where ½D; K is the acceptable range of read coverage, and C is the
since there has rarely been an aligner specifically designed for short maximum number of incorrectly oriented reads; otherwise, we

read alignment to long sequences of a closely related species (Bao color it black. Depending on if the vertex is constructed from
et al., 2014). The contig guided approach aligns the contigs to the reads aligned to contigs (pc 6¼ 1) or from reads aligned directly
reference genome, so that the short reads aligned to the contigs can to reference genome (pc ¼ 1), the thresholds D, K and C are dif-
be further aligned to the reference genome. Although the contigs are ferent, because usually more reads can be aligned to contigs than
not from the same species with the reference genome, they are much to reference genome. We calculate the thresholds by applying a
longer than the short reads, so can be aligned to the reference gen- sampling based approach (Muggli et al., 2015). This approach
ome tolerating many differences. As a result, a sufficient number of samples contigs or read directly aligned reference genome regions,
short reads can be aligned to the reference genome guided by the finds the distribution of read coverage to calculate D and K, and
contigs. Then we align the not aligned short reads directly to the ref- also finds the distribution of the number of incorrectly oriented
erence genome. The not aligned short reads include those not reads to calculate C. Both distributions are normal distributions,
aligned to the contigs, and those aligned to the contigs but the cor- so D and K are calculated as the mean minus and plus three times
responding contigs are not aligned to the reference genome. BWA- the standard deviation, respectively, and C is calculated as the
MEM is used to align the contigs to the reference genome, because it mean plus three times the standard deviation [see Muggli et al.
has a good balance between alignment sensitivity and running time (2015) for more details].
from our experience among several options (Li and Durbin, 2009).
Bowtie2 is used to align the remaining short reads to the reference
genome with more relaxed identity settings, e.g. relatively large in- 2.4 Misassembly detection with red-black multiposi-
sert length. tional de Bruijn graph
If a short read has multiple alignments of similar identity to the After the red-black multipositional de Bruijn graph is constructed
contigs or to the reference genome, we randomly choose one from and colored, each contig position pc has a corresponding vertex
them. This can avoid alignment bias that short reads from multiple ðs; pc ; pg Þ in the graph, so we check adjacent contig positions one by
repeat regions are aligned to a single region. In addition, if a short one. A misassembly is detected between two adjacent contig pos-
read is not aligned to the reference genome, we simply discard it. itions pc and p0c ¼ pc þ 1 with corresponding vertices ðs; pc ; pg Þ and
This does not affect much of the misassembly detection perform- ðs0 ; p0c ; p0g Þ, respectively, if constraint (2.1) is met and (2.2) or (2.3) is
ance, as long as the majority of them can be aligned (see Section also met.
3.3.1). (2.1)The two vertices are inconsistent, i.e. jpg p0g j > U, where
U is the distance indicating a possible misassembly;
(2.2)The two vertices are red;
2.2 Construction of red-black multipositional de Bruijn (2.3)The two vertices are connected with an alternative black
graph path of length at least U.
We construct l k þ 1 connected vertices from an aligned read of The default value of U is set 85, which is QUAST’s minimum dis-
length l. Each vertex is a k-mer ðs; pc ; pg Þ where s is k read bases, pc tance deciding a misassembly (Gurevich et al., 2013, see Section
is the first contig position s is aligned to, and pg is the first reference 3.3). Constraint (2.1) checks inconsistent vertices with close align-
genome position s is aligned to. pc is set -1 if s is aligned directly to ment positions to contigs and distant alignment positions to refer-
the reference genome. We join two vertices ðs; pc ; pg Þ and ðs0 ; p0c ; p0g Þ ence genome to detect a possible misassembly. Constraint (2.2)
if constraints (1.1)–(1.3) below are met. checks reliability of the inconsistent vertices to confirm the misas-
(1.1) s ¼ s0 ; sembly. This constraint is based on the simplicity of the red-black
(1.2) jpc p0c j < or pc ¼ 1 or p0c ¼ 1, where is the allow- multipositional de Bruijn graph, because by avoiding unnecessary
able number of shifts; branched paths, the vertices can be accurately colored (Muggli
(1.3) jpg p0g j < . et al., 2015). Constraint (2.3) checks an alternative reliable path
Constraints (1.2)–(1.3) guarantee simplicity of the de Bruijn graph, connecting the inconsistent vertices to confirm the misassembly.
i.e. avoidance of unnecessary branched paths. This is because they This constraint is also based on the simplicity of the red-black multi-
avoid false joins of short reads from different genome positions, positional de Bruijn graph, because only by avoiding unnecessary
while also allow joins of short reads with alignment differences from branched paths, the alternative reliable path is specific enough as an
the same genome position. In addition, constraint (1.2) guarantees indicator of the misassembly. In addition, this constraint is also
completeness of the de Bruijn graph, i.e. existence of necessary based on the completeness of the red-black multipositional de Bruijn
paths. This is because it allows joins of short reads aligned directly graph, which guarantees existence of the alternative reliable path. In
to reference genome with those aligned to contigs. Note that even if practice, it might be difficult to find the complete path connecting
some short reads are aligned to reference genome guided by two the inconsistent vertices, especially when the path is long, so con-
adjacently aligned contigs in the previous step, there usually exist straint (2.3) could be relaxed to (2.3’) below.
short reads aligned directly to the reference genome in between over- (2.3’) Each of the two vertices is connected with a black path,
lapping both contig ends, so all of them can be joined to form a com- and the total length of the two paths is at least U.
plete path. The simplicity and completeness of the de Bruijn graph This relaxed constraint may result in false detections of misassem-
are crucial for misassembly detection (see Section 2.4). blies, since the two black paths in (2.3’) may not be subpaths of the
ReMILO 27
alternative black path in (2.3), but it simplifies the algorithm and 2.6 Implementation of the ReMILO software
can help detect more misassemblies. The ReMILO software is implemented in C þþ for Linux platform.
Compared to the misassembly detection approach of the ReMILO’s input includes the contigs, reference genome, short reads
misSEQuel algorithm, our approach not only checks the reliability and long reads, and its output includes a file recording locations of
of vertices, but also checks the inconsistent vertices and the alterna- the detected misassemblies and another file containing split contigs
tive reliable paths, so could achieve higher sensitivity and accuracy. at the misassembly locations. ReMILO can work with or without
Figure 1 shows an illustration on advantage of the red-black multi- long reads, depending on whether they are inputted.
positional de Bruijn graph over the red-black positional de Bruijn

graph in misassembly detection with an example.
3 Evaluation
2.5 Misassembly detection using long reads 3.1 Experimental design
We align the long reads to contigs also by BWA-MEM. If the long 3.1.1 Test on short read assemblies of human chromosome 14 data
reads are of relatively low quality (Eid et al., 2009), various long (i) We compared ReMILO to REAPR (Hunt et al., 2013), Pilon
read error correctors could be applied to improve the quality (Bao (Walker et al., 2014) and misFinder (Zhu et al., 2015) on contigs of
and Lan, 2017; Koren et al., 2012; Salmela and Rivals, 2014), so human chromosome 14, which is from the GAGE evaluation
that they could fit the input requirement of BWA-MEM. Then we (Salzberg et al., 2012) (chromosome type: diploid; size: 107.3 Mbp;
check adjacent contig positions one by one. A misassembly is de- downloaded from the GAGE website). misSEQuel was not com-
tected between two adjacent positions pc and p0c ¼ pc þ 1, if con- pared in this test, because the optical mapping data was not ob-
straints (3.1)–(3.2) below are met. tained. The contigs were assembled by short read assemblers
(3.1)pc and p0c are aligned to long read positions pr and p0r , re- ALLPATHS-LG (Gnerre et al., 2011), MaSuRCA (Zimin et al.,
spectively, and jpr p0r j > W, where W is the distance indicating a 2013) and SOAPdenovo2 (Luo et al., 2012), which are typical as-
possible misassembly; semblers supporting short read assemblies. The corresponding short
(3.2)Constraint (3.1) is met for at least C long reads. reads of 34 coverage (read length amount: 101 36.5 M bp;
The default value of W is also set 85. Figure 2 shows illustrations on from the GAGE website) in fragment library were used for misas-
locating misassemblies of several types using long reads. sembly detection. The reference genome was the chimpanzee
Finally, we combine the detected misassemblies using long reads chromosome 14 [from the Ensembl FTP site (release 85)]. The long
and those using reference genome as the final results. Two detected reads from human chromosome 14 were not available, so we aligned
misassemblies at contig positions close to each other are treated as long reads of several libraries from the whole human genome (from
the same error, and contig position in the middle is reported as the NCBI accession SRX2010823) to human chromosome 14 by BWA-
misassembly location. MEM, and obtained the aligned ones of 10 coverage (read
Contig ACGAGCA T ACGTGCA

0 A 593 B
Genome ACGAGCA ACGAGCA T ACGTGCA

112 A A' 712 B
(A) Red-Black Positional de Bruijn Graph (B) Red-Black Multipositional de Bruijn Graph
(ACGA,0,112,4), (CGAG,1,113,4), (GAGC,2,114,4), (AGCA,3,115,4),
(ACGA,0,4), (CGAG,1,4), (GAGC,2,4), (AGCA,3,4), (GCAA,-1,116,4), (CAAC,-1,117,4), (AACG,-1,118,4), (ACGA,-1,119,4),
(GCAT, 4,4), , (ACGT,593,2), (CGTG,594,2), (GTGC,595,2), (CGAG,-1,120,4), (GAGC,-1,121,4), (AGCA,-1,122,4), (GCAT, 4,116,4), ,
(TGCA,596,2), (ACGT,0,2), (CGTG,1,2), (GTGC,2,2), (TGCA,3,2) (ACGT,593,712,2), (CGTG,594,713,2), (GTGC,595,714,2), (TGCA,596,715,2),
(ACGT,0,112,2), (CGTG,1,113,2), (GTGC,2,114,2), (TGCA,3,115,2)
CGA,1 GAG,2 AGC,3 CGA,1,113 GAG,2,114 AGC,3,115
ACG,0 GCA,4 ACG,0,112 GCA,4,116
CGT,1 GTG,2 TGC,3 CGT,1,113 GTG,2,114 TGC,3,115
CAT,5 CGA,-1,120 ACG,-1,119 AAC,-1,118 CAA,-1,117
ACG,593 CGT,594 GTG,595 TGC,596 GCA,597 GAG,-1,121 AGC,-1,122 GCA,-1,123
CAT,5,124
ACG,593,712 CGT,594,713 GTG,595,714 TGC,596,715 GCA,597,716
Fig. 1. Advantage of the red-black multipositional de Bruijn graph over the positional de Bruijn graph. Compared to the target genome (unknown in dashed rect-
angle), the contig has a misassembly not containing genome region A0 . Contig/genome regions A and B are similar to each other. (A) Short reads are listed with
their alignment positions to the contig and multiplicity, constructing a red-black positional de Bruijn graph. The short reads from genome region A0 need to be
aligned to contig region A, resulting in excessive coverage in region A and a corresponding red path in the graph, to detect the misassembly. However, due to
some alignment issue, the short reads are not aligned, resulting in a black path from vertex ðACG; 0Þ to ðCGA; 1Þ to ðGCA; 4Þ, so the misassembly is not detected.
In addition, many short reads from genome region B are aligned to contig region A, resulting in insufficient coverage in region B and a corresponding red path in
the graph from vertex ðACG; 593Þ to ðGCA; 597Þ, so a false detection occurs in region B. (B) Short reads are listed with their alignment positions to the contig, to
the reference genome and multiplicity, constructing a red-black multipositional de Bruijn graph. The short reads from genome region A0 are aligned to reference
genome region A0 , resulting in inconsistent vertices ðGCA; 4; 116Þ and ðCAT ; 5; 124Þ (j5 4 ¼ 1j but j124 116j ¼ 8) and an alternative black path connecting them
from vertex ðCAA; 5; 117Þ to ðGCA; 1; 123Þ (shaded), so the misassembly is detected. In addition, no false detection occurs, because there does not exist any
other inconsistent vertices
28 E.Bao et al.
A B
C D

Fig. 2. Illustrations on locating misassemblies of several types using long reads (only one long read is used in each illustration for simplicity). (A) A transposition
of contig regions A and B results in two misassemblies between B and Aþ and between A and B 0 þ. These misassemblies can be detected with adjacent contig
positions p1 and p2 (or p3 and p4) aligned to long read positions p1 and p2 (or p3 and p4) of a distance, respectively. (B) An inversion of contig region B results in
two misassemblies between A and B and between Bþ and B 0 þ. These misassemblies can be detected with adjacent contig positions p1 and p2 (or p3 and p4)
aligned to long read positions p1 and p2 (or p3 and p4) of a distance, respectively. (C) A collapsed contig region B with B 0 results in one misassembly between
B and C þ, which can be detected with adjacent contig positions p1 and p2 aligned to long read positions p1 and p2 of a distance, respectively. (D) An expanded
contig region B 00 from B 0 results in one misassembly between B 0 and B 00 þ, which can be detected with adjacent contig positions p1 and p2 aligned to long read
positions p1 and p2 of a distance, respectively
length amount: 1851 591.2k bp). The long reads were error cor- 3.1.3 Test on short and long read hybrid assemblies of
rected by HALC, a long read error corrector designed by ourselves S.pastorianus data
(Bao and Lan, 2017). (ii) We also compared ReMILO to the existing (i) Because long reads of relatively higher coverage can not only be
algorithms on synthetic contigs. The synthetic contigs were generated inputted to ReMILO to detect misassemblies but also be assembled
from the initial contigs by combining them and adding relatively large together with short reads, we also compared ReMILO to REAPR,
indels. Compared to misassemblies in the initial contigs, the intro- Pilon and misFinder on hybrid S.pastorianus contigs assembled from
duced misassemblies are a little more explicit, but have known loca- both short and long reads (genome type: triploid; size: 18.7 Mbp;
tions, so can be used to compare the algorithms in a more accurate downloaded from NCBI accession AZCJ00000000.1). The contigs
manner. In the following discussion, the synthetic contigs are referred were assembled from short reads of 72 coverage (read length
to as syntigs to distinguish from the initial ones. (iii) In addition, on amount: 300 2.7 M bp; from NCBI accession DRX036591) and
the contigs, we varied the reference chromosomes inputted to long reads of 37 coverage (read length amount: 2942 244k bp)
ReMILO, including the chromosomes of gorilla, orangutan, by assembler SPAdes, which is a typical assembler supporting the
gibbon and macaque [from the Ensembl FTP site (release 85)], to see hybrid assembly inputted with either the uncorrected long reads or
impact of the reference genomes upon misassembly detection results. corrected ones. The long reads were simulated by PacBio reads simu-
Similarity of the chromosomes to the human chromosome 14 drops lator PBSIM (Ono et al., 2013), and then inputted to SPAdes dir-
from chimpanzee to macaque, and is quantified as percentages of the ectly, and alternatively, corrected by HALC and then inputted to
human chromosome 14 short reads alignable to the chromosomes. SPAdes. All the short and corrected long reads were used for misas-
sembly detection. The reference genome was the S.cerevisiae genome
(from NCBI accessions NC_001133.9-NC_001148.4). (ii) In add-
ition, again, we compared ReMILO to the existing algorithms on
3.1.2 Test on short read assemblies of japonica rice data syntigs generated from the contigs by combining them and adding
(i) In order to compare ReMILO with misSEQuel with or without relatively large indels.
an additional data source inputted (long reads to ReMILO and op-
tical mapping data to misSEQuel), we used a japonica rice data (gen- An additional test on short and long read hybrid assemblies of
ome type: diploid; size: 374.5 Mbp; downloaded from NCBI A.thaliana data is described in Supplementary Section S2, and the
accession GCA_001623365.1). The contigs were assembled from results are shown in Supplementary Section S3. All the software
short reads of 55 coverage (read length amount: 76 268.9 M above was in default settings. Assembly statistics are listed in
bp; from NCBI accession SRX032913) by short read assemblers Supplementary Table S1. The statistics may have some differences
IDBA (Peng et al., 2012), SPAdes (Bankevich et al., 2012) and from those reported previously in Salzberg et al. (2012) and Muggli
Velvet (Zerbino and Birney, 2008). The selection of assemblers was et al. (2015), probably because of the updated assembler versions.
consistent with Muggli et al. (2015), excluding ABySS and There are two things to note in these tests. (i) Although we used
SOAPdenovo2 whose contigs contain few misassemblies. All the quite a few assemblers, the purpose of this paper is not to compare
short reads were used for misassembly detection. The reference gen- these assemblers, but to show ReMILO’s performance working on
ome was the indica rice genome [from the Ensembl Plants FTP site contigs by different assemblers. (ii) The reference genomes or
(release 33)]. The long reads were of 20 coverage (read length chromosome inputted to ReMILO can also be used to extend the
amount: 2950 2520.6k bp; from NCBI accession SRX1897300) assembled contigs (Bao et al., 2014) or build scaffolds (Kim et al.,
and error corrected by HALC. The optical mapping data was 2013; Kolmogorov et al., 2014), but the extended contigs are usu-
queried and obtained from Kawahara et al. (2013). (ii) In addition, ally high quality ones with limited misassemblies and the scaffolds
we varied the long read coverage from 10 to 40 to see impact of do not change the initial contigs, so we did not run ReMILO on the
the long read coverage upon misassembly detection results. extended contigs or scaffolds.
ReMILO 29
Table 1. Evaluation of misassembly detection performance on the human chromosome 14 data
Algorithm TPR (extensive) TPR (local) FPR
(a) Contigs assembled by ALLPATHS-LG

REAPR 39.8%(39/98) 25.8%(34/132) 27.6%(2324/8434)
Pilon 35.7%(35/98) 18.9%(25/132) 24.2%(2037/8434)
misFinder 38.8%(38/98) 24.2%(32/132) 23.2%(1953/8434)
ReMILO 41.8%(41/98) 35.6%(47/132) 21.8%(1836/8434)

(b) Contigs assembled by MaSuRCA
REAPR 77.4%(1129/1459) 19.9%(86/432) 23.2%(1789/7711)
Pilon 64.6%(943/1459) 20.6%(89/432) 28.5%(2196/7711)
misFinder 71.7%(1046/1459) 8.1%(35/432) 11.5%(886/7711)
ReMILO 77.9%(1136/1459) 33.6%(145/432) 11.1%(859/7711)
(c) Contigs assembled by SOAPdenovo2
REAPR 54.0%(2876/5327) 33.8%(1376/4067) 30.3%(2647/8740)
Pilon 47.8%(2544/5327) 35.5%(1443/4067) 26.6%(2329/8740)
misFinder 52.7%(2805/5327) 32.6%(1324/4067) 23.2%(2030/8740)
ReMILO 61.5%(3276/5327) 50.8%(2065/4067) 15.7%(1376/8740)
(a’) Syntigs generated from (a)
REAPR 55.8%(122/219) 47.0%(103/219) 0.4%(34/8434)
Pilon 53.9%(118/219) 45.2%(99/219) 0.5%(39/8434)
misFinder 58.9%(129/219) 41.6%(91/219) 0.5%(41/8434)
ReMILO 58.0%(127/219) 51.1%(112/219) 0.6%(48/8434)
(b’) Syntigs generated from (b)
REAPR 73.7%(701/951) 46.8%(445/951) 1.4%(109/7711)
Pilon 68.8%(654/951) 46.2%(439/951) 1.6%(121/7711)
misFinder 71.1%(676/951) 37.7%(359/951) 1.5%(116/7711)
ReMILO 75.1%(714/951) 49.3%(469/951) 1.3%(99/7711)
(c’) Syntigs generated from (c)
REAPR 66.7%(362/543) 58.7%(319/543) 0.6%(51/8740)
Pilon 68.3%(371/543) 55.6%(302/543) 1.0%(87/8740)
misFinder 66.5%(361/543) 53.2%(289/543) 0.8%(71/8740)
ReMILO 71.8%(390/543) 54.5%(296/543) 0.7%(65/8740)
Note: The contigs are assembled by various short read assemblers ALLPATHS-LG, MaSuRCA and SOAPdenovo2, and the syntigs are synthesized from them.
The performance of ReMILO is compared to REAPR, Pilon and MisFinder. TPR (extensive) is the true positive rate of extensive misassemblies, TPR (local) is the
true positive rate of local misassemblies, and FPR is the false positive rate of misassemblies. The best value for each column is shown in boldface.
3.2 Performance measurements incorrectly detected misassemblies over the maximum number of
We used QUAST to locate both true extensive misassemblies [MA possible misassemblies. The maximum number of possible misas-
(extensive)] and local misassemblies [MA (local)] in the contigs semblies is estimated as the total number of contig bases over the
(Gurevich et al., 2013). QUAST aligns the contigs to the correspond- average distance between two misassemblies. (v) True positive rate
ing target genome or chromosome (i.e. human chromosome 14, of extensively misassembled contigs [TPRC (extensive)] is the num-
japonica rice genome, S.pastorianus genome or A.thaliana genome), ber of correctly detected extensively misassembled contigs over the
and checks flanking subcontigs aligned with distances [see Gurevich total number of such contigs. (vi) True positive rate of locally misas-
et al. (2013) for detailed definitions of the misassemblies]. sembled contigs [TPRC (local)] is the number of correctly detected
Accordingly, a contig is an extensively misassembled contig [MC locally misassembled contigs over the total number of such contigs.
(extensive)], if it contains at least one extensive misassembly; a con- (vii) False positive rate of contigs (FPRC) is the number of incor-
tig is a locally misassembled contig [MC (local)], if it contains at rectly detected misassembled contigs over the total number of cor-
least one local misassembly. Note that one contig could be both an rect contigs without misassemblies.
extensively misassembled contig and a locally misassembled contig. For the syntigs, we had known locations of the introduced true
Compared to the located errors and contigs, we made the follow- MA (extensive) and MA (local). Therefore, we made the same meas-
ing measurements. Here, a correctly detected misassembly (or misas- urements as above, despite that a correctly detected misassembly is a
sembled contig) is a misassembly (or contig) located by QUAST, misassembly overlapping an introduced true error, while an incor-
while an incorrectly detected misassembly (or misassembled contig) rectly detected misassembly is a misassembly not overlapping any
is a misassembly (or contig) not located by QUAST or any other al- introduced true error or detected in the corresponding contigs.
gorithm. (i) True positive rate of extensive misassemblies [TPR (ex-
tensive)] is the number of correctly detected extensive misassemblies
over the total number of extensive misassemblies. (ii) True positive 3.3 Results
rate of local misassemblies [TPR (local)] is the number of correctly 3.3.1 Results on short read assemblies of human chromosome
detected local misassemblies over the total number of local misas- 14 data
semblies. (iii) True positive rate of misassemblies (TPR) is the num- The results on various contig/syntig sets assembled by different short
ber of correctly detected misassemblies over the total number of read assemblers or further synthesized are listed in Table 1. On the
misassemblies (combining both extensive and local misassemblies). contigs, ReMILO can detect 41.8–77.9% extensive misassemblies
(iv) False positive rate of misassemblies (FPR) is the number of and 33.6–50.8% local misassemblies with 11.1–21.8% false
30 E.Bao et al.
A TPR and FPR with various reference chromosomes detections. Compared to the existing algorithms, ReMILO detects
100% more misassemblies with fewer false detections. On the syntigs,
ALLPATH−LG TPR ALLPATH−LG FPR
MaSuRCA TPR MaSuRCA FPR ReMILO can detect 58.0–75.1% extensive misassemblies and 49.3–
SOAPdenovo2 TPR SOAPdenovo2 FPR
80%
54.5% local misassemblies with 0.6–1.3% false detections. Overall,

ReMILO detects more misassemblies with a compatible amount of
false detections, and all the algorithms’ performance is a little better
60%
than that on the contigs, mainly because the introduced misassem-

blies are more explicit. These results indicate ReMILO is sensitive
40%

and accurate in detecting misassemblies in contigs/syntigs generated
from short reads.
20%
With the various reference chromosomes, on the contigs,

ReMILO’s true positive and false positive rates of misassemblies are
plotted in Figure 3A (see Supplementary Table S2 for raw numbers
0%
Chimpanzee (94.5%) Gorilla (91.6%) Orangutan (88.9%) Gibbon (49.9%) Macaque (29.9%) of the plot). Both rates are relatively stable despite the decrease in
reference chromosome similarity from chimpanzee to macaque.
B TPRC and FPRC with various long read coverage
100%
Only with the least similar macaque chromosome, the true positive
IDBA TPRC IDBA FPRC
SPAdes TPRC SPAdes FPRC rates show dramatic drops. These results indicate ReMILO is not so
Velvet TPRC Velvet FPRC
much dependent on the reference genome, so given the contigs of
80%
one species, the reference genomes of a relatively large range of its

closely related species can be used to detect misassemblies.
60%
40%
3.3.2 Results on short read assemblies of japonica rice data

The results on various contig sets assembled by different short read
20%
assemblers are listed in Table 2. With long reads inputted, ReMILO

can detect 42.3–57.5% extensively misassembled contigs and 34.1–
43.8% locally misassembled contigs with 7.3–19.6% false detec-
0%
0 10x 20x 30x 40x tions. Compared to misSEQuel with optical mapping data inputted,
ReMILO detects more misassembled contigs but it also makes more
Fig. 3. ReMILO’s true positive rate and false positive rate of misassemblies false detections; compared to misSEQuel without optical mapping
(TPR and FPR, respectively) on contigs with various reference chromosomes
data inputted, ReMILO detects fewer misassembled contigs but it
of the human chromosome 14 data (A) and true positive rate and false posi-
tive rate of misassembled contigs (TPRC and FPRC, respectively) with various
also makes fewer false detections. These results indicate ReMILO
long read coverage of the japonica rice data (B). Similarity of the chromo- keeps a balance between sensitivity and accuracy in misassembly de-
somes to the human chromosome 14 (quantified as percentages of the tection. In addition, compared to itself without long reads inputted,
human chromosome 14 short reads alignable to the chromosomes) is listed ReMILO obtains 4.2–5.8% higher true positive rates of extensively
together with the species names misassembled contigs and 6.4–10.2% higher true positive rates of
Table 2. Evaluation of misassembly detection performance on the japonica rice data
Algorithm TPRC (extensive) TPRC (local) FPRC
(a) Contigs assembled by IDBA

misSEQuel- 100.0%(1336/1336) 100.0%(434/434) 93.6%(15 713/16 791)
misSEQuel 22.3%(298/1336) 26.0%(113/434) 11.3%(1896/16 791)
ReMILO- 45.7%(610/1336) 27.7%(120/434) 16.2%(2716/16 791)
ReMILO 49.9%(666/1336) 34.1%(148/434) 17.3%(2910/16 791)
(b) Contigs assembled by SPAdes
misSEQuel- 100.0%(1958/1958) 100.0%(144/144) 95.2%(15 037/15 804)
misSEQuel 20.7%(405/1958) 22.9%(33/144) 12.4%(1952/15 804)
ReMILO- 51.7%(1013/1958) 36.1%(52/144) 17.2%(2711/15 804)
ReMILO 57.5%(1125/1958) 43.8%(63/144) 19.6%(3090/15 804)
(c) Contigs assembled by Velvet
misSEQuel- 100.0%(638/638) 100.0%(49/49) 96.8%(5518/5700)
misSEQuel 10.2%(65/638) 14.3%(7/49) 3.9%(222/5700)
ReMILO- 37.8%(241/638) 30.6%(15/49) 6.2%(351/5700)
ReMILO 42.3%(270/638) 40.8%(20/49) 7.3%(418/5700)
Note: The contigs are assembled by various short read assemblers IDBA, SPAdes and Velvet. The performance of ReMILO with or without long reads inputted
(ReMILO-) is compared to misSEQuel with or without optical mapping data inputted (misSEQuel-). TPRC (extensive) is the true positive rate of extensively mis-
assembled contigs, TPRC (local) is the true positive rate of locally misassembled contigs, and FPRC is the false positive rate of contigs. The best value for each col-
umn excluding that of misSEQuel- is shown in boldface.
ReMILO 31
Table 3. Evaluation of misassembly detection performance on the S.pastorianus data
Algorithm TPR (extensive) TPR (local) FPR
(a) Contigs assembled by SPAdes

REAPR 55.4%(413/746) 25.0%(21/84) 7.7%(126/1636)
Pilon 53.4%(398/746) 16.7%(14/84) 4.6%(76/1636)
misFinder 46.4%(346/746) 4.8%(4/84) 5.4%(89/1636)
ReMILO 60.6%(452/746) 28.6%(24/84) 8.8%(144/1636)

(b) Contigs assembled by SPAdes (cor)
REAPR 51.9%(28/54) 35.4%(34/96) 3.7%(54/1462)
Pilon 57.4%(31/54) 33.3%(32/96) 4.0%(58/1462)
misFinder 37.0%(20/54) 16.7%(16/96) 2.3%(33/1462)
ReMILO 64.8%(35/54) 41.7%(40/96) 4.7%(68/1462)
(a’) Syntigs generated from (a)
REAPR 71.2%(301/423) 44.7%(189/423) 5.6%(92/1636)
Pilon 63.6%(269/423) 41.6%(176/423) 6.2%(101/1636)
misFinder 59.3%(251/423) 34.3%(145/423) 4.6%(76/1636)
ReMILO 70.9%(300/423) 47.5%(201/423) 9.4%(153/1636)
(b’) Syntigs generated from (b)
REAPR 64.0%(96/150) 52.0%(78/150) 4.6%(68/1462)
Pilon 66.7%(100/150) 47.3%(71/150) 3.6%(53/1462)
misFinder 57.3%(86/150) 46.0%(69/150) 2.9%(42/1462)
ReMILO 70.0%(105/150) 54.0%(81/150) 5.4%(79/1462)
Note: The contigs are assembled by short and long read hybrid assembler SPAdes, and the syntigs are synthesized from them. In (a), the contigs are assembled
by SPAdes inputted with the uncorrected long reads, while in (b), by SPAdes inputted with the corrected ones [SPAdes (cor)]. The performance of ReMILO is com-
pared to REAPR and Pilon. Again, TPR (extensive) is the true positive rate of extensive misassemblies, TPR (local) is the true positive rate of local misassemblies,
and FPR is the false positive rate of misassemblies. The best value for each column is shown in boldface.
locally misassembled contigs, despite 1.1–2.4% higher false positive genome by Bowtie2 and BWA-MEM, about 20% is for constructing
rates of contigs. These results indicate ReMILO is more sensitive red-black multipositional de Bruijn graph to detect misassemblies,
using the long reads. Note that here we compared misassembled about 30% is for contig alignment to long reads by BWA-MEM,
contigs rather than misassemblies, because misSEQuel has limited and about 10% is for using long reads to detect more misassemblies.
support on reporting misassemblies. For memory usage, the peak memory usage usually appears during
With the various long read coverage, ReMILO’s true positive contig alignment to long reads. These results indicate ReMILO is
and false positive rates of misassembled contigs are plotted in Figure sufficiently fast and memory efficient working on various data, and
3B (see Supplementary Table S3 for raw numbers of the plot). Both can thus be practically used.
true positive rates and false positive rates increase with the long read
coverage, but the former increase to a much larger extent than the
latter. These results indicate ReMILO is more sensitive without los- 4 Conclusions
ing much accuracy with more long reads inputted. Hence, long reads This paper introduces ReMILO, a reference assisted misassembly
of relatively high coverage are preferred for misassembly detection. detection algorithm that uses both short and long reads. ReMILO
constructs a red-black multipositional de Bruijn graph of simplicity
3.3.3 Results on short and long read hybrid assemblies of and completeness from short read alignments to the contigs and ref-
S.pastorianus data erence genome. ReMILO checks the graph for inconsistent vertices
The results on various contig/syntig sets assembled by SPAdes or fur- and alternative reliable paths connecting them to detect misassem-
ther synthesized are listed in Table 3. On the contigs, ReMILO can blies. In addition, ReMILO also uses long reads to detect more mis-
detect 60.6–64.8% extensive misassemblies and 28.6–41.7% local assemblies. Experimental results demonstrate that ReMILO is
misassemblies with 4.7–8.8% false detections; on the syntigs, sensitive and accurate. In the future, we will expand ReMILO in the
ReMILO can detect 70.0–70.9% extensive misassemblies and 47.5– following aspects. (i) We will provide support for additional
54.0% local misassemblies with 5.4–9.4% false detections. Overall, variant-aware aligners for both short reads and contigs. (ii) The con-
compared to the existing algorithms, ReMILO detects more misas- tig splitting at misassembly locations will be extended to further cor-
semblies with a compatible amount of false detections. These results rectly join the split contigs. (iii) Additional data sources such as 10
indicate ReMILO is also sensitive and accurate in detecting misas- Genomics linked reads will be supported (Zheng et al., 2016).
semblies in hybrid contigs/syntigs generated from both short and
long reads.
Acknowledgements
We thank Martin Muggli from the Colorado State University and Alexey
3.3.4 Running time and memory usage
Gurevich from the St. Petersburg Academic University for the discussions
On contigs of the human chromosome 14, japonica rice and
about misassemblies. We thank Takeshi Itoh from the National Institute of
S.pastorianus data, ReMILO’s running time is 11.1–11.5, 17.5–21.1 Agrobiological Sciences for providing us the optical mapping data of japonica
and 7.3–7.7 h, respectively, and ReMILO’s memory usage is 4.7– rice, and also thank Shiguo Zhou from the Genome Surveillance, Inc. for an-
4.9, 12.9–14.0 and 2.5–2.7 GB, respectively. About 40% of the total swering our questions about the data. We thank Thomas Girke and Tao Jiang
running time is for short read alignments to contigs and reference from the University of California, Riverside for the suggestions during
32 E.Bao et al.
improvement of this work. We acknowledge the support of core facilities at Kolmogorov,M. et al. (2014) Ragout a reference-assisted assembly tool for
the Institute for Integrative Genome Biology (IIGB), the University of bacterial genomes. Bioinformatics, 30, i302–i309.
California, Riverside. Koren,S. et al. (2012) Hybrid error correction and de novo assembly of
single-molecule sequencing reads. Nat. Biotechnol., 30, 693–700.
Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with bow-
Funding tie 2. Nat. Methods, 9, 357–359.
Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with bur-
This work has been supported by grants from the National Science
rows–wheeler transform. Bioinformatics, 25, 1754–1760.
Foundation of China [61502027 to E.B.], and the Fundamental Research
Luo,R. et al. (2012) Soapdenovo2: an empirically improved memory-efficient

Funds for the Central Universities [2015RC045 to E.B.].
short-read de novo assembler. GigaScience, 1, 18.
Conflict of Interest: none declared. Muggli,M.D. et al. (2015) Misassembly detection using paired-end sequence
reads and optical mapping data. Bioinformatics, 31, i80–i88.
Ono,Y. et al. (2013) Pbsim: Pacbio reads simulator toward accurate genome
References assembly. Bioinformatics, 29, 119–121.
Peng,Y. et al. (2012) Idba-ud: a de novo assembler for single-cell and metage-
Bankevich,A. et al. (2012) Spades: a new genome assembly algorithm and its
nomic sequencing data with highly uneven depth. Bioinformatics, 28,
applications to single-cell sequencing. J. Comput. Biol., 19, 455–477.
1420–1428.
Bao,E. and Lan,L. (2017) Halc: High throughput algorithm for long read error
Pevzner,P. et al. (2001) An eulerian path approach to dna fragment assembly.
correction. BMC Bioinformatics, 18, 204.
Proc. Natl. Acad. Sci. USA, 98, 9748.
Bao,E. et al. (2014) Aligngraph: algorithm for secondary de novo genome as-
Ronen,R. et al. (2012) Sequel: improving the accuracy of genome assemblies.
sembly guided by closely related references. Bioinformatics, 30, i319–i328.
Bioinformatics, 28, i188–i196.
Deshpande,V. et al. (2013) Cerulean: a hybrid assembly using high throughput
Salmela,L. and Rivals,E. (2014) Lordec: accurate and efficient long read error
short and long reads. In: Darling,A., Stoye,J. (eds.) Algorithms in
correction. Bioinformatics, btu538.
Bioinformatics. Springer, Berlin Heidelberg, pp. 349–363.
Salzberg,S.L. et al. (2012) Gage: A critical evaluation of genome assemblies
Eid,J. et al. (2009) Real-time DNA sequencing from single polymerase mol-
and assembly algorithms. Genome Res., 22, 557–567.
ecules. Science, 323, 133–138.
Schneeberger,K. et al. (2011) Reference-guided assembly of four diverse arabi-
English,A.C. et al. (2012) Mind the gap: upgrading genomes with pacific bio-
dopsis thaliana genomes. Proc. Natl. Acad. Sci. USA, 108, 10249–10254.
sciences rs long-read sequencing technology. PloS One, 7, e47768.
Walker,B.J. et al. (2014) Pilon: an integrated tool for comprehensive microbial
Feuk,L. et al. (2006) Structural variation in the human genome. Nat. Rev.
variant detection and genome assembly improvement. PloS One, 9,
Genet., 7, 85–97.
e112963.
Gnerre,S. et al. (2011) High-quality draft assemblies of mammalian genomes
Ye,C. et al. (2014). Dbg2olc: Efficient assembly of large genomes using the
from massively parallel sequence data. Proc. Natl. Acad. Sci. USA, 108,
compressed overlap graph. arXiv preprint arXiv: 1410.2801.
1513–1518.
Zerbino,D. and Birney,E. (2008) Velvet: algorithms for de novo short read as-
Gurevich,A. et al. (2013). Quast: quality assessment tool for genome assem-
sembly using de bruijn graphs. Genome Res., 18, 821–829.
blies. Bioinformatics, btt086.
Zheng,G.X. et al. (2016) Haplotyping germline and cancer genomes with
Hunt,M. et al. (2013) Reapr: a universal tool for genome assembly evaluation.
high-throughput linked-read sequencing. Nat. Biotechnol., 34, 303–311.
Genome Biol., 14, 1.
Zhu,X. et al. (2015) misfinder: identify mis-assemblies in an unbiased manner
Kawahara,Y. et al. (2013) Improvement of the oryza sativa nipponbare refer-
using reference and paired-end reads. BMC Bioinformatics, 16, 1.
ence genome using next generation sequence and optical map data. Rice,
Zimin,A.V. et al. (2013) The masurca genome assembler. Bioinformatics, 29,
6, 1.
2669–2677.
Kim,J. et al. (2013) Reference-assisted chromosome assembly. Proc. Natl.
Acad. Sci. USA, 110, 1785–1790.

Remilo: Reference Assisted Misassembly Detection Algorithm Using Short and Long Reads

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Remilo: Reference Assisted Misassembly Detection Algorithm Using Short and Long Reads

Uploaded by

Copyright:

Available Formats

Bioinformatics, 34(1), 2018, 24–32

Downloaded from https://academic.oup.com/bioinformatics/article/34/1/24/4085773 by UNIVERSIDAD DE SEVILLA user on 03 December 2020

Downloaded from https://academic.oup.com/bioinformatics/article/34/1/24/4085773 by UNIVERSIDAD DE SEVILLA user on 03 December 2020

Downloaded from https://academic.oup.com/bioinformatics/article/34/1/24/4085773 by UNIVERSIDAD DE SEVILLA user on 03 December 2020

Downloaded from https://academic.oup.com/bioinformatics/article/34/1/24/4085773 by UNIVERSIDAD DE SEVILLA user on 03 December 2020

Contig ACGAGCA T ACGTGCA

Genome ACGAGCA ACGAGCA T ACGTGCA

CGA,1 GAG,2 AGC,3 CGA,1,113 GAG,2,114 AGC,3,115

ACG,0 GCA,4 ACG,0,112 GCA,4,116

CGT,1 GTG,2 TGC,3 CGT,1,113 GTG,2,114 TGC,3,115

CAT,5 CGA,-1,120 ACG,-1,119 AAC,-1,118 CAA,-1,117

ACG,593 CGT,594 GTG,595 TGC,596 GCA,597 GAG,-1,121 AGC,-1,122 GCA,-1,123

ACG,593,712 CGT,594,713 GTG,595,714 TGC,596,715 GCA,597,716

Downloaded from https://academic.oup.com/bioinformatics/article/34/1/24/4085773 by UNIVERSIDAD DE SEVILLA user on 03 December 2020

Table 1. Evaluation of misassembly detection performance on the human chromosome 14 data

Algorithm TPR (extensive) TPR (local) FPR

(a) Contigs assembled by ALLPATHS-LG

Downloaded from https://academic.oup.com/bioinformatics/article/34/1/24/4085773 by UNIVERSIDAD DE SEVILLA user on 03 December 2020

54.5% local misassemblies with 0.6–1.3% false detections. Overall,

than that on the contigs, mainly because the introduced misassem-

Downloaded from https://academic.oup.com/bioinformatics/article/34/1/24/4085773 by UNIVERSIDAD DE SEVILLA user on 03 December 2020

With the various reference chromosomes, on the contigs,

one species, the reference genomes of a relatively large range of its

3.3.2 Results on short read assemblies of japonica rice data

assemblers are listed in Table 2. With long reads inputted, ReMILO

Table 2. Evaluation of misassembly detection performance on the japonica rice data

Algorithm TPRC (extensive) TPRC (local) FPRC

(a) Contigs assembled by IDBA

Table 3. Evaluation of misassembly detection performance on the S.pastorianus data

Algorithm TPR (extensive) TPR (local) FPR

(a) Contigs assembled by SPAdes

Downloaded from https://academic.oup.com/bioinformatics/article/34/1/24/4085773 by UNIVERSIDAD DE SEVILLA user on 03 December 2020

Downloaded from https://academic.oup.com/bioinformatics/article/34/1/24/4085773 by UNIVERSIDAD DE SEVILLA user on 03 December 2020

You might also like