You are on page 1of 33

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/335806374

DNA Sequencing Technologies: Sequencing Data Protocols and


Bioinformatics Tools

Article  in  ACM Computing Surveys · September 2019


DOI: 10.1145/3340286

CITATIONS READS

2 4,487

7 authors, including:

Ka-Chun Wong Shankai Yan


City University of Hong Kong National Institutes of Health
161 PUBLICATIONS   1,623 CITATIONS    22 PUBLICATIONS   316 CITATIONS   

SEE PROFILE SEE PROFILE

Xiangtao Li
Jilin University
95 PUBLICATIONS   2,070 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Structure, Function and Evolution of the ECM: A Systems-Level Analysis View project

Verbal aggression detection on Twitter comments View project

All content following this page was uploaded by Ka-Chun Wong on 24 October 2019.

The user has requested enhancement of the downloaded file.


DNA Sequencing Technologies: Sequencing Data Protocols
and Bioinformatics Tools
KA-CHUN WONG∗ , City University of Hong Kong, Hong Kong SAR
JIAO ZHANG, SHANKAI YAN, XIANGTAO LI, QIUZHEN LIN, and SAM KWONG, City
University of Hong Kong, Hong Kong SAR
CHENG LIANG, Shandong Normal University, China
The recent advances in DNA sequencing technology, from first generation sequencing (FGS) to third
generation sequencing (TGS), have constantly transformed the genome research landscape. Its data throughput
is unprecedented and of several folds as compared to the past technologies. DNA sequencing technologies
generate sequencing data that are big, sparse, and heterogeneous. It results in the rapid development of various
data protocols and bioinformatics tools for handling sequencing data.
In this review, a historical snapshot of DNA sequencing is taken with an emphasis on data manipulation
and tools. The technological history of DNA sequencing is described and reviewed in thorough detail. To
manipulate the sequencing data generated, different data protocols are introduced and reviewed. In particular,
data compression methods are highlighted and discussed to provide readers a practical perspective in the
real world setting, which have been largely ignored by most of the existing reviews. A large variety of
bioinformatics tools are also reviewed to help readers extract the most from their sequencing data in different
aspects such as sequencing quality control, genomic visualization, single nucleotide variant calling, INDEL
calling, structural variation calling, and integrative analysis. Towards the end, we critically discuss the existing
DNA sequencing technologies for its pitfalls and potential solutions.
CCS Concepts: • Applied computing → Life and medical sciences;
Additional Key Words and Phrases: DNA Sequencing, Third Generation Sequencing (TGS), History, Technology,
Data Protocols, Bioinformatics, Computational Biology, Tools, Software
ACM Reference Format:
Ka-Chun Wong, Jiao Zhang, Shankai Yan, Xiangtao Li, Qiuzhen Lin, Sam Kwong, and Cheng Liang. 2099.
DNA Sequencing Technologies: Sequencing Data Protocols and Bioinformatics Tools. ACM Comput. Surv. 9, 4,
Article 39 (March 2099), 32 pages. https://doi.org/0000001.0000001

1 INTRODUCTION
DNA sequencing technologies have dramatically driven and changed the genome research field in
recent years. Notably, the recent technologies have enabled massive sequencing data generation
for different species. Especially, global projects (e.g. 1000 Genomes Project and Genotype-Tissue
Expression (GTEx)) have been completed, leading to high-throughput sequencing data accumulation
at an unprecedented level.
∗ This is the corresponding author

Authors’ addresses: Ka-Chun Wong, kc.w@cityu.edu.hk, City University of Hong Kong, Kowloon Tong, Hong Kong SAR;
Jiao Zhang; Shankai Yan; Xiangtao Li; Qiuzhen Lin; Sam Kwong, City University of Hong Kong, Kowloon Tong, Hong Kong
SAR; Cheng Liang, Shandong Normal University, Shandong, China.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the 39
full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2009 Copyright held by the owner/author(s). Publication rights licensed to Association for Computing Machinery.
0360-0300/2099/3-ART39 $15.00
https://doi.org/0000001.0000001

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:2 Wong et al.

Nonetheless, the sequencing data are unique and different from the traditional structured data;
for instance, the NGS data are sparse, noisy, and discontinuous. Special care has to be taken to
alleviate and transform those challenges to be taken advantage of. In addition, the raw sequencing
data are of large volume (in millions or above per each dataset) which imposes difficulties in
applying the existing computing facilities for statistical analysis. Therefore, customized sequencing
data protocols with its accompanying bioinformatics facilities have been implemented for users to
tackle the above-mentioned problems and gain insights from the mentioned data.
In this article, we will briefly review DNA sequencing technology in three aspects: technological
history (section 2), data protocols (section 3), and bioinformatics tools (section 4). In section 2,
we will cover the technological history from the first generation sequencing (FGS) to the third
generation sequencing (TGS). Furthermore, in section 3, we will discuss the sequencing data
protocols in terms of data storage formats, data quality protocols, and data compression methods.
Alongside the data protocols, bioinformatics tools are needed for in-depth analysis. Hence an
overview of the related bioinformatics tools is given in section 4.
An acronym table (Table S2) is provided for readers to refer to and comprehend the content.

2 SEQUENCING TECHNOLOGIES
Rapid development progress in sequencing technologies have enabled biologists to produce massive
high-quality sequencing reads at decreasing costs [43, 96]. The main differences among various
DNA sequencing platforms lie in library generation, amplification technologies, and the recording
methodology in identifying sequences [83]. They are generally processed with the data workflow
in Figure S1. The first stage is the preparation of tissue material in which biosamples are dissected
and extracted for DNA sequencing.
The second part of the workflow is to extract the DNA fragments of interest as the representative
sources of sequencing. For amplicon sequencing, we try to extract the DNA fragments from a
specific genomic region. For whole genome sequencing, we randomly cut whole genomes of the
samples into DNA fragments which will be sequenced in parallel.
Thirdly, we can add library preparation reagents to build a library of DNA fragment templates.
Fourthly, the DNA fragment templates can be amplified using an amplification technology; for
instance, the Roche (454), SOLiD-4, and Ion Torrent platforms rely on emulsion PCR (emPCR) while
Illumina platform is based on bridge PCR. In contrast, rolling circle amplification (RCA) is adopted
for the Complete Genomics platform. Sanger sequencing also requires PCR amplification when it
is adopted for deep sequencing of a genomic region of interest (i.e. amplicon approach) 1 . Fifthly,
after the amplification stage, the DNA fragments in the amplified library are mixed with removal
reagents for unused primer clean-up. After that, DNA sequencing can be carried out using DNA
sequencing machines which can be classified into FGS (First Generation Sequencing), NGS (Next
Generation Sequencing), and TGS (Third Generation Sequencing) as described in the following
sections (additional details can be found in the supplementary sections).

2.1 Next Generation Sequencing (NGS)


Following the success of First Generation Sequencing (FGS) in 1980s and 1990s (a detailed survey
about FGS can be found in the supplementary sections), genome research has transitioned to a big
data era since the invention of Next Generation Sequencing (NGS). To overcome the limitations
of FGS, researchers developed new methods to sequence multiple samples in parallel after 2004.
The high throughput capacity and low cost of massively parallel analysis are the major differences
1 https://www.thermofisher.com/hk/en/home/life-science/sequencing/sanger-sequencing/sanger-dna-sequencing/pcr-

sanger-sequencing.html

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:3

Fig. 1. Overview of Roche 454 pyrosequencing. First, a long DNA sequence is fragmented to small DNA
fragments, and denatured to ssDNA. sstDNA library is created by ligating adaptors which are used as primers
binding to beads. Then those sstDNA sample are loaded onto water-in-oil mixture beads. In the emPCR
amplification step, sstDNA are clonally amplified and attached to beads. Finally, sequencing and genome
analysis process is conducted to call DNA bases based on different light intensities.

between NGS and FGS [137]. NGS technique is claimed to accelerate the sequencing speed and time
from years to weeks. In addition, backed by extensive sequencing coverage, the data generated by
NGS can be statistically sound with high confidence [97]. However, the sheer size of the data also
imposes a big data challenge which has to be tackled by specialized data manipulation techniques
and bioinformatics tools such as the ones that are introduced in the sections below.

Roche 454 System. The very first commercial system of NGS is the Roche 454 system (introduced
in 2005) which adopts the pyrosequencing [98]. As shown in Figure 1, the first step is sample
preparation. Single-stranded template DNA (sstDNA) is created with specific adaptors and ligated
to two ends of DNA fragment. Emulsion PCRs (emPCRs) are then used to amplify and isolate DNA
molecules captured on the bead surface which is composed of oil-water mixture. Using picotite
plate (PTP), one of the dNTPs read the bases on the template strand in a complementary manner.
Meanwhile, unmatched bases will be degraded through apyrase. The target DNA sequence can then
be deduced by detecting the light signal intensities of nucleotides complementary to the template
sequence, as emitted by the luciferin which has incorporated into oxyluciferin.
Based on similar methodology, another efficient genome sequencer FLX System (GS FLX) was
developed. In the succeeding year, they sequenced the whole genome of a single human individual,
which was completed within two months and had cost about one-hundredth of traditional capillary
electrophoresis techniques [155]. Nowadays, 454 GS FLX Titanium XL+ claims to achieve sequencing
reads up to thousands of base pairs with an accuracy of 99.997% and its typical throughput is 700Mb
within 23 hours; reads of 454 GS FLX Titanium XLR70 can scale up to 600bp with 99.995% accuracy
and its typical throughput is 450Mb within 10 hours [40]. The significant benefit of GS FLX+ system
is its high speed. However, the reagent cost is still a challenge. In addition, INDEL sequencing
errors are commonly seen [95].

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:4 Wong et al.

Fig. 2. Overview of Illumina sequencing. sstDNA library is prepared by ligating specialized adaptors at both
fragment ends on the surface of flow cell channels randomly. Double-stranded bridges are formed using
enzymes. The double-stranded bridges are then denatured to sstDNA, which can be clonally amplified to
clusters. Sequencing reagents, including DNA polymerase enzyme, nucleotide-labeled reversible terminators,
and primers, are incorporated. The fluorescence emission are imaged and recognized. The cycles are replicated
n iterations to read n bases.

Illumina system. In 2006, the second generation sequencing platform Genome Analyzer (GA) was
developed. GA sequencer uses Sequencing By Synthesis (SBS) technology for automated sample
preparation and large scale parallel sequencing, which is a major difference compared with Roche
454 [95]. Once the DNA fragments of interest have been retrieved, specific adaptors (made from
oligonucleotide) are bound to the fragment ends. A single-stranded DNA library is prepared and
grafted to flow cell by several cycles of denaturation and PCR. Then the bridge amplification is
clustered, containing clonal DNA fragments. Linearization enzyme is used to segment the library
of clusters to single strands. The visible light signal intensities are then captured using a charge-
coupled device (CCD) after four kinds of labeled ddNTPs with different colors and fluorescent dyes
are added and complemented to a base each iteration [89]. Figure 2 shows the overall process of
SBS.
After GA, several benchtop-scale platforms have been developed such as MiniSeq, MiSeq, and
NextSeq. For production-scale, HiSeq and NovaSeq have been developed. In early 2010, Illumina
released HiSeq 2000 using SBS which was originally used in GA. HiSeq 2000 has the highest outputs
but relatively lower cost than Roche 454 and SOLiD. HiSeq 2500 using SBS chemistry is an upgrade
of HiSeq 2000, which has two modes. High output run mode can generate up to 1Tb within 6 days
while the rapid run mode can generate up to 300Gb within 60 hours. To deliver high throughput at a
lower price per data point than HiSeq 2500, HiSeq 3000/4000 was built upon the HiSeq 2500 system.
In 2014, Illumina released HiSeq X Ten System, which is composed of 10 HiSeq X instruments. It is
the first platform which can sequence about 18,000 human genomes within one year [24].
SOLiD system. Sequencing by Oligo Ligation Detection (SOLiD) was released in 2006. [95]. Its
novelty lies in the two-base sequencing strategy with high accuracy. Figure 3 shows the SOLiD
sequencing process. The sample preparation has two modes: fragment library for single DNA and

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:5

Fig. 3. Overview of SOLiD sequencing. A universal sequencing primer is annealed to sstDNA with adaptor.
Then the probe is hybridized and ligated to primer. Nucleotides are imaged by using four-color fluorescent
dye 5’ end of the probes. Nucleotides with fluorescent dye are washed out, opening a free 5’ end for new
ligation. In such a process, N cycles ligation is carried out for full sequence coverage. This process is repeated
5 times by a new sequence primer and one base shift each time in order to cover the whole sequence in 5
reading frames.

mate-paired library for two DNA fragments. Target DNA sequence is cut into small fragments and
adaptors are ligated to both ends. Fragment and mate-paired library contain single-piece and two-
piece DNA with known distance in the sample of interest respectively. Thus the libraries contain
millions of unique subsequences representing the entire target sequence. DNA is clonally amplified
and enriched to bead during the emPCR reaction. Each glass slide is covered with covalently bound
beads, enabling the flexibility to analyze 1, 4, or 8 samples for each slide. With barcoding, it can
reach up to 16 libraries that can increase the throughput to 256 samples in a single run. 4 fluorescent
dye label the related probes. The complement probe hybridizes to the template sequence. The dye
is removed after fluorescence method such that the 5’ end is free again. This process is repeated for
n iterations and additional iterations can be added to extend the read length. For the next primer
round, the synthesized strand is cleaved. Another primer hybridized is offset shifted for another
base. The ligation cycles are repeated as before. Such a process is repeated for 5 more rounds, which
can provide a dual measurement for each base, increasing the sequencing accuracy. SOLiD system
open slide format increases bead densities that can also increase the throughput. By using two

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:6 Wong et al.

independent flow cells, it can enable two different experiments in a single run, thus increasing
the productivity. The unique Di-Base coding process is designed to enhance the accuracy, which
can achieve 99.94% accuracy with measurement error corrections. In 2010, five releases of ABI
SOLiD system were upgraded to SOLiD 5500xl with extended read lengths and high accuracy. A
shortcoming of SOLiD is the high cost of its computational infrastructure.
Ion Torrent System. Ion Torrent sequencing system is the first commercial sequencer which
is not based on dye labeled oligonucleotides and camera scanning. Instead, during each DNA
polymerization step, hydrogen ions are released; it can be detected using Ion Torrent sequencing.
Thanks to its semi-conducting nature, real time sequencing can be completed within hours. Briefly,
Ion torrent sequencing can deduce whether a nucleotide is added or not by detecting the pH change.
The process involves capturing a DNA sequence in a micro-well and unmodified dNTPs floating
across the wells once per time. The hydrogen ion is released every cycle when the polymerase
incorporates the specific oligonucleotide; it changes the pH of the solution which results in a
voltage change that can be detected by ion sensors [132].
In particular, there are four machine types available: S5, S5 XL, Personal Genome Machine (PGM),
and Proton. Ion S5 and Ion S5 XL are designed for target sequencing while Ion PGM and Ion Proton
are designed for whole-genome sequencing. In 2011, the detection of E. coli O104:H4, which infected
more than 3000 people, proved the usefulness of PGM sequencer [101, 128]. In addition, the Ion
Torrent PGM system and HiSeq 2000 have also assisted scientists to find antibiotic resistance [18].

2.2 Third Generation Sequencing(TGS)


Third generation sequencing (TGS) has gained quite a lot of attention in recent years. There are
many arguments regarding the differences in definition between the second and third generation
sequencing [110]. Here are some examples: SMRT from Pacific Biosciences was claimed to be
classified as 2.5th generation since it satisfies the definitions of both NGS and TGS (definitions
can be found in the supplementary sections) [47]. SMRT sequencer, Complete Genomics (GC),
and Oxford Nanopore sequencing were classified to NGS by Fei et al. [36]. Schadt et al. pointed
out that Ion Torrent’s semiconductor sequencer and Helicos Genetic Analysis Platform were the
intermediates between NGS and TGS [135]. Nonetheless, there is a general consensus that NGS is
characterized by its high throughput, while TGS is distinguished by its single molecule sequencing
(SMS) ability. For illustration purposes, the technological history of NGS (or NGS) and TGS is
visualized in Figure 4.

Fig. 4. Timeline of the DNA sequencing technological history in the recent 10 years.

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:7

Helicos true Single Molecule Sequencing (tSMS) system. Single Molecule Sequencing (SMS) tech-
nique was proposed in 2003 [11]. Then, the technique was developed to sequence a viral genome
in 2008 [50]. Later, it was commercially launched by Helicos BioSciences [50]. The true Single
Molecule Sequencing (tSMS) system is the first technology to allow SMS without any amplification.
The methodology resembles Illumina but it can avoid PCR amplification bias and errors [10]. On
the other hand, the fluorescence readers can reduce sequencing sample mislabeling. Figure 5a
shows the sample preparation process of Helicos tSMS [115, 116]. However, it is relatively slow
and produces short reads. Its reliance on consumable reagents is another limitation. Various types
of sequencing errors have also been reported; for instance, a study reveals 1.5% insertion errors, 3%
deletion errors, and 0.2 % substitution errors after its sequencing experiment [7].
PacBio Single Molecule Real Time (SMRT) system. Single molecule real time (SMRT) was released
in 2011 [150]. SMRT enables dynamic data visualization during sequencing. Zero-Mode Waveguide
(ZMW) is the major component of SMRT technology, which consists of small pores with few
nanometers in diameter surrounded by a 100nm metal film and silicon dioxide. Each ZMW acts as
a visualization chamber that enables the detection of a single molecule [75]. The single nucleotide
extension can be detected in real time by DNA library washing with fluorescent dNTPs. Only those
fluorescent incorporated nucleotides contain detectable fluorescence. Figure 5b shows the most
important step of PacBio SMRT sequencing. The first commercial SMRT instrument had around
75000 ZMWs, which enables detection of about 75000 SMS reactions in parallel. The advantage
of SMRT sequencing instrument is its minimal usage of reagent and sample preparation, which
lowers the sequencing cost and reduces the running time from days to minutes. In addition, the
PCR step in the sample preparation is not adopted, saving time as well as avoiding biases and errors.
Furthermore, the average read length (1,300bp) of the first sequencer delivered by PacBio (PacBio
RS) is longer than the existing NGS technologies [30].

Table 1. Raw facts about FGS, NGS, and TGS platforms [36, 41, 89, 114]. Run time indicates the running
time for sequencing the nucleotides under the respective read size limits. File size denotes data sizes trans-
ferred from instruments. Sequencing depth is reported as the average sequencing depth with or above 95%
sequencing read accuracy [29, 56]. Reference details could be found in http://www.molecularecologist.com/
next-gen-fieldguide-2016/

Bases Throughput Reagent Sequencing


Generation Sequencer Platform Sequencing Mechanism Run Time File Size (GB) Primary Errors Error Rate
per Read per Run Cost per Mb Depth
FGS ABI 3730xl Sanger Dideoxy chain termination 650 62Kb $1500 2h 0.03 Substitution 0.1%-1% 6-10X
454 FLX+ 454 Synthesis, emPCR 650 0.65Gb $9.5 20h 40 images, 8 sff INDEL 1%
454 GS Jr. Titanium 454 Synthesis, emPCR 400 0.05Gb $19.5 10h < 3 images, < 1 sff INDEL 1%
454 FLX Titanium 454 Synthesis, emPCR 400 0.4Gb $15.5 10h 20 images, 4 sff INDEL 1%
Illumina MiSeq Illumina Synthesis, bridgePCR 600 15Gb $0.1 56h 1-15 Substitution 0.1%
Illumina NextSeq 500 Illumina Synthesis, bridgePCR 300 120Gb $0.035 29h ≤ 40/lane Substitution 0.1%
NGS 30-100X
Illumina HiSeq 2500 Illumina Synthesis, bridgePCR 250 500Gb $0.029 6 days ≤ 40/lane Substitution 0.1%
Illumina HiSeq X Illumina Synthesis, bridgePCR 300 900Gb $0.007 < 3 days 132-136/lane Substitution 0.1%
SOLiD - 5500xl SOLiD Ligation, emPCR 110 155Gb $0.068 8 days 148 A-T bias 0.06%
Ion Torrent - PGM Ion Torrent SBS, emPCR 400 0.2-2.2Gb $0.4-$2.2 4-7h 1-10 sff, 0.2-2.5 fastq INDEL 1%
Ion Torrent - Proton Ion Torrent SBS, emPCR 200 16Gb $0.06 4h 120 sff, 30 fastq INDEL 1%
Helicos tSMS HeliScope SMS by synthesis 35 28Gb N/A 8 days 11 images Substitution 3%-5% 28X
TGS PacBio RS II PacBio SMRT by synthesis 12000 0.66Gb $0.3 ≤ 6h 2 (basecalls, QV) INDEL ≤ 1% 8X
Oxford Nanopore MinION Oxford Nanopore sequencing 10000 6-260Gb $0.15 ≤ 6h N/A Deletion 4% 60X

Oxford Nanopore Sequencing System. In 1996, nanopore sequencing was proposed to exploit the
fact that only single molecules of nucleotides can pass through tiny nanopores [60]. The basic
principle behind Nanopore sequencing relies on the voltage bias incurred when a single-stranded
DNA is passed through a nanopore. The particle movement can be detected as ionic current
changes across each nanopore. An example of Oxford Nanopore sequencing is shown in Figure 5c.
Double-stranded DNA molecules could also be detected using non-biological solid-state materials
in the near future [27, 84]. In addition, each nanopore has the potential to be used multiple times.
Nanopore sequencing method is claimed to have the lowest sequencing cost since it does not

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:8 Wong et al.

Table 2. Commercial comparison among FGS, NGS and TGS [41]. Thousands of US dollars are calculated.
Reference details could be found in http://www.molecularecologist.com/next-gen-fieldguide-2016/

Type SequencerPlatform Current Former Sequencing InstrumentAdditionalService Computing Advantages Disadvantages


com- Com- Mecha- Cost Cost Cost Device
pany pany nism
ABI Sanger Thermo Applied Dideoxy $376 N/A $19.8 Desktop High Low throughtput;
FGS 3730xl Fisher Biosys- chain ter- accuracy; high cost
Scien- tems mination long reads
tific
454 454 Roche 454 Synthesis, $450 $30 $50 Desktop Long read High cost per Mb;
FLX+ emPCR length support of
instrument ending
in mid-2016
454 GS 454 Roche 454 Synthesis, $108 $16 $12.6 Desktop Low cost High cost per Mb;
Jr. emPCR per run few reads; retired
Tita- in mid-2016
nium
Illumina Illumina Illumina Solexa Synthesis, $99 N/A $14 Cloud High speed; Fewer reads; higher
MiSeq bridgePCR long read cost per Mb
lengths compared to
NextSeq and HiSeq
Illumina Illumina Illumina Solexa Synthesis, $250 N/A $32 Cloud Moderate Short reads; long
NextSeq bridgePCR instrument run time
500 and run
costs
Illumina Illumina Illumina Solexa Synthesis, $690 $55 $80.5 Cluster Low cost Expensive
HiSeq bridgePCR per Mb; instrument and
2500 hight running costs
output
NGS
Illumina Illumina Illumina Solexa Synthesis, $1200 $55 $98.7 Cluster Lowest cost Huge data storage
HiSeq bridgePCR per Mb requirement
X
SOLiD SOLiD Thermo Applied Ligation, $251 $54 $44.4 Cluster High Relatively short
- Fisher Biosys- emPCR accuracy reads; high capital
5500xl Scien- tems cost
tific
Ion Ion Thermo Ion SBS, $49 $18-$32 $4.3- Desktop Low cost More hands-on
Tor- Tor- Fisher Torrent emPCR $9.9 instrument time; high cost per
rent - rent Scien- Mb; small user
PGM tific community
Ion Ion Thermo Ion SBS, $224 $19 $19.9- Cluster Moderate More hands-on
Tor- Tor- Fisher Torrent emPCR $32.8 cost time; high cost per
rent - rent Scien- instrument Mb; small user
Proton tific for medium community
throughput
applica-
tions
Helicos HeliScopeHelicos N/A SMS by $999 N/A N/A Cluster Single Expensive
tSMS synthesis molecule instrument; high
error rates
PacBio PacBio Pacific N/A SMRT by $695 N/A $84 Cluster Long read High error rates;
RS Bio- synthesis length; fast; high cost per Mb
Sciences modest cost
TGS
per sample
Oxford Oxford Oxford N/A Nanopore $1000 $0 $0 Laptop Small High cost per read;
NanoporeNanoporeNanopore sequenc- portable biased errors
Min- Tech- ing instrument;
ION nologies low cost
instrument;
long reads

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:9

(a) (b)

(c)

Fig. 5. (a) Overview of sample preparation of tSMS for Helicos. DNA sequence is isolated from cells and
fragmented to small reads. It is denatured to be single-stranded while dATP is added. Then fluorescent
nucleotides are added as terminators. After sample preparation is finished, samples are loaded to flow cell.
The next steps are similar to the imaging steps of Illumina sequencing as shown in Figure 2. (b) Pacific
Biosciences SMRT sequencing. DNA template and adaptor bond with polymerase are immobilized at the
bottom of ZMW micro-well. Fluorescent labeled nucleotides with different colors are introduced to each
chamber. Light emission is captured in ZMW micro-well during SMRT sequencing. (c) Nanopore sequencing.
Double-stranded DNA is denatured to be single-stranded and driven through each nanopore. DNA sequence
information is detected by the change of electrical current as nucleotides pass through.

involve the use of fluorescent reagents and other optical chips [21]. Errors can be computationally
corrected in Nanopore sequencing, providing a read accuracy of 99.8% [131].

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:10 Wong et al.

2.3 Technological Comparison


FGS applies SBS or other degradation technology to separate DNA fragments; therefore, FGS can
achieve long sequencing reads with high raw read accuracy but it can be extremely time consuming
and costly. Therefore, people have developed NGS to overcome the limitations of FGS by trading
speed and accuracy over read lengths. NGS uses washing and scanning SBS which allows high raw
read accuracy but the reads are shorter than FGS. Such short sequencing read length may limit our
understandings on structural variations and de novo read assembly over repetitive genomic regions.
Therefore, TGS is designed to be an alternative for long sequencing reads. TGS uses single molecule
fluorescent or physical inspection technology which allows a moderate raw read accuracy but also
generates long reads (e.g. 1000bp). TGS provides the cheapest cost per base and per run. However,
TGS is still being developed at the time of writing. Nonetheless, most of the NGS techniques studied
here can be readily adapted to TGS in the future. Table 2 details the comparisons among FGS, NGS,
and TGS.

2.4 Commercial Availability


These technologies have undergone intense commercial competition. For FGS, the chemical degra-
dation method developed by Maxam and Gilbert [99] and the chain-termination approach proposed
by Sanger [133] were considered as the most important breakthroughs in FGS. However, Maxam
and Gilbert’s method was laborious and was not continuously improved since it was proposed;
therefore, it was gradually replaced by Sanger sequencing. For NGS, 454 Life Sciences was shut
down by Roche in October 2013 when its technology lost competitiveness after the success of
Illumina. Roche still produced 454 sequencers until 2015 but stopped supporting the products
such as Genome Sequencer FLX+ and Genome Sequencer Junior by the middle of 2016 [53]. Life
Technologies and Applied Biosystems have been acquired by Thermo Fisher Scientific which still
supports various sequencing platforms. With the rise of TGS, the battles of sequencing supremacy
will still go on, benefiting the scientific community [31].

3 DATA PROTOCOLS
3.1 Data Repositories
The public availability of sequencing data is essential for scientific reproducibility. Each sequencing
experiment can generate several gigabytes or even terabytes of raw sequencing data and its related
annotations [6]. Most of them are archived in the well-established repositories accessible to the
general public; for instance, the Sequence Read Archive (SRA) [73] at NCBI, DNA Data Bank of
Japan (DDBJ) [147], European Nucleotide Archive (ENA) [74] at EBI, GeneBank, and 1000 Genomes
Project [3]. The Encyclopedia of DNA Elements (ENCODE) Consortium [23] includes the DNA
sequencing data related to the functional elements in human while the European Genome-Phenome
Archive (EGA) also includes DNA sequencing data for research in molecular medicine with donor
consent privacy and authorised access [69].
Especially, the Sequencing Read Archive (SRA) of NCBI supports raw and early-analyzed data
from second generation sequencing platforms. However, NCBI announced that the SRA may slowly
phase out the database support as a result of budgetary constraints [148]. Although funding was
sought finally, it raises concerns over its stability. Therefore, alternatives to the SRA repositories
are being developed; for instance, the Genome Sequence Archive (GSA) has been developed by
BIG Data Center (BIGD) recently [102], accommodating new and upcoming sequencing data from
different sequencing platforms [151].

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:11

3.2 Data Formats


Sequencing data generated by different instruments are stored in various formats according to their
specific applications. Standard formats such as FASTA/FASTQ [22, 120], SAM/BAM [77], and VCF
[25] are widely used. The FASTA/FASTQ formats are designed for storing raw genomic data before
alignment; whereas the SAM/BAM formats are designed for both aligned and unaligned sequence
data. Figure 6 shows the widely used sequencing formats as well as their relationships.

Fig. 6. Overview of widely used sequencing formats. Usually, sequencers generate FASTA or FASTQ files.
SAM file is obtained after the alignment of FASTQ file with a reference genome / sequence. SAM file can also
be obtained without the alignment. BAM format is the binary format converted from SAM file, saving storage
space. VCF format is used to store the variants (or variations) as observed from the sequence information.

FASTA format. FASTA format is the de facto standard for sequence data. Each sequence is
represented by two parts successively. The first part belongs to the headers with the starting symbol
“>”; it usually contains identification information. After the header line, the successive lines store
raw sequence data as the second part. Such a format setting is successively repeated for each
sequence of interest. Figure 7a shows an example from NCBI [39]. In that file, you can see that the
odd-numbered lines are descriptions while the even-numbered lines are the target DNA sequence
data. Note that sequence data could span more than one line as long as its header line is given.
FASTQ format. FASTQ is an advanced version of the FASTA format with the addition of quality
scores. It acts as a common format covering Sanger standard, Solexa/Illumina, Roche 454, SOLiD
and others. It is widely used for sequencing data exchange because of its simplicity. There are four
lines for each sequence. The first line begins with “@” character and then descriptive information,
including the unique identifier. The target DNA sequence is in the second line. Sequencing quality
description is given and started with the “+” character in the third line, followed by the quality
scores for the sequence in the forth line. Such a format is repeated for each sequence successively.
Figure 7b shows an example.
Both the nucleotides and quality scores are encoded in ASCII characters. Quality score is created
and assigned for each nucleotide, indicating the calling accuracy of each nucleotide with respect to
its sequencing platform. In order to represent the quality scores as printable ASCII characters, they
are typically stored as Phred-scaled base error probability. Different sequencing instruments adopt
different quality score encoding methods.
SAM/BAM format. Sequence Alignment/Map (SAM) is a text format, which contains sequence
alignment information [79]. It can also store the unaligned sequence information. Normally, SAM
format denotes the aligned format against one reference sequence generated by various read
aligners [45]. SAM can accommodate both single-end reads and paired-end reads. Single-ended

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:12 Wong et al.

(a)

(b)

Fig. 7. Examples of different sequencing data formats: (a) FASTA file from NCBI nucleotide database (b)
FASTQ file of SRR5099898 from SRA.

reads are individual sequencing reads while paired-end reads are pairs of reads coupled by a known
distance in bp. Figure 8a shows an example of SAM format. There are two parts in SAM file: head
part starts with “@” and alignment part without “@”. All lines in SAM format are tab-delimited
[79].
Binary Alignment/Map (BAM) is designed to enhance the data reading and writing performance
of SAM file manipulation. From the computational perspective, BAM is the corresponding binary
representation. In addition, BAM files can be indexed for fast sequence extraction [79]. Therefore,
BAM can store the same kind of file formats as the original SAM file but compact in size using the
BGZF library. Sorting by position index can streamline data processing without loading the whole
BAM file into main memory.
VCF format. Variant Call Format (VCF) can store genomic variations (or variants) of sequencing
data such as Single Nucleotide Polymorphisms (SNP), indel variations, structural variations, and
other variant annotations. VCF was developed for The 1000 Human Genomes Project and can also
be applied to other projects. SNPs and inserts/deletions (INDELs) are usually stored in VCF format.
Since different genomes of the same species usually share a significant portion of nucleotides, VCF
is scalable enough to store massive genotype information. Figure 8b shows an example of VCF
file format. VCF file contains two parts: header and data. In the header section, the header line

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:13

(a)

(b)

Fig. 8. Examples of different sequencing data formats: (a) SAM file dumped from the BAM file of ENCODE
(i.e. wgEncodeSydhTfbsHuvecGata2UcdAlnRep1.bam) (b) VCF file for chromosome 1 variations from the
1000 human genomes project (phase 3 pipeline).

“##fileformat” and “#CHROM” are mandatory while the rest are optional lines. In data section, each
line indicates a variant at a specific genomic position. The first eight columns of data section are
mandatory and tab-delimited, followed by optional columns. “CHROM” identifies chromosome
number or sequence identifier (e.g. an angle-bracketed ID String pointing to a contig in genome
assembly file); “POS” denotes the first base position of variant within CHROM; “ID” is the identifier
of the variant which should be unique; “REF” is the reference allele; “ALT” denotes the alternate
non-reference allele(s); “QUAL” is Phred-scaled quality score; “FILTER” is filter status; “INFO”
is the optional semicolon-separated user extensible annotation; “FORMAT” indicates it contains
subsequent genotype information [25, 45].
For genotype information phasing, one can tell whether the genotypes are phased or not by
looking at the separator symbols: The symbol "|" indicates phased genotypes while the symbol "/"
indicates unphased genotypes. Therefore, the VCF example in Figure 8b contains phased genotypes.
To manipulate VCF files, the VCFtools software suite has been developed [25]. In particular, it can
perform six main operations 2 : variant filtering, file comparison, variant summarization, file type
conversion, format validation with merging ability, and set operations on variants.

3.3 Data Quality Scoring Schemes


Phred quality score is assigned to each base after reading DNA sequencing trace files [33, 34]. Since
FASTQ format was first developed and used for Sanger sequencing; Phred quality scores were
widely adopted in the early days. In order to make quality scores to be legible and printable, Phred
2 http://vcftools.sourceforge.net/

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:14 Wong et al.

qualities are encoded in the range of 0-93 with respect to ASCII characters in the range of 33-126.
The Sanger FASTQ file is suitable for both raw reads and post-processing assembly where high
quality scores are diverse in distribution. In addition, the SAM format in Illumina 1.8+ version also
adopts that Phred quality scoring scheme. Phred quality scores used in Sanger FASTQ file can be
defined with respect to its base error probability 𝑃 𝑒 as:
𝑄 𝑆𝑎𝑛𝑔𝑒𝑟 = −10 × log10 (𝑃 𝑒 )
Solexa released their own version of FASTQ format in 2004 which is incompatible with Sanger
FASTQ format [8]. FASTQ only stores single quality score per base, while Solexa generates quality
scores for all four bases in other files. Solexa quality scoring scheme ranges from -5 to 62 with
respect to ASCII characters in the range of 59-126. Solexa adopts variant logarithmic mapping
to handle low-quality data. Solexa quality scoring scheme with its base error probability can be
defined as:  
𝑃𝑒
𝑄 𝑆𝑜𝑙𝑒𝑥𝑎 = −10 × log10
1 − 𝑃𝑒
Since Solexa was purchased by Illumina in 2006, it continued to use this 𝑄 𝑆𝑜𝑙𝑒𝑥𝑎 in Solexa FASTQ
files before Illumina 1.3. However, Illumina developed the third incompatible FASTQ format variant
in Illumina 1.3 to 1.8, which encodes Phred scores in range 0 to 62 with respect to ASCII characters
in the range of 64 to 126 [54]. For the Illumina 1.5 to 1.8, the Phred quality scores of 0, 1, and 2 have
different meanings from the former versions. The values of 0 and 1 are unused while the value of 2
represents ASCII character “B” which is also used as the read segment quality control indicator at
the end of reads.

3.4 Data Compression


It was estimated that genomics had even surpassed astronomical science in data size [142]. According
to the statistics of world high-throughput sequencer (http://omicsmaps.com/), the output data
storage demand is approximately 50-100PB per year. The DNA sequencing data throughput is
increased almost tenfold every year which is even faster than Moore’s Law. In addition, the
DNA sequencing costs of TGS are much cheaper than FGS. Therefore, Kahn suggested that the
development of data manipulation hardware will lag behind sequencing data generation [59].
Researchers have been developing different compression methods. Those approaches can be broadly
classified into two integral classes: quality score compression and sequence compression. Details
can be found in the supplementary sections.
Quality Score Compression. The direct storage of raw sequencing data is infeasible due to its
enormous size and redundancy. Since sequencing instruments generally store output data in
FASTQ/SAM file, each nucleotide character has a corresponding quality score. Thus, the quality
score comprises almost half of the FASTQ/SAM data. Many researchers come up with various
compression algorithms to compress the quality scores to reduce the data storage space. In particular,
Yorukoglu et al. have proposed a quality score compression method (Quartz) which can even improve
downstream genotyping accuracy [164]. Other methods are described in the supplementary sections.
Sequence Compression. The key insight of sequence compression is the existence of low com-
plexity regions; for instance, the genome data of two human individuals share almost 99% identical
sequence content. Only 1% of discrepancy needs to be stored [155]. There are many algorithms to
compress sequence data to reduce storage space. In particular, the CRAM format has been designed
by EBI to replace the BAM format. However, it has been demonstrated that CRAM can be slower
than BAM in decoding, although CRAM can be faster than BAM encoding and save additional
34-55% of file space [9]. Other methods are described in the supplementary sections.

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:15

Fig. 9. Bioinformatics workflow for handling DNA sequencing data: alignment, quality control, variant
discovery, and integrative analysis. The main skeleton of the pipeline is referenced from OMICtools [51].

To search sequencing reads against genomes, algorithms such as MAQ [81], BWA [77], and
Bowtie [67] were developed as efficient methods for short read alignment. However, when one
conducts a search on reads of unknown origins against the existing sequence database, the search
becomes tedious and time-consuming. Therefore, Loh et al. [91] proposed CaBLAST and CaBLAT
to accelerate BLAST [5] and BLAT [61] for sequence search in the compressed space respectively.

4 BIOINFORMATICS TOOLS
The rapid development in DNA sequencing technologies have provided the potential to have a
revolutionary impact on genomics [63, 139]. Given the prevalence of DNA sequencing technologies,
a very large number of sequencing reads have been generated (e.g. over 35 petabases per year
according to [142]); we are facing a number of big data challenges in handling those sequencing
data such as the sequencing data cleaning, storage, indexing, search, statistical computing, and
visualization [121].

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:16 Wong et al.

To address those challenges, we review the classical bioinformatics workflow for DNA sequencing
analysis. We summarized the overall workflow and depicted each procedure as well as the related
bioinformatics tools in Figure 9 .

4.1 Quality Control


Although sequencing technologies have enabled the rapid production of numerous sequencing
data at reduced costs; some defects of sequencing results would be brought to downstream analysis.
Those sequencing artifacts include low-quality reads [168], which can greatly deteriorate the
completeness and the accuracy of the downstream analysis such as sequence assembly, variant
calling , gene expression analysis [119], and other genomic annotations.
The reasons behind low-quality reads are varied, including the lack of sample purification before
sequencing, complex genomic regions in the samples, and different sequencing platform bias.
Besides, the errors and bias of PCR amplification would also cause artificially repetitive sequences.
Sequencing samples can be contaminated with other species [168]. The experimental faults can be
the principal reasons while some unexpected nucleotides from unknown species may also cause
contamination.
According to the quality scores generated after each sequencing run, we can choose a quality
control tool to trim or substitute the base callings with low quality. Due to the PCR amplification
bias, duplicated reads would also arise in the raw sequencing data. The removal of those duplicated
reads could remove the bias and speed up the computational efficiency. Therefore, a set of programs
for duplicate removal were developed such as cd-hit-454 [111], QUASR [153], FastUniq [159], and
PyroTrimmer [112] before sequence alignment. Nonetheless, we note that such a removal process is
still under scrutiny, as evidenced by the recent study [118]. After sequence alignment, several steps
could also be executed to refine the alignments; for instance, SAMBLASTER [35] can efficiently
mark duplicates on the output of an aligner. After that, a comprehensive parallel analysis toolkit
for sequencing data such as GATK [100] could be used to locally realign the sequences around
INDELs and recalibrate the base quality scores using the empirical error rates.
The most frequently used programs are listed in Table S3. These programs provide command
lines or web interfaces. They can take the sequence data (e.g. FASTA, FASTQ, etc.) or the alignment
data (e.g. BAM, SAM, etc.) as input. Most of them share similar quality control functions

4.2 Experimental Design


After quality control, sequencing reads can be assembled or mapped according to the sample type
where the input DNA fragments come from. As depicted in Figure 10, if the sequencing reads come
from environmental samples, it could involve multiple species, calling for de novo metagenomic
assembly. In contrast, if the sequencing reads come from a purified sample, the reads can be de
novo assembled in a standard manner. More simply, the sequencing reads can also be mapped onto
a reference genome if it is available. For population studies, one may repeat the above process for
different purified samples from the same species, resulting in multiple individual genomes. Those
genomes can be aligned and called for variants, enabling the downstream analysis on the called
variants.

4.3 De Novo Genome Assembly


De novo genome assembly is to align and merge DNA fragments from different genomic regions to
reconstruct the original genome. This technique emerges because of the increasing need to assemble
the vast amount of short reads generated by high-throughput sequencers. All draft genomes have
been reconstructed using genome assembly, including the species in danger of extinction (e.g. the

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:17

Fig. 10. Different paths of DNA sequencing read analysis after quality control.

giant panda [86] and Tasmanian Devil [108]). The genome assembly tools (i.e. genome assemblers)
output the result as overlapping sequence data named as contigs or scaffolds.
A contig is a contiguous consensus sequence constructed from a set of sequencing reads which
are overlapping with each other while a scaffold is an ordered set of contigs linked by gaps of
known length (e.g. mate-pair/paired-end distance). A genome draft is the collection of scaffolds
which have been ordered on the respective physical map of chromosomes where the scaffolds are
linked by the contigs assembled from sequencing reads.
A basic method for de novo genome assembly can be implemented with a greedy algorithm that
aims to find out the shortest common sequence of all the input fragment sequences [167]. The
challenge is to compute for the pairwise alignments of all sequencing reads in an efficient manner.
There are many genome assembly methods proposed in the past. In particular, the genome assem-
blers based on graph are commonly used. Briefly, those assemblers organize and represent the input
sequencing reads as individual nodes which are connected by edges if the two corresponding reads
are overlapping in sequence pattern. Based on such an idea, the graph-based genome assemblers
have been developed for various purposes [105]; for instance, MIRA is an assembler designed for
small genomes and transcribed mRNA assembly [17]. It makes use of additional information such as
the known SNP sites and repeat stretches within the whole assembly context to implement a robust
method for sequence assembly while insuring the quality of the final result. Velvet is another assem-
bler that is suitable for assembling very short reads from small genomes such as prokaryotic data
[166]. SOAP [88] and ALLPATHS-LG [42] are recommended for large and repeat-rich vertebrate
genomes (e.g. human and mouse). SOAP is used to assemble genomes of human, plant, and other
animals, achieving remarkable results, which create novel opportunities for building reference
sequences for downstream analysis. ALLPATHS-LG is proved to have long-range connectivity, high
accuracy, continuity, and good coverage for different genomes. In summary, different approaches
have their own strength and applicable fields. An appropriate method should be selected according
to the specific user requirements. Even for now, the genome assembly methods are still struggling
to reduce the memory consumption due to the increasing data volumes of the fragmented genome
reads from massively parallel sequencing data.

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:18 Wong et al.

4.4 Reference Genome Alignment


Given reference genomes available (e.g. mouse, zebrafish, and human from Genome Reference
Consortium or other species from other consortium as listed on Ensembl3 ), sequence alignment
can be adopted to align and map sequencing reads onto the reference genome for further analysis.
We would like to note that "alignment" aims at the correct placement of each base of sequencing
reads on a given reference sequence with gaps while "mapping" aims at the correct placement of
entire sequencing reads on a given reference sequence.
Since sequencing reads of dynamic sizes are expected under noisy experimental conditions,
different software tools have been developed such as BWA [78], Bowtie [68], CUSHAW [90], and
MAQ [82]. Different types of sequencing libraries and technologies can be used such as single-end,
paired-end, and strand-specific libraries; it results in different alignment modes of which users
should be aware.
Thanks to the rapid development of parallel computing, a growing number of parallel sequence
alignment programs have emerged in the field of bioinformatics, such as CUSHAW2-GPU [162],
CloudBurst [136], and Stampy [93]. Those software tools were built upon various parallel platforms
such as CUDA, MapReduce, and Pthreads. With parallel computing, the speed of sequence alignment
could be significantly accelerated on massive sequencing data. A list of sequence alignment programs
is tabulated in Table S4.

4.5 Sequencing Visualization


With the rapid growth of DNA sequencing technology, sequencing data analysis becomes a tough
task. To better understand the output data of sequence alignment software tools, a collection of
visualization tools were developed, including SAMtools tview [80], IGV [149], Tablet [106], and
IGB [37]. Those programs mostly have graphical user interfaces and possess a set of functions
besides visualization; some of which are tabulated in Table S6.
The reason for using visualization tools is that most output representations of sequence alignment
are rows of records with aligned characters in each column, which is not easily understandable. Fur-
thermore, manual editing and alignment curation by hand require versatile visualization techniques
to support it, especially for the ones who are not familiar with programming.
A good visualization tool should possess a number of traits such as usability, ease of deployment,
efficiency, and performance robustness. In particular, usability is the most important trait in
bioinformatics tools since the tools are usually developed in an interdisciplinary manner where
people from different backgrounds come together. Generally, a tool’s usability refers to the ease of
use by human who can learn how to use the tool quickly.
A simple step of installation or deployment will greatly shorten the preparation time for the
tool. Usually, a web-based interface rather than command line interface is preferred by biologists.
The ability to handle numerous short read alignments appears to be important with NGS data. An
efficient genome visualization tool can relieve the burden of the analysts who face large batches of
sequencing data such as the UCSC Genome Browser. In addition, the modern technology such as
JavaScript and HTML5 can also enhance our existing genome visualization platforms; examples
can be found in Table S6.

4.6 Variant Discovery


Statistical methods can also make a great contribution to statistically significant variation discovery
/ detection in genome-wide population-based studies [28]. Some of the variant discovery tools

3 http://www.ensembl.org/info/about/species.html

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:19

are tabulated in Table S7, representing SNV calling, INDEL calling, and Structural Variation (SV)
calling [117].
SNV calling. refers to the methods for single nucleotide variants (SNVs) identification. If an SNV
is observed common enough in a given population (e.g. allele frequency > 1%), such an SNV is also
called single nucleotide polymorphism (SNP).
Generally, we can categorize those techniques into two groups: somatic mutation calling and
germline mutation calling. The alignment of sequencing reads from multiple somatic cells of an
individual can be used to identify somatic mutations. In contrast, the reads from germ cells are for
germline mutations [44]. A range of methods have been developed to handle those two mutation
types such as SAMtools [80], VarScan [62], MuTec [20], Strelka [134], and SomaticSniper [70].
Normally, SNVs occur in individuals with low frequency. A great deal of bioinformatics approaches
are proposed in literature such as GATK [28], MAQ [82], SOAPsnp [87], Syzygy [126], SAMtools
[76], SNPdryad [157], and Cortex [55].
INDEL Calling. is to identify the insertions or deletions (INDELs) of nucleotides. Compared to
SNVs, INDELs can always alter codon intervals. This kind of mutation has different impacts on
human diseases [107]. Therefore, INDEL detection has become an essential task in bioinformatics.
A great number of algorithms have been proposed for INDEL detection such as GATK [28], Pindel
[160], BreakDancer [15], Syzygy [126], Dindel [2], and MoDIL [72]. Instead of genome-wide hash-
table or suffix tree indexing, Pindel relies on pattern growth for unique substring search so as to
ensure memory efficiency and running time. The experiments reveal that it can detect deletion as
large as 10kb. Dindel realigns reads to candidate haplotypes for INDEL calling [2]. The candidate
haplotypes are generated by merging the SNVs as well as INDELs detected by other methods
or the read mapper. It is suitable for genome-wide INDEL detection since it explicitly accounts
for the sequencing errors in long homopolymer runs and addresses the problem of ambiguous
INDEL definition. Apart from INDELs, there still exists various types of structural variations. A
comprehensive evaluation of the INDEL calling software can be found in [109].
Structural Variation. (SV) is generally defined as the regional variations on chromosomes. It
is a general term that consists of multiple variations such as INDELs, translocations, inversions,
duplications, and copy-number variants (CNVs).
It is observed that SVs are associated with genetic diseases; it may even have selection pressure
on molecular evolution [145, 154]. Many approaches to identify SVs are proposed in the past; for
instance, Hydra-sv [123], BreakSeq [66], FusionMap [38], SVMerge [156], and APOLLOH [48].
In particular, Hydra-sv is a genome-wide method for detection of SV breakpoints by paired-end
mapping [123]. It can accurately map diverse types of SV especially in the regions of segmental
duplication. However, incomplete reference genome assemblies would be the main noise to affect
the accuracy. Especially, CNV is a type of SV and defined as the variable number of repeats of
genomic sections larger than 1000 elements [124]. CNVs can affect the polymorphism of various
species, influencing different traits of organisms such as susceptibility to disease. There is a growing
number of CNV detection methods that are proposed; for example, BreakDancer [15], mrCaNaVaR
[4], RDXplorer [163], CNVnator [1], CNV-seq [158], and PEMer [64].
Pathogenic Variant Identification. refers to the identification of pathogenic variant such as the
somatic abnormalities in cancer genomes while passenger mutation indicates the somatic mutations
which do not have any functional consequence during cell division [144]. To elucidate the molecular
mechanisms of genetic diseases such as carcinogenesis, it is crucial to classify pathogenic variants
from millions of observed mutations. Several methods have been developed; for instance, SNPdryad
[157], PANTHER [104], MutSig [71], MutationTaster [138], InVEx [52], SNAP [13], and SNPs3D

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:20 Wong et al.

[165]. To investigate those methods’ performance, we have benchmarked the widely used methods
(namely, MutationTaster, MutationAssessor, SIFT, LRT, FATHMM, PROVEAN, MetaSVM, MetaLR,
and M-CAP) on the pathogenic non-synonymous SNPs (nsSNPs) found in the ClinVar dataset
(fileDate=20170206). Specifically, we have retrieved the non-synonymous SNPs from the ClinVar
dataset in Feb 2017, resulting in 237,647 variants with clinical evidence support. Among them, 95,524
nsSNPs with clinical evidence support are retrieved using dbNSFP (version 2.9.2). The 95,524 nsSNPs
are subdivided and filtered further into two groups, 17,754 pathogenic nsSNPs (’clinvar_clnsig’=5)
and 3,924 benign nsSNPs (’clinvar_clnsig’=2) based on the ’clinvar_clnsig’ labels. Based on the
resultant nsSNP dataset, we have run those methods using their default parameter settings and
benchmarked their pathogenic nsSNP annotation ability (prediction for 2 classes) based on Receiver
Operating Characteristic (ROC) curves as depicted in Figure 11; it can be observed that the ensemble
methods (e.g. MetaLR and MetaSVM) performed the best while PROVEAN is the best-performing
individual method according to the Area Under Curve (AUC) values.

Fig. 11. Receiver Operating Characteristic (ROC) curves for Pathogenic non-synonymous SNPs (nsSNPs) Iden-
tification Performance Comparison between Different Methods on the ClinVar dataset (fileDate=20170206).
The bracketed numbers inside the legend denote the Area Under Curve (AUC) values.

To investigate the performance further, we have visualized their ranked score distributions
between the pathogenic nsSNPs and benign nsSNPs in Figure 12 according to dbNSFP (version
2.9.2). A wide performance spectrum across different methods can be observed. Several observations
can be made: (1) the methods can score pretty well for pathogenic nsSNP variants. (2) However,
their performance on benign nsSNP variants vary across different methods. In particular, we can
observe that there are noticeable amount of nsSNPs annotated to be pathogenic except MetaLR
which performance is close to the ideal setting. (3) Esnemble methods (e.g. MetaLR and MetaSVM)
always perform better than the individual methods. Given those observations, we would suggest the
bioinformatics users to directly adopt the ensemble methods for pathogenic nsSNP identification
in most cases.

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:21

Fig. 12. Violin Plots for Pathogenic non-synonymous SNPs (nsSNPs) Identification Performance Comparison
between Different Methods on the ClinVar dataset (fileDate=20170206). The violin widths have been scaled
to be bound by the same width in pixels.

Hypothesis and Limitations. The aforementioned variant calling tools usually could not work
alone to achieve the most accurate results. It requires a number of capabilities for accurate variant
calling; for instance, the sequencing read quality would be one factor that can affect the accuracy of
variant calling [12]. Mapping reads to the reference genome will be another crucial computation step
before variant calling [82, 85]. Especially, the reads that contain INDELs could cause misalignment
due to the insertion or deletion of nucleotides. Therefore, the stages in NGS data analysis will
influence each other in different ways. The tools for different types of variant calling should usually
be used together to reduce the error propagation. Most of the variant calling tools have assumed
hypothesis and limitations with respect to different NGS data analysis tasks; for instance, SAMtools
and GATK are the two popular tools for variant calling. SAMtools is a comprehensive utility set for
manipulation of alignment data in SAM/BAM format with variant calling functions. On the other
hand, GATK is the de facto analysis toolkit that is competent at different types of variant calling.
Both of these tools rely on Bayesian statistics to call for variants. Major differences lie in their SNP
genotype likelihood model: sequencing errors are assumed independent of each other by GATK but
not by SAMtools which assumes a first-order Markov relation. With respect to the functions, GATK
claims that it can support multi-allelic cases but SAMtools does not include that. Another difference
between these two software programs is the assumption in which they deal with reads with low
mapping quality. SAMtools uses all reads by default but GATK only considers the reads with high
quality. Furthermore, SAMtools uses manually tuned filters to select sequencing reads, while GATK
train models to learn the filters from the data for robust analysis. In addition, there are other subtle
differences in their implementations which can lead to different variant calling results. Users are
advised to run both of them to ensure the best variant calling [28, 76]. On the other hand, there
are other types of methods which can handle both normal and abnormal samples as pairs using
joint diploid genotype likelihood (e.g. VarScan, Strelka, and MutTect). VarScan is found to be the
best-performing tool [141], thanks to its robust heuristic which can tolerate extreme sequencing

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:22 Wong et al.

read depth, pooled samples, and contaminated samples. In particular, it can overcome the side
effect caused by the inaccurate mapping of short reads that contain INDELs. However, some studies
indicated that VarScan is less accurate than other tools [16, 161]. As for MuTect, it is based on
Bayesian statistics to detect rare somatic mutations. It can achieve high sensitivity and specificity
with only a few supporting reads but it requires careful filter tuning. Strelka also applies Bayesian
approaches to represent the allele frequencies between normal and tumor samples. This model
structure can maintain the sensitivity of variant calls on impure samples. Similar to VarScan, Strelka
computes a statistical test between the normal and tumor allele frequencies. However, one study
shows that Strelka tends to return less germline polymorphisms compared with other methods
[127]. As a general guidance, the selection of the bioinformatics tools for variant calling tools
should be considered and adjusted thoroughly in advance, with the help of the related visualization
tools such as IGV [149] and BamView [14]. A complete pipeline should be adopted to accomplish
the sophisticated task in variant calling. In this regards, scientists have already proposed different
frameworks for pipeline automation such as BioBlend [140], SeqMule [46], FamPipe [19], RUbioSeq
[130], and P3BSseq [94]. Users are highly recommended to take advantages of those frameworks
for expert assistance and guidance in building bioinformatics tool pipelines.

4.7 TGS Tools


The rapid development of TGS technologies have enabled new applications of genome sequencing
[58]. PacBioToCA [65] provides a correction algorithm and assembly approach by utilizing short
reads to control the errors from PacBio RS reads; it can achieve 99.9% read correction accuracy.
However, the reads processed by PacBioToCA are shorter than the others. PBSIM [113] and SimLoRD
[143] are both read simulators of TGS by analyzing data from PacBio. However, PBSIM is unable to
provide mapping information or alignments of the simulated reads. Poretools [92] is one of the first
toolkits designed for MinION data manipulation. Proovread [49] has improved the flexibility of
adapting the hardware by providing a hybrid correction pipeline. Sequencing data manipulation is
a vital task in TGS, PoreSeq [146] and poRe [152] are packages for user to operate and visualize the
MinION sequencing data. However, PoreSeq requires multiple overlapping reads when considering
the final base calls and the sequence of events. Nanocall [26] is an open source offline basecaller
for Nanopore, which has significant impact on the applications where there is no stable internet
access such as the recent Ebola virus surveillance. A list of TGS-specific bioinformatics tools and
links are illustrated in Table S8.

5 DISCUSSION
Although high-throughput DNA sequencing technologies have proved itself to be a wild success, it
cannot be denied that we still have technical issues to be resolved; for instance, the NGS technologies
are known to contain potential sequencing biases under different sequencing platforms [103]; for
instance, the library construction settings, diverse error rates in different genomic regions, and
other systematic coverage biases. Therefore, Ross et al. have proposed methods for characterizing
and measuring the biases in sequencing data [129]. A highly repeatable deep sequencing study
indicates six-fold sequencing coverage variation across mtDNA [32]. For the TGS technologies,
the raw base calling error rate of the MinlON was estimated to be 38.2% which is much higher
than the existing NGS technologies. Therefore, we do expect continuous improvements for DNA
sequencing technologies in the near future; for instance, Jain et al. have recently developed a SNV
detection tool for MinlON, achieving a precision and recall of up to 99% [57].
From the computational perspective, the wide adoption of high-throughput DNA sequencing
technologies imposes a big data challenge. The sequencing throughput is increased almost tenfold
every year; it is much faster than Moore’s Law. Therefore, Kahn suggested that the development

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:23

of hardware would lag behind the sequencing data generation [59]. Similarly, Stephens et al. also
foresaw that the sequencing technologies would transform the genome research landscape to the
big-data-driven direction, similar to the process of modernization that has taken strong effect in
the field of astronomical physics research [142].
Nonetheless, the TGS technologies still hold tremendous potential; for instance, the long se-
quencing reads of TGS can help us understand the structural variations and existing sequencing
gaps on genomes since the existing studies using NGS are limited by short read sizes [125]; for
instance, the mobility of the portable MinION (Oxford Nanopore Technologies, Oxford, UK) has
contributed to the recent real-time Ebola surveillance in Africa [122]. Therefore, the benefits of
TGS technologies still hold, although considerable efforts have to be spent on sequencing error
correction.
From the broad perspective, the advances in DNA sequencing technologies can accelerate
the development of precision medicine (personalized medicine) solutions. In the past, genetic
information was seldom adopted for medical prescriptions. Such a situation will be changed
because we can sequence each human genome at a price less than USD $1000 now [142]. The
human genome information will enable us to take care of each individual patient separately;
personalized medicine solutions can be developed.

6 ACKNOWLEDGEMENT
The authors would like to thank Sesugh Samuel Nder, Ajay Rajnikanth, Prashant Sridhar, Nikunj
Agarwal, Rui-Ci Lin, Aditya Kumar Kedia, and Stacy Lee for their careful manuscript proofreading.

7 FUNDING
The work described in this article was substantially supported by three grants from the Research
Grants Council of the Hong Kong Special Administrative Region [CityU 21200816], [CityU 11203217],
and [CityU 11200218]. The donation support of the Titan Xp GPU from the NVIDIA Corporation is
appreciated.

REFERENCES
[1] A. Abyzov, A. E. Urban, M. Snyder, and M. Gerstein. 2011. CNVnator: An approach to discover, genotype, and
characterize typical and atypical CNVs from family and population genome sequencing. Genome Research 21, 6 (jun
2011), 974–984. https://doi.org/10.1101/gr.114876.110
[2] C. A. Albers, G. Lunter, D. G. MacArthur, G. McVean, W. H. Ouwehand, and R. Durbin. 2011. Dindel: Accurate indel
calls from short-read data. Genome Research 21, 6 (jun 2011), 961–973. https://doi.org/10.1101/gr.112326.110
[3] Susan Aldridge, Brady Huggett, KS Jayaraman, Lisa Melton, Mark Ratner, and Nayanah Siva. 2008. 1000 Genomes
project. Nature Biotechnology 26, 3 (mar 2008), 256–256. https://doi.org/10.1038/nbt0308-256b
[4] Can Alkan, Jeffrey M Kidd, Tomas Marques-Bonet, Gozde Aksay, Francesca Antonacci, Fereydoun Hormozdiari,
Jacob O Kitzman, Carl Baker, Maika Malig, Onur Mutlu, S Cenk Sahinalp, Richard A Gibbs, and Evan E Eichler. 2009.
Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genetics 41, 10
(oct 2009), 1061–1067. https://doi.org/10.1038/ng.437
[5] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. 1990. Basic local alignment
search tool. Journal of molecular biology 215, 3 (1990), 403–410.
[6] Riyue Bao, Lei Huang, Jorge Andrade, Wei Tan, Warren A Kibbe, Hongmei Jiang, and Gang Feng. 2014. Review of
current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing.
Cancer informatics (2014), 67–83.
[7] Robert W Bauman. 2013. Microbiology with diseases by taxonomy. Pearson Higher Ed.
[8] S Bennett. 2004. Solexa Ltd. Pharmacogenomics 5 (2004), 433–438.
[9] James K Bonfield. 2014. The Scramble conversion tool. Bioinformatics (2014), btu390.
[10] Jayson Bowers, Judith Mitchell, Eric Beer, Philip R Buzby, Marie Causey, J William Efcavitch, Mirna Jarosz, Edyta
Krzymanska-Olejnik, Li Kung, Doron Lipson, et al. 2009. Virtual terminator nucleotides for next-generation DNA
sequencing. Nature methods 6, 8 (2009), 593–595.

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:24 Wong et al.

[11] Ido Braslavsky, Benedict Hebert, Emil Kartalov, and Stephen R Quake. 2003. Sequence information can be obtained
from single DNA molecules. Proceedings of the National Academy of Sciences 100, 7 (2003), 3960–3964.
[12] William Brockman, Pablo Alvarez, Sarah Young, Manuel Garber, Georgia Giannoukos, William L Lee, Carsten Russ,
Eric S Lander, Chad Nusbaum, and David B Jaffe. 2008. Quality scores and SNP detection in sequencing-by-synthesis
systems. Genome research 18, 5 (may 2008), 763–70. https://doi.org/10.1101/gr.070227.107
[13] Yana Bromberg and Burkhard Rost. 2007. SNAP: predict effect of non-synonymous polymorphisms on function.
Nucleic Acids Research 35, 11 (2007), 3823–3835. https://doi.org/10.1093/nar/gkm238
[14] Tim Carver, Simon R. Harris, Thomas D. Otto, Matthew Berriman, Julian Parkhill, and Jacqueline A. McQuillan. 2013.
BamView: visualizing and interpretation of next-generation sequencing read alignments. Briefings in Bioinformatics
14, 2 (2013), 203–212. https://doi.org/10.1093/bib/bbr073
[15] Ken Chen, John W Wallis, Michael D McLellan, David E Larson, Joelle M Kalicki, Craig S Pohl, Sean D McGrath,
Michael C Wendl, Qunyuan Zhang, Devin P Locke, Xiaoqi Shi, Robert S Fulton, Timothy J Ley, Richard K Wilson,
Li Ding, and Elaine R Mardis. 2009. BreakDancer: an algorithm for high-resolution mapping of genomic structural
variation. Nature Methods 6, 9 (sep 2009), 677–681. https://doi.org/10.1038/nmeth.1363
[16] A. Y. Cheng, Y.-Y. Teo, and R. T.-H. Ong. 2014. Assessing single nucleotide variant detection and genotype calling
on whole-genome sequenced individuals. Bioinformatics 30, 12 (jun 2014), 1707–1713. https://doi.org/10.1093/
bioinformatics/btu067
[17] Bastien Chevreux, Thomas Pfisterer, Bernd Drescher, Albert J Driesel, Werner E G Müller, Thomas Wetter, and Sándor
Suhai. 2004. Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection
in sequenced ESTs. Genome research 14, 6 (jun 2004), 1147–59. https://doi.org/10.1101/gr.1917404
[18] Chen-Shan Chin, Jon Sorenson, Jason B Harris, William P Robins, Richelle C Charles, Roger R Jean-Charles, James
Bullard, Dale R Webster, Andrew Kasarskis, Paul Peluso, et al. 2011. The origin of the Haitian cholera outbreak strain.
New England Journal of Medicine 364, 1 (2011), 33–42.
[19] R. H. Chung, W. Y. Tsai, C. Y. Kang, P. J. Yao, H. J. Tsai, and C. H. Chen. 2016. FamPipe: An Automatic Analysis
Pipeline for Analyzing Sequencing Data in Families for Disease Studies. PLoS Comput. Biol. 12, 6 (Jun 2016), e1004980.
[20] Kristian Cibulskis, Michael S Lawrence, Scott L Carter, Andrey Sivachenko, David Jaffe, Carrie Sougnez, Stacey Gabriel,
Matthew Meyerson, Eric S Lander, and Gad Getz. 2013. Sensitive detection of somatic point mutations in impure and
heterogeneous cancer samples. Nature Biotechnology 31, 3 (feb 2013), 213–219. https://doi.org/10.1038/nbt.2514
[21] James Clarke, Hai-Chen Wu, Lakmal Jayasinghe, Alpesh Patel, Stuart Reid, and Hagan Bayley. 2009. Continuous base
identification for single-molecule nanopore DNA sequencing. Nature nanotechnology 4, 4 (2009), 265–270.
[22] Peter JA Cock, Christopher J Fields, Naohisa Goto, Michael L Heuer, and Peter M Rice. 2010. The Sanger FASTQ file
format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research 38, 6 (2010),
1767–1771.
[23] ENCODE Project Consortium et al. 2004. The ENCODE (ENCyclopedia of DNA elements) project. Science 306, 5696
(2004), 636–640.
[24] David Cyranoski. 2016. China’s bid to be a Dna superpower. Nature 534, 7608 (2016), 462–463.
[25] Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A Albers, Eric Banks, Mark A DePristo, Robert E Handsaker,
Gerton Lunter, Gabor T Marth, Stephen T Sherry, et al. 2011. The variant call format and VCFtools. Bioinformatics 27,
15 (2011), 2156–2158.
[26] Matei David, Lewis Jonathan Dursi, Delia Yao, Paul C Boutros, and Jared T Simpson. 2016. Nanocall: an open source
basecaller for Oxford Nanopore sequencing data. Bioinformatics (2016), btw569.
[27] Cees Dekker. 2007. Solid-state nanopores. Nature nanotechnology 2, 4 (2007), 209–215.
[28] Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl, Anthony A
Philippakis, Guillermo del Angel, Manuel A Rivas, Matt Hanna, Aaron McKenna, Tim J Fennell, Andrew M Kernytsky,
Andrey Y Sivachenko, Kristian Cibulskis, Stacey B Gabriel, David Altshuler, and Mark J Daly. 2011. A framework for
variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43, 5 (apr 2011),
491–498. https://doi.org/10.1038/ng.806
[29] Aarti Desai, Veer Singh Marwah, Akshay Yadav, Vineet Jha, Kishor Dhaygude, Ujwala Bangar, Vivek Kulkarni, and
Abhay Jere. 2013. Identification of optimum sequencing depth especially for de novo genome assembly of small
genomes using next generation sequencing data. PloS one 8, 4 (2013), e60204.
[30] John Eid, Adrian Fehr, Jeremy Gray, Khai Luong, John Lyle, Geoff Otto, Paul Peluso, David Rank, Primo Baybayan,
Brad Bettman, et al. 2009. Real-time DNA sequencing from single polymerase molecules. Science 323, 5910 (2009),
133–138.
[31] Michael Eisenstein. 2012. The battle for sequencing supremacy. Nature biotechnology 30, 11 (2012), 1023.
[32] R. Ekblom, L. Smeds, and H. Ellegren. 2014. Patterns of sequencing coverage bias revealed by ultra-deep sequencing
of vertebrate mitochondria. BMC Genomics 15 (2014), 467.

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:25

[33] Brent Ewing and Phil Green. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities.
Genome research 8, 3 (1998), 186–194.
[34] Brent Ewing, LaDeana Hillier, Michael C Wendl, and Phil Green. 1998. Base-calling of automated sequencer traces
usingPhred. I. Accuracy assessment. Genome research 8, 3 (1998), 175–185.
[35] Gregory G Faust and Ira M Hall. 2014. SAMBLASTER: fast duplicate marking and structural variant read extraction.
Bioinformatics (Oxford, England) 30, 17 (sep 2014), 2503–5. https://doi.org/10.1093/bioinformatics/btu314
[36] Y Fei. 2014. DNA sequencing, sanger and next-generation sequencing. Applications of Molecular genetics in Personalized
Medicine. USA: OMICS Group eBooks (2014).
[37] Nowlan H. Freese, David C. Norris, and Ann E. Loraine. 2016. Integrated genome browser: visual analytics platform
for genomics. Bioinformatics 32, 14 (jul 2016), 2089–2095. https://doi.org/10.1093/bioinformatics/btw069
[38] Huanying Ge, Kejun Liu, Todd Juan, Fang Fang, Matthew Newman, and Wolfgang Hoeck. 2011. FusionMap: detecting
fusion genes from next-generation sequencing data at base-pair resolution. Bioinformatics (Oxford, England) 27, 14
(jul 2011), 1922–8. https://doi.org/10.1093/bioinformatics/btr310
[39] Lewis Y Geer, Aron Marchler-Bauer, Renata C Geer, Lianyi Han, Jane He, Siqian He, Chunlei Liu, Wenyao Shi, and
Stephen H Bryant. 2009. The NCBI biosystems database. Nucleic acids research (2009), gkp858.
[40] André Gilles, Emese Meglécz, Nicolas Pech, Stéphanie Ferreira, Thibaut Malausa, and Jean-François Martin. 2011.
Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC genomics 12, 1 (2011), 245.
[41] Travis C Glenn. 2011. Field guide to next-generation DNA sequencers. Molecular ecology resources 11, 5 (2011),
759–769.
[42] Sante Gnerre, Iain MacCallum, Dariusz Przybylski, Filipe J. Ribeiro, Joshua N. Burton, Bruce J. Walker, Ted Sharpe,
Giles Hall, Terrance P. Shea, Sean Sykes, Aaron M. Berlin, Daniel Aird, Maura Costello, Riza Daza, Louise Williams,
Robert Nicol, Andreas Gnirke, Chad Nusbaum, Eric S. Lander, and David B. Jaffe. 2011. High-quality draft assemblies
of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences 108, 4
(2011), 1513–1518. https://doi.org/10.1073/PNAS.1017351108
[43] Sara Goodwin, John D McPherson, and W Richard McCombie. 2016. Coming of age: ten years of next-generation
sequencing technologies. Nature Reviews Genetics 17, 6 (2016), 333–351.
[44] Anthony JF Griffiths, Jeffrey H Miller, David T Suzuki, Richard C Lewontin, and William M Gelbart. 2000. Somatic
versus germinal mutation. In An Introduction to Genetic Analysis (7th ed.). W. H. Freeman.
[45] SAM/BAM Format Specification Working Group et al. 2013. Sequence alignment/map format specification. Available
at https://github. com/samtools/hts-specs (2013).
[46] Y. Guo, X. Ding, Y. Shen, G. J. Lyon, and K. Wang. 2015. SeqMule: automated pipeline for analysis of human
exome/genome sequencing data. Sci Rep 5 (Sep 2015), 14283.
[47] Ivo Glynne Gut. 2013. New sequencing technologies. Clinical and Translational Oncology 15, 11 (2013), 879–881.
[48] G. Ha, A. Roth, D. Lai, A. Bashashati, J. Ding, R. Goya, R. Giuliany, J. Rosner, A. Oloumi, K. Shumansky, S.-F. Chin, G.
Turashvili, M. Hirst, C. Caldas, M. A. Marra, S. Aparicio, and S. P. Shah. 2012. Integrative analysis of genome-wide loss
of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple-negative
breast cancer. Genome Research 22, 10 (oct 2012), 1995–2007. https://doi.org/10.1101/gr.137570.112
[49] Thomas Hackl, Rainer Hedrich, Jörg Schultz, and Frank Förster. 2014. proovread: large-scale high-accuracy PacBio
correction through iterative short read consensus. Bioinformatics 30, 21 (2014), 3004–3011.
[50] Timothy D Harris, Phillip R Buzby, Hazen Babcock, Eric Beer, Jayson Bowers, Ido Braslavsky, Marie Causey, Jennifer
Colonell, James DiMeo, J William Efcavitch, et al. 2008. Single-molecule DNA sequencing of a viral genome. Science
320, 5872 (2008), 106–109.
[51] V. J. Henry, A. E. Bandrowski, A. S. Pepin, B. J. Gonzalez, and A. Desfeux. 2014. OMICtools: an informative directory
for multi-omic data analysis. Database (Oxford) 2014 (2014).
[52] Eran Hodis, Ian R. Watson, Gregory V. Kryukov, Stefan T. Arold, Marcin Imielinski, Jean-Philippe Theurillat, Elizabeth
Nickerson, Daniel Auclair, Liren Li, Chelsea Place, Daniel DiCara, Alex H. Ramos, Michael S. Lawrence, Kristian
Cibulskis, Andrey Sivachenko, Douglas Voet, Gordon Saksena, Nicolas Stransky, Robert C. Onofrio, Wendy Winckler,
Kristin Ardlie, Nikhil Wagle, Jennifer Wargo, Kelly Chong, Donald L. Morton, Katherine Stemke-Hale, Guo Chen,
Michael Noble, Matthew Meyerson, John E. Ladbury, Michael A. Davies, Jeffrey E. Gershenwald, Stephan N. Wagner,
Dave S.B. Hoon, Dirk Schadendorf, Eric S. Lander, Stacey B. Gabriel, Gad Getz, Levi A. Garraway, and Lynda Chin. 2012.
A Landscape of Driver Mutations in Melanoma. Cell 150, 2 (2012), 251–263. https://doi.org/10.1016/j.cell.2012.06.024
[53] Mark Hollmer. 2013. Roche to close 454 Life Sciences as it reduces gene sequencing focus. Available
at http://www.fiercebiotech.com/medical-devices/roche-to-close-454-life-sciences-as-it-reduces-gene-sequencing-focus
(2013).
[54] Inc Illumina. 2008. Sequencing Analysis Software User Guide for Pipeline version 1.3 and CASAVA version 1.0
Illumina Inc. San Diego, CA, USA (2008).

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:26 Wong et al.

[55] Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. 2012. De novo assembly and genotyping of
variants using colored de Bruijn graphs. Nature Genetics 44, 2 (jan 2012), 226–232. https://doi.org/10.1038/ng.1028
[56] Miten Jain, Ian T Fiddes, Karen H Miga, Hugh E Olsen, Benedict Paten, and Mark Akeson. 2015. Improved data
analysis for the MinION nanopore sequencer. Nature methods 12, 4 (2015), 351–356.
[57] Miten Jain, Ian T Fiddes, Karen H Miga, Hugh E Olsen, Benedict Paten, and Mark Akeson. 2015. Improved data analysis
for the MinION nanopore sequencer. Nature Methods 12, 4 (feb 2015), 351–356. https://doi.org/10.1038/nmeth.3290
[58] Miten Jain, Hugh E Olsen, Benedict Paten, and Mark Akeson. 2016. The Oxford Nanopore MinION: delivery of
nanopore sequencing to the genomics community. Genome Biology 17, 1 (2016), 239.
[59] Scott D Kahn. 2011. On the Future of Genomic Data. science 1197891, 728 (2011), 331.
[60] John J Kasianowicz, Eric Brandin, Daniel Branton, and David W Deamer. 1996. Characterization of individual
polynucleotide molecules using a membrane channel. Proceedings of the National Academy of Sciences 93, 24 (1996),
13770–13773.
[61] W James Kent. 2002. BLAT-the BLAST-like alignment tool. Genome research 12, 4 (2002), 656–664.
[62] Daniel C. Koboldt, David E. Larson, Richard K. Wilson, Daniel C. Koboldt, David E. Larson, and Richard K. Wilson. 2013.
Using VarScan 2 for Germline Variant Calling and Somatic Mutation Detection. In Current Protocols in Bioinformatics.
John Wiley and Sons, Inc., Hoboken, NJ, USA, 15.4.1–15.4.17. https://doi.org/10.1002/0471250953.bi1504s44
[63] Daniel C Koboldt, Karyn Meltz Steinberg, David E Larson, Richard K Wilson, and Elaine R Mardis. 2013. The
next-generation sequencing revolution and its impact on genomics. Cell 155, 1 (2013), 27–38.
[64] Jan O Korbel, Alexej Abyzov, Xinmeng Mu, Nicholas Carriero, Philip Cayting, Zhengdong Zhang, Michael Snyder,
Mark B Gerstein, E Pennisi, L Feuk, AR Carson, SW Scherer, R Redon, S Ishikawa, KR Fitch, L Feuk, T Borodina, H
Himmelbauer, ES Lander, MS Waterman, SF Altschul, W Gish, W Miller, EW Myers, and DJ Lipman. 2009. PEMer: a
computational framework with simulation-based error models for inferring genomic structural variants from massive
paired-end sequencing data. Genome Biology 10, 2 (2009), R23. https://doi.org/10.1186/gb-2009-10-2-r23
[65] Sergey Koren, Michael C Schatz, Brian P Walenz, Jeffrey Martin, Jason T Howard, Ganeshkumar Ganapathy, Zhong
Wang, David A Rasko, W Richard McCombie, Erich D Jarvis, et al. 2012. Hybrid error correction and de novo assembly
of single-molecule sequencing reads. Nature biotechnology 30, 7 (2012), 693–700.
[66] Hugo Y K Lam, Xinmeng Jasmine Mu, Adrian M Stütz, Andrea Tanzer, Philip D Cayting, Michael Snyder, Philip M
Kim, Jan O Korbel, and Mark B Gerstein. 2010. Nucleotide-resolution analysis of structural variants using BreakSeq
and a breakpoint library. Nature Biotechnology 28, 1 (jan 2010), 47–55. https://doi.org/10.1038/nbt.1600
[67] Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg, et al. 2009. Ultrafast and memory-efficient alignment of
short DNA sequences to the human genome. Genome biol 10, 3 (2009), R25.
[68] Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg, TA Down, VK Rakyan, DJ Turner, P Flicek, H Li, E
Kulesha, S Graf, N Johnson, J Herrero, EM Tomazou, NP Thorne, L Backdahl, M Herberth, KL Howe, DK Jackson,
MM Miretti, JC Marioni, E Birney, TJ Hubbard, R Durbin, S Tavare, S Beck, DS Johnson, A Mortazavi, RM Myers,
D Weese, T Rausch, and K Reinert. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the
human genome. Genome Biology 10, 3 (2009), R25. https://doi.org/10.1186/gb-2009-10-3-r25
[69] Ilkka Lappalainen, Jeff Almeida-King, Vasudev Kumanduri, Alexander Senf, John Dylan Spalding, Gary Saunders,
Jag Kandasamy, Mario Caccamo, Rasko Leinonen, Brendan Vaughan, et al. 2015. The European Genome-phenome
Archive of human data consented for biomedical research. Nature genetics 47, 7 (2015), 692–695.
[70] David E. Larson, Christopher C. Harris, Ken Chen, Daniel C. Koboldt, Travis E. Abbott, David J. Dooling, Timothy J.
Ley, Elaine R. Mardis, Richard K. Wilson, and Li Ding. 2012. SomaticSniper: identification of somatic point mutations in
whole genome sequencing data. Bioinformatics 28, 3 (feb 2012), 311–7. https://doi.org/10.1093/bioinformatics/btr665
[71] Michael S. Lawrence, Petar Stojanov, Paz Polak, Gregory V. Kryukov, Kristian Cibulskis, Andrey Sivachenko, Scott L.
Carter, Chip Stewart, Craig H. Mermel, Steven A. Roberts, Adam Kiezun, Peter S. Hammerman, Aaron McKenna,
Yotam Drier, Lihua Zou, Alex H. Ramos, Trevor J. Pugh, Nicolas Stransky, Elena Helman, Jaegil Kim, Carrie Sougnez,
Lauren Ambrogio, Elizabeth Nickerson, Erica Shefler, Maria L. Cortés, Daniel Auclair, Gordon Saksena, Douglas
Voet, Michael Noble, Daniel DiCara, Pei Lin, Lee Lichtenstein, David I. Heiman, Timothy Fennell, Marcin Imielinski,
Bryan Hernandez, Eran Hodis, Sylvan Baca, Austin M. Dulak, Jens Lohr, Dan-Avi Landau, Catherine J. Wu, Jorge
Melendez-Zajgla, Alfredo Hidalgo-Miranda, Amnon Koren, Steven A. McCarroll, Jaume Mora, Ryan S. Lee, Brian
Crompton, Robert Onofrio, Melissa Parkin, Wendy Winckler, Kristin Ardlie, Stacey B. Gabriel, Charles W. M. Roberts,
Jaclyn A. Biegel, Kimberly Stegmaier, Adam J. Bass, Levi A. Garraway, Matthew Meyerson, Todd R. Golub, Dmitry A.
Gordenin, Shamil Sunyaev, Eric S. Lander, and Gad Getz. 2013. Mutational heterogeneity in cancer and the search for
new cancer-associated genes. Nature 499, 7457 (jun 2013), 214–218. https://doi.org/10.1038/nature12213
[72] Seunghak Lee, Fereydoun Hormozdiari, Can Alkan, and Michael Brudno. 2009. MoDIL: detecting small indels from
clone-end sequencing with mixtures of distributions. Nature Methods 6, 7 (jul 2009), 473–474. https://doi.org/10.
1038/nmeth.f.256

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:27

[73] R. Leinonen, H. Sugawara, and M. Shumway. 2010. The Sequence Read Archive. Nucleic Acids Research 39, Database
(jan 2010), D19–D21. https://doi.org/10.1093/nar/gkq1019
[74] R. Leinonen, H. Sugawara, and M. Shumway. 2011. The Sequence Read Archive. Nucleic Acids Research 39, Database
(jan 2011), D19–D21. https://doi.org/10.1093/nar/gkq1019
[75] Michael J Levene, Jonas Korlach, Stephen W Turner, Mathieu Foquet, Harold G Craighead, and Watt W Webb. 2003.
Zero-mode waveguides for single-molecule analysis at high concentrations. Science 299, 5607 (2003), 682–686.
[76] Heng Li. 2011. A statistical framework for SNP calling, mutation discovery, association mapping and population
genetical parameter estimation from sequencing data. Bioinformatics 27, 21 (2011), 2987–2993. https://doi.org/10.
1093/bioinformatics/btr509
[77] Heng Li and Richard Durbin. 2009. Fast and accurate short read alignment with Burrows–Wheeler transform.
Bioinformatics 25, 14 (2009), 1754–1760.
[78] Heng Li and Richard Durbin. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics (Oxford, England) 25, 14 (jul 2009), 1754–60. https://doi.org/10.1093/bioinformatics/btp324
[79] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard
Durbin, et al. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25, 16 (2009), 2078–2079.
[80] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis,
Richard Durbin, and 1000 Genome Project Data Processing 1000 Genome Project Data Processing Subgroup. 2009.
The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England) 25, 16 (aug 2009), 2078–9.
https://doi.org/10.1093/bioinformatics/btp352
[81] Heng Li, Jue Ruan, and Richard Durbin. 2008. Mapping short DNA sequencing reads and calling variants using
mapping quality scores. Genome research 18, 11 (2008), 1851–1858.
[82] Heng Li, Jue Ruan, and Richard Durbin. 2008. Mapping short DNA sequencing reads and calling variants using
mapping quality scores. Genome research 18, 11 (nov 2008), 1851–8. https://doi.org/10.1101/gr.078212.108
[83] Jian Li, Aarif Mohamed Nazeer Batcha, Björn Grüning, and Ulrich R Mansmann. 2015. An NGS Workflow Blueprint
for DNA Sequencing Data and Its Application in Individualized Molecular Oncology. Cancer informatics 14, Suppl 5
(2015), 87.
[84] Jiali Li, Derek Stein, Ciaran McMullan, Daniel Branton, Michael J Aziz, and Jene A Golovchenko. 2001. Ion-beam
sculpting at nanometre length scales. Nature 412, 6843 (2001), 166–169.
[85] M. Li, Magnus Nordborg, and Lei M. Li. 2004. Adjust quality scores from alignment and improve sequencing accuracy.
Nucleic Acids Research 32, 17 (sep 2004), 5183–5191. https://doi.org/10.1093/nar/gkh850
[86] Ruiqiang Li, Wei Fan, Geng Tian, Hongmei Zhu, Lin He, Jing Cai, Quanfei Huang, Qingle Cai, Bo Li, Yinqi Bai, Zhihe
Zhang, Yaping Zhang, Wen Wang, Jun Li, Fuwen Wei, Heng Li, Min Jian, Jianwen Li, Zhaolei Zhang, Rasmus Nielsen,
Dawei Li, Wanjun Gu, Zhentao Yang, Zhaoling Xuan, Oliver A. Ryder, Frederick Chi-Ching Leung, Yan Zhou, Jianjun
Cao, Xiao Sun, Yonggui Fu, Xiaodong Fang, Xiaosen Guo, Bo Wang, Rong Hou, Fujun Shen, Bo Mu, Peixiang Ni,
Runmao Lin, Wubin Qian, Guodong Wang, Chang Yu, Wenhui Nie, Jinhuan Wang, Zhigang Wu, Huiqing Liang,
Jiumeng Min, Qi Wu, Shifeng Cheng, Jue Ruan, Mingwei Wang, Zhongbin Shi, Ming Wen, Binghang Liu, Xiaoli
Ren, Huisong Zheng, Dong Dong, Kathleen Cook, Gao Shan, Hao Zhang, Carolin Kosiol, Xueying Xie, Zuhong Lu,
Hancheng Zheng, Yingrui Li, Cynthia C. Steiner, Tommy Tsan-Yuk Lam, Siyuan Lin, Qinghui Zhang, Guoqing Li, Jing
Tian, Timing Gong, Hongde Liu, Dejin Zhang, Lin Fang, Chen Ye, Juanbin Zhang, Wenbo Hu, Anlong Xu, Yuanyuan
Ren, Guojie Zhang, Michael W. Bruford, Qibin Li, Lijia Ma, Yiran Guo, Na An, Yujie Hu, Yang Zheng, Yongyong Shi,
Zhiqiang Li, Qing Liu, Yanling Chen, Jing Zhao, Ning Qu, Shancen Zhao, Feng Tian, Xiaoling Wang, Haiyin Wang,
Lizhi Xu, Xiao Liu, Tomas Vinar, Yajun Wang, Tak-Wah Lam, Siu-Ming Yiu, Shiping Liu, Hemin Zhang, Desheng
Li, Yan Huang, Xia Wang, Guohua Yang, Zhi Jiang, Junyi Wang, Nan Qin, Li Li, Jingxiang Li, Lars Bolund, Karsten
Kristiansen, Gane Ka-Shu Wong, Maynard Olson, Xiuqing Zhang, Songgang Li, Huanming Yang, Jian Wang, and Jun
Wang. 2010. The sequence and de novo assembly of the giant panda genome. Nature 463, 7279 (jan 2010), 311–317.
https://doi.org/10.1038/nature08696
[87] R. Li, Y. Li, X. Fang, H. Yang, J. Wang, K. Kristiansen, and J. Wang. 2009. SNP detection for massively parallel
whole-genome resequencing. Genome Research 19, 6 (jun 2009), 1124–1132. https://doi.org/10.1101/gr.088013.108
[88] Ruiqiang Li, Hongmei Zhu, Jue Ruan, Wubin Qian, Xiaodong Fang, Zhongbin Shi, Yingrui Li, Shengting Li, Gao Shan,
Karsten Kristiansen, Songgang Li, Huanming Yang, Jian Wang, and Jun Wang. 2009. De novo assembly of human
genomes with massively parallel short read sequencing. Genome Research (2009). https://doi.org/10.1101/GR.097261.
109
[89] Lin Liu, Yinhu Li, Siliang Li, Ni Hu, Yimin He, Ray Pong, Danni Lin, Lihua Lu, and Maggie Law. 2012. Comparison of
next-generation sequencing systems. BioMed Research International 2012 (2012).
[90] Yongchao Liu, Bernt Popp, Bertil Schmidt, AD Smith, Z Xuan, MQ Zhang, H Li, J Ruan, R Durbin, N Homer, B
Merriman, SF Nelson, B Langmead, C Trapnell, L Li, JR Myers, GT Marth, B Ewing, P Green, A McKenna, M Hanna, E
Banks, A Sivachenko, K Cibulskis, P Ferragina, G Manzini, TF Smith, and MS Waterman. 2014. CUSHAW3: Sensitive

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:28 Wong et al.

and Accurate Base-Space and Color-Space Short-Read Alignment with Hybrid Seeding. PLoS ONE 9, 1 (jan 2014),
e86869. https://doi.org/10.1371/journal.pone.0086869
[91] Po-Ru Loh, Michael Baym, and Bonnie Berger. 2012. Compressive genomics. Nature biotechnology 30, 7 (2012),
627–630.
[92] Nicholas J Loman and Aaron R Quinlan. 2014. Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics
30, 23 (2014), 3399–3401.
[93] G. Lunter and M. Goodson. 2011. Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence
reads. Genome Research 21, 6 (jun 2011), 936–939. https://doi.org/10.1101/gr.111120.110
[94] P. L. Luu, D. Gerovska, M. Arrospide-Elgarresta, S. Retegi-Carrion, H. R. Scholer, and M. J. Arauzo-Bravo. 2017.
P3BSseq: parallel processing pipeline software for automatic analysis of bisulfite sequencing data. Bioinformatics 33,
3 (Feb 2017), 428–431.
[95] Elaine R Mardis. 2008. The impact of next-generation sequencing technology on genetics. Trends in genetics 24, 3
(2008), 133–141.
[96] Elaine R Mardis. 2011. A decade’s perspective on DNA sequencing technology. Nature 470, 7333 (2011), 198–203.
[97] Elaine R Mardis. 2013. Next-generation sequencing platforms. Annual review of analytical chemistry 6 (2013), 287–303.
[98] Marcel Margulies, Michael Egholm, William E Altman, Said Attiya, Joel S Bader, Lisa A Bemben, Jan Berka, Michael S
Braverman, Yi-Ju Chen, Zhoutao Chen, et al. 2005. Genome sequencing in microfabricated high-density picolitre
reactors. Nature 437, 7057 (2005), 376–380.
[99] Allan M Maxam and Walter Gilbert. 1977. A new method for sequencing DNA. Proceedings of the National Academy
of Sciences 74, 2 (1977), 560–564.
[100] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M.
Daly, and M. A. DePristo. 2010. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation
DNA sequencing data. Genome Research 20, 9 (sep 2010), 1297–1303. https://doi.org/10.1101/gr.107524.110
[101] Alexander Mellmann, Dag Harmsen, Craig A Cummings, Emily B Zentz, Shana R Leopold, Alain Rico, Karola Prior,
Rafael Szczepanowski, Yongmei Ji, Wenlan Zhang, et al. 2011. Prospective genomic characterization of the German
enterohemorrhagic Escherichia coli O104: H4 outbreak by rapid next generation sequencing technology. PloS one 6, 7
(2011), e22751.
[102] BIG Data Center Members. 2017. The BIG Data Center: from deposition to integration to translation. Nucleic Acids
Research 45, Database issue (2017), D18.
[103] C. A. Meyer and X. S. Liu. 2014. Identifying and mitigating bias in next-generation sequencing methods for chromatin
biology. Nat. Rev. Genet. 15, 11 (Nov 2014), 709–721.
[104] Huaiyu Mi, Sagar Poudel, Anushya Muruganujan, John T. Casagrande, and Paul D. Thomas. 2016. PANTHER version
10: expanded protein families and functions, and analysis tools. Nucleic Acids Research 44, D1 (jan 2016), D336–D342.
https://doi.org/10.1093/nar/gkv1194
[105] Jason R. Miller, Sergey Koren, and Granger Sutton. 2010. Assembly algorithms for next-generation sequencing data.
Genomics 95, 6 (2010), 315–327. https://doi.org/10.1016/j.ygeno.2010.03.001
[106] Iain Milne, Gordon Stephen, Micha Bayer, Peter J A Cock, Leighton Pritchard, Linda Cardle, Paul D Shaw, and David
Marshall. 2013. Using Tablet for visual exploration of second-generation sequencing data. Briefings in bioinformatics
14, 2 (mar 2013), 193–202. https://doi.org/10.1093/bib/bbs012
[107] S. B. Montgomery, D. L. Goode, E. Kvikstad, C. A. Albers, Z. D. Zhang, X. J. Mu, G. Ananda, B. Howie, K. J. Karczewski,
K. S. Smith, V. Anaya, R. Richardson, J. Davis, D. G. MacArthur, A. Sidow, L. Duret, M. Gerstein, K. D. Makova, J.
Marchini, G. McVean, G. Lunter, and Gerton Lunter. 2013. The origin, evolution, and functional impact of short
insertion-deletion variants identified in 179 human genomes. Genome Research 23, 5 (may 2013), 749–761. https:
//doi.org/10.1101/gr.148718.112
[108] Elizabeth P. Murchison, Ole B. Schulz-Trieglaff, Zemin Ning, Ludmil B. Alexandrov, Markus J. Bauer, Beiyuan
Fu, Matthew Hims, Zhihao Ding, Sergii Ivakhno, Caitlin Stewart, Bee Ling Ng, Wendy Wong, Bronwen Aken, Simon
White, Amber Alsop, Jennifer Becq, Graham R. Bignell, R. Keira Cheetham, William Cheng, Thomas R. Connor,
Anthony J. Cox, Zhi-Ping Feng, Yong Gu, Russell J. Grocock, Simon R. Harris, Irina Khrebtukova, Zoya Kingsbury,
Mark Kowarsky, Alexandre Kreiss, Shujun Luo, John Marshall, David J. McBride, Lisa Murray, Anne-Maree Pearse,
Keiran Raine, Isabelle Rasolonjatovo, Richard Shaw, Philip Tedder, Carolyn Tregidgo, Albert J. Vilella, David C.
Wedge, Gregory M. Woods, Niall Gormley, Sean Humphray, Gary Schroth, Geoffrey Smith, Kevin Hall, Stephen M.J.
Searle, Nigel P. Carter, Anthony T. Papenfuss, P. Andrew Futreal, Peter J. Campbell, Fengtang Yang, David R.
Bentley, Dirk J. Evers, and Michael R. Stratton. 2012. Genome Sequencing and Analysis of the Tasmanian Devil
and Its Transmissible Cancer. Cell 148, 4 (2012), 780–791. https://doi.org/10.1016/j.cell.2011.11.065
[109] Joseph A Neuman, Ofer Isakov, and Noam Shomron. 2013. Analysis of insertion-deletion from deep-sequencing data:
software evaluation for optimal detection. Briefings in bioinformatics 14, 1 (jan 2013), 46–55. https://doi.org/10.1093/
bib/bbs013

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:29

[110] Thomas P Niedringhaus, Denitsa Milanova, Matthew B Kerby, Michael P Snyder, and Annelise E Barron. 2011.
Landscape of next-generation sequencing technologies. Analytical chemistry 83, 12 (2011), 4327–4341.
[111] Beifang Niu, Limin Fu, Shulei Sun, Weizhong Li, DB Rusch, AL Halpern, G Sutton, KB Heidelberg, S Williamson, S
Yooseph, D Wu, JA Eisen, JM Hoffman, K Remington, JC Venter, K Remington, JF Heidelberg, AL Halpern, D Rusch,
JA Eisen, D Wu, I Paulsen, KE Nelson, W Nelson, SG Tringe, C von Mering, A Kobayashi, AA Salamov, K Chen, HW
Chang, M Podar, JM Short, EJ Mathur, JC Detter, SR Gill, M Pop, RT Deboy, PB Eckburg, PJ Turnbaugh, BS Samuel, JI
Gordon, DA Relman, CM Fraser-Liggett, KE Nelson, GW Tyson, J Chapman, P Hugenholtz, EE Allen, RJ Ram, PM
Richardson, VV Solovyev, EM Rubin, DS Rokhsar, JF Banfield, EA Dinsdale, RA Edwards, D Hall, F Angly, M Breitbart,
JM Brulc, M Furlan, C Desnues, M Haynes, L Li, J Frias-Lopez, Y Shi, GW Tyson, ML Coleman, SC Schuster, SW
Chisholm, EF Delong, PJ Turnbaugh, M Hamady, T Yatsunenko, BL Cantarel, A Duncan, RE Ley, ML Sogin, WJ Jones,
BA Roe, JP Affourtit, J Shendure, H Ji, V Gomez-Alvarez, TK Teal, TM Schmidt, W Li, L Jaroszewski, A Godzik, W Li,
L Jaroszewski, A Godzik, W Li, A Godzik, M Margulies, M Egholm, WE Altman, S Attiya, JS Bader, LA Bemben, J
Berka, MS Braverman, YJ Chen, Z Chen, SM Huse, JA Huber, HG Morrison, ML Sogin, DM Welch, AR Quinlan, DA
Stewart, MP Stromberg, GT Marth, Z Zhang, S Schwartz, L Wagner, W Miller, K Mavromatis, N Ivanova, K Barry, H
Shapiro, E Goltsman, AC McHardy, I Rigoutsos, A Salamov, F Korzeniewski, M Land, RS Poretsky, I Hewson, S Sun,
AE Allen, JP Zehr, MA Moran, JA Gilbert, D Field, Y Huang, R Edwards, W Li, P Gilna, I Joint, JD Thompson, DG
Higgins, and TJ Gibson. 2010. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC
Bioinformatics 11, 1 (2010), 187. https://doi.org/10.1186/1471-2105-11-187
[112] Jeongsu Oh, Byung Kwon Kim, Wan-Sup Cho, Soon Gyu Hong, and Kyung Mo Kim. 2012. PyroTrimmer: a software
with GUI for pre-processing 454 amplicon sequences. Journal of Microbiology 50, 5 (oct 2012), 766–769. https:
//doi.org/10.1007/s12275-012-2494-6
[113] Yukiteru Ono, Kiyoshi Asai, and Michiaki Hamada. 2013. PBSIM: PacBio reads simulator toward accurate genome
assembly. Bioinformatics 29, 1 (2013), 119–121.
[114] Fatih Ozsolak. 2012. Third-generation sequencing techniques and applications to drug discovery. Expert opinion on
drug discovery 7, 3 (2012), 231–243.
[115] Fatih Ozsolak, Philipp Kapranov, Sylvain Foissac, Sang Woo Kim, Elane Fishilevich, A Paula Monaghan, Bino John,
and Patrice M Milos. 2010. Comprehensive polyadenylation site maps in yeast and human reveal pervasive alternative
polyadenylation. Cell 143, 6 (2010), 1018–1029.
[116] Fatih Ozsolak, Adam R Platt, Dan R Jones, Jeffrey G Reifenberger, Lauryn E Sass, Peter McInerney, John F Thompson,
Jayson Bowers, Mirna Jarosz, and Patrice M Milos. 2009. Direct RNA sequencing. Nature 461, 7265 (2009), 814–818.
[117] Stephan Pabinger, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler,
Michael R Speicher, Johannes Zschocke, and Zlatko Trajanoski. 2014. A survey of tools for variant analysis of
next-generation genome sequencing data. Briefings in bioinformatics 15, 2 (2014), 256–278.
[118] Swati Parekh, Christoph Ziegenhain, Beate Vieth, Wolfgang Enard, and Ines Hellmann. 2016. The impact of amplifi-
cation on differential expression analyses by RNA-seq. Scientific reports 6 (2016).
[119] Ravi K. Patel, Mukesh Jain, ER Mardis, Z Wang, M Gerstein, M Snyder, R Garg, RK Patel, AK Tyagi, M Jain, R Garg, RK
Patel, S Jhanwar, P Priya, A Bhattacharjee, A Martinez-Alcantara, E Ballesteros, FM Rojas, H Koshinsky, VY Fofanov,
D Blankenberg, A Gordon, GV Kuster, N Coraor, J Taylor, MP Cox, DA Peterson, PJ Biggs, R Schmieder, Y Lim, F
Rohwer, R Edwards, R Schmieder, R Edwards, PJA Cock, CJ Fields, N Goto, ML Heuer, PM Rice, M Margulies, M
Egholm, WE Altman, S Attiya, JS Bader, T Lassmann, Y Hayashizaki, CO Daub, M Morgan, S Anders, M Lawrence, P
Aboyoun, H Pages, RV Pandey, V Nolte, and C Schlotterer. 2012. NGS QC Toolkit: A Toolkit for Quality Control of
Next Generation Sequencing Data. PLoS ONE 7, 2 (feb 2012), e30619. https://doi.org/10.1371/journal.pone.0030619
[120] William R Pearson and David J Lipman. 1988. Improved tools for biological sequence comparison. Proceedings of the
National Academy of Sciences 85, 8 (1988), 2444–2448.
[121] Mihai Pop and Steven L. Salzberg. 2008. Bioinformatics challenges of new sequencing technology. Trends in Genetics
24, 3 (2008), 142–149. https://doi.org/10.1016/j.tig.2007.12.006
[122] J. Quick, N. J. Loman, S. Duraffour, J. T. Simpson, E. Severi, L. Cowley, J. A. Bore, R. Koundouno, G. Dudas, A.
Mikhail, N. Ouedraogo, B. Afrough, A. Bah, J. H. Baum, B. Becker-Ziaja, J. P. Boettcher, M. Cabeza-Cabrerizo, A.
Camino-Sanchez, L. L. Carter, J. Doerrbecker, T. Enkirch, I. Garcia-Dorival, N. Hetzelt, J. Hinzmann, T. Holm, L. E.
Kafetzopoulou, M. Koropogui, A. Kosgey, E. Kuisma, C. H. Logue, A. Mazzarelli, S. Meisel, M. Mertens, J. Michel,
D. Ngabo, K. Nitzsche, E. Pallasch, L. V. Patrono, J. Portmann, J. G. Repits, N. Y. Rickett, A. Sachse, K. Singethan, I.
Vitoriano, R. L. Yemanaberhan, E. G. Zekeng, T. Racine, A. Bello, A. A. Sall, O. Faye, O. Faye, N. Magassouba, C. V.
Williams, V. Amburgey, L. Winona, E. Davis, J. Gerlach, F. Washington, V. Monteil, M. Jourdain, M. Bererd, A. Camara,
H. Somlare, A. Camara, M. Gerard, G. Bado, B. Baillet, D. Delaune, K. Y. Nebie, A. Diarra, Y. Savane, R. B. Pallawo,
G. J. Gutierrez, N. Milhano, I. Roger, C. J. Williams, F. Yattara, K. Lewandowski, J. Taylor, P. Rachwal, D. J. Turner, G.
Pollakis, J. A. Hiscox, D. A. Matthews, M. K. O’Shea, A. M. Johnston, D. Wilson, E. Hutley, E. Smit, A. Di Caro, R.
Wolfel, K. Stoecker, E. Fleischmann, M. Gabriel, S. A. Weller, L. Koivogui, B. Diallo, S. Keita, A. Rambaut, P. Formenty,

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:30 Wong et al.

S. Gunther, and M. W. Carroll. 2016. Real-time, portable genome sequencing for Ebola surveillance. Nature 530, 7589
(Feb 2016), 228–232.
[123] A. R. Quinlan, R. A. Clark, S. Sokolova, M. L. Leibowitz, Y. Zhang, M. E. Hurles, J. C. Mell, and I. M. Hall. 2010.
Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Research 20, 5
(may 2010), 623–635. https://doi.org/10.1101/gr.102970.109
[124] Richard Redon, Shumpei Ishikawa, Karen R. Fitch, Lars Feuk, George H. Perry, T. Daniel Andrews, Heike Fiegler,
Michael H. Shapero, Andrew R. Carson, Wenwei Chen, Eun Kyung Cho, Stephanie Dallaire, Jennifer L. Freeman,
Juan R. González, Mònica Gratacòs, Jing Huang, Dimitrios Kalaitzopoulos, Daisuke Komura, Jeffrey R. MacDonald,
Christian R. Marshall, Rui Mei, Lyndal Montgomery, Kunihiro Nishimura, Kohji Okamura, Fan Shen, Martin J.
Somerville, Joelle Tchinda, Armand Valsesia, Cara Woodwark, Fengtang Yang, Junjun Zhang, Tatiana Zerjal, Jane
Zhang, Lluis Armengol, Donald F. Conrad, Xavier Estivill, Chris Tyler-Smith, Nigel P. Carter, Hiroyuki Aburatani,
Charles Lee, Keith W. Jones, Stephen W. Scherer, and Matthew E. Hurles. 2006. Global variation in copy number in
the human genome. Nature 444, 7118 (nov 2006), 444–454. https://doi.org/10.1038/nature05329
[125] A. Rhoads and K. F. Au. 2015. PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics 13, 5 (Oct
2015), 278–289.
[126] Manuel A Rivas, Mélissa Beaudoin, Agnes Gardet, Christine Stevens, Yashoda Sharma, Clarence K Zhang, Gabrielle
Boucher, Stephan Ripke, David Ellinghaus, Noel Burtt, Tim Fennell, Andrew Kirby, Anna Latiano, Philippe Goyette,
Todd Green, Jonas Halfvarson, Talin Haritunians, Joshua M Korn, Finny Kuruvilla, Caroline Lagacé, Benjamin Neale,
Ken Sin Lo, Phil Schumm, Leif Törkvist, Marla C Dubinsky, Steven R Brant, Mark S Silverberg, Richard H Duerr,
David Altshuler, Stacey Gabriel, Guillaume Lettre, Andre Franke, Mauro D’Amato, Dermot P B McGovern, Judy H
Cho, John D Rioux, Ramnik J Xavier, Mark J Daly, John D Rioux, Ramnik J Xavier, and Mark J Daly. 2011. Deep
resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nature
Genetics 43, 11 (oct 2011), 1066–1073. https://doi.org/10.1038/ng.952
[127] N. D. Roberts, R. D. Kortschak, W. T. Parker, A. W. Schreiber, S. Branford, H. S. Scott, G. Glonek, and D. L. Adelson.
2013. A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics 29, 18 (sep 2013),
2223–2230. https://doi.org/10.1093/bioinformatics/btt375
[128] Holger Rohde, Junjie Qin, Yujun Cui, Dongfang Li, Nicholas J Loman, Moritz Hentschke, Wentong Chen, Fei Pu,
Yangqing Peng, Junhua Li, et al. 2011. Open-source genomic analysis of Shiga-toxin–producing E. coli O104: H4. New
England Journal of Medicine 365, 8 (2011), 718–724.
[129] M. G. Ross, C. Russ, M. Costello, A. Hollinger, N. J. Lennon, R. Hegarty, C. Nusbaum, and D. B. Jaffe. 2013. Characterizing
and measuring bias in sequence data. Genome Biol. 14, 5 (2013), R51.
[130] M. Rubio-Camarillo, G. Gomez-Lopez, J. M. Fernandez, A. Valencia, and D. G. Pisano. 2013. RUbioSeq: a suite
of parallelized pipelines to automate exome variation and bisulfite-seq analyses. Bioinformatics 29, 13 (Jul 2013),
1687–1689.
[131] Nicole Rusk. 2009. Cheap third-generation sequencing. Nature Methods 6, 4 (2009), 244–244.
[132] Nicole Rusk. 2011. Torrents of sequence. Nature Methods 8, 1 (2011), 44–44.
[133] Frederick Sanger, Steven Nicklen, and Alan R Coulson. 1977. DNA sequencing with chain-terminating inhibitors.
Proceedings of the national academy of sciences 74, 12 (1977), 5463–5467.
[134] Christopher T Saunders, Wendy S W Wong, Sajani Swamy, Jennifer Becq, Lisa J Murray, and R Keira Cheetham. 2012.
Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics (Oxford,
England) 28, 14 (jul 2012), 1811–7. https://doi.org/10.1093/bioinformatics/bts271
[135] Eric E Schadt, Steve Turner, and Andrew Kasarskis. 2010. A window into third-generation sequencing. Human
molecular genetics (2010), ddq416.
[136] Michael C Schatz. 2009. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics (Oxford, England)
25, 11 (jun 2009), 1363–9. https://doi.org/10.1093/bioinformatics/btp236
[137] Stephan C Schuster. 2007. Next-generation sequencing transforms today’s biology. Nature 200, 8 (2007), 16–18.
[138] Jana Marie Schwarz, Christian Rödelsperger, Markus Schuelke, and Dominik Seelow. 2010. MutationTaster evaluates
disease-causing potential of sequence alterations. Nature Methods 7, 8 (aug 2010), 575–576. https://doi.org/10.1038/
nmeth0810-575
[139] Jay Shendure and Hanlee Ji. 2008. Next-generation DNA sequencing. Nature Biotechnology 26, 10 (oct 2008), 1135–1145.
https://doi.org/10.1038/nbt1486
[140] C. Sloggett, N. Goonasekera, and E. Afgan. 2013. BioBlend: automating pipeline analyses within Galaxy and CloudMan.
Bioinformatics 29, 13 (Jul 2013), 1685–1686.
[141] L. F. Stead, K. M. Sutton, G. R. Taylor, P. Quirke, and P. Rabbitts. 2013. Accurately identifying low-allelic fraction
variants in single samples with next-generation sequencing: applications in tumor subclone resolution. Hum. Mutat.
34, 10 (Oct 2013), 1432–1438.

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
DNA Sequencing Technologies 39:31

[142] Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H Campbell, Chengxiang Zhai, Miles J Efron, Ravishankar Iyer,
Michael C Schatz, Saurabh Sinha, and Gene E Robinson. 2015. Big data: astronomical or genomical? PLoS biology 13,
7 (2015), e1002195.
[143] Bianca Stöcker, Johannes Köster, and Sven Rahmann. 2016. SimLoRD–Simulation of Long Read Data. Bioinformatics
(2016), btw286.
[144] Michael R. Stratton, Peter J. Campbell, and P. Andrew Futreal. 2009. The cancer genome. Nature 458, 7239 (apr 2009),
719–724. https://doi.org/10.1038/nature07943
[145] Peter H. Sudmant, Tobias Rausch, Eugene J. Gardner, Robert E. Handsaker, Alexej Abyzov, John Huddleston, Yan
Zhang, Kai Ye, Goo Jun, Markus Hsi-Yang Fritz, Miriam K. Konkel, Ankit Malhotra, Adrian M. Stütz, Xinghua Shi,
Francesco Paolo Casale, Jieming Chen, Fereydoun Hormozdiari, Gargi Dayama, Ken Chen, Maika Malig, Mark J. P.
Chaisson, Klaudia Walter, Sascha Meiers, Seva Kashin, Erik Garrison, Adam Auton, Hugo Y. K. Lam, Xinmeng Jasmine
Mu, Can Alkan, Danny Antaki, Taejeong Bae, Eliza Cerveira, Peter Chines, Zechen Chong, Laura Clarke, Elif Dal, Li
Ding, Sarah Emery, Xian Fan, Madhusudan Gujral, Fatma Kahveci, Jeffrey M. Kidd, Yu Kong, Eric-Wubbo Lameijer,
Shane McCarthy, Paul Flicek, Richard A. Gibbs, Gabor Marth, Christopher E. Mason, Androniki Menelaou, Donna M.
Muzny, Bradley J. Nelson, Amina Noor, Nicholas F. Parrish, Matthew Pendleton, Andrew Quitadamo, Benjamin Raeder,
Eric E. Schadt, Mallory Romanovitch, Andreas Schlattl, Robert Sebra, Andrey A. Shabalin, Andreas Untergasser,
Jerilyn A. Walker, Min Wang, Fuli Yu, Chengsheng Zhang, Jing Zhang, Xiangqun Zheng-Bradley, Wanding Zhou,
Thomas Zichner, Jonathan Sebat, Mark A. Batzer, Steven A. McCarroll, Ryan E. Mills, Mark B. Gerstein, Ali Bashir,
Oliver Stegle, Scott E. Devine, Charles Lee, Evan E. Eichler, Jan O. Korbel, and Jan O. Korbel. 2015. An integrated map of
structural variation in 2,504 human genomes. Nature 526, 7571 (sep 2015), 75–81. https://doi.org/10.1038/nature15394
[146] Tamas Szalay and Jene A Golovchenko. 2015. De novo sequencing and variant calling with nanopores using PoreSeq.
Nature biotechnology 33, 10 (2015), 1087–1091.
[147] Y. Tateno, T. Imanishi, S. Miyazaki, K. Fukami-Kobayashi, N. Saitou, H. Sugawara, and T. Gojobori. 2002. DNA
Data Bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Research 30, 1 (jan 2002), 27–30.
https://doi.org/10.1093/nar/30.1.27
[148] GB Editorial Team. 2011. Closure of the NCBI SRA and implications for the long-term future of genomics data storage.
(2011).
[149] Helga Thorvaldsdóttir, James T. Robinson, and Jill P. Mesirov. 2013. Integrative Genomics Viewer (IGV): high-
performance genomics data visualization and exploration. Briefings in Bioinformatics 14, 2 (2013), 178–192. https:
//doi.org/10.1093/bib/bbs017
[150] Erwin L van Dijk, Hélène Auger, Yan Jaszczyszyn, and Claude Thermes. 2014. Ten years of next-generation sequencing
technology. Trends in genetics 30, 9 (2014), 418–426.
[151] Yanqing Wang, Fuhai Song, Junwei Zhu, Sisi Zhang, Yadong Yang, Tingting Chen, Bixia Tang, Lili Dong, Nan Ding,
Qian Zhang, et al. 2017. GSA: Genome Sequence Archive. Genomics, Proteomics and Bioinformatics (2017).
[152] Mick Watson, Marian Thomson, Judith Risse, Richard Talbot, Javier Santoyo-Lopez, Karim Gharbi, and Mark Blaxter.
2015. poRe: an R package for the visualization and analysis of nanopore sequencing data. Bioinformatics 31, 1 (2015),
114–115.
[153] Simon J Watson, Matthijs R A Welkers, Daniel P Depledge, Eve Coulter, Judith M Breuer, Menno D de Jong, Paul
Kellam, DD. Richman, EM. Bunnik, A. Moya, E. Holmes, F. González-Candelas, C. Wang, Y. Mitsuya, B. Gharizadeh,
M. Ronaghi, RW. Shafer, J. Archer, MS. Braverman, BE. Taillon, B. Desany, I. James, PR. Harrigan, M. Lewis, DL.
Robertson, N. Eriksson, L. Pachter, Y. Mitsuya, S-Y. Rhee, C. Wang, B. Gharizadeh, M. Ronaghi, RW. Shafer, N.
Beerenwinkel, J. Archer, G. Baillie, SJ. Watson, P. Kellam, A. Rambaut, DL. Robertson, K. Nakamura, SM. Huse, JA.
Huber, HG. Morrison, ML. Sogin, DM. Welch, AR. Quinian, DA. Stewart, MP. Strömberg, GT. Marth, RV. Pandey, V.
Nolte, J. Boenigk, C. Schlötterer, R. Schmieder, R. Edwards, RV. Patel, M. Jain, Z. Ning, AJ. Cox, JC. Mullikin, H. Li,
B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, G. Baillie, ML. Metzker,
and A. McKenna. 2013. Viral population analysis and minority-variant detection using short read next-generation
sequencing. Philosophical transactions of the Royal Society of London. Series B, Biological sciences 368, 1614 (mar 2013),
20120205. https://doi.org/10.1098/rstb.2012.0205
[154] Joachim Weischenfeldt, Orsolya Symmons, François Spitz, and Jan O. Korbel. 2013. Phenotypic impact of genomic
structural variation: insights from and for human disease. Nature Reviews Genetics 14, 2 (jan 2013), 125–138. https:
//doi.org/10.1038/nrg3373
[155] David A Wheeler, Maithreyan Srinivasan, Michael Egholm, Yufeng Shen, Lei Chen, Amy McGuire, Wen He, Yi-Ju
Chen, Vinod Makhijani, G Thomas Roth, et al. 2008. The complete genome of an individual by massively parallel
DNA sequencing. nature 452, 7189 (2008), 872–876.
[156] K. Wong, T. M. Keane, J. Stalker, and D. J. Adams. 2010. Enhanced structural variant and breakpoint detection using
SVMerge by integration of multiple detection methods and local assembly. Genome Biol. 11, 12 (2010), R128.

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.
39:32 Wong et al.

[157] Ka-Chun Wong and Zhaolei Zhang. 2014. SNPdryad: predicting deleterious non-synonymous human SNPs using
only orthologous protein sequences. Bioinformatics (Oxford, England) 30, 8 (jan 2014), 1112–1119. https://doi.org/10.
1093/bioinformatics/btt769
[158] Chao Xie, Martti T Tammi, J Sebat, B Lakshmi, J Troge, J Alexander, J Young, P Lundin, S Månér, H Massa, M Walker,
M Chi, N Navin, R Lucito, J Healy, J Hicks, K Ye, A Reiner, TC Gilliam, B Trask, N Patterson, A Zetterberg, M Wigler,
AJ Iafrate, L Feuk, MN Rivera, ML Listewnik, PK Donahoe, Y Qi, SW Scherer, KC Woodwark, G Cameron, R Durbin,
A Cox, T Hubbard, M Clamp, and WJ Kent. 2009. CNV-seq, a new method to detect copy number variation using
high-throughput sequencing. BMC Bioinformatics 10, 1 (2009), 80. https://doi.org/10.1186/1471-2105-10-80
[159] Haibin Xu, Xiang Luo, Jun Qian, Xiaohui Pang, Jingyuan Song, Guangrui Qian, Jinhui Chen, Shilin Chen, R Li, W Fan,
G Tian, H Zhu, L He, C Shinzato, E Shoguchi, T Kawashima, M Hamada, K Hisata, PA Hohenlohe, S Bassham, PD
Etter, N Stiffler, EA Johnson, RK Thomas, AC Baker, RM Debiasi, W Winckler, T Laframboise, T Lu, G Lu, D Fan, C
Zhu, W Li, I Kozarewa, Z Ning, MA Quail, MJ Sanders, M Berriman, RE Handsaker, JM Korn, J Nemesh, SA McCarroll,
M Boetzer, CV Henkel, HJ Jansen, D Butler, W Pirovano, SJ Bowne, MM Humphries, LS Sullivan, PF Kenna, LC Tam,
MJ Clark, R Chen, HY Lam, KJ Karczewski, G Euskirchen, DA Skelly, M Johansson, J Madeoy, J Wakefield, JM Akey,
B Langmead, C Trapnell, M Pop, SL Salzberg, B Langmead, MC Schatz, J Lin, M Pop, SL Salzberg, H Li, R Durbin,
H Li, B Handsaker, A Wysoker, T Fennell, J Ruan, L Pireddu, S Leo, G Zanetti, JO Korbel, AE Urban, JP Affourtit, B
Godwin, F Grubert, H Park, JI Kim, YS Ju, O Gokcumen, RE Mills, RE Mills, CT Luttig, CE Larkins, A Beauchamp, C
Tsui, GE Liu, C Alkan, L Jiang, S Zhao, EE Eichler, S Liu, CT Yeh, T Ji, K Ying, H Wu, WR Pearson, T Wood, Z Zhang,
W Miller, MS Burriesci, EM Lehnert, JR Pringle, DR Zerbino, E Birney, R Li, H Zhu, J Ruan, W Qian, X Fang, S Gnerre,
I Maccallum, D Przybylski, FJ Ribeiro, and JN Burton. 2012. FastUniq: A Fast De Novo Duplicates Removal Tool for
Paired Short Reads. PLoS ONE 7, 12 (dec 2012), e52249. https://doi.org/10.1371/journal.pone.0052249
[160] Kai Ye, Marcel H. Schulz, Quan Long, Rolf Apweiler, and Zemin Ning. 2009. Pindel: a pattern growth approach to
detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 21
(2009), 2865–2871. https://doi.org/10.1093/bioinformatics/btp394
[161] Ming Yi, Yongmei Zhao, Li Jia, Mei He, Electron Kebebew, and Robert M. Stephens. 2014. Performance comparison of
SNP detection tools with illumina exome sequencing data an assessment using both family pedigree information and
sample matched SNP array data. Nucleic Acids Research 42, 12 (jul 2014), e101–e101. https://doi.org/10.1093/nar/
gku392
[162] Yongchao Yongchao Liu and Bertil Schmidt. 2014. CUSHAW2-GPU: Empowering Faster Gapped Short-Read Alignment
Using GPU Computing. IEEE Design and Test 31, 1 (feb 2014), 31–39. https://doi.org/10.1109/MDAT.2013.2284198
[163] S. Yoon, Z. Xuan, V. Makarov, K. Ye, and J. Sebat. 2009. Sensitive and accurate detection of copy number variants
using read depth of coverage. Genome Research 19, 9 (sep 2009), 1586–1592. https://doi.org/10.1101/gr.092981.109
[164] Y William Yu, Deniz Yorukoglu, Jian Peng, and Bonnie Berger. 2015. Quality score compression improves genotyping
accuracy. Nature biotechnology 33, 3 (2015), 240–243.
[165] Peng Yue, Eugene Melamud, John Moult, PD Stenson, EV Ball, M Mort, AD Phillips, JA Shiel, NS Thomas, S Abeysinghe,
M Krawczak, DN Cooper, ST Sherry, MH Ward, M Kholodov, J Baker, L Phan, EM Smigielski, K Sirotkin, GD Bader,
D Betel, CW Hogue, M Kanehisa, S Goto, S Kawashima, Y Okuno, M Hattori, BJ Stapley, G Benoit, N Daraselia, A
Yuryev, S Egorov, S Novichkova, MK Halushka, JB Fan, K Bentley, L Hsie, N Shen, A Weder, R Cooper, R Lipshutz,
and A Chakravarti. 2006. SNPs3D: Candidate gene and SNP selection for association studies. BMC Bioinformatics 7, 1
(2006), 166. https://doi.org/10.1186/1471-2105-7-166
[166] Daniel R Zerbino and Ewan Birney. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs.
Genome research 18, 5 (may 2008), 821–9. https://doi.org/10.1101/gr.074492.107
[167] Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller. 2000. A Greedy Algorithm for Aligning DNA
Sequences. Journal of Computational Biology 7, 1-2 (feb 2000), 203–214. https://doi.org/10.1089/10665270050081478
[168] Qian Zhou, Xiaoquan Su, Anhui Wang, Jian Xu, and Kang Ning. 2013. QC-Chain: Fast and Holistic Quality Control
Method for Next-Generation Sequencing Data. PLoS ONE 8, 4 (apr 2013), e60234. https://doi.org/10.1371/journal.
pone.0060234

Received February 2099; revised March 2099; accepted June 2099

ACM Computing Surveys, Vol. 9, No. 4, Article 39. Publication date: March 2099.

View publication stats

You might also like