You are on page 1of 53

Algorithms for Next-Generation

Sequencing 1st Edition Wing-Kin Sung


Visit to download the full and correct content document:
https://textbookfull.com/product/algorithms-for-next-generation-sequencing-1st-edition
-wing-kin-sung/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Computational Methods for Next Generation Sequencing


Data Analysis 1st Edition Ion Mandoiu

https://textbookfull.com/product/computational-methods-for-next-
generation-sequencing-data-analysis-1st-edition-ion-mandoiu/

Next of Kin 1st Edition Elton Skelter

https://textbookfull.com/product/next-of-kin-1st-edition-elton-
skelter/

Next of Kin 1st Edition Elton Skelter

https://textbookfull.com/product/next-of-kin-1st-edition-elton-
skelter-2/

Fundamentals of Deep Learning Designing Next Generation


Machine Intelligence Algorithms 1st Edition Nikhil
Buduma

https://textbookfull.com/product/fundamentals-of-deep-learning-
designing-next-generation-machine-intelligence-algorithms-1st-
edition-nikhil-buduma/
Security and Privacy for Next-Generation Wireless
Networks Sheng Zhong

https://textbookfull.com/product/security-and-privacy-for-next-
generation-wireless-networks-sheng-zhong/

Next Generation DNA Led Technologies 1st Edition


Sharada Avadhanam

https://textbookfull.com/product/next-generation-dna-led-
technologies-1st-edition-sharada-avadhanam/

Network Programmability and Automation Skills for the


Next Generation Network Engineer 1st Edition Jason
Edelman

https://textbookfull.com/product/network-programmability-and-
automation-skills-for-the-next-generation-network-engineer-1st-
edition-jason-edelman/

Materials and Processes for Next Generation Lithography


1st Edition Alex Robinson And Richard Lawson (Eds.)

https://textbookfull.com/product/materials-and-processes-for-
next-generation-lithography-1st-edition-alex-robinson-and-
richard-lawson-eds/

Energetic Materials - Advanced Processing Technologies


for Next-Generation Materials 1st Edition Mark J.
Mezger (Ed.)

https://textbookfull.com/product/energetic-materials-advanced-
processing-technologies-for-next-generation-materials-1st-
edition-mark-j-mezger-ed/
ALGORITHMS FOR
NEXT-GENERATION SEQUENCING
ALGORITHMS FOR
NEXT-GENERATION SEQUENCING

Wing-Kin Sung
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2017 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20170421

International Standard Book Number-13: 978-1-4665-6550-0 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents

Preface xi

1 Introduction 1
1.1 DNA, RNA, protein and cells . . . . . . . . . . . . . . . . . . 1
1.2 Sequencing technologies . . . . . . . . . . . . . . . . . . . . . 3
1.3 First-generation sequencing . . . . . . . . . . . . . . . . . . . 4
1.4 Second-generation sequencing . . . . . . . . . . . . . . . . . 6
1.4.1 Template preparation . . . . . . . . . . . . . . . . . . 6
1.4.2 Base calling . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.3 Polymerase-mediated methods based on reversible
terminator nucleotides . . . . . . . . . . . . . . . . . . 7
1.4.4 Polymerase-mediated methods based on unmodified
nucleotides . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.5 Ligase-mediated method . . . . . . . . . . . . . . . . . 11
1.5 Third-generation sequencing . . . . . . . . . . . . . . . . . . 12
1.5.1 Single-molecule real-time sequencing . . . . . . . . . . 12
1.5.2 Nanopore sequencing method . . . . . . . . . . . . . . 13
1.5.3 Direct imaging of DNA using electron microscopy . . 15
1.6 Comparison of the three generations of sequencing . . . . . . 16
1.7 Applications of sequencing . . . . . . . . . . . . . . . . . . . 17
1.8 Summary and further reading . . . . . . . . . . . . . . . . . 19
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 NGS file formats 21


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Raw data files: fasta and fastq . . . . . . . . . . . . . . . . . 22
2.3 Alignment files: SAM and BAM . . . . . . . . . . . . . . . . 24
2.3.1 FLAG . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 CIGAR string . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Bed format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Variant Call Format (VCF) . . . . . . . . . . . . . . . . . . . 29
2.6 Format for representing density data . . . . . . . . . . . . . 31
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

v
vi Contents

3 Related algorithms and data structures 35


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Recursion and dynamic programming . . . . . . . . . . . . . 35
3.2.1 Key searching problem . . . . . . . . . . . . . . . . . . 36
3.2.2 Edit-distance problem . . . . . . . . . . . . . . . . . . 37
3.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Maximum likelihood . . . . . . . . . . . . . . . . . . . 39
3.3.2 Unobserved variable and EM algorithm . . . . . . . . 40
3.4 Hash data structures . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Maintain an associative array by simple hashing . . . 43
3.4.2 Maintain a set using a Bloom filter . . . . . . . . . . . 45
3.4.3 Maintain a multiset using a counting Bloom filter . . . 46
3.4.4 Estimating the similarity of two sets using minHash . 47
3.5 Full-text index . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.1 Suffix trie and suffix tree . . . . . . . . . . . . . . . . 49
3.5.2 Suffix array . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.3 FM-index . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.3.1 Inverting the BWT B to the original text T 53
3.5.3.2 Simulate a suffix array using the FM-index . 54
3.5.3.3 Pattern matching . . . . . . . . . . . . . . . 55
3.5.4 Simulate a suffix trie using the FM-index . . . . . . . 55
3.5.5 Bi-directional BWT . . . . . . . . . . . . . . . . . . . 56
3.6 Data compression techniques . . . . . . . . . . . . . . . . . . 58
3.6.1 Data compression and entropy . . . . . . . . . . . . . 58
3.6.2 Unary, gamma, and delta coding . . . . . . . . . . . . 59
3.6.3 Golomb code . . . . . . . . . . . . . . . . . . . . . . . 60
3.6.4 Huffman coding . . . . . . . . . . . . . . . . . . . . . . 60
3.6.5 Arithmetic code . . . . . . . . . . . . . . . . . . . . . 62
3.6.6 Order-k Markov Chain . . . . . . . . . . . . . . . . . . 64
3.6.7 Run-length encoding . . . . . . . . . . . . . . . . . . . 65
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4 NGS read mapping 69


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Overview of the read mapping problem . . . . . . . . . . . . 70
4.2.1 Mapping reads with no quality score . . . . . . . . . . 70
4.2.2 Mapping reads with a quality score . . . . . . . . . . . 71
4.2.3 Brute-force solution . . . . . . . . . . . . . . . . . . . 72
4.2.4 Mapping quality . . . . . . . . . . . . . . . . . . . . . 74
4.2.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Align reads allowing a small number of mismatches . . . . . 76
4.3.1 Mismatch seed hashing approach . . . . . . . . . . . . 77
4.3.2 Read hashing with a spaced seed . . . . . . . . . . . . 78
4.3.3 Reference hashing approach . . . . . . . . . . . . . . . 82
4.3.4 Suffix trie-based approaches . . . . . . . . . . . . . . . 84
Contents vii

4.3.4.1 Estimating the lower bound of the number of


mismatches . . . . . . . . . . . . . . . . . . . 87
4.3.4.2 Divide and conquer with the enhanced pigeon-
hole principle . . . . . . . . . . . . . . . . . . 89
4.3.4.3 Aligning a set of reads together . . . . . . . 92
4.3.4.4 Speed up utilizing the quality score . . . . . 94
4.4 Aligning reads allowing a small number of mismatches
and indels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4.1 q-mer approach . . . . . . . . . . . . . . . . . . . . . . 97
4.4.2 Computing alignment using a suffix trie . . . . . . . . 99
4.4.2.1 Computing the edit distance using a suffix trie 100
4.4.2.2 Local alignment using a suffix trie . . . . . . 103
4.5 Aligning reads in general . . . . . . . . . . . . . . . . . . . . 105
4.5.1 Seed-and-extension approach . . . . . . . . . . . . . . 107
4.5.1.1 BWA-SW . . . . . . . . . . . . . . . . . . . . 108
4.5.1.2 Bowtie 2 . . . . . . . . . . . . . . . . . . . . 109
4.5.1.3 BatAlign . . . . . . . . . . . . . . . . . . . . 110
4.5.1.4 Cushaw2 . . . . . . . . . . . . . . . . . . . . 111
4.5.1.5 BWA-MEM . . . . . . . . . . . . . . . . . . . 112
4.5.1.6 LAST . . . . . . . . . . . . . . . . . . . . . . 113
4.5.2 Filter-based approach . . . . . . . . . . . . . . . . . . 114
4.6 Paired-end alignment . . . . . . . . . . . . . . . . . . . . . . 116
4.7 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5 Genome assembly 123


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.2 Whole genome shotgun sequencing . . . . . . . . . . . . . . . 124
5.2.1 Whole genome sequencing . . . . . . . . . . . . . . . . 124
5.2.2 Mate-pair sequencing . . . . . . . . . . . . . . . . . . 126
5.3 De novo genome assembly for short reads . . . . . . . . . . . 126
5.3.1 Read error correction . . . . . . . . . . . . . . . . . . 128
5.3.1.1 Spectral alignment problem (SAP) . . . . . . 129
5.3.1.2 k-mer counting . . . . . . . . . . . . . . . . . 133
5.3.2 Base-by-base extension approach . . . . . . . . . . . . 138
5.3.3 De Bruijn graph approach . . . . . . . . . . . . . . . . 141
5.3.3.1 De Bruijn assembler (no sequencing error) . 143
5.3.3.2 De Bruijn assembler (with sequencing errors) 144
5.3.3.3 How to select k . . . . . . . . . . . . . . . . . 146
5.3.3.4 Additional issues of the de Bruijn graph
approach . . . . . . . . . . . . . . . . . . . . 147
5.3.4 Scaffolding . . . . . . . . . . . . . . . . . . . . . . . . 150
5.3.5 Gap filling . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.4 Genome assembly for long reads . . . . . . . . . . . . . . . . 154
viii Contents

5.4.1 Assemble long reads assuming long reads have a low


sequencing error rate . . . . . . . . . . . . . . . . . . . 155
5.4.2 Hybrid approach . . . . . . . . . . . . . . . . . . . . . 157
5.4.2.1 Use mate-pair reads and long reads to improve
the assembly from short reads . . . . . . . . 160
5.4.2.2 Use short reads to correct errors in long reads 160
5.4.3 Long read approach . . . . . . . . . . . . . . . . . . . 161
5.4.3.1 MinHash for all-versus-all pairwise alignment 162
5.4.3.2 Computing consensus using Falcon Sense . . 163
5.4.3.3 Quiver consensus algorithm . . . . . . . . . . 165
5.5 How to evaluate the goodness of an assembly . . . . . . . . . 168
5.6 Discussion and further reading . . . . . . . . . . . . . . . . . 168
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

6 Single nucleotide variation (SNV) calling 175


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.1.1 What are SNVs and small indels? . . . . . . . . . . . 175
6.1.2 Somatic and germline mutations . . . . . . . . . . . . 178
6.2 Determine variations by resequencing . . . . . . . . . . . . . 178
6.2.1 Exome/targeted sequencing . . . . . . . . . . . . . . . 179
6.2.2 Detection of somatic and germline variations . . . . . 180
6.3 Single locus SNV calling . . . . . . . . . . . . . . . . . . . . 180
6.3.1 Identifying SNVs by counting alleles . . . . . . . . . . 181
6.3.2 Identify SNVs by binomial distribution . . . . . . . . 182
6.3.3 Identify SNVs by Poisson-binomial distribution . . . . 184
6.3.4 Identifying SNVs by the Bayesian approach . . . . . . 185
6.4 Single locus somatic SNV calling . . . . . . . . . . . . . . . . 187
6.4.1 Identify somatic SNVs by the Fisher exact test . . . . 187
6.4.2 Identify somatic SNVs by verifying that the SNVs
appear in the tumor only . . . . . . . . . . . . . . . . 188
6.4.2.1 Identify SNVs in the tumor sample by
posterior odds ratio . . . . . . . . . . . . . . 188
6.4.2.2 Verify if an SNV is somatic by the posterior
odds ratio . . . . . . . . . . . . . . . . . . . . 191
6.5 General pipeline for calling SNVs . . . . . . . . . . . . . . . 192
6.6 Local realignment . . . . . . . . . . . . . . . . . . . . . . . . 193
6.7 Duplicate read marking . . . . . . . . . . . . . . . . . . . . . 195
6.8 Base quality score recalibration . . . . . . . . . . . . . . . . 195
6.9 Rule-based filtering . . . . . . . . . . . . . . . . . . . . . . . 198
6.10 Computational methods to identify small indels . . . . . . . 199
6.10.1 Split-read approach . . . . . . . . . . . . . . . . . . . 199
6.10.2 Span distribution-based clustering approach . . . . . . 200
6.10.3 Local assembly approach . . . . . . . . . . . . . . . . 203
6.11 Correctness of existing SNV and indel callers . . . . . . . . . 204
6.12 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 205
Contents ix

6.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

7 Structural variation calling 209


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
7.2 Formation of SVs . . . . . . . . . . . . . . . . . . . . . . . . 211
7.3 Clinical effects of structural variations . . . . . . . . . . . . . 214
7.4 Methods for determining structural variations . . . . . . . . 215
7.5 CNV calling . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
7.5.1 Computing the raw read count . . . . . . . . . . . . . 218
7.5.2 Normalize the read counts . . . . . . . . . . . . . . . . 219
7.5.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . 219
7.6 SV calling pipeline . . . . . . . . . . . . . . . . . . . . . . . . 222
7.6.1 Insert size estimation . . . . . . . . . . . . . . . . . . . 222
7.7 Classifying the paired-end read alignments . . . . . . . . . . 223
7.8 Identifying candidate SVs from paired-end reads . . . . . . . 226
7.8.1 Clustering approach . . . . . . . . . . . . . . . . . . . 227
7.8.1.1 Clique-finding approach . . . . . . . . . . . . 228
7.8.1.2 Confidence interval overlapping approach . . 229
7.8.1.3 Set cover approach . . . . . . . . . . . . . . . 233
7.8.1.4 Performance of the clustering approach . . . 236
7.8.2 Split-mapping approach . . . . . . . . . . . . . . . . . 236
7.8.3 Assembly approach . . . . . . . . . . . . . . . . . . . . 237
7.8.4 Hybrid approach . . . . . . . . . . . . . . . . . . . . . 238
7.9 Verify the SVs . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.10 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

8 RNA-seq 245
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.2 High-throughput methods to study the transcriptome . . . . 247
8.3 Application of RNA-seq . . . . . . . . . . . . . . . . . . . . . 248
8.4 Computational Problems of RNA-seq . . . . . . . . . . . . . 250
8.5 RNA-seq read mapping . . . . . . . . . . . . . . . . . . . . . 250
8.5.1 Features used in RNA-seq read mapping . . . . . . . . 250
8.5.1.1 Transcript model . . . . . . . . . . . . . . . . 250
8.5.1.2 Splice junction signals . . . . . . . . . . . . . 252
8.5.2 Exon-first approach . . . . . . . . . . . . . . . . . . . 253
8.5.3 Seed-and-extend approach . . . . . . . . . . . . . . . . 256
8.6 Construction of isoforms . . . . . . . . . . . . . . . . . . . . 260
8.7 Estimating expression level of each transcript . . . . . . . . . 261
8.7.1 Estimating transcript abundances when every read
maps to exactly one transcript . . . . . . . . . . . . . 261
8.7.2 Estimating transcript abundances when a read maps to
multiple isoforms . . . . . . . . . . . . . . . . . . . . . 264
8.7.3 Estimating gene abundance . . . . . . . . . . . . . . . 266
x Contents

8.8 Summary and further reading . . . . . . . . . . . . . . . . . 268


8.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

9 Peak calling methods 271


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
9.2 Techniques that generate density-based datasets . . . . . . . 271
9.2.1 Protein DNA interaction . . . . . . . . . . . . . . . . . 271
9.2.2 Epigenetics of our genome . . . . . . . . . . . . . . . . 273
9.2.3 Open chromatin . . . . . . . . . . . . . . . . . . . . . 274
9.3 Peak calling methods . . . . . . . . . . . . . . . . . . . . . . 274
9.3.1 Model fragment length . . . . . . . . . . . . . . . . . . 276
9.3.2 Modeling noise using a control library . . . . . . . . . 279
9.3.3 Noise in the sample library . . . . . . . . . . . . . . . 280
9.3.4 Determination if a peak is significant . . . . . . . . . . 281
9.3.5 Unannotated high copy number regions . . . . . . . . 283
9.3.6 Constructing a signal profile by Kernel methods . . . 284
9.4 Sequencing depth of the ChIP-seq libraries . . . . . . . . . . 285
9.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 286
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

10 Data compression techniques used in NGS files 289


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
10.2 Strategies for compressing fasta/fastq files . . . . . . . . . . 290
10.3 Techniques to compress identifiers . . . . . . . . . . . . . . . 290
10.4 Techniques to compress DNA bases . . . . . . . . . . . . . . 291
10.4.1 Statistical-based approach . . . . . . . . . . . . . . . . 291
10.4.2 BWT-based approach . . . . . . . . . . . . . . . . . . 292
10.4.3 Reference-based approach . . . . . . . . . . . . . . . . 295
10.4.4 Assembly-based approach . . . . . . . . . . . . . . . . 297
10.5 Quality score compression methods . . . . . . . . . . . . . . 299
10.5.1 Lossless compression . . . . . . . . . . . . . . . . . . . 300
10.5.2 Lossy compression . . . . . . . . . . . . . . . . . . . . 301
10.6 Compression of other NGS data . . . . . . . . . . . . . . . . 302
10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

References 307

Index 339
Preface

Next-generation sequencing (NGS) is a recently developed technology enabling


us to generate hundreds of billions of DNA bases from the samples. We can
use NGS to reconstruct the genome, understand genomic variations, recover
transcriptomes, and identify the transcription factor binding sites or the epi-
genetic marks.
The NGS technology radically changes how we collect genomic data from
the samples. Instead of studying a particular gene or a particular genomic re-
gion, NGS technologies enable us to perform genome-wide study unbiasedly.
Although more raw data can be obtained from sequencing machines, we face
computational challenges in analyzing such a big dataset. Hence, it is impor-
tant to develop efficient and accurate computational methods to analyze or
process such datasets. This book is intended to give an in-depth introduction
to such algorithmic techniques.
The primary audiences of this book include advanced undergraduate stu-
dents and graduate students who are from mathematics or computer science
departments. We assume that readers have some training in college-level bi-
ology, statistics, discrete mathematics and algorithms.
This book was developed partly from the teaching material for the course
on Combinatorial Methods in Bioinformatics, which I taught at the National
University of Singapore, Singapore. The chapters in this book are classified
based on the application domains of the NGS technologies. In each chapter, a
brief introduction to the technology is first given. Then, different methods or
algorithms for analyzing such NGS datasets are described. To illustrate each
algorithm, detailed examples are given. At the end of each chapter, exercises
are given to help readers understand the topics.
Chapter 1 introduces the next-generation sequencing technologies. We
cover the three generations of sequencing, starting from Sanger sequencing
(first generation). Then, we cover second-generation sequencing, which in-
cludes Illumina Solexa sequencing. Finally, we describe the third-generation
sequencing technologies which include PacBio sequencing and nanopore se-
quencing.
Chapter 2 introduces a few NGS file formats, which facilitate downstream
analysis and data transfer. They include fasta, fastq, SAM, BAM, BED, VCF
and WIG formats. Fasta and fastq are file formats for describing the raw
sequencing reads generated by the sequencers. SAM and BAM are file formats

xi
xii Preface

for describing the alignments of the NGS reads on the reference genome. BED,
VCF and WIG formats are annotation formats.
To develop methods for processing NGS data, we need efficient algorithms
and data structures. Chapter 3 is devoted to briefly describing these tech-
niques.
Chapter 4 studies read mappers. Read mappers align the NGS reads on
the reference genome. The input is a set of raw reads in fasta or fastq files.
The read mapper will align each raw read on the reference genome, that is,
identify the region in the reference genome which is highly similar to the read.
Then, the read mapper will output all these alignments in a SAM or BAM
file. This is the basic step for many NGS applications. (It is the first step for
the methods in Chapters 6 9.)
Chapter 5 studies the de novo assembly problem. Given a set of raw reads
extracted from whole genome sequencing of some sample genome, de novo
assembly aims to stitch the raw reads together to reconstruct the genome.
It enables us to reconstruct novel genomes like plants and bacteria. De novo
assembly involves a few steps: error correction, contig assembly (de Bruijn
graph approach or base-by-base extension approach), scaffolding and gap fill-
ing. This chapter describes techniques developed for these steps.
Chapter 6 discusses the problem of identifying single nucleotide variations
(SNVs) and small insertions/deletions (indels) in an individual genome. The
genome of every individual is highly similar to the reference human genome.
However, each genome is still different from the reference genome. On average,
there is 1 single nucleotide variation in every 3000 bases and 1 small indel in
every 1000 bases. To discover these variations, we can first perform whole
genome sequencing or exome sequencing of the individual genome to obtain
a set of raw reads. After aligning the raw reads on the reference genome, we
use SNV callers and indel callers to call SNVs and small indels. This chapter
is devoted to discussing techniques used in SNV callers and indel callers.
Apart from SNVs and small indels, copy number variations (CNVs) and
structural variations (SVs) are the other types of variations that appear in our
genome. CNVs and SVs are not as frequent as SNVs and indels. Moreover, they
are more prone to change the phenotype. Hence, it is important to understand
them. Chapter 7 is devoted to studying techniques used in CNV callers and
SV callers.
All above technologies are related to genome sequencing. We can also se-
quence RNA. This technology is known as RNA-seq. Chapter 8 studies meth-
ods for analyzing RNA-seq. By applying computational methods on RNA-seq,
we can recover the transcriptome. More precisely, RNA-seq enables us to iden-
tify exons and split junctions. Then, we can predict the isoforms of the genes.
We can also determine the expression of each transcript and each gene.
By combining Chromatin immunoprecipitation and next-generation se-
quencing, we can sequence genome regions that are bound by some transcrip-
tion factors or with epigenetic marks. Such technology is known as ChIP-
seq. The computational methods that identify those binding sites are known
Preface xiii

as ChIP-seq peak callers. Chapter 9 is devoted to discussing computational


methods for such purpose.
As stated earlier, NGS data is huge; and the NGS data files are usually
big. It is difficult to store and transfer NGS files. One solution is to com-
press the NGS data files. Nowadays, a number of compression methods have
been developed and some of the compression formats are used frequently in
the literatures like BAM, bigBed and bigWig. Chapter 10 aims to describe
these compression techniques. We also describe techniques that enable us to
randomly access the compressed NGS data files.
Supplementary material can be found at
http://www.comp.nus.edu.sg/ ksung/algo in ngs/.
I would like to thank my PhD supervisors Tak-Wah Lam and Hing-
Fung Ting and my collaborators Francis Y. L. Chin, Kwok Pui Choi, Ed-
win Cheung, Axel Hillmer, Wing Kai Hon, Jansson Jesper, Ming-Yang Kao,
Caroline Lee, Nikki Lee, Hon Wai Leong, Alexander Lezhava, John Luk, See-
Kiong Ng, Franco P. Preparata, Yijun Ruan, Kunihiko Sadakane, Chialin Wei,
Limsoon Wong, Siu-Ming Yiu, and Louxin Zhang. My knowledge of NGS and
bioinformatics was enriched through numerous discussions with them. I would
like to thank Ramesh Rajaby, Kunihiko Sadakane, Chandana Tennakoon,
Hugo Willy, and Han Xu for helping to proofread some of the chapters. I
would also like to thank my parents Kang Fai Sung and Siu King Wong, my
three brothers Wing Hong Sung, Wing Keung Sung, and Wing Fu Sung, my
wife Lily Or, and my three kids Kelly, Kathleen and Kayden for their support.
Finally, if you have any suggestions for improvement or if you identify any
errors in the book, please send an email to me at ksung@comp.nus.edu.sg. I
thank you in advance for your helpful comments in improving the book.

Wing-Kin Sung
Chapter 1
Introduction

DNA stands for deoxyribonucleic acid. It was first discovered in 1869 by


Friedrich Miescher [58]. However, it was not until 1944 that Avery, MacLeod
and McCarty [12] demonstrated that DNA is the major carrier of genetic in-
formation, not protein. In 1953, James Watson and Francis Crick discovered
the basic structure of DNA, which is a double helix [310]. After that, people
started to work on DNA intensively.
DNA sequencing sprang to life in 1972, when Frederick Sanger (at the Uni-
versity of Cambridge, England) began to work on the genome sequence using
a variation of the recombinant DNA method. The full DNA sequence of a viral
genome (bacteriophage φX174) was completed by Sanger in 1977 [259, 260].
Based on the power of sequencing, Sanger established genomics,1 which is the
study of the entirety of an organism’s hereditary information, encoded in DNA
(or RNA for certain viruses). Note that it is different from molecular biology
or genetics, whose primary focus is to investigate the roles and functions of
single genes.
During the last decades, DNA sequencing has improved rapidly. We can
sequence the whole human genome within a day and compare multiple individ-
ual human genomes. This book is devoted to understanding the bioinformatics
issues related to DNA sequencing. In this introduction, we briefly review DNA,
RNA and protein. Then, we describe various sequencing technologies. Lastly,
we describe the applications of sequencing technologies.

1.1 DNA, RNA, protein and cells


Deoxyribonucleic acid (DNA) is used as the genetic material (with the
exception that certain viruses use RNA as the genetic material). The basic
building block of DNA is the DNA nucleotide. There are 4 types of DNA
nucleotides: adenine (A), guanine (G), cytosine (C) and thymine (T). The DNA

1 The actual term “genomics” is thought to have been coined by Dr. Tom Roderick, a

geneticist at the Jackson Laboratory (Bar Harbor, ME) at a meeting held in Maryland on
the mapping of the human genome in 1986.

1
2 Algorithms for Next-Generation Sequencing

50 A C G T A G C T 30
jj jjj jjj jj jj jjj jjj jj
30 T G C A T C G A 50

FIGURE 1.1: The double-stranded DNA. The two strands show a comple-
mentary base pairing.

nucleotides can be chained together to form a strand of DNA. Each strand of


DNA is asymmetric. It begins from 50 end and ends at 30 end.
When two opposing DNA strands satisfy the Watson-Crick rule, they can
be interwoven together by hydrogen bonds and form a double-stranded DNA.
The Watson-Crick rule (or complementary base pairing rule) requires that the
two nucleotides in opposing strands be a complementary base pair, that is,
they must be an (A, T) pair or a (C, G) pair. (Note that A = T and C G are
bound with the help of two and three hydrogen bonds, respectively.) Figure 1.1
gives an example double-stranded DNA. One strand is ACGTAGCT while the
other strand is its reverse complement, i.e., AGCTACGT.
The double-stranded DNAs are located in the nucleus (and mitochondria)
of every cell. A cell can contain multiple pieces of double-stranded DNAs, each
is called a chromosome. As a whole, the collection of chromosomes is called a
genome; the human genome consists of 23 pairs of chromosomes, and its total
length is roughly 3 billion base pairs.
The genome provides the instructions for the cell to perform daily life
functions. Through the process of transcription, the machine RNA polymerase
transcribes genes (the basic functional units) in our genome into transcripts
(or RNA molecules). This process is known as gene expression. The complete
set of transcripts in a cell is denoted as its transcriptome.
Each transcript is a chain of 4 different ribonucleic acid (RNA) nucleotides:
adenine (A), guanine (G), cytosine (C) and uracil (U). The main difference be-
tween the DNA nucleotide and the RNA nucleotide is that the RNA nucleotide
has an extra OH group. This extra OH group enables the RNA nucleotide to
form more hydrogen bonds. Transcripts are usually single stranded instead of
double stranded.
There are two types of transcripts: non-coding RNA (ncRNA) and message
RNA (mRNA). ncRNAs are transcripts that do not translate into proteins.
They can be classified into transfer RNAs (tRNAs), ribosomal RNAs (rRNAs),
short ncRNAs (of length < 30 bp, includes miRNA, siRNA and piRNA) and
long ncRNAs (of length > 200 bp, example includes Xist, and HOTAIR).
mRNA is the intermediate between DNA and protein. Each mRNA con-
sists of three parts: a 5’ untranslated region (a 5’ UTR), a coding region and
a 3’ untranslated region (3’ UTR). The length of the coding region is of a
multiple of 3. It is a sequence of triplets of nucleotides called codons. Each
codon corresponds to an amino acid.
Through translation, the machine ribosome translates each mRNA into a
Introduction 3

protein, which is the sequence of amino acids corresponding to the sequence of


codons in the mRNA. Protein forms complex 3D structures. Each protein is
a biological nanomachine that performs a specialized function. For example,
enzymes are proteins that work as catalysts to promote chemical reactions
for generating energy or digesting food. Other proteins, called transcription
factors, interact with the genome to turn on or off the transcriptions. Through
the interaction among DNA, RNA and protein, our genome dictates which
cells should grow, when cells should die, how cells should be structured, and
creates various body parts.
All cells in our body are developed from a single cell through cell division.
When a cell divides, the double helix genome is separated into single-stranded
DNA molecules. An enzyme called DNA polymerase uses each single-stranded
DNA molecule as the template to replicate the genome into two identical
double helixes. By this replication process, all cells within the same individual
will have the same genome. However, due to errors in copying, some variations
(called mutations) might happen in some cells. Those variations or mutations
may cause diseases such as cancer.
Different individuals have similar genomes, but they also have genome
variations that contribute to different phenotypes. For example, the color of
our hairs and our eyes are controlled by the differences in our genomes. By
studying and comparing genomes of different individuals, researchers develop
an understanding of the factors that cause different phenotypes and diseases.
Such knowledge ultimately helps to gain insights into the mystery of life and
contributes to improving human health.

1.2 Sequencing technologies


DNA sequencing is a process that determines the order of the nucleotide
bases. It translates the DNA of a specific organism into a format that is deci-
pherable by researchers and scientists. DNA sequencing has allowed scientists
to better understand genes and their roles within our body. Such knowledge
has become indispensable for understanding biological processes, as well as in
application fields such as diagnostic or forensic research. The advent of DNA
sequencing has significantly accelerated biological research and discovery.
To facilitate the genomics study, we need to sequence the genomes of differ-
ent species or different individuals. A number of sequencing technologies have
been developed during the last decades. Roughly speaking, the development
of the sequencing technologies consists of three phases:

First-generation sequencing: Sequencing based on chemical degradation


and gel electrophoresis.
4 Algorithms for Next-Generation Sequencing

Second-generation sequencing: Sequencing many DNA fragments in par-


allel. It has higher yield, lower cost, but shorter reads.

Third-generation sequencing: Sequencing a single DNA molecule with-


out the need to halt between read steps.

In this section, we will discuss the three phases in detail.

1.3 First-generation sequencing


Sanger and Coulson proposed the first-generation sequencing in 1975 [259,
260]. It enables us to sequence a DNA template of length 500 1000 within a
few hours. The detailed steps are as follows (see Figure 1.3).

1. Amplify the DNA template by cloning.

2. Generate all possible prefixes of the DNA template.

3. Separation by electrophoresis.

4. Readout with fluorescent tags.

Step 1 amplifies the DNA template. The DNA template is inserted into
the plasmid vector; then the plasmid vector is inserted into the host cells for
cloning. By growing the host cells, we obtain many copies of the same DNA
template.
Step 2 generates all possible prefixes of the DNA template. Two tech-
niques have been proposed for this step: (1) the Maxam-Gilbert technique [194]
and (2) the chain termination methodology (Sanger method) [259, 260]. The
Maxam-Gilbert technique relies on the cleaving of nucleotides by chemical.
Four different chemicals are used and generate all sequences ending with A, C, G
and T, respectively. This allows us to generate all possible prefixes of the tem-
plate. This technique is most efficient for short DNA sequences. However, it
is considered unsafe because of the extensive use of toxic chemicals.
The chain termination methodology (Sanger method) is a better alter-
native. Given a single-stranded DNA template, the method performs DNA
polymerase-dependent synthesis in the presence of (1) natural deoxynu-
cleotides (dNTPs) and (2) dideoxynucleotides (ddNTPs). ddNTPs serve as
non-reversible synthesis terminators (see Figure 1.2(a,b)). The DNA synthesis
reaction is randomly terminated whenever a ddNTP is added to the growing
oligonucleotide chain, resulting in truncated products of varying lengths with
an appropriate ddNTP at their 3’ terminus.
After we obtain all possible prefixes of the DNA template, the product is
a mixture of DNA fragments of different lengths. We can separate these DNA
Introduction 5

C G T A A C G T A
C A C G T T C T G C A T C A C G T T C T G C A T
à
+ dATP + H+ + PPi
(a)

C G T A A C G T A
C A C G T T C T G C A T C A C G T T C T G C A T
à
+ ddATP + H+ + PPi
(b)

FIGURE 1.2: (a) The chemical reaction for the incorporation of dATP into
the growing DNA strand. (b) The chemical reaction for the incorporation of
ddATP into the growing DNA strand. The vertical bar behind A indicates
that the extension of the DNA strand is terminated.

3’-GCATCGGCATATG...-5’
5’-CGTA
CGTA G - +
CGTAG C
CGTAGC C
CGTAGCC G
CGTAGCCG T
DNA Insert Insert
CGTAGCCGT A
template into into GCCGTATAC
CGTAGCCGTA T
vector host cell Cloning CGTAGCCGTAT A
CGTAGCCGTATA C Electrophoresis
Cyclic sequencing & readout

FIGURE 1.3: The steps of Sanger sequencing.

fragments by their lengths using gel electrophoresis (Step 3). Gel electrophore-
sis is based on the fact that DNA is negatively charged. When an electrical
field is applied to a mixture of DNA on a gel, the DNA fragments will move
from the negative pole to the positive pole. Due to friction, short DNA frag-
ments travel faster than long DNA fragments. Hence, the gel electrophoresis
separates the mixture into bands, each containing DNA molecules of the same
length.
Using the fluorescent tags attached to the terminal ddNTPs (we have
4 different colors for the 4 different ddNTPs), the DNA fragments ending
with different nucleotides will be labeled with different fluorescent dyes. By
detecting the light emitted from different bands, the DNA sequence of the
template will be revealed (Step 4).
In summary, the Sanger method can generate sequences of length 800 bp.
The process can be fully automated and hence it was a popular DNA sequenc-
6 Algorithms for Next-Generation Sequencing

ing method in 1970 2000. However, it is expensive and the throughput is


slow. It can only process a limited number of DNA fragments per unit of time.

1.4 Second-generation sequencing


Second-generation sequencing can generate hundreds of millions of short
reads per instrument run. When compared with first-generation sequencing,
it has the following advantages: (1) it uses clone-free amplification, and (2) it
can sequence many reads in parallel. Some commercially available technologies
include Roche/454, Illumina, ABI SOLiD, Ion Torrent, Helicos BioSciences
and Complete Genomics.
In general, second-generation sequencing involves the following two main
steps: (1) Template preparation and (2) base calling in parallel. The following
Section 1.4.1 describes Step 1 while Section 1.4.2 describes Step 2.

1.4.1 Template preparation

Given a set of DNA fragments, the template preparation step first gener-
ates a DNA template for each DNA fragment. The DNA template is created
by ligating adaptor sequences to the two ends of the target DNA fragment (see
Figure 1.4(a)). Then, the templates are amplified using PCR. There are two
common methods for amplifying the templates: (1) emulsion PCR (emPCR)
and (2) solid-phase amplification (Bridge PCR).
emPCR amplifies each DNA template by a bead. First of all, one piece of
DNA template and a bead are inserted within a water drop in oil. The surface
of every bead is coated with a primer corresponding to one type of adaptor.
The DNA template will hybridize with one primer on the surface of the bead.
Then, it is PCR amplified within a water drop in oil. Figure 1.4(b) illustrates
the emPCR. emPCR is used by 454, Ion Torrent and SOLiD.
For bridge PCR, the amplification is done on a flat surface (say, glass),
which is coated with two types of primers, corresponding to the adaptors.
Each DNA template is first hybridized to one primer on the flat surface.
Amplification proceeds in cycles, with one end of each bridge tethered to the
surface. Figure 1.4(c) illustrates the bridge PCR process. Bridge PCR is used
by Illumina.
Although PCR can amplify DNA templates, there is amplification bias.
Experiments revealed that templates that are AT-rich or GC-rich have a lower
amplification efficient. This limitation creates uneven sequencing of the DNA
templates in the sample.
Introduction 7

(a)

templates

beads

water drop in oil template binds PCR for


to the bead a few rounds
(b)

(c)

FIGURE 1.4: (a) From the DNA fragments, DNA template is created by
attaching the two ends with adaptor sequences. (b) Amplifying the template
by emPCR. (c) Amplifying the template by bridge PCR.

1.4.2 Base calling


Now we have many PCR clones of amplified templates (see Figure 1.5(a)).
This step aims to read the DNA sequences from the amplified templates in
parallel. This method is called the cyclic-array method. There are two ap-
proaches: the polymerase-mediated method (also called sequencing by syn-
thesis) and the ligase-mediated method (also called sequencing by ligation).
The polymerase-mediated method is further divided into methods based on re-
versible terminator nucleotides and methods based on unmodified nucleotides.
Below, we will discuss these approaches.

1.4.3 Polymerase-mediated methods based on reversible ter-


minator nucleotides
A reversible terminator nucleotide is a modified nucleotide. Similar to
ddNTPs, during the DNA polymerase-dependent synthesis, if a reversible ter-
minator nucleotide is incorporated onto the DNA template, the DNA synthesis
is terminated. Moreover, we can reverse the termination and restart the DNA
synthesis.
Figure 1.5(b) demonstrates how we use reversible terminator nucleotides
for sequencing. First, we hybridize the primer on the adaptor of the template.
Then, by DNA polymerase, a reversible terminator nucleotide is incorporated
onto the template. After that, we scan the signal of the dye attached to the
8 Algorithms for Next-Generation Sequencing

PCR clone
C C C
T C
T T T
G G G
C G
C C C
A A AC
T A
C C C C
T T T G T
G G G C G
C C C A C
A A A A

(a)

G
C C G C G C
T T T T ……
G G G G
Add After Repeat the
C C C C
reversible scanning, steps to
A A A A
terminator reverse the sequence
dGTP termination other bases

Wash &
scan
(b)

FIGURE 1.5: Polymerase-mediated sequencing methods based on reversible


terminator nucleotides. (a) PCR clones of the DNA templates are evenly dis-
tributed on a flat surface. Each PCR clone contains many DNA templates of
the same type. (b) The steps of polymerase-mediated methods are based on
reversible terminator nucleotides.

reversible terminator nucleotide by imaging. After imaging, the termination


is reversed by cleaving the dye-nucleotide linker. By repeating the steps, we
can sequence the complete DNA template.
Two commercial sequencers use this approach. They are Illumina and He-
licos BioSciences.
The Illumina sequencer amplifies the DNA templates by bridge PCR.
Then, all PCR clones are distributed on the glass plate. By using the four-
color cyclic reversible termination (CRT) cycle (see Figure 1.6(b)), we can
sequence all the DNA templates in parallel.
The major error of Illumina sequencing is substitution error, with a higher
portion of errors occurring when the previous incorporated nucleotide is a
base G.
Another major error of Illumina sequencing is that the accuracy decreases
with increasing nucleotide addition steps. The errors accumulate due to the
failure in cleaving off the fluorescent tags or due to errors in controlling the
Introduction 9

A C
C G
T T
G A
T C

(a)

(b)

A C G T A C G T
(c)

FIGURE 1.6: Polymerase-mediated sequencing methods based on reversible


terminator nucleotides. (a) A flat surface with many PCR clones. In particu-
lar, we show the DNA templates for two clones. (b) Four-color cyclic reversible
termination (CRT) cycle. Within each cycle, we extend the template of each
PCR clone by one base. The color indicates the extended base. Precisely, the
four colors, dark gray, black, white and light gray, correspond to the four nu-
cleotides A, C, G and T, respectively. (c) One-color cyclic reversible termination
(CRT) cycle. Each cycle tries to extend the template of each PCR clone by
one particular base. If the extension is successful, the white color is lighted
up.
10 Algorithms for Next-Generation Sequencing

reversible terminator nucleotides. Then, bases fail to get incorporated to the


template strand or extra bases might get incorporated [190].
Helicos BioSciences does not perform PCR amplification. It is a single
molecular sequencing method. It first immobilizes the DNA template on the
flat surface. Then, all DNA templates on the surface are sequenced in par-
allel by using a one-color cyclic reversible termination (CRT) cycle (see Fig-
ure 1.6(c)). Note that this technology can also be used to sequence RNA
directly by using reverse transcriptase instead of DNA polymerase. However,
the reads generated by Helicos BioSciences are very short ( 25 bp). It is also
slow and expensive.

1.4.4 Polymerase-mediated methods based on unmodified


nucleotides
The previous methods require the use of modified nucleotides. Actually,
we can sequence the DNA templates using unmodified nucleotides. The basic
observation is that the incorporation of a deoxyribonucleotide triphosphate
(dNTP) into a growing DNA strand involves the formation of a covalent bond
and the release of pyrophosphate and a positively charged hydrogen ion (see
Figure 1.2). Hence, it is possible to sequence the DNA template by detecting
the concentration change of pyrophosphate or hydrogen ion. Roche 454 and
Ion Torrent are two sequencers which take advantage of this principle.
The Roche 454 sequencer performs sequencing by detecting pyrophos-
phates. It is called pyrosequencing. First, the 454 sequencer uses emPCR to
amplify the templates. Then, amplified beads are loaded into an array of wells.
(Each well contains one amplified bead which corresponds to one DNA tem-
plate.) In each iteration, a single type of dNTP flows across the wells. If the
dNTP is complementary to the template in a well, polymerase will extend by
one base and relax pyrophosphate. With the help of enzymes sulfurylase and
luciferase, the pyrophosphate is converted into visual light. The CDC camera
detects the light signal from all wells in parallel. For each well, the light inten-
sity generated is recorded as a flowgram. For example, if the DNA template
in a well is TCGGTAAAAAACAGTTTCCT, Figure 1.7 is the corresponding
flowgram. Precisely, the light signal can be detected only when the dNTP that
flows across the well is complementary to the template. If the template has
a homopolymer of length k, the light intensity detected is k-fold higher. By
interpreting the flowgram, we can recover the DNA sequence.
However, when the homopolymer is long (say longer than 6), the detec-
tor is not sensitive enough to report the correct length of the homopolymer.
Therefore, the Roche 454 sequencer gives higher rate of indel errors.
Ion Torrent was created by the person as Roche 454. It is the first semi-
conductor sequencing chip available on the commercial market. Instead of
detecting pyrophosphate, it performs sequencing by detecting hydrogen ions.
The basic method of Ion Torrent is the same as that of Roche 454. It also uses
emPCR to amplify the templates and the amplified beads are also loaded into
Introduction 11

6
5

intensity
4
3
2
1
ACGTACGTACGTACGTACGT

FIGURE 1.7: The flowgram for the DNA sequence TCG-


GTAAAAAACAGTTTCCT.

a high-density array of wells, and each well contains one template. In each
iteration, a single type of dNTP flows across the wells. If the dNTP is comple-
mentary to the template, polymerase will extend by one base and relax H+.
The relaxation of H+ changes the pH of the solution in the well and an IS-
FET sensor at the bottom of the well measures the pH change and converts it
into electric signals [251]. The sensor avoids the use of optical measurements,
which require a complicated camera and laser. This is the main difference
between Ion Torrent sequencing and 454 sequencing. The unattached dNTP
molecules are washed out before the next iteration. By interpreting the flow-
gram obtained from the ISFET sensor, we can recover the sequences of the
templates.
Since the method used by Ion Torrent is similar to that of Roche 454, it
also has the disadvantage that it cannot distinguish long homopolymers.

1.4.5 Ligase-mediated method


Instead of extending the template base by base using polymerase, ligase-
mediated methods use probes to check the bases on the template. ABI SOLiD
is the commercial sequencer that uses this approach. In SOLiD, the templates
are first amplified by emPCR. After that, millions of templates are placed on
a plate. SOLiD then tries to probe the bases of all templates in parallel. In
every iteration, SOLiD probes two adjacent bases of each template, i.e., it uses
two-base color encoding. The color coding scheme is shown in the following
table. For example, for the DNA template AT GGA, it is coded as A3102.

A C G T
A 0 1 2 3
C 1 0 3 2
G 2 3 0 1
T 3 2 1 0
12 Algorithms for Next-Generation Sequencing

The primary advantage of the two-base color encoding is that it improves


the single nucleotide variation (SNV) calling. Since every base is covered by
two color bases, it reduces the error rate for calling SNVs. However, conversion
from color bases to nucleotide bases is not simple. Errors may be generated
during the conversion process.
In summary, second-generation sequencing enables us to generate hundreds
of billions of bases per run. However, each run takes days to finished due to a
large number of scanning and washing cycles. Adding of a base per cycle is not
100% correct. This causes sequencing errors. Furthermore, base extensions of
some strands may be lag behind or lead forward. Hence, errors accumulate
as the reads get long. This is the reason why second-generation sequencing
cannot get very long read. Furthermore, due to the PCR amplification bias,
this approach may miss some templates with high or low GC content.

1.5 Third-generation sequencing


Although many of us are still using second-generation sequencing, third-
generation sequencing is coming. There is no fixed definition for third-
generation sequencing yet. Here, we define it as a single molecule sequencing
(SMS) technology without the need to halt between read steps (whether enzy-
matic or otherwise). A number of third-generation sequencing methods have
been proposed. They include:

Single-molecule real-time sequencing

Nanopore-sequencing technologies

Direct imaging of individual DNA molecules using advanced microscopy


techniques

1.5.1 Single-molecule real-time sequencing


Pacific BioSciences released their PacBio RS sequencing platform [71].
Their approach is called single-molecule real-time (SMRT) sequencing. It mim-
ics what happens in our body as cells divide and copies their DNA with the
DNA polymerase machine. Precisely, PacBio RS immobilizes DNA polymerase
molecules on an array slide. When the DNA template gets in touch with the
DNA polymerase, DNA synthesis happens with four fluorescently labeled nu-
cleotides. By detecting the light emitted, PacBio RS reconstructs the DNA
sequences. Figure 1.8 illustrates the SMRT sequencing approach.
PacBio RS sequencing requires no prior amplification of the DNA template.
Hence, it has no PCR bias. It can achieve more uniform coverage and lower GC
bias when compared with Illumina sequencing [79]. It can read long sequences
Introduction 13

Immobilized
polymerase

FIGURE 1.8: The illustration of PacBio sequencing. On an array slide,


there are a number of immobilized DNA polymerase molecules. When a DNA
template gets in touch with the DNA polymerase (see the polymerase at the
lower bottom right), DNA synthesis happens with the fluorescently labeled
nucleotides. By detecting the emitted light signal, we can reconstruct the
DNA sequence.

of length up to 20, 000 bp, with an average read length of about 10, 000 bp.
Another advantage of PacBio RS is that it can sequence methylation status
simultaneously.
However, PacBio sequencing is more costly. It is about 3 4 times more
expensive than short read sequencing. Also, PacBio RS has a high error rate,
up to 17.9% errors [46]. The majority of the errors are indel errors [71]. Luckily,
the error rate is unbiased and almost constant throughout the entire read
length [146]. By repeatedly sequencing the same DNA template, we can reduce
the error rate.

1.5.2 Nanopore sequencing method


A nanopore is a pore of nano size on a thin membrane. When a voltage
is applied across the membrane, charged molecules that are small enough can
move from the negative well to the positive well. Moreover, molecules with
different structures will have different efficiencies in passing through the pore
and affect the electrical conductivity. By studying the electrical conductivity,
we can determine the molecules that pass through the pore.
This idea has been used in a number of methods for sequencing DNA.
These methods are called the nanopore sequencing method. Since nanopore
methods use unmodified DNA, it requires an extremely small amount of input
material. They also have the potential to sequence long DNA reads efficiently
at low cost. There are a number of companies working on the nanopore se-
quencing method. They include (1) Oxford Nanopore, (2) IBM Transistor-
mediated DNA sequencing, (3) Genia and (4) NABsys.
Oxford nanopore technology detects nucleotides by measuring the ionic
current flowing through the pore. It allows the single-strand DNA sequence to
14 Algorithms for Next-Generation Sequencing

FIGURE 1.9: An illustration of the sequencing technique of Oxford


nanopore.

flow through the pore continuously. As illustrated in Figure 1.9, DNA material
is placed in the top chamber. The positive charge draws a strand of DNA
moving from the top chamber to the bottom chamber flowing through the
nanopore. By detecting the difference in the electrical conductivity in the
pore, the DNA sequence is decoded. (Note that IBM’s DNA transistor is a
prototype which uses a similar idea.)
The approach has difficulty in calling the individual base accurately. In-
stead, Oxford nanopore technology will read the signal of k (say 5) bases
in each round. Then, using a hidden Markov model, the DNA base can be
decoded base by base.
Oxford nanopore technology has announced two sequencers: MiniION and
GridION. MiniION is a disposable USB-key sequencer. GridION is an ex-
pandable sequencer. Oxford nanopore technology claimed that GridION can
sequence 30x coverage of a human genome in 6 days at US$2200 $3600. It
has the potential to decode a DNA fragment of length 100, 000 bp. Its cost is
about US$25 $40 per gigabyte. Although it is not expensive, the error rate is
about 17.8% (4.9% insertion error, 7.8% deletion error and 5.1% substitution
error) [115].
Unlike Oxford nanopore technology, Genia suggested combining nanopore
and the DNA polymerase to sequence a single-strand DNA template. In Genia,
the DNA polymerase is tethered with a biological nanopore. When a DNA
template gets in touch with the DNA polymerase, DNA synthesis happens
with four engineered nucleotides for A, C, G and T , each attached with a
different short tag. When a nucleotide is incorporated into the DNA template,
the tag is cleaved and it will travel through the biological nanopore and an
electric signal is measured. Since different nucleotides have different tags, we
can reconstruct the DNA template by measuring the electric signals.
NABsys is another nanopore sequencer. It first chops the genome into DNA
fragments of length 100, 000 bp. The DNA fragments are hybridized with a
particular probe so that specific short DNA sequences on the DNA fragments
Another random document with
no related content on Scribd:
lines. Downright fraud is possible, as in the concealment or false
weighing of dutiable articles, the publication of false statements
regarding the financial condition of fiduciary institutions, the covering
up of defects in tenement house building, and so on. Practices of this
character are extremely dangerous, however, as they are subject to
detection and consequent punishment by the first more than
ordinarily inquisitive inspector. Criminal chances are materially
improved by the bribery of officials who are thus bound to
concealment both by money interest and their own fear of exposure.
Vigilant and honest administration of the laws and the infliction of
sufficient penalties, particularly if they involve the imprisonment of
principals, may be trusted to reduce such practices to a minimum.
There are, however, other and more open methods of attack upon
the regulation of business by the state. Appeal may be made to the
courts in the hope that laws of this character may be declared
unconstitutional. Business interests may also seek their repeal or
amendment at the hands of the legislature. No one who accepts the
fundamental principles of our government can quarrel with either of
these two modes of procedure. While doubtless intended to secure
the public interest, attempts to regulate industry by the state may in
given cases really defeat this end. And the latter likelihood would be
increased almost to the point of certainty if legitimate protests from
the businesses affected were stifled. On the other hand attempts
may be made by business interests to influence courts or
legislatures corruptly. Assuming that the honesty of all the
departments of government would be proof against such attempts
there is still another possibility. A business affected by some form of
state regulation may endeavour to call to its aid the influence of
party. This final method of procedure may also be conducted in a
perfectly legitimate fashion. Public sentiment may be honestly
converted to the view that the amendment or repeal of the law, as
demanded by the business interest, is also in the interest of the state
as a whole. On the other hand the effort may be made, either by
large contributions to campaign funds, or by other still more
objectionable means, to enlist the support of party regardless of the
public welfare. Carrying out these various methods of procedure to
their logical conclusion brings us, therefore, face to face with the
question which underlies the whole fabric of political corruption,
namely how shall our party organisations be supported and
financed?
Reserving this issue for subsequent discussion, certain general
features of the reaction of industry against state regulation must be
noted as of immense importance. Black as it is corruption after all is
a mere incident in this struggle. The broad lines of development
which the Republic will follow in the near future would seem to
depend largely upon the outcome of this great process. Certain
social prophets tell us insistently that there are but two possibilities: if
the state wins the upper hand the inevitable result will be socialism; if
on the other hand business triumphs we must resign ourselves to a
more or less benevolent financial oligarchy. Imagination is not
lacking to embellish or render repulsive this pair of alternatives. From
Plato to Herbert George Wells all social prophets have been thus
gifted with the power of depicting finely if not correctly the minutiæ of
the world as it is to be. Time, the remorseless confuter of all earlier
forecasters of this sort, has shown them to be singularly lacking in
the ability to anticipate the great divergent highways of development,
—sympodes in Ward’s phrase,—which have opened up before the
march of human progress and determined the subsequent lines of
movement. So it may well be in the present case. The social futures,
one or the other of which we are bidden by present day prophets to
choose, are like two radii of a circle. They point in very different
directions, it is true, but between them are many possible yet
uncharted goals. And if social development may be likened to
movement in three instead of only two dimensions the lines open to
the future are enormously more divergent than our seers are wont to
conceive. Employing a less mathematical figure, prophecy usually
proceeds from the assumption that cataclysmic changes are
immediately impending. Otherwise, by the way, the prophets would
utterly fail to attract attention. But this presupposes an extremely
plastic condition of society. If social structures, however, really are so
plastic, they may then be remoulded not into one or two forms
merely, but into many other forms conceivable or inconceivable at
the present time. Current and widely accepted forecasts to the
contrary, therefore, one may still venture to doubt that our through
ticket to 1950 or 2000 a.d. is inscribed either “socialism” or “financial
oligarchy,” and stamped “non-transferable.”
Whatever the future may bring forth faith has not yet been lost in
the efficacy of state regulation. It is certainly within the bounds of
possibility that some working balance between government and
industry may be established through this means which will continue
indefinitely. As late as 1870, according to Mr. Dicey, individualistic
opinions dominated English law-making thought.[57] Reaction from
the doctrine of laisser faire was, if anything, even later in the United
States. For a considerable period after the turning point had been
passed measures of state regulation were weakened by survivals of
old habits of thought. Even to-day the period of construction and
extension along the line of state regulation seems far from finished.
Certain inevitable errors have required correction, but no major
portion of the system has been abandoned. Further progress should
be made easier by accumulating experience and precedents.
It is significant of the temper of the American people on this
question that an increasing number of the great successes of our
political life are being made by the men who have shown themselves
strongest and most resourceful in correcting the abuses of business.
Under present conditions no public man suspected of weakness on
this issue has the remotest chance of election to the presidency or to
the governorship of any of our larger states. On the purely
administrative side of the system of state regulation new and more
powerful agencies,—such as the Bureau of Corporations and the
various state Public Service Commissions,—have been devised
recently to grapple with the situation.
Considering the extreme importance of the work which such
agencies have to perform it is improbable that either their position, or
the position of government in general, is as yet sufficiently strong
with reference to corporate interests. Under modern conditions
success in business means very large material rewards. Important
as are industry and commerce, however, certain parts of the work
which the state is now performing are, from the social point of view,
immensely more valuable. The commonweal requires that our best
intellect be applied to these tasks. Any condition which favours the
drafting of our ablest men from the service of the state into the
service of business is a point in favour of the latter wherever the
conflict between them is joined. Yet not infrequently we witness the
promotion of judges to attorneyships for great corporations, the
translation of men who have won their spurs in administrative
supervision of certain kinds of business to high managerial positions
in these same businesses. As long as our morals remain as
mercenary as those of a Captain Dugald Dalgetty there may seem to
be little to criticise in such transactions. Doubtless loyalty is shown
by the men concerned to the government, their original employer,
and to the corporation, their ultimate employer. If, however, there is
to be an exchange of labourers between these fields it would be
vastly better for the state if the man who had succeeded brilliantly in
business should normally expect promotion to high governmental
position. As things are at present the glittering prizes known to be
obtained by ex-government officials who have gone into corporation
service cannot have the most favourable effect upon the minds and
activities of officials remaining in public positions requiring them to
exercise supervision over business activities.
It is high time that there should be a reversal of policy in this
connection. We need urgently greater security of tenure, greater
social esteem, much higher salaries, and ample retiring pensions for
those public officials who are on the fighting line of modern
government. Incomes as large as those of our insurance presidents
or trust magnates are not needed, although in many cases they
would be much more richly deserved. There is recompense which
finer natures will always recognise in the knowledge that they are
performing a vitally important public work. In spite of the loss which
political corruption causes the state it is probably more than made up
by the devoted, and in part unrequited, work of its good servants.
Still the labourer is worthy of his hire. It is both deplorable and
disastrous that the current rewards of good service should be so
meagre while the rewards of betrayal are so large.[58]
But it may be objected that public sentiment in a democracy will
not support high salaries even for the most important public services,
that democracies notoriously remunerate their higher officials very
much less adequately than monarchies. The time is ripe, however,
for challenging this attitude. As long as government work is looked
upon as a dull, soulless, and not extremely useful routine, popular
opposition to high salaries is not only comprehensible but
praiseworthy. Once convinced of the value of certain services,
however (and there is plentiful opportunity to do this), it is doubtful if
self-governing peoples will show themselves less intelligent than
kings. Even now engineers directing great and unusual public
undertakings are frequently paid larger salaries than high officers of
government charged with the execution of ordinary functions.
Recognition of the importance of the work of the specialist is much
more common now than formerly. The development of science and
large scale industry is daily enforcing this lesson. As regards the
unwillingness of democracy to pay well, a change that has recently
come over the policy of certain labour unions is worth noting. Instead
of ordering strikes they have on occasion taken a lesson from the
procedure of their employers, and resorted to the courts for the
redress of grievances. And in doing so they have called in the very
ablest lawyers they could find, paying the latter out of the funds of
the union fees as large as they could have earned on the opposing
side. A similar illustration is afforded by the policy of the Social-
democratic party in Germany which for a considerable period has
recognised and met the necessity of remunerating its leaders in
editorial, parliamentary, and propagandist work at professional rates
far higher than the average wages received by the party rank and
file.[59] Demos may despise aristocracy of birth but he is perhaps not
so incapable of comprehending the aristocracy of service as many of
his critics suppose.

By the system of regulation, as we have seen, government is


brought into close contact with business along many lines. But the
state is also one of the largest sellers and buyers in the markets of
the world, and as such has many other intimate points of contact
with economic affairs. Like any individual or corporation under the
same circumstances it is liable to be victimised by practices
designed fundamentally to make it sell cheaply or buy dearly,—both
to the advantage of corrupt outside interests. Franchise grabbing is
perhaps the most magnificent single example of the difficulties the
state encounters when it appears as a vendor. Popular ignorance of
the real nature of such valuable rights has been responsible for
enormous losses on this score. A city which for any reason had to
dispose of a parcel of land would find itself safeguarded in some
measure by the fact that a large number of its citizens were familiar
with real estate values and methods. Knowledge of intangible
property is very much less common. There has been a great deal of
effective educational work along this line, however, and that
community is indeed backward which at the present time does not
understand perfectly that perpetual franchises without proper
safeguards of public interests are fraudulent on their face.
Administrative agencies such as were referred to in connection with
business regulation, and particularly public service commissions, are
doing extremely valuable work in cutting down the possibilities of
franchise corruption. The grabbing of alleys, the seizure of water
fronts, and the occupation of sidewalks are minor forms of the same
sort of evil which can no longer be practised with the impunity
characteristic of the good old free and easy days of popular
ignorance and carelessness.
In disposing of its public domain, although here, of course, the
price obtained was not the major consideration on the part of the
government, notorious cases of corruption were of common
occurrence. Recent developments, particularly in connection with
timber, mineral, and oil lands, reveal the stiffer attitude which the
public and the government are taking on this question. Time was
also when deposits of public funds yielded little or no interest to cities
and counties. At bottom such loan transactions were sales, i.e., the
sale of the temporary use of public monies. Quite commonly
treasurers were in the habit of considering themselves responsible
for the return of capital sums only, and any interest received was
regarded as a perquisite of office. The system had all the support of
tradition and general usage; it was frequently practised with no effort
at concealment and without protest on moral or business grounds.
Nowadays its existence would be regarded as a sign of political
barbarism, and would furnish opportunity for charges of corruption
about which effective reform effort would speedily gather.
The element of selling is not of major importance in many services
performed by the state which nevertheless require constant watching
in order to prevent corrupt misuse. Efforts to use the mails
improperly are continually being made by get-rich-quick schemes,
swindlers, gamblers, and touts of all descriptions. Crop reports are
furnished without charge, but extraordinary precautions are
necessary to prevent them from leaking out.
Instances such as the foregoing also serve to illustrate the point
that any extension of the functions of government results in an
enlargement of the opportunities for political corruption. Government
railways, for example, would afford venal public officials many
crooked opportunities (as e.g., for false billing, charges, and
discrimination) which do not now exist as direct menaces to public
administrative integrity. Private management of railways, however,
has not been so free from such evils as to be able to use this
argument effectively against nationalisation. Municipalities selling
water, gas, or electricity, are notoriously victimised on a large scale
by citizens whose self-interest as consumers overcomes all thought
of civic duty or common honesty.[60] There is one other closely
related phenomenon connected with the extension of the economic
functions of government which will perhaps not so readily be thought
of as corrupt, and which yet deserves consideration from that point
of view. Public service enterprises under government ownership and
management are always exposed to what some writers stigmatise as
“democratic finance,”—that is strong popular pressure to reduce
rates. Cost of production and the general financial condition of a
given city may make the reduction of the prices charged for water,
electricity, or gas, frankly contrary to public policy. Yet the self-
interest of the consumer suggests the use of his vote and political
influence to compel such reductions. Of course his action is not
consciously corrupt, but it has every other characteristic feature of
this evil.
The state is a much larger buyer than seller, and manifold are the
possibilities of malpractice in connection with its enormous
purchases of land, materials, supplies, and labour. Yet nothing in our
contemporary political life is more marvellous than the light-hearted
indifference with which the public regards the voting and expenditure
of sums running into the millions. We recall vaguely that the great
constitutional issues of English history were fought out on fiscal
grounds, but we find the budget of our own city a deadly bore. We
rise in heated protest whenever the tax rate is advanced by a
fraction of a mill, but we regard with indifference the new forms either
of necessary expenditure or needless extravagance which make
such increases inevitable. There is one general advantage which
state buying has over state selling, however. A great many of the
purchases of government are made in competitive markets where
fair price rates are ascertainable with comparative ease. Even large
public contracts such as for erecting public buildings or for road
construction are usually resolvable into a number of comparatively
simple processes and purchases. Of course there are exceptions, as
for example government contracts with railroad companies for
carrying mail. But as a rule the ordinary purchasing operations of the
state are simpler and more easily comprehensible than such acts of
sale as franchise grants,—to mention the one of most importance
from the view point of possible corruption. It is precisely at this
vulnerable point on the buying side of governmental operations that
the New York Bureau of Municipal Research has struck home. With
a skill that amounts to positive genius this voluntary agency has
placed before the people the ruling market prices and the
enormously higher prices actually paid by officials for public supplies.
Taking the purchasing departments of our best organised private
corporations as a model it has drawn practical plans for the
installation of similar methods as part of our municipal machinery.
Equipped with field glasses and mechanical registering devices its
agents have kept tab upon the flaccid activities of labourers in the
public service and have contrasted the long distance results thus
obtained with the suddenly energised performances of the same
men when they knew themselves to be under observation. It has co-
operated quietly and effectively with all willing officials in improving
the methods of work in their offices, in installing more logical
accounting systems and better methods of recording work done; and
it has fought effectively, with the penalty of discharge by the
Governor in two cases, those officials who were not amenable to
proper corrective influences. And finally the Bureau has attacked the
city budget and has even succeeded in making that dry and
formidable document the object of active and intelligent public
interest. Yet the cost of the Bureau’s work has been out of all
proportion small in comparison with the benefits obtained. “Less than
$30,000 was spent in 1908 in securing for four million people the
beginnings of a method of recording work done when done, and
money spent when spent, which will henceforth make inefficiency
harder than efficiency, and corruption more difficult than honesty.”[61]
It is extremely gratifying to note the widespread interest which this
work is arousing. Philadelphia and Cincinnati have established
bureaus under the supervision of the parent organisation, and other
cities are earnestly considering similar action. There would seem to
be ample opportunity for the employment of the same sort of agency
in connection with our state governments, and possibly in certain
fields of national administration. Altogether it is by far the most
noteworthy recent effort to stamp out governmental inefficiency and
corruption. Other efforts are being made to the same end,
particularly where extravagance is concerned. In many cities
business men’s clubs, improvement associations, and taxpayers’
leagues, are watching expenditure as it has never been watched
before. Public officials have co-operated loyally in a number of
cases. Occasional discoveries have been made of safeguards and
powers in the law which had been long forgotten or unused. The
progress of uniform bookkeeping is rapid and highly significant in this
connection. Aided largely by the latter development census reports,
particularly the special issues dealing with statistics of cities of over
30,000 inhabitants, are making it possible to compare municipalities
on strictly quantitative lines as to the cost and efficiency of various
services. Altogether we are developing a new financial alertness and
intelligence that should materially cut down various forms of political
inefficiency and corruption, particularly on the purchasing side of
governmental operations. It is not too much to say that the methods
by which these results are to be effected are either at hand or
capable of being worked out on demand. The question now is simply
one of funds and the training of specialists to do the work.

Efforts by the state to regulate or suppress vice and crime bring


about reactions on the part of the interests affected not unlike those
which follow from the attempt to regulate legitimate business.[62] The
corrupt practices thus occasioned are among the commonest and
most objectionable that society has to contend with. Very large
money returns in the aggregate are obtainable by political
organisations which protect vice. It is doubtful, however, whether
such sources of income are ordinarily so productive financially as the
various corrupt relationships between business and politics
described above. On the other hand the alliance between politics
and the underworld has the advantage, from an unmoral point of
view, that it yields not only money but also strong support at the
polls. And as some of the voters whose support is thus secured may
be depended upon for work as repeaters, ballot-box staffers, and
thugs, the net political power which they wield is much greater than
their number alone would indicate. There is one other important
difference between the corruption arising from the attempt to
regulate legitimate business and that which arises from the
regulation of vice. In the former case a “deal” can frequently be
consummated between two principals, the leader of a centralised
business interest on the one hand and the leader of a centralised
political organisation on the other. Something of the secrecy and the
binding personal obligation of a “gentleman’s agreement” are thus
obtained. Of course to carry out the compact the political leader must
control the action of his dependents in public office, but the latter,
whatever they may suspect, know nothing definite about the
agreement into which the boss has entered. The protection of vice
cannot be managed by so small a number of conspirators.
Compliance by the whole police force with orders from the “front” is
necessary. Every patrolman knows what is going on when he is told
that he must not molest dives, gambling hells, or brothels. Members
of the force may or may not be used to collect protection money and
pass it “higher up,” but since there are many establishments which
must pay tribute the employment of a considerable corps of
collectors is necessary. In any event the number of those who have
knowledge amounting to legal evidence is likely to be much larger in
the case of the protection of vice than in the case, for example, of a
corrupt franchise grant. Exposure either by some disgruntled police
officer or by some overtaxed purveyor of vice is likely to come at any
moment.
When revelations of this kind are made the moral uprising which
follows seldom lacks force. Cynics may sneer at such popular
manifestations as paroxysms of virtue, but practical political leaders
are not likely to underestimate the damage which they are capable of
inflicting. Even if the machine emerges comparatively unharmed
from such a period of attack the heads of some of the leaders are
likely to fall before the popular fury. Remembering Fernando Wood’s
caution about the necessity of “pandering a little to the moral
element in the community,” farsighted bosses are therefore more
and more inclined to limit their operations in the field of vice
protection and to seek support from their relations with legitimate
business interests. If this process is carried on further, as now seems
likely, the general problem we are considering will be simplified by
the reduction to a minimum of corruption by the vicious element,
although, of course, we would still have on our hands the large
question of corruption in the interests of otherwise legitimate
business.
With all their mistakes and wasted energy moral uprisings will help
this development materially. After a “spasm” things seldom fall back
into the lowest depths of their former state. We need more frequent
spasms, at least until we shall have learned to live continuously and
steadily on a higher plane. The gravest evil of the present
intermittent situation would seem to be that in our occasional fits of
indignation we write requirements into the law which, considering the
prevalence of minor vicious habits among large masses of the
people, are impossible of execution. One of the lessons which we
have learned in attempting to regulate business is that legislation of
this sort must be supplemented by special administrative agencies if
it is to have any effect whatever. Admirable child labour laws, for
example, remain absolutely futile without labour commissioners and
a force of inspectors created particularly to see them enforced. Yet in
the regulation of vice we rush on heedlessly piling one new duty after
another upon an already overburdened and underpaid police force.
As a consequence its chiefs become habituated to the idea that they
can exercise discretion as to what laws are to be enforced and what
laws may with impunity be neglected. Usually the police decision of
such questions is that crime must be dealt with vigilantly and
severely, vice, on the other hand, with large tolerance. No better
situation could be devised for the schemes of corruptionists.
Moreover the latter are materially aided in their nefarious work by the
high standards and the severe penalties which the moral element
has written into the law. Using these as clubs the corrupt politician
can extort much larger contributions than would otherwise be
possible. There are only two ways in which this situation may be
met. Either we must reduce our too ambitious programmes for the
regulation of vice, or we must strengthen our police forces until they
are adequate to meet the demands placed upon them.
In whatever way this question may be decided there is ample
ground for demanding the application of the most thoroughgoing civil
service reform rules to our police establishments. It is well known
that the rank and file of our police forces despise the dirty work
forced upon them by the political wretches “higher up.” No man
capable of the heroic deeds which are commonplace in the annals of
the police service rings the doors of bawdy houses and collects a
tithe from the sorrowful wages of shame except under the
compulsion of the sternest bread and butter necessity. Usually it is
necessary to select the yellow dogs of the force for this particular
devil’s work, and the promotion of such creatures for their pliancy is
a stench in the nostrils of their honest comrades. If one doubts that
this is the real spirit of the police force let him consider the character
of the fire departments of nearly all of our cities. The men in the latter
service are chosen from the same class, face dangers of the same
magnitude, and receive wages of about the same amount as
policemen. In the main they deserve and receive high praise for their
efficiency and honesty. It is true that they are not exposed to the
corrupt temptations which surround policemen, and also that their
work is held up to exacting standards by business men and
insurance companies. The fact remains that our police forces are
constructed of the same human material as our fire fighting forces. It
is difficult to avoid the conclusion that if we adapt our requirements in
the regulation of vice to the capacities of our forces, (or vice versa); if
we criticise and appreciate their work in the same way as we do the
work of their sister department; if we place their appointment, tenure,
and promotion, upon grounds of merit and utterly exclude the
influence of the corrupt “higher-ups,” we will have gone a long ways
toward drying up the most despicable of the sources of political
corruption.

Certain of the moral aspects of tax dodging have been dealt with
in an earlier connection.[63] Contrary as it may seem to the principle
of economic interest there is a good deal of carelessness and
stupidity in this field, and cases are occasionally found of property
that has been over-assessed. On the other hand more or less
deliberate dodging is indulged in very largely. So far as this practice
deserves the stigma of corruption it is a stigma which rests to a very
considerable extent upon the so-called class of “good citizens.”
Certain residents of “swell suburbs” who daily thank God that their
government is not so corrupt as that of the neighbouring city are
among the worst offenders of this kind. As directors of corporations
men of the same standing are sometimes responsible for
monumental evasions. Of course what was said with regard to the
reaction of business against state regulation holds good here also.
There are plenty of legitimate methods of protesting against unjust or
oppressive taxes,—before the courts, before legislatures, and by
open propaganda designed to influence party organisations. But the
furtive concealment or misrepresentation of taxable values is a
matter of entirely different moral colour, and it is the latter practice
that we now have to consider.
Tax dodging is so common and so well established that it has built
up an exculpatory system of its own. Instead of bringing his
conscience up to the standards set by the law the ordinary property
owner inquires into the practice of his neighbours, and governs
himself accordingly. Wherever business competition is involved the
threat of possible bankruptcy practically forces him into this course.
Yet withal our citizen is likely to be somewhat troubled in his mind
about his conduct, at least until he learns how shamelessly John
Smith who lives just a little way down the street has behaved. This
information quite restores his equanimity, and at the next
assessment he may outdo Smith himself. Thus the extra legal, if not
frankly illegal, neighbourhood standard tends constantly to become
lower. And not only taxpayers but tax officials themselves are
affected by local feeling and fall into the habit of closing their eyes to
certain kinds of property and expecting only a certain percentage of
the valuation fixed by law to be returned.
Back of such neighbourhood ideas on taxation there are at least
two lines of defence. One is that our present tax system is highly
illogical, bothersome, and burdensome. To a considerable extent,
unfortunately, this is the case, but it does not justify illegal methods
of seeking redress. Not much argument is required to convince the
ordinary property owner that all the inequities of the existing system
fall with cumulative weight upon his devoted head. His pocketbook
affirms this view more strongly perhaps than he would care to admit,
but anyhow the net result is that he feels himself more or less
pardonable for evading the burden as far as he can. The other
common excuse for tax dodging is supplied by the conviction that
much of the money raised by government is wasted or stolen. The
logic based upon this premise is most curiously inverted, but none
the less it sways the action of great multitudes. Politicians are a set
of grafters, says our taxpayer. The more money they get the more
they will waste and steal. I will punish the rascals as they deserve by
dodging my taxes, that is by becoming a grafter myself. Seldom,
however, is the latter clause clearly expressed. Yet the existence of
corruption on one side of the state’s activities is thus made the
excuse for corruption on another side whereby the state is mulcted
of much revenue which it might receive. Of course evasion results in
higher tax rates, which, by the way, fall crushingly upon the citizen
who makes full and honest returns, and thus part of the loss in
income to the government is made up. It seems clear, however, that
in the long run government income is materially reduced by the
feeling prevalent among taxpayers which has just been described
and the procedure based upon it. On the whole this is by far the
most serious single economic consequence of political corruption. It
is bad enough that public money should be stolen, that public work
should be badly done, and that politicians and contractors should
grow rich in consequence, but it is far worse that the state should be
starved of the funds necessary to perform its existing functions
properly and to extend its activities into new fields. In the regulation
of industry, in education, philanthropy, sanitation, and art, American
government is very far from achieving what it should do in the public
interest and what it could do more efficiently and cheaply than any
other agency. Yet our progress toward this goal is perpetually
hindered by the existence of waste and corruption on the one hand,
and the consequent peremptory shutting of the taxpayers’
pocketbooks on the other hand.
Consideration of the theory underlying tax dodging reveals certain
broad lines of correction. In proportion as our tax system
approximates greater justice, the evasion which defends itself on the
ground of the inequities of the present system will tend to disappear.
To show how this may be done is beyond the limits of the present
study. Suffice it to say that the work which reformers and students of
public finance are now doing on inheritance, income, and corporation
taxes, on the taxation of unearned increment, on various applications
of the progressive principle, and so on, should eventuate in the
tapping of much needed new sources of income, and thus facilitate
the correction of present unjust burdens. The introduction of
economical methods and the elimination of the grosser forms of
corruption in the field of government expenditure will weaken the
excuses offered for evasion on the ground of public waste and graft.
There is an overwhelming mass of evidence to show that the
American taxpayer, once convinced of the necessity of a given public
work and further assured that it will be honestly executed, is
generous to the point of munificence. The annals of our cities are full
of the creation of appointive state boards composed of men of the
highest local standing and intrusted with the carrying out of some
great single project,—the erection of a city hall building, the
construction of a water works and filtration plant, or of a park system.
Very few such commissions are grudged the large sums of money
necessary for their undertakings. If every ordinary branch of our
government enjoyed public confidence to a similar degree such
special boards would no longer be needed, and ample funds would
be forthcoming from taxpayers for all our present functions and for
other new and worthy functions which might be undertaken greatly to
the public advantage. Finally our system of tax administration, like
our system of government regulation of business, needs
strengthening. Larger districts and centralised power in the hands of
the assessors will help to lift them above the neighbourhood feeling
which is responsible for so much evasion. Higher salaries will bring
expert talent and backbone sufficient to resist the pressure brought
to bear by large interests. As a matter of fact the first threat of a virile
execution of the laws wipes out a considerable part of the tax
dodging in a community.
There would also seem to be much virtue in the plan for the
reduction of the tax rate advocated by the Ohio State Board of
Commerce.[64] Given a community with a high general tax rate it
may be the case, for example, that only about fifty per cent of true
value is being returned. Everybody knows it; everybody is sneakingly
ashamed of it. Still fundamental honesty is supposed to exist in the
man who does not cut the percentage below say, forty; the really
mean taxpayer is he who risks twenty-five per cent on what he feels
obliged to return and conceals whatever he can into the bargain.
Under such circumstances it is proposed to take the heroic step of
reducing the tax rate to one-half its existing figure or even less.[65] If
this can be done it is hoped that taxpayers can be induced to return
their property at full value, and that much personal property will
come out of the concealment into which it has fled under the menace
of a high general property tax rate. In support of this argument cases
are cited where a reduction of the personal property tax rate alone
has brought in much larger returns and converted into a source of
revenue a form of taxation which everywhere under high rates shows
a tendency to dwindle to practically nothing.
As against the plan it may be said that taxpayers are habitually
suspicious and extremely likely to regard any change whatever as
certain to increase their burdens. Many of them would see nothing in
the proposition beyond a tricky device to screw up their valuations
under the pretext of low rates with the intention of falling upon them
later with rates as high or higher than before. Unless the reform were
fully understood and then backed up by the most vigorous
administration there would be grave danger that revenues would
materially suffer. Given these conditions essential to success,
however, the plan would seem decidedly worth trying. A community
willing to take the plunge, it is pointed out, would enjoy large
advantages over its less enterprising competitors in the advertising
value of a low tax rate, particularly as a means of inducing new
industries to locate in its midst. The reform would probably not
increase revenue, indeed it does not aim to do so, but it would yield
a moral return of immense value. Self respect would be restored to
many otherwise thoroughly good citizens, and public opinion, to the
tone of which the taxpaying class contributes largely, would be lifted
to higher planes. Under the present vicious system there must be a
considerable number who realise secretly and more or less vaguely
the inherent kinship between their conduct and that of the corrupt
political organisation under which they live, and who in consequence
remain inactive and ashamed when movements for better things in
city, state, or nation, are on foot.

In our earlier study of the nature of political corruption an effort


was made to distinguish between bribery and auto-corruption. As
examples of the latter, legislators may employ knowledge gathered
on the floor of the chamber or in committee rooms for the purpose of
speculating to their own advantage on stock or produce exchanges;
administrative officials who by virtue of their position learn in
advance any information which may affect the market (i.e., crop
reports) can do likewise; knowledge of important judicial decisions
before they are handed down may be exploited in a similar fashion;
insiders who have access to plans for projected public improvements
are in a position to acquire quietly the needed real estate for sale at
much advanced prices to the city or state when condemnation
proceedings are actually begun. In addition to large and striking
transactions of this sort there are innumerable smaller possibilities
for auto-corruption which present themselves to public functionaries
ranging in rank from the highest places in the official hierarchy to that
of the policeman on his beat. Logically the practices under
consideration are not separable from the general forms of political
corruption as the term is applied in the present discussion. Instead
they would seem to be a special method of carrying on corrupt
transactions of many different kinds. All cases of auto-corruption,
however, have the common feature that bribery by an outside
interest is excluded, and this absence of bribery has sometimes led
to the assertion that the practices included under this heading are
“honest grafts.” An illusory appearance of innocence is also
conferred by the difficulty of tracing the incidence of the burdens
resulting from such transactions. There is no moral ground for such
favourable distinctions, however. On the contrary, auto-corruption is
clearly worse than bribery in that an entire transaction of this
character must be guiltily designed and executed wholly by one
person, and that person an official charged with knowledge which
should be used only in the public interest. Moreover the profits of his
secret treachery may be turned entirely into his private bank
account. If so he cannot even plead in justification that he has been
acting in behalf of a party organisation by gathering and contributing
the funds necessary to its management. The latter feature of auto-
corruption stands in need of special consideration.
It may be laid down as a principle of fairly general applicability that
political corruption as such is disadvantageous to the party
organisation permitting it. From a purely tactical point of view it would
be the extreme of bad party management to tolerate loose conditions
under which a large number of officials could graft freely on their own
account and place the entire proceeds in their own pockets. Whether
they confined themselves to auto-corruption or accepted bribes
singly or in combinations the case would not be materially altered. A
day of reckoning at the polls would surely come, and it would find the
party treasury unprovided with the means of defence. Yet conditions
of this character prevailed pretty generally prior to the advent of
strongly centralised political machines. Corruption was extremely
diffuse, personal responsibility for it widely scattered, party
responsibility not so clear. Manipulation required whole corps of
lobbyists, “third houses,” “black horse cavalry,” and the like. Under
present conditions the party organisation with strongly centralised
management has a direct interest in limiting corruption and also in
seeing that contributions which arise from such practices as it
tolerates actually flow into the party war chest. Corruption in purely
selfish interest becomes treachery to the party as well as treachery
to the state. Of course even under strong centralised and permanent
party control cases of auto-corruption still occur, and at times officials
receive bribes without turning in their quotas. If so, however, it may
fairly be presumed that they make their peace with the organisation
in other ways. They may be exempt from paying tribute, but they are
nevertheless obliged to deliver votes or influence. Many sins are laid
at the door of the machine. It has at least the advantage of enabling
us to centralise responsibility for all corrupt practices which occur
under its management.
From this point of view one may undertake an outline of the form
of corruption connected with political control. All the preceding forms
of political corruption may be considered the obverse, this is the
reverse of the die. None of the practices earlier considered can be
carried on without danger; the corruption of political control is the
crooked means of avoiding the cumulative effects of these practices.
It is not popular, it is not good politics even in the narrowest practical
acceptance of that term, for a political organisation to grant corrupt
favours to business, to wink at the violation of the law by vice, to
allow its partisans in office to sell government property cheap and
buy government supplies dear. If any of these things are permitted
the organisation, like the common criminal, must take care to lay
aside “fall money” against a day of trial.
One might feel greater confidence in the restraining influence of
party centralisation were it not for the fact that the more dangerous
to party success are the forms of corruption which an organisation
tolerates the more lucrative they are apt to be. Though its sins be as
scarlet still they produce funds sufficient to buy indulgences and to
leave a handsome profit over. In connection with business
regulation, for example, bribery in any considerable amount is not
possible until legislation is enacted or close at hand. And legislation
of this kind is not likely to be passed or threatened unless a strong
public sentiment demands action. Political manipulation which
attempts to frustrate regulation at such a juncture must sooner or
later prepare itself to reckon with the public sentiment which it has
flouted. Vice cannot be tolerated except in contravention of laws
against it, and to do so means to offend the moral sentiment in the
community which placed such laws on the statute books. Franchise
grabbing is not profitable on a large scale until the experience of
earlier public service corporations has impressed upon the public
mind the great value of such grants. If, nevertheless, grabs are
permitted by the machine, the boodle must be sufficient to pay both
for the personal services involved and to repair any resulting
damage to party prestige at the next election. Of course many
citizens are apathetic with regard to such abuses or even ignorant of
their existence, and there are others who are so involved in corrupt
practices, particularly in connection with tax dodging, meter fixing,
and the protection of vice, that they feel themselves allied in interest
with the party organisation and accordingly vote its ticket. Always,
however, there is a contingent, and frequently it is large enough to
hold the balance of power, which is neither ignorant nor apathetic,
and which, although perhaps too quiescent ordinarily, will rise in
revolt against any organisation which grafts too boldly and too
widely.
The situation of the venal machine is, therefore, substantially this:
more money can be obtained at any time if certain practices

You might also like