You are on page 1of 53

Computational Systems Biology Tao

Huang
Visit to download the full and correct content document:
https://textbookfull.com/product/computational-systems-biology-tao-huang/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Emerging trends in applications and infrastructures for


computational biology, bioinformatics, and systems
biology : systems and applications 1st Edition Arabnia

https://textbookfull.com/product/emerging-trends-in-applications-
and-infrastructures-for-computational-biology-bioinformatics-and-
systems-biology-systems-and-applications-1st-edition-arabnia/

Computational Systems Biology Approaches in Cancer


Research 1st Edition Inna Kuperstein (Editor)

https://textbookfull.com/product/computational-systems-biology-
approaches-in-cancer-research-1st-edition-inna-kuperstein-editor/

COMPLEX SYSTEMS AND COMPUTATIONAL BIOLOGY APPROACH


ACUTE INFLAMMATION 2nd Edition Yoram Vodovotz

https://textbookfull.com/product/complex-systems-and-
computational-biology-approaches-to-acute-inflammation-2nd-
edition-yoram-vodovotz/

Computational Psychiatry A Systems Biology Approach to


the Epigenetics of Mental Disorders 1st Edition Rodrick
Wallace (Auth.)

https://textbookfull.com/product/computational-psychiatry-a-
systems-biology-approach-to-the-epigenetics-of-mental-
disorders-1st-edition-rodrick-wallace-auth/
Systems Biology Nikolaus Rajewsky

https://textbookfull.com/product/systems-biology-nikolaus-
rajewsky/

Computational Methods in Synthetic Biology Mario Andrea


Marchisio

https://textbookfull.com/product/computational-methods-in-
synthetic-biology-mario-andrea-marchisio/

GeNeDis 2018 Computational Biology and Bioinformatics


Panayiotis Vlamos

https://textbookfull.com/product/genedis-2018-computational-
biology-and-bioinformatics-panayiotis-vlamos/

Micro Electro Mechanical Systems Micro Nano


Technologies Qing-An Huang

https://textbookfull.com/product/micro-electro-mechanical-
systems-micro-nano-technologies-qing-an-huang/

Systems Biology Methods in Molecular Biology 2745


Mariano Bizzarri

https://textbookfull.com/product/systems-biology-methods-in-
molecular-biology-2745-mariano-bizzarri/
Methods in
Molecular Biology 1754

Tao Huang Editor

Computational
Systems Biology
Methods and Protocols
METHODS IN MOLECULAR BIOLOGY

Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK

For further volumes:


http://www.springer.com/series/7651
Computational Systems Biology

Methods and Protocols

Edited by

Tao Huang
Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
Editor
Tao Huang
Shanghai Institutes for Biological Sciences
Chinese Academy of Sciences
Shanghai, China

ISSN 1064-3745 ISSN 1940-6029 (electronic)


Methods in Molecular Biology
ISBN 978-1-4939-7716-1 ISBN 978-1-4939-7717-8 (eBook)
https://doi.org/10.1007/978-1-4939-7717-8
Library of Congress Control Number: 2018935135

© Springer Science+Business Media, LLC, part of Springer Nature 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations
and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to
be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty,
express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.
The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Humana Press imprint is published by the registered company Springer Science+Business Media, LLC part of
Springer Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Preface

With the rapid development of high-throughput technologies, such as next-generation


sequencing and single-cell sequencing, many tough biomedical questions can be answered
since it is no longer impossible to get a whole picture of the biological system. Complex
diseases, such as tuberculous meningitis and leukemia, involve dysfunctions on multiple
levels, including DNA variants, mRNA differential expression, and protein fluctuation.
Accurately measuring these molecules is the first step of understanding the biological
system.
But even if we can get all these multi-omics data, the bioinformatics analysis of such big
data is still very challenging. There are two types of analysis for deciphering the mechanism
hidden behind the biomed big data. One method is machine learning. It can analyze various
features and build a predictive model which can predict the response of a biological system to
a perturbation or classify the subtypes of samples. In recent years, one of the machine
learning methods, deep learning, is extremely popular and has become a powerful tool for
big data analysis.
Another effective method is network analysis based on graph theories. Network is how
we understand the complex world. It starts from a node. And a connection in real life is
abstracted as an edge. It can grow fast and become more and more complex. Eventually, it
will exhibit unique properties and reflect the complex system. It inspires the development of
many algorithms, such as the neural network in deep learning. And in biomedicine, it is a
wonderful way of integrating diverse big data and transforming the biological questions into
mathematical questions, especially graph theory questions. The graph theory empowers the
network analysis to see the hidden truth underneath the hairy ball we see. The visualization
of a large-scale network can help us get a sense of the network, but it can’t really give us the
useful information that we are interested in, such as which genes are the key drivers and
which genes are novel disease genes or possible drug targets.
In this book, we introduce the latest experimental and bioinformatics methods for DNA
sequencing, RNA sequencing, cell-free tumor DNA sequencing, single-cell sequencing, and
single-cell proteomics and metabolomics. Then, we review the advanced analysis methods,
such as genome-wide association studies (GWAS), machine learning, reconstruction and
analysis of gene regulatory networks, and differential coexpression network analysis, and
give a practical guide for how to choose and use the right algorithm or software to handle
specific high-throughput data or multi-omics data. A powerful novel RNA-seq data analysis
and visualization tool, iSeq, is released in this book. The last parts of the book are the
applications of these high-throughput technologies and advanced analysis methods in
complex diseases, such as tuberculous meningitis and leukemia.
We hope that after reading this book, the readers can understand: how the biomed big
data is generated, which tools can be used to process them, which advanced machine
learning and network analysis are optional for data integration and knowledge discovery,
and what achievements have been made nowadays.

Shanghai, China Tao Huang

v
Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 DNA Sequencing Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Keyi Long, Lei Cai, and Lin He
2 Transcriptome Sequencing: RNA-Seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Hong Zhang, Lin He, and Lei Cai
3 Capture Hybridization of Long-Range DNA Fragments
for High-Throughput Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Xing Chen, Gang Ni, Kai He, Zhao-Li Ding, Gui-Mei Li,
Adeniyi C. Adeola, Robert W. Murphy, Wen-Zhi Wang, and Ya-Ping Zhang
4 The Introduction and Clinical Application of Cell-Free Tumor DNA. . . . . . . . . . 45
Jun Li, Renzhong Liu, Cuihong Huang, Shifu Chen, and Mingyan Xu
5 Bioinformatics Analysis for Cell-Free Tumor DNA Sequencing Data . . . . . . . . . . 67
Shifu Chen, Ming Liu, and Yanqing Zhou
6 An Overview of Genome-Wide Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . 97
Michelle Chang, Lin He, and Lei Cai
7 Integrative Analysis of Omics Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Xiang-Tian Yu and Tao Zeng
8 The Reconstruction and Analysis of Gene Regulatory Networks . . . . . . . . . . . . . . 137
Guangyong Zheng and Tao Huang
9 Differential Coexpression Network Analysis for Gene Expression Data . . . . . . . . 155
Bao-Hong Liu
10 iSeq: Web-Based RNA-seq Data Analysis and Visualization . . . . . . . . . . . . . . . . . . 167
Chao Zhang, Caoqi Fan, Jingbo Gan, Ping Zhu, Lei Kong, and Cheng Li
11 Revisit of Machine Learning Supported Biological and Biomedical Studies. . . . . 183
Xiang-tian Yu, Lu Wang, and Tao Zeng
12 Identifying Interactions Between Long Noncoding RNAs
and Diseases Based on Computational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Wei Lan, Liyu Huang, Dehuan Lai, and Qingfeng Chen
13 Survey of Computational Approaches for Prediction
of DNA-Binding Residues on Protein Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Yi Xiong, Xiaolei Zhu, Hao Dai, and Dong-Qing Wei
14 Computational Prediction of Protein O-GlcNAc Modification . . . . . . . . . . . . . . . 235
Cangzhi Jia and Yun Zuo
15 Machine Learning-Based Modeling of Drug Toxicity. . . . . . . . . . . . . . . . . . . . . . . . 247
Jing Lu, Dong Lu, Zunyun Fu, Mingyue Zheng, and Xiaomin Luo
16 Metabolomics: A High-Throughput Platform for Metabolite
Profile Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Jing Cheng, Wenxian Lan, Guangyong Zheng, and Xianfu Gao

vii
viii Contents

17 Single-Cell Protein Assays: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293


Beiyuan Fan, Junbo Wang, Ying Xu, and Jian Chen
18 Data Analysis in Single-Cell Transcriptome Sequencing. . . . . . . . . . . . . . . . . . . . . . 311
Shan Gao
19 Applications of Single-Cell Sequencing for Multiomics . . . . . . . . . . . . . . . . . . . . . . 327
Yungang Xu and Xiaobo Zhou
20 Progress on Diagnosis of Tuberculous Meningitis. . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Yi-yi Wang and Bing-di Xie
21 Insights of Acute Lymphoblastic Leukemia with Development
of Genomic Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Heng Xu and Yang Shu

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Contributors

ADENIYI C. ADEOLA  State Key Laboratory of Genetic Resources and Evolution, Kunming,
Yunnan, China; China-Africa Centre for Research and Education & Yunnan Laboratory
of Molecular Biology of Domestic Animals, Kunming, Yunnan, China; Animal Branch
of the Germplasm Bank of Wild Species, Kunming Institute of Zoology, Chinese Academy
of Sciences, Kunming, Yunnan, China
LEI CAI  Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders
(Ministry of Education), Collaborative Innovation Center for Genetics and Development,
Bio-X Institutes, Shanghai Jiao Tong University, Shanghai, China
MICHELLE CHANG  Key Laboratory for the Genetics of Developmental and Neuropsychiatric
Disorders (Ministry of Education), Collaborative Innovation Center of Genetics and
Development, Bio-X Institutes, Shanghai Jiao Tong University, Shanghai, China
JIAN CHEN  State Key Laboratory of Transducer Technology, Institute of Electronics, Chinese
Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing,
China
QINGFENG CHEN  School of Computer, Electronics and Information, Guangxi University,
Nanning, China; State Key Laboratory for Conservation and Utilization of Subtropical
Agro-bioresources, Guangxi University, Nanning, China
SHIFU CHEN  HaploX Biotechnology, Shenzhen, Guangdong, China
XING CHEN  State Key Laboratory of Genetic Resources and Evolution, Kunming, Yunnan,
China
JING CHENG  Department of Medical Instrument, Shanghai University of Medicine
and Health Sciences, Shanghai, China
HAO DAI  School of Life Sciences and Biotechnology, Shanghai Jiao Tong University,
Shanghai, China
ZHAO-LI DING  Kunming Biological Diversity Regional Centre of Large Apparatus
and Equipments, Kunming, Yunnan, China; Public Technology Service Centre,
Kunming, Yunnan, China
BEIYUAN FAN  State Key Laboratory of Transducer Technology, Institute of Electronics,
Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences,
Beijing, China
CAOQI FAN  Peking-Tsinghua Center for Life Sciences, Academy for Advanced
Interdisciplinary Studies; Center for Bioinformatics, School of Life Sciences, Peking
University, Beijing, China
ZUNYUN FU  State Key Laboratory of Drug Research, Drug Discovery and Design Center,
Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
JINGBO GAN  Peking-Tsinghua Center for Life Sciences, Academy for Advanced
Interdisciplinary Studies; Center for Bioinformatics, School of Life Sciences, Peking
University, Beijing, China
SHAN GAO  College of Life Sciences, Nankai University, Tianjin, People’s Republic of China;
Institute of Statistics, Nankai University, Tianjin, People’s Republic of China
XIANFU GAO  Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology,
Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China

ix
x Contributors

KAI HE  State Key Laboratory of Genetic Resources and Evolution, Kunming, Yunnan,
China
LIN HE  Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders
(Ministry of Education), Collaborative Innovation Center for Genetics and Development,
Bio-X Institutes, Shanghai Jiao Tong University, Shanghai, China
CUIHONG HUANG  HaploX Biotechnology, Shenzhen, Guangdong, China
LIYU HUANG  Information and Network Center, Guangxi University, Nanning, China
TAO HUANG  Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences,
Shanghai, China
CANGZHI JIA  Department of Mathematics, Dalian Maritime University, Dalian, China
LEI KONG  Peking-Tsinghua Center for Life Sciences, Academy for Advanced
Interdisciplinary Studies; Center for Bioinformatics, School of Life Sciences, Peking
University, Beijing, China
DEHUAN LAI  School of Computer, Electronics and Information, Guangxi University,
Nanning, China
WEI LAN  School of Computer, Electronics and Information, Guangxi University, Nanning,
China
WENXIAN LAN  State Key Laboratory of Bio-Organic and Natural Product Chemistry,
Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, Shanghai, China
CHENG LI  Peking-Tsinghua Center for Life Sciences, Academy for Advanced
Interdisciplinary Studies; Center for Bioinformatics, School of Life Sciences, Peking
University, Beijing, China; Center for Statistical Science, Peking University, Beijing,
China
GUI-MEI LI  Kunming Biological Diversity Regional Centre of Large Apparatus and
Equipments, Kunming, Yunnan, China; Public Technology Service Centre, Kunming,
Yunnan, China
JUN LI  HaploX Biotechnology, Shenzhen, Guangdong, China
BAO-HONG LIU  State Key Laboratory of Veterinary Etiological Biology; Key Laboratory of
Veterinary Parasitology of Gansu Province; Lanzhou Veterinary Research Institute, Chinese
Academy of Agricultural Sciences, Lanzhou, Gansu, People’s Republic of China; Jiangsu
Co-Innovation Center for Prevention and Control of Animal Infectious Diseases and
Zoonoses, Yangzhou, People’s Republic of China
MING LIU  HaploX Biotechnology, Nanshan District, Shenzhen, Guangdong, China
RENZHONG LIU  HaploX Biotechnology, Shenzhen, Guangdong, China
KEYI LONG  Key Laboratory for the Genetics of Developmental and Neuropsychiatric
Disorders (Ministry of Education), Collaborative Innovation Center for Genetics and
Development, Bio-X Institutes, Shanghai Jiao Tong University, Shanghai, China
DONG LU  State Key Laboratory of Drug Research, Drug Discovery and Design Center,
Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China;
University of Chinese Academy of Sciences, Beijing, China
JING LU  Key Laboratory of Molecular Pharmacology and Drug Evaluation (Yantai
University), Ministry of Education, Collaborative Innovation Center of Advanced Drug
Delivery System and Biotech Drugs in Universities of Shandong, School of Pharmacy, Yantai
University, Yantai, China
XIAOMIN LUO  Drug Discovery and Design Center, State Key Laboratory of Drug Research,
Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
Contributors xi

ROBERT W. MURPHY  State Key Laboratory of Genetic Resources and Evolution, Kunming,
Yunnan, China; Centre for Biodiversity and Conservation Biology, Royal Ontario
Museum, Toronto, ON, Canada
GANG NI  State Key Laboratory of Genetic Resources and Evolution, Kunming, Yunnan,
China; Yunnan Laboratory of Molecular Biology of Domestic Animals, Kunming,
Yunnan, China
YANG SHU  Precision Medicine Center, State Key Laboratory of Biotherapy, Precision
Medicine Key Laboratory of Sichuan Province, West China Hospital, Sichuan University,
Chengdu, Sichuan, China
JUNBO WANG  State Key Laboratory of Transducer Technology, Institute of Electronics,
Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences,
Beijing, China
LU WANG  Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology,
Chinese Academy Science, Shanghai, China
WEN-ZHI WANG  State Key Laboratory of Genetic Resources and Evolution, Kunming
Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan, China; Animal
Branch of the Germplasm Bank of Wild Species, Kunming Institute of Zoology, Chinese
Academy of Sciences, Kunming, Yunnan, China; Wildlife Forensics Science Services,
Kunming, Yunnan, China; Guizhou Academy of Testing and Analysis, Guiyang,
Guizhou, China
YI-YI WANG  Department of Neurology, Tianjin Haihe Hospital, Tianjin, P.R. China
DONG-QING WEI  School of Life Sciences and Biotechnology, Shanghai Jiao Tong University,
Shanghai, China
BING-DI XIE  Department of Neurology, Tianjin Medical University General Hospital,
Tianjin, P.R. China
YI XIONG  School of Life Sciences and Biotechnology, Shanghai Jiao Tong University,
Shanghai, China
HENG XU  State Key Laboratory of Biotherapy, Precision Medicine Key Laboratory of
Sichuan Province, Precision Medicine Center, West China Hospital, Sichuan University,
Chengdu, Sichuan, China
MINGYAN XU  HaploX Biotechnology, Shenzhen, Guangdong, China
YING XU  Key Laboratory of Cell Differentiation and Apoptosis of Ministry of Education,
Department of Pathophysiology, Shanghai Jiao-Tong University School of Medicine,
Shanghai, China
YUNGANG XU  Center for Systems Medicine, School of Biomedical Informatics, UTHealth at
Houston, Houston, TX, USA; Center for Bioinformatics and Systems Biology, Wake Forest
School of Medicine, Winston-Salem, NC, USA
XIANG-TIAN YU  Key Laboratory of Systems Biology, Institute of Biochemistry and Cell
Biology, Chinese Academy Science, Shanghai, China
TAO ZENG  Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology,
Chinese Academy Science, Shanghai, China
CHAO ZHANG  PKU-Tsinghua-NIBS Graduate Program, School of Life Sciences, Peking
University, Beijing, China
HONG ZHANG  Key Laboratory for the Genetics of Developmental and Neuropsychiatric
Disorders (Ministry of Education), Collaborative Innovation Center for Genetics and
Development, Bio-X Institutes, Shanghai Jiaotong University, Shanghai, China
xii Contributors

YA-PING ZHANG  State Key Laboratory of Genetic Resources and Evolution, Kunming,
Yunnan, China; Yunnan Laboratory of Molecular Biology of Domestic Animals,
Kunming, Yunnan, China; Animal Branch of the Germplasm Bank of Wild Species,
Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan, China;
Laboratory for Conservation and Utilization of Bio-resource and Key Laboratory for
Microbial Resources of the Ministry of Education, Yunnan University, Kunming, Yunnan,
China
GUANGYONG ZHENG  Key Laboratory of Computational Biology, Bio-Med Big Data Center,
CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological
Sciences, Chinese Academy of Sciences, Shanghai, China
MINGYUE ZHENG  State Key Laboratory of Drug Research, Drug Discovery and Design
Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai,
China
XIAOBO ZHOU  Center for Systems Medicine, School of Biomedical Informatics, UTHealth at
Houston, Houston, TX, USA; Center for Bioinformatics and Systems Biology, Wake Forest
School of Medicine, Winston-Salem, NC, USA
YANQING ZHOU  HaploX Biotechnology, Nanshan District, Shenzhen, Guangdong, China
PING ZHU  Peking-Tsinghua Center for Life Sciences, Academy for Advanced
Interdisciplinary Studies; Center for Bioinformatics, School of Life Sciences, Peking
University, Beijing, China
XIAOLEI ZHU  School of Life Sciences and Biotechnology, Shanghai Jiao Tong University,
Shanghai, China
YUN ZUO  Department of Mathematics, Dalian Maritime University, Dalian, China
Chapter 1

DNA Sequencing Data Analysis


Keyi Long, Lei Cai, and Lin He

Abstract
Among various biological data, DNA sequence is doubtlessly a fundamental datum. By obtaining particular
DNA sequence data and analyzing, biologists get to understand life science more precisely. This chapter is
an overview of DNA sequencing technology and its data analysis methods, providing information about
DNA sequencing, several different methods, and tools applied in data analysis. Both advantages and
disadvantages are discussed.

Key words DNA sequence, DNA sequencing, Data analysis, Sequence comparison, Methods and
tools

1 DNA Sequencing

Three essential elements of life science are DNA, RNA, and protein;
they lay the foundation of all living creatures. Millions of scientists
make joint efforts to understand the mystery of life, and tons of
work have been done to figure out relations between structures and
their properties.
For molecular biologists, information encoded in the
sequences of nucleic acid molecules is of vital importance since it
not only passes the genetic information from generation to genera-
tion but also influences function by transcription and translation.
Research at the frontiers of life science cannot be done without
obtaining and analyzing certain DNA sequences, which means
determining the particular order and number of the four bases—
adenine, guanine, cytosine, and thymine—in a strand of DNA.
Advances in recombinant DNA technology have allowed the isola-
tion of large numbers of biologically interesting fragments of
DNA [1].

1.1 Methods of DNA With the help of restriction endonucleases, large DNA molecules
Sequencing can be cut into small fragments in an orderly fashion. Also, recom-
binant DNA techniques aid in purifying and characterizing

Tao Huang (ed.), Computational Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 1754,
https://doi.org/10.1007/978-1-4939-7717-8_1, © Springer Science+Business Media, LLC, part of Springer Nature 2018

1
2 Keyi Long et al.

individual restriction fragments from mixtures. And most impor-


tantly, to do DNA sequencing, at least three steps are needed:
cloning, sequencing, and analyzing.
In the year 1970, Ray Wuat from Cornell University first
tapped a location-specific primer extension strategy into determin-
ing DNA sequences. And ever since the year 1977 Sanger and his
colleagues established the chain termination method and accom-
plish the first full DNA genome sequencing of bacteriophage
ϕX174, DNA sequencing methods have been developed and
improved.

1.1.1 Traditional There are two basic methods in DNA sequencing—the Maxam-
Methods Gilbert sequencing (also known as chemical sequencing) as well as
the chain termination method (also known as Sanger sequencing).
The former method attaches radioactive labels to the 50 end of
DNA, and by using chemical treatment, it generates subsequent
breaks at particular bases. Autoradiography helps yield a series of
dark bands, which represent the radiolabeled DNA fragments. On
the other hand, Sanger’s method requires modified
di-deoxynucleoside triphosphates (ddNTPs). Due to the fact that
DNA polymerase I cannot distinguish normal deoxynucleoside
triphosphates (dNTPs) and ddNTPs, those new strands with
ddNTPs lack a 30 -OH group required for the formation of a
phosphodiester bond between two nucleotides, thus stopping the
elongation of DNA. By labeling ddNTPs, we get to know the DNA
sequence [2].
Although Sanger’s way is effective in many aspects, it can only
read 450 bp in a single reaction, and the process is time-consuming,
limiting its use in large fragment sequencing. After prevailing for
decades, other methods are invented and widely used on the basis
of their work, like the shotgun strategy and bridge PCR. More
importantly, with the rapid development of science and technology,
high-throughput sequencing methods are established; they then
play an essential role in modern DNA sequencing with the ability to
process mass data in a short time.

1.1.2 High-Throughput Since the 1990s, a handful of new methods of DNA sequencing
(HTP) Sequencing Methods were invented—454 pyrosequencing, Illumina (Solexa) sequenc-
ing, and SOLiD sequencing are three most used technologies.
Other methods include the massively parallel signature sequencing
(MPSS), the polony sequencing, DNA nanoball sequencing, etc.
These methods all share common characteristics of high through-
put and low costs, and together they were known as the “next-
generation” sequencing (NGS) methods. The core thought of
HTP methods is to do DNA sequencing while synthesizing the
new strand.
DNA Sequencing Data Analysis 3

Nowadays, genomic questions are so complex that a depth of


information is needed. In ultra-high-throughput sequencing, as
many as 500,000 sequencing-by-synthesis operations may be run
in parallel [3]. With its unprecedented throughput, speed, and
scalability compared with traditional DNA sequencing, NGS
enables researchers to study biological problems at a new level
and has been widely implemented in commercial DNA sequencers.
Table 1 makes comparisons between several high-throughput
sequencing methods [4].
Among those NGS methods, 454 pyrosequencing is doubt-
lessly the most classic one. It does not require ddNTPs for chain
termination. Instead, it mainly utilizes emulsion PCR to accom-
plish DNA elongation. By detecting the pyrophosphate released
during nucleotide incorporation, the sequencer can analyze the
sequence. Data will be stored in standard flowgram format (SFF)
files for downstream analysis.
The process can be divided into the following steps:
1. Library construction. The library DNAs with 454-specific
adaptors are denatured to be single strand.
2. Surface attachment and bridge amplification.
3. Denaturation and complete amplification. For example, by
emulsion PCR.
4. Single base extension and sequencing.
The theory can be concluded as follows:
When one dNTP (dATP, dGTP, dCTP, dTTP) complements to
the bases of the template strand with the help of DNA polymerase,
one pyrophosphate (PPi) is released. Catalyzed by ATP sulfurylase,
PPi can bind to adenosine-50 -phosphosulfate (APS) to generate
ATP. With luciferase, the ATP drives the luciferin into oxyluciferin
and generates visible light, which then be captured by CDD system.
The signal will be then analyzed by computers and finally show the
exact DNA sequence.
Although the next-generation sequencing methods are still the
most prevailing technologies, the third-generation sequencing
(TGS), also known as the single molecule sequencing (SMS), is
developing rapidly. This kind of technology depends on detecting
single molecule signal and no longer needs PCR, aiming to increase
throughput and decrease the time to result and cost by eliminating
the need for excessive reagents and harnessing the processivity of
DNA polymerase [5].

2 Methods for DNA Sequencing Data Analysis

After obtaining the exact sequences of the nucleic acid, it is usually


necessary to identify the quality of the outcome, to extract target
4

Table 1
Comparison of several high-throughput sequencing methods

Accuracy Cost per


(single read Time 1 million bases
Method Read length not consensus) Reads per run per run (in US$) Advantages Disadvantages

Pyrosequencing 700 bp 99.9% 1 million 24 h $10 Long read size. Runs are
Keyi Long et al.

(454) Fast expensive.


Homopolymer
errors
Sequencing by MiniSeq, NextSeq, 99.9% (Phred30) MiniSeq/MiSeq, 1–25 1–11 days, $0.05–0.15 Potential for high Equipment can
synthesis 75–300 bp; million; depending sequence be very
(Illumina) MiSeq, 50–600 bp; NextSeq, 130-400 upon yield, expensive.
HiSeq 2500, 50–500 bp; million; sequencer depending Requires high
HiSeq 3/4000, HiSeq 2500, and upon concentrations
50–300 bp; HiSeq X, 300 million–2 specified sequencer of DNA
300 bp billion; HiSeq read model and
3/4000, 2.5 billion; length desired
HiSeq X, 3 billion application
Sequencing by 50 + 35 or 50 + 50 bp 99.9% 1.2–1.4 billion 1–2 weeks $0.13 Low cost per base Slower than other
ligation methods.
(SOLiD Has issues
sequencing) sequencing
palindromic
sequences
Nanopore Dependent on library prep, ~92–97% single read Dependent on read Data $500–999 per flow Very long reads, Lower
sequencing not the device, (up to 99.96% length streamed cell, base cost- portable (palm throughput
so user chooses read consensus) selected by user in real dependent sized) than other
length (up to 500 kb time. on expt machines,
reported) Choose single read
1 min to accuracy in 90 s
48 h
Table source: https://en.wikipedia.org/wiki/DNA_sequencing
DNA Sequencing Data Analysis 5

fragments, and to compare the sequence with a reference genome.


Also, biologists pay attention to other characteristics of the
sequence that might determine its biological features. That is why
the work of data analysis should be done for further study.

2.1 General Steps of Generally, DNA sequencing data analysis includes these four steps:
DNA Sequencing Data l Trimming of overlapping sequences.
Analysis
l Multiple alignments of template sequences.
l Consistency check between reading text and chromatogram
peak data.
l Review and correction of software misreads.
To be more precise, by using DNA sequencing technology,
especially the Sanger sequencing, we obtain data in the form of
chromatogram—a series of four differently colored peaks. Usually,
after opening the result file in a software such as Chromas Lite,
there shows red, black, green, and blue peaks, each color
corresponding to a different DNA base. On both ends of the
chromatogram, there exist about 50 bases that are difficult to
recognize. This is because of impurities and is a normal
phenomenon.
When screening the chromatogram, we are likely to find two
overlapping peaks. It seems that this spot represents a heterozygos-
ity locus. However, things get more complicated when the two
overlapping peaks have different axes or when the two peaks share
one axis but are of the same height. This spot is not a heterozygos-
ity locus since one peak is the interference peak. Mostly, one or two
spots before a big base peak exists an interference peak whose
height is approximately half of the big peak. The closer they are,
the more interference they have. And under these circumstances,
the computer often makes mistakes; that is where humans step in
and correct those misreads.
When checking the outcome of the software, we conclude
some rules to help us determine whether the results are accurate
after tons of work:
1. The main peak mostly sits on the right side of the
interference peak.
2. The interference peak can be higher or lower or of the same
height than the main peak.
As a result, in order to reduce misreads, we often do several
procedures:
1. Consistency check among reading text and results in gene pool
and chromatogram peak data must be done.
6 Keyi Long et al.

2. When finding a possible spot, compare it with multiple


samples.
3. Calculate the mutation rate of your finding, and compare it
with data in authoritative publications or databases.

2.2 Procedure for When it comes to analyzing the results of next-generation DNA
NGS Data Analysis sequencing (NGS) data, the situation is more complicated. This is
because the results are determined by varied DNA library con-
2.2.1 Quality Control
structing process and adaptors-adding process. Since the modern
high-throughput sequencers can generate hundreds of millions of
sequences in a single run, before analyzing this sequence to draw
biological conclusions, we are prone to perform some simple qual-
ity control checks to ensure that the raw data looks good and there
are no problems or biases in the data.
Although many sequencers will generate a QC report, this is
usually not enough since it only focused on identifying problems
which were generated by the sequencer itself. FastQC is a widely
used software that aims to provide a more detailed QC report,
which can spot problems which originate either in the sequencer
or in the starting library material. When using FastQC, we should
know the following steps:
1. Use the Linux system and install FastQC:
(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
2. Type in command “fastqc [-o output dir] [--(no)extract] [-f
fastq|bam|sam] [-c contaminant file].” “output dir” means the
output path, the parameter “extract” determines the output
unpacking, and the parameter “-f” represents the format of
input.
3. Run FastQC and read the result files:
l The HTML report shows a summary of the modules which
were run and a quick evaluation of whether the results of the
module seem entirely normal (green tick), slightly abnormal
(orange triangle), or very unusual (red cross).
l View the per base sequence quality. Quality can be seen as
the value of Fred. In “10  log10( p),” “p” stands for the
possibility of a mistake. Values of the lower quartile and the
median should be considered. If the value of the lower
quartile exceeds 30, the quality can be regarded as
very good.
l View the per sequence quality scores. Normally, if 90% of
the reads have the quality value of more than 35 scores, the
quality can be regarded as very good.
l View the distribution of A,T,G,C. In most cases, the
amount of A/T (28%) outweighs that of G/C (22%).
DNA Sequencing Data Analysis 7

2.2.2 Data Analysis For data analysis, we choose Illumina system as an example. Illu-
mina offers a variety of next-generation sequencing (NGS) data
analysis software tools. Push-button tools for DNA sequence align-
ment, variant calling, and data visualization are all included. Data
generated on Illumina sequencing instruments are automatically
transferred and stored securely in BaseSpace Sequence Hub. And
the analyzing procedure should be done as follows:

Primary Analysis 1. Judge the results’ quality. If the outcome is not in good quality,
the analyzing process will be meaningless.
2. Searching for your aim fragments.
3. Real-time analysis and base calling by the Illumina system.

Secondary Analysis 1. After real-time analysis (RTA) in the primary analysis, use
MiSeq Reporter, an online software, to analyze data.
2. After opening MiSeq Reporter, click “analysis” to see different
modules including A (assembly), E (enrichment), G (generate
FASTQ), M (metagenomics), R (Resequencing), etc.
3. Choose the analyzing module you need and run the procedure.
4. Read the MiSeq Reporter report. For example, if you choose
module R, after running the resequencing procedure, the
detailed report will show a list of samples, a table of targets, a
list of SNPs and their corresponding scores, Q score, as well as
the depth of sequencing.
5. The output is in demultiplex (*.demux) and FASTQ (*.fastq)
formats. You can use third-party software programs to further
analyze the data.
6. Compare the results with the reference genome.

2.3 Several Tools to It is a DNA sequence viewer and annotation tool written in Java.
Facilitate Data User can download it for free and run it under systems including
Analysis UNIX, GNU/Linux, Macintosh, and Windows.
First, import information from EMBL and GenBank, as well as
2.3.1 Artemis R5
files in FASTA format. Then it gives visualization of sequence
features, next-generation data and the results of analyses within
the context of the sequence, and also its six-frame translation.

2.3.2 Arlequin It is an integrated software package for population genetics data


analysis. Arlequin provides methods to analyze patterns of genetic
diversity within and between population samples [6].
The software is freely available on http://cmpg.unibe.ch/soft
ware/arlequin3. It can recognize data including DNA sequences,
standard multilocus genotypes, RFLP data, microsatellite data, etc.
It is a powerful software that is capable of many functions including
molecular diversity, mismatch distribution, computation of stan-
dard genetic diversity indices, as well as the estimation of allele and
8 Keyi Long et al.

haplotype frequencies. Also it can run tests of departure from


linkage equilibrium and do thorough analyses of population subdi-
vision under the AMOVA framework.
When imported data is in the RFLP data:
“1” means there exist restriction sites, while “0” means none, and
“-” means a lack of restriction sites.
When imported data is DNA sequences:
“-” stands for a lack of nucleotide, while “?” stands for an unknown
nucleotide.
“R” means A/G (purine), while “Y” means C/T (pyrimidine).
“M” means A/C, “W” means A/T; “S” means C/G, “K” means
G/T, “B” means C/G/T, “D” means A/C/T; “H” means
A/C/T; “V” means A/C/G; “N” means A/C/G/T.

2.3.3 DnaSP It is a software for comprehensive analysis of DNA polymorphism


data. As a powerful tool, it helps us to understand the evolutionary
process and to establish the functional significance of particular
genomic regions [7].
Remarkably, the DnaSP v5 can handle and analyze multiple
data files in batch. It can identify conserved DNA regions, which
can contribute to phylogenetic footprint-based analysis. Also, it
allows exhaustive DNA polymorphism analysis, and the results
can be illustrated graphically and in a text format.

2.3.4 SSAHA2 (Sequence It is a pairwise sequence alignment program designed for the
Search and Alignment by efficient mapping of sequencing reads onto genomic reference
Hashing Algorithm) sequences.
It can recognize a range of output formats concluding SAM,
CIGAR, PSL, etc. And this tool reads data from most sequencing
platforms like ABI-Sanger, Roche 454, and Illumina-Solexa.
There are many other tools for researchers to use, facilitating
them to better analyze data generated. Table 2 illustrates tools of
different kinds.

3 Extension: Methods and Tools for DNA Sequence Analysis

3.1 Background In the past decades, many manual methods have been applied to
analyzing DNA sequence data. However, the drawbacks of these
methods are apparent—when the data is in extraordinary amount,
it takes lots of time and energy. Fortunately, computers are well-
used in solving the problem. By establishing DNA sequence data-
bases storing data information of magnanimity, researchers are able
to adopt statistical approaches for analysis.
DNA Sequencing Data Analysis 9

Table 2
Several tools for data analysis

Function Name Site


Plot ggplot2 http://docs.ggplot2.org/current/
circos http://circos.ca/
Mapping BWA http://bio-bwa.sourceforge.net/
Bowtie2 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
SNP/indel calling samtools http://samtools.sourceforge.net/samtools.shtml
gatk http://www.broadinstitute.org/gatk/
pindel http://gmt.genome.wustl.edu/pindel/0.2.4/index.html
Analysis tools plink http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml
ngsTools https://github.com/mfumagalli/ngsTools
Structure analysis frappe http://med.stanford.edu/tanglab/software/frappe.html
structure http://pritchardlab.stanford.edu/structure.html
ngsAdmix http://www.popgen.dk/software/index.php/NgsAdmix
Databases DDBJ http://www.ddbj.nig.ac.jp/index-e.html
ENA http://www.ebi.ac.uk/ena/home
KEGG http://www.genome.jp/kegg/
ensembl http://asia.ensembl.org/index.html

The key to data analysis is data mining, of which the basis is


sequence similarities. The most common approach to similarity
research is DNA sequence alignment which can find the optimal
match between sequences according to similar matrix given, as well
as probable insertion, deletion, and mutation.

3.1.1 Two Stages of DNA Analyzing nucleic acid sequences with computer programs can be
Sequence Analysis divided into two stages:
1. The first stage is the straightforward search for sequences with
known properties, which involves position determination.
2. The second stage aims to detect subtle, less straightforward
sequence patterns including controlling elements like promo-
ters. The results can be presented by catalogs of sequence
patterns.

3.1.2 Two Categories of Computational approaches to sequence alignment generally fall


Computational Approaches into two categories: global alignments and local alignments.
1. Calculating a global alignment is a form of global optimization
that “forces” the alignment to span the entire length of all
query sequences.
2. Local alignments identify regions of similarity within long
sequences that are often widely divergent overall. Local
10 Keyi Long et al.

alignments are often preferable but can be more difficult to


calculate because of the additional challenge of identifying the
regions of similarity.

3.2 Methods and There are various widely used DNA sequencing data analysis tools;
Tools some are more familiar to us while some may not.

3.2.1 Two Types of DNA DNA sequence alignment can be divided into different types:
Sequence Alignment
1. Pairwise alignment: it can only compare two sequences.
2. Multiple sequence alignment: it is an extension of pairwise
alignment to incorporate more than two sequences at a time.
Several software are chosen to be discussed as follows.

3.2.2 BLAST BLAST, also known as Basic Local Alignment Search Tool (site:
blast.ncbi.nlm.nih.gov/Blast.cgi), is an algorithm to compare pri-
mary biological sequence information. Usually, you don’t have to
download and install it. All you have to do is to visit the website
stated above.
BLAST is actually a family of programs that is widely used in
bioinformatics; it enables us to make comparison between the
query sequence and a database of sequences. Those sequences can
belong to DNA, RNA, or protein. By selecting particular BLAST
tool and determining a certain threshold, we can identify sequences
that resemble the input sequence. For nucleic acid, there is
nucleotide-nucleotide BLAST (blastn). After putting in a DNA
query and setting certain parameters, we get results showing the
most similar DNA sequences.
Blastn does its job by locating short matches. Usually, there is a
threshold score T. If the score is higher than a predetermined T, the
alignment will be included in the results given by BLAST and vice
versa. Therefore, choosing a proper value of T means getting a
proper amount of results.
This tool is highly sensitive and can be utilized for several
purposes: species identification, domains location, phylogeny
establishment, etc.
1. Visit the site blast.ncbi.nlm.nih.gov/Blast.cgi and choose
blastn.
2. Upload your DNA sequence in proper format like FASTA.
3. Set proper parameters including T.
4. Click BLAST.
5. Reviewing your alignment results; mismatches can be a frame-
shift in the query sequence.
6. If any error exists, go back, check the sequence file, change
values of parameters, and BLAST again.
DNA Sequencing Data Analysis 11

3.2.3 CLUSTAL Clustal is an effective tool for multiple alignment of nucleic acid and
protein sequences. After downloading the software, we input data
containing DNA sequences, then set certain parameters and wait
for the results. When multiple sequence alignment is needed, we
use Clustal X.
The proper input formats conclude NBRF/PIR, FASTA,
EMBL/Swiss-Prot, Clustal, GCC/MSF, GCG9 RSF, and GDE,
while the output format can be Clustal, NBRF/PIR, GCG/MSF,
PHYLIP, GDE, or NEXUS.
When using Clustal for data analysis, the bigger the input file is,
the longer it takes for alignment. The results obtained from Clustal
can be further utilized by loading the output file into other software
like MEGA, which will be soon discussed.
1. Download the desktop application and open it.
2. Upload file containing DNA sequences in proper format; at
this stage, you can have a look at the colored bases.
3. Select different tools for different purposes:
Select “do complete alignment” for a pairwise alignment.
Select “do alignment from guide tree and phylogeny” to create
a guide tree (or use a user-defined tree).
Select “produce guide tree only” to use the guide tree to carry
out a multiple alignment.
4. Review the results, save it in a favorable format.
5. The results can be used for further studies.

3.2.4 MEGA MEGA is short for Molecular Evolutionary Genetics Analysis. As a


desktop application released in 1993, it has continuously helped
users conduct statistical analysis of biological macromolecules to
study molecular evolution and construct phylogenetic trees [8].
MEGA is multifunctional. In addition to sequence alignment
construction, it performs outstandingly in distance estimation and
tree-making. To be more precise, the fact that MEGA has included
likelihood methods for estimating evolutionary distances between
sequence pairs as well as distance-based and maximum parsimony
methods for inferring phylogenetic trees is historical.
1. Download the desktop application MEGA and open it.
2. Upload file containing DNA sequences in proper format, or
open a file conducted by Clustal or any other tools that have
results MEGA can recognize and further analyze.
3. Click “do complete alignment” if you upload DNA sequences;
otherwise skip this step.
4. Select tools like “compare pairwise distances” or “construct
neighbor-joining tree” for certain purposes.
5. Save the results in required format.
12 Keyi Long et al.

4 Applications of DNA Sequencing Data Analysis

DNA sequencing data analysis is of vital importance for multiple


reasons [9–12]; one strong point is it can be applied in multiple
conditions:
1. Obtaining information encoded in gene.
We can compare sequences, predict the sequence of promoters
and enhancers, and identify the order of amino acids in certain
proteins.
2. Discovering new genes.
We can discover new genes by analyzing EST (expressed
sequence tag) sequences and using DNA chip technology.
3. Analyzing gene polymorphism.
We can analyze gene polymorphism, especially SNP (single-
nucleotide polymorphism) to identify and locate functional
genes, which can be targets of human evolution or diseases.
4. Predicting advanced structures.
We can use the information of the primary structure to predict
advanced structures of nucleic acids and proteins, thus predict-
ing their functions.
5. Achieving personalized medicine.
With the soaring need for personalized medicine, health-care
providers are capable of using DNA sequencing data to give
medical suggestions to patients.
Besides all these applications above, next-generation sequenc-
ing data analysis distinguishes itself in the following aspects:
1. Sequence the whole genomes rapidly and can zoom in to
deeply sequence target regions.
2. Analyze genome-wide methylation or DNA-protein
interactions.
3. Help researchers to dig into microbial diversity in humans or in
the environment.
Although using computers for data analysis has obvious advan-
tages, there still exist weaknesses:
1. Processing DNA sequencing data requires time and experience.
2. Although results are illustrated by those tools, researchers are
supposed to analyze DNA sequencing results, to see whether
the outcome is reasonable, and to plan future experiments.
3. We cannot depend on computer analysis totally; the software
also make mistakes.
DNA Sequencing Data Analysis 13

References
1. Gingeras TR, Roberts RJ (1980) Steps toward 8. Kumar S, Nei M, Dudley J, Tamura K (2008)
computer analysis of nucleotide sequences. Sci- MEGA: a biologist-centric software for evolu-
ence 209(4463):1322–1328 tionary analysis of DNA and protein sequences.
2. Sanger F, Air GM, Barrell BG, Brown NL, Brief Bioinform 9(4):299–306. https://doi.
Coulson AR, Fiddes CA, Hutchison CA, Slo- org/10.1093/bib/bbn017
combe PM, Smith M (1977) Nucleotide 9. Cai L, Yuan W, Zhang Z, He L, Chou GC
sequence of bacteriophage phi X174 DNA. (2016) In-depth comparison of somatic point
Nature 265(5596):687–695. https://doi. mutation callers based on different tumor next-
org/10.1016/0022-2836(78)90346-7 generation sequencing depth data. Sci Rep
3. ten Bosch JR, Grody WW (2008) Keeping up 6:36540. https://doi.org/10.1038/
with the next generation: massively parallel srep36540
sequencing in clinical diagnostics. J Mol 10. Huang T, Liu CL, Li LL, Cai MH, Chen WZ,
Diagn 10(6):484–492. https://doi.org/10. Xu YF, O’Reilly PF, Cai L, He L (2016) A new
2353/jmoldx.2008.080027 method for identifying causal genes of schizo-
4. Liu L, Li Y, Li S, Hu N, He Y, Pong R, Lin D, phrenia and anti-tuberculosis drug-induced
Lu L, Law M (2012) Comparison of next- hepatotoxicity. Sci Rep 6:32571. https://doi.
generation sequencing systems. J Biomed Bio- org/10.1038/srep32571
technol 2012:251364. https://doi.org/10. 11. Fang S, Zhang Y, Xu M, Xue C, He L, Cai L,
1155/2012/251364 Xing X (2016) Identification of damaging
5. Schadt EE, Turner S, Kasarskis A (2010) A nsSNVs in human ERCC2 gene. Chem Biol
window into third-generation sequencing. Drug Des 88(3):441–450. https://doi.org/
Hum Mol Genet 19(R2):R227–R240. 10.1111/cbdd.12772
https://doi.org/10.1093/hmg/ddq416 12. Cai L, Deng SL, Liang L, Pan H, Zhou J, Wang
6. Excoffier L, Laval G, Schneider S (2005) Arle- MY, Yue J, Wan CL, He G, He L (2013)
quin (version 3.0): an integrated software pack- Identification of genetic associations of
age for population genetics data analysis. Evol SP110/MYBBP1A/RELA with pulmonary
Bioinformatics Online 1:47–50 tuberculosis in the Chinese Han population.
7. Librado P, Rozas J (2009) DnaSP v5: a soft- Hum Genet 132:265–273. https://doi.org/
ware for comprehensive analysis of DNA poly- 10.1007/s00439-012-1244-5
morphism data. Bioinformatics 25
(11):1451–1452
Chapter 2

Transcriptome Sequencing: RNA-Seq


Hong Zhang, Lin He, and Lei Cai

Abstract
RNA sequencing (RNA-seq) can not only be used to identify the expression of common or rare transcripts
but also in the identification of other abnormal events, such as alternative splicing, novel transcripts, and
fusion genes. In principle, RNA-seq can be carried out by almost all of the next-generation sequencing
(NGS) platforms, but the libraries of different platforms are not exactly the same; each platform has its own
kit to meet the special requirements of the instrument design.

Key words Next-generation sequencing, RNA sequencing, Messenger RNA, Library construction,
Data analysis

1 Introduction

In a broad sense, transcriptome refers to the collection of all tran-


scripts under certain physiological condition, including messenger
RNA (mRNA), ribosomal RNA (rRNA), transfer RNA (tRNA),
and noncoding RNA (ncRNA), while in a narrow sense, it refers to
collection of all mRNA transcripts [1]. Transcriptome sequencing,
also called RNA-seq or whole-transcriptome shotgun sequencing
(WTSS), uses high-throughput sequencing technology to rapidly
and comprehensively obtain the transcriptional status of biological
samples at a specific time [2]. At present, RNA-seq is mainly used in
the study of mRNA, small RNA, noncoding RNA, or microRNAs.
Different types of RNA can be obtained by adding additional
separation and enrichment steps before cDNA synthesis. Illumina
TruSeq is a method using conjugated magnetic beads (oligo-dT) to
capture ploy A+ from total RNA and then contract mRNA library.
During the ploy A+ enrichment process, non-ploy A+ RNA, includ-
ing miRNA, rRNA, and other noncoding RNA, were removed
[3, 4]. The mRNA library preparation steps contain five steps:
(1) RNA fragmentation, (2) reverse transcription, (3) adapter liga-
tion, (4) library cleanup and amplification, and (5) library quantifi-
cation, quality control [5] (Fig. 1). Here, we show the method of

Tao Huang (ed.), Computational Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 1754,
https://doi.org/10.1007/978-1-4939-7717-8_2, © Springer Science+Business Media, LLC, part of Springer Nature 2018

15
16 Hong Zhang et al.

Fig. 1 mRNA library construction workflow for Illumina (from David Corney 2013)

the RNA-seq from total RNA extraction, library construction, and


data analysis.

2 Materials

Prepare all solutions using ultrapure water and analytical grade


reagents. Prepare and store all reagents at room temperature
(unless indicated otherwise).

2.1 Total RNA 1. Liquid nitrogen.


Extraction 2. 70% ethanol.
3. Tissue: keep the tissue in the liquid nitrogen until the proce-
dure is completed.
4. TRIzol Reagent (Invitrogen).
5. DEPC-treated water (Ambion).
6. Chloroform: trichloromethane.
7. Isopropanol.
8. Thermo Scientific NanoDrop 2000 spectrophotometer: RNA
quantification
9. Agilent 2100 Bioanalyzer system: RNA quality control.
Transcriptome Sequencing: RNA-Seq 17

2.2 mRNA Library 1. RNA Purification Beads: purifying the poly-A containing
Construction mRNA molecules using oligo-dT attached magnetic bead,
stored at 4  C (Illumina, San Diego, CA).
2. Bead Washing Buffer (BWB), Elution Buffer (ELB), Bead-
Binding Buffer (BBB): 1 tube per 48 reactions, stored at
20  C (Illumina, San Diego, CA).
3. Elute, Prime, Fragment Mix (EPF): 1 tube per 48 reactions,
stored at 20  C (Illumina, San Diego, CA).
4. First-Strand Master Mix (FSM): 1 tube, stored at 20  C
(Illumina, San Diego, CA).
5. SuperScript II Reverse Transcriptase: 1 tube, stored at 20  C.
6. Second-Strand Master Mix (SSM): 1 tube per 48 reactions,
stored at 25  C to 15  C (Illumina, San Diego, CA).
7. AMPure XP beads: stored at 4  C.
8. 80% ethanol.
9. Resuspension Buffer (RSB): 1 tube, stored at 20  C.
10. End-Repair Mix: add 50 -phosphate groups needed for down-
stream ligation, 1 tube per 48 reactions, stored at 20  C
(Illumina, San Diego, CA).
11. A-Tailing Mix: make fragments compatible with adapters and
prevent self-ligation by adding a 30 -A overhang, 1 tube per
48 reactions, stored at 20  C (Illumina, San Diego, CA).
12. Ligation Mix: join 30 -T overhang adapters to 30 -A overhang
inserts, 1 tube per 48 reactions, stored at 20  C (Illumina,
San Diego, CA).
13. Stop Ligation Buffer: inactivate the ligation. 1 tube per 48 reac-
tions, stored at 20  C (Illumina, San Diego, CA).
14. Resuspension Buffer (RSB): 1 tube, stored at 20  C (Illu-
mina, San Diego, CA).
15. PCR Master Mix (PMM): 1 tube per 48 reactions, stored at
20  C (Illumina, San Diego, CA).
16. PCR Primer Cocktail (PPC): 1 tube per 48 reactions, stored at
20  C (Illumina, San Diego, CA).
17. Sequencing chip: flow cell.
18. Illumina HiSeq system.

2.3 Data Analysis 1. Raw data processing: Trimmomatic.


2. Mapping: TopHat (Bowtie).
3. Quality control: RSeQC.
4. Differentially expressed gene analysis: htseq-count, DEseq,
DAVID, KEGG.
18 Hong Zhang et al.

5. Differential alternative splicing analysis: MISO (a mixture of


isoforms).
6. Fusion gene analysis: TopHat-Fusion.

3 Methods

3.1 Total RNA 1. Remove the tissue sample from 80  C refrigerator, and
Extraction immediately put it in the thermos cup with liquid nitrogen
(see Note 1).
2. Remove the sample from the liquid nitrogen and put into a
1.5 mL EP tube; add 300 μL TRIzol reagent, fully grinding
with an electric tissue grinder; then add 700 L TRIzol; and
place the tube on the ice for 30 min to ensure that sufficient
crushing of the cells.
3. Add 200 μL chloroform, vortex, and then centrifuge at
13,000  g for 10 min.
4. Remove supernatant to a new EP tube (see Note 2).
5. Add 500 μL isopropanol, vortex, place at 20  C for 20 min,
and then centrifuge at 13,000  g for 10 min.
6. Discard supernatant; add 1 mL 70% ethanol solution, mild
concussion for 10s; and then centrifuge at 8000  g for 2 min.
7. Discard supernatant, and repeat step 6 one time.
8. Discard supernatant, centrifuge at 8000  g for 15 s, remove
excess liquid, and place the EP tube on ice for 2 min to make
ethanol fully volatile.
9. According to the precipitation size, add 30–200 μL ultrapure
water.
10. Determine the concentration of RNA solution by using Nano-
Drop 2000 spectrophotometer.
11. Use the Agilent 2100 Bioanalyzer system to detect the RNA
integrity (see Note 3).
12. RNA solution should be stored in the 80  C refrigerator.

3.2 Library 1. Add 2 μg total RNA samples (less than 50 μL) to a 200 μL EP
Construction tube, dilute to 50 μL, then add 50 μL RNA Purification Beads
(see Note 4), and gently pipette the entire volume up and down
eight times to mix thoroughly.
2. Place the EP tube on PCR thermal cycler (65  C for 5 min,
4  C hold) to denature the RNA.
3. Place the EP tube at room temperature for 5 min to facilitate
binding of the polyA RNA to the beads.
Transcriptome Sequencing: RNA-Seq 19

4. Place the EP tube on the magnetic stand for 5 min to separate


the polyA RNA beads from the solution.
5. Discard the liquid, wash the beads by adding 200 μL Bead
Washing Buffer, gently pipette the entire volume up and
down eight times to mix thoroughly, and place the EP tube
on the magnetic stand for 5 min.
6. Discard the liquid, add 50 μL of Elution Buffer, gently pipette,
and place the EP tube on PCR thermal cycler (80  C for 2 min,
25  C hold).
7. Add 50 μL Bead-Binding Buffer, gently pipette, place the EP
tube at room temperature for 5 min, then place the EP tube
on the magnetic stand for 5 min, and discard the liquid (see
Note 5).
8. Add 200 μL Bead Washing Buffer, gently pipette for eight
times, and place the tube on the magnetic stand for 5 min.
9. Discard the liquid; add 19.5 μL Elute, Prime, Fragment Mix;
gently pipette for eight times; and place the EP tube on PCR
thermal cycler (94  C for 8 min, 4  C hold) (see Note 6).
10. Place the tube on the magnetic stand for 5 min, and remove
17 μL solution into a new EP tube.
11. Add 1 μL SuperScript II to 79.6 μL First-Strand Master Mix,
and mix thoroughly (see Note 7).
12. Add 8 μL solution configured in step 11 to the EP tube in step
10, and mix thoroughly.
13. Place the EP tube on PCR thermal cycler (25  C for 10 min,
42  C for 50 min, 70  C for 15 min, 4  C hold).
14. Add 25 μL Second-Strand Master Mix to the EP tube in step
13, mix thoroughly, and place the EP tube on PCR thermal
cycler (16  C for 1 h, 4  C hold).
15. Add 90 μL AMPure XP purification beads, gently pipette for
eight times, place the EP tube at room temperature for 15 min,
and place the tube on the magnetic stand for 5 min.
16. Discard the liquid, add 200 μL 80% ethanol solution with the
EP tube on the magnetic stand, and incubate the EP tube at
room temperature for 30s.
17. Repeat step 16 one time.
18. Discard the liquid, let the EP tube at room temperature for
about 15 min till the full evaporation of the ethanol, and then
remove the EP tube from the magnetic stand.
19. Add 62.5 μL Resuspension Buffer, place the EP tube at room
temperature for 2 min, and then place it on the magnetic stand.
20. Remove 60 μL supernatant to a new EP tube.
21. Add 40 μL End-Repair Mix, mix thoroughly, and incubate the
EP tube at 30  C for 30 min.
Another random document with
no related content on Scribd:
The Project Gutenberg eBook of The art of music,
Vol. 07 (of 14), Pianoforte and chamber music
This ebook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this ebook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.

Title: The art of music, Vol. 07 (of 14), Pianoforte and chamber
music

Editor: Leland Hall


Edward Burlingame Hill
Daniel Gregory Mason
César Saerchinger

Release date: December 9, 2023 [eBook #72303]

Language: English

Original publication: New York: National Society of Music, 1915

Credits: Andrés V. Galia, Jude Eylander and the Online


Distributed Proofreading Team at https://www.pgdp.net
(This file was produced from images generously made
available by The Internet Archive)

*** START OF THE PROJECT GUTENBERG EBOOK THE ART OF


MUSIC, VOL. 07 (OF 14), PIANOFORTE AND CHAMBER MUSIC
***
TRANSCRIBER'S NOTES
In the plain text version Italic text is denoted by
_underscores_. Small caps are represented in
UPPER CASE. The sign ^ represents a
superscript; thus e^ represents the lower case
letter “e” written immediately above the level of
the previous character.

The musical files for the musical examples


discussed in the book have been provided by
Jude Eylander. Those examples can be heard by
clicking on the [Listen] tab. This is only possible in
the HTML version of the book. The scores that
appear in the original book have been included as
“jpg” images.

In some cases the scores that were used to


generate the music files differ slightly from the
original scores. Those differences are due to
modifications that were made by the Music
Transcriber during the process of creating the
musical archives in order to make the music play
accurately on modern musical transcribing
programs. These scores are included as PNG
images, and can be seen by clicking on the [PNG]
tag in the HTML version of the book.

There is a list of the illustrations presented in the


book. Some of the captions included in the
transcription were taken from this list as there
were images without any caption. However, the
image facing page 471 is not listed and hence it
was left without caption.

Obvious punctuation and other printing errors


have been corrected.

The book cover has been modified by the


Transcriber and is included in the public domain.
THE ART OF MUSIC
The Art of Music
A Comprehensive Library of Information
for Music Lovers and Musicians

Editor-in-Chief

DANIEL GREGORY MASON


Columbia University

Associate Editors

EDWARD B. HILL LELAND HALL


Harvard University Past Professor, Univ. of
Wisconsin

Managing Editor

CÉSAR SAERCHINGER
Modern Music Society of New York

In Fourteen Volumes
Profusely Illustrated
NEW YORK
THE NATIONAL SOCIETY OF MUSIC
Home Concert

Painting by Fritz von Uhde


THE ART OF MUSIC: VOLUME SEVEN

Pianoforte and Chamber Music

Department Editor:

LELAND HALL, M.A.


Past Professor of Musical History, University of
Wisconsin

Introduction by
HAROLD BAUER
NEW YORK
THE NATIONAL SOCIETY OF MUSIC
Copyright, 1915, by
THE NATIONAL SOCIETY OF MUSIC, Inc.
[All Rights Reserved]
PREFATORY NOTE
The editor has not attempted to give within the limits of this single
volume a detailed history of the development of both pianoforte and
chamber music. He has emphasized but very little the historical
development of either branch of music, and he has not pretended to
discuss exhaustively all the music which might be comprehended
under the two broad titles.

The chapters on pianoforte music are intended to show how the


great masters adapted themselves to the exigencies of the
instrument, and in what manner they furthered the development of
the difficult technique of writing for it. Also, because the piano may
be successfully treated in various ways, and because it lends itself to
the expression of widely diverse moods, there is in these chapters
some discussion of the great masterpieces of pianoforte literature in
detail.

The arrangement of material is perhaps not usual. What little has


been said about the development of the piano, for example, has
been said in connection with Beethoven, who was the first to avail
himself fully of the advantages the piano offered over the
harpsichord. A discussion, or rather an analysis, of the pianoforte
style has been put in the chapter on Chopin, who is even today the
one outstanding master of it.

In the part of the book dealing with chamber music the material has
been somewhat arbitrarily arranged according to combinations of
instruments. The string quartets, the pianoforte trios, quartets, and
quintets, the sonatas for violin and piano, and other combinations
have been treated separately. The selection of some works for a
more or less detailed discussion, and the omission of even the
mention of others, will undoubtedly seem unjustifiable to some; but
the editor trusts at least that those he has chosen for discussion may
illumine somewhat the general progress of chamber music from the
time of Haydn to the present day.

For the chapters on violin music before Corelli and the beginnings of
chamber music we are indebted to Mr. Edward Kilenyi, whose initials
appear at the end of these chapters.

Leland Hall
INTRODUCTION
The term Chamber Music, in its modern sense, cannot perhaps be
strictly defined. In general it is music which is fine rather than broad,
or in which, at any rate, there is a wealth of detail which can be
followed and appreciated only in a relatively small room. It is not, on
the whole, brilliantly colored like orchestral music. The string quartet,
for example, is conspicuously monochrome. Nor is chamber music
associated with the drama, with ritual, pageantry, or display, as are
the opera and the mass. It is—to use a well worn term—very nearly
always absolute music, and, as such, must be not only perfect in
detail, but beautiful in proportion and line, if it is to be effective.

As far as externals are concerned, chamber music is made up of


music for a solo instrument, with or without accompaniment
(excluding, of course, concertos and other like forms, which require
the orchestra, and music for the organ, which can hardly be
dissociated from cathedrals and other large places), and music for
small groups of instruments, such as the string trio and the string
quartet, and combinations of diverse instruments with the piano.
Many songs, too, sound best in intimate surroundings; but one thinks
of them as in a class by themselves, not as a part of the literature of
chamber music.

With very few exceptions, all the great composers have sought
expression in chamber music at one time or another; and their
compositions in this branch seem often to be the finest and the most
intimate presentation of their genius. Haydn is commonly supposed
to have found himself first in his string quartets. Mozart’s great
quartets are almost unique among his compositions as an
expression of his genius absolutely uninfluenced by external
circumstances and occasion. None of Beethoven’s music is more
profound nor more personal than his last quartets. Even among the
works of the later composers, who might well have been seduced
altogether away from these fine and exacting forms by the
intoxicating glory of the orchestra, one finds chamber music of a rich
and special value.

This special value consists in part in the refined and unfailing


musical skill with which the composers have handled their slender
material; but more in the quality of the music itself. The great works
of chamber music, no matter how profound, speak in the language of
intimacy. They show no signs of the need to impress or overwhelm
an audience. Perhaps no truly great music does. But operas and
even symphonies must be written with more or less consideration for
external circumstances, whereas in the smaller forms, composers
seem to be concerned only with the musical inspiration which they
feel the desire to express. They speak to an audience of
understanding friends, as it were, before whom they may reveal
themselves without thought of the effectiveness of their speech.
They seem in them to have consulted only their ideals. They have
taken for granted the sympathetic attention of their audience.

The piano has always played a commanding rôle in the history of


chamber music. From the early days when the harpsichord with its
figured bass was the foundation for almost all music, both vocal and
instrumental, few forms in chamber music have developed
independently of it, or of the piano, its successor. The string quartet
and a few combinations of wind instruments offer the only
conspicuous exceptions. The mass of chamber music is made up of
pianoforte trios, quartets, and quintets, of sonatas for pianoforte and
various other instruments; and, indeed, the great part of pianoforte
music is essentially chamber music.

It may perhaps seem strange to characterize as remarkably fine and


intimate the music which has been written for an instrument often
stigmatized as essentially unmusical. But the piano has attracted
nearly all the great composers, many of whom were excellent
pianists; and the music which they have written for it is indisputably
of the highest and most lasting worth. There are many pianoforte
sonatas which are all but symphonies, not only in breadth of form,
but in depth of meaning. Some composers, notably Beethoven and
Liszt, demanded of the piano the power of the orchestra. Yet on the
whole the mass of pianoforte music remains chamber music.

The pianoforte style is an intricate style, and to be effective must be


perfectly finished. The instrument sounds at its best in a small hall. In
a large one its worst characteristics are likely to come all too clearly
to the surface. And though it is in many ways the most powerful of all
the instruments, truly beautiful playing does not call upon its limits of
sound, but makes it a medium of fine and delicately shaded musical
thought. To regard it as an instrument suited primarily to big and
grandiose effects is grievously to misunderstand it, and is likely,
furthermore, to make one overlook the possibilities of tone color
which, though often denied it, it none the less possesses.

In order to study intelligently the mechanics, or, if you will, the art of
touch upon the piano, and in order to comprehend the variety of
tone-color which can be produced from it, one must recognize at the
outset the fact that the piano is an instrument of percussion. Its
sounds result from the blows of hammers upon taut metal strings.
With the musical sound given out by these vibrating strings must
inevitably be mixed the dull and unmusical sound of the blow that set
them vibrating. The trained ear will detect not only the thud of the
hammer against the string, but that of the finger against the key, and
that of the key itself upon its base. The study of touch and tone upon
the piano is the study of the combination and the control of these two
elements of sound, the one musical, the other unmusical.

The pianist can acquire but relatively little control over the musical
sounds of his instrument. He can make them soft and loud, but he
cannot, as the violinist can, make a single tone grow from soft to
loud and die away to soft again. The violinist or the singer both
makes and controls tone, the one by his bow, the other by his breath;
the pianist, in comparison with them, but makes tone. Having caused
a string to vibrate by striking it through a key, he cannot even sustain
these vibrations. They begin at once to weaken; the sound at once
grows fainter. Therefore he has to make his effects with a volume of
sounds which has been aptly said to be ever vanishing.

On the other hand, these sounds have more endurance than those
of the xylophone, for example; and in their brief span of failing life the
skillful pianist may work somewhat upon them according to his will.
He may cut them exceedingly short by allowing the dampers to fall
instantaneously upon the strings, thus stopping all vibrations. He
may even prolong a few sounds, a chord let us say, by using the
sustaining pedal. This lifts the dampers from all the strings, so that
all vibrate in sympathy with the tones of the chord and reënforce
them, so to speak. This may be done either at the moment the notes
of the chord are struck, or considerably later, after they have begun
appreciably to weaken. In the latter case the ear can detect the
actual reënforcement of the failing sounds.

Moreover, the use of the pedal serves to affect somewhat the color
of the sounds of the instrument. All differences in timbre depend on
overtones; and if the pianist lifts all dampers from the strings by the
pedals, he will hear the natural overtones of his chord brought into
prominence by means of the sympathetic vibrations of other strings
he has not struck. He can easily produce a mass of sound which
strongly suggests the organ, in the tone color of which the shades of
overtones are markably evident.

The study of such effects will lead him beyond the use of the pedal
into some of the niceties of pianoforte touch. He will find himself able
to suppress some overtones and bring out others by emphasizing a
note here and there in a chord of many notes, especially in an
arpeggio, and by slighting others. Such an emphasis, it is true, may
give to a series of chords an internal polyphonic significance; but if
not made too prominent, will tend rather to color the general sound
than to make an effect of distinct drawing.
It will be observed that in the matter of so handling the volume of
musical sound, prolonging it and slightly coloring it by the use of the
pedal or by skillful emphasis of touch, the pianist’s attention is
directed ever to the after-sounds, so to speak, of his instrument. He
is interested, not in the sharp, clear beginning of the sound, but in
what follows it. He finds in the very deficiencies of the instrument
possibilities of great musical beauty. It is hardly too much to say,
then, that the secret of a beautiful or sympathetic touch, which has
long been considered to be hidden in the method of striking the keys,
may be found quite as much in the treatment of sounds after the
keys have been struck. It is a mystery which can by no means be
wholly solved by a muscular training of the hands; for a great part of
such training is concerned only with the actual striking of the keys.

We have already said that striking the keys must produce more or
less unmusical sounds. These sounds are not without great value.
They emphasize rhythm, for example, and by virtue of them the
piano is second to no instrument in effects of pronounced,
stimulating rhythm. The pianist wields in this regard almost the
power of the drummer to stir men to frenzy, a power which is by no
means to be despised. In martial music and in other kinds of
vigorous music the piano is almost without shortcomings. But
inasmuch as a great part of pianoforte music is not in this vigorous
vein, but rather in a vein of softer, more imaginative beauty, the
pianist must constantly study how to subject these unmusical sounds
to the after-sounds which follow them. In this study he will come
upon the secret of the legato style of playing.

If the violinist wishes to play a phrase in a smooth legato style, he


does not use a new stroke of his bow for each note. If he did so, he
would virtually be attacking the separate notes, consequently
emphasizing them, and punctuating each from the other. Fortunately
for him, he need not do so; but the pianist cannot do otherwise. Each
note he plays must be struck from the strings of his instrument by a
hammer. He can only approximate a legato style—by concealing, in
one way or another, the sounds which accompany this blow.

You might also like