Professional Documents
Culture Documents
Dissertation Martin Hölzer, 2017 PDF
Dissertation Martin Hölzer, 2017 PDF
DISSERTATION
FRIEDRICH-SCHILLER-UNIVERSITÄT JENA
Fakultät für Mathematik und Informatik
1
“Ja, die [Genome Biology and Evolution] wollen die Schriften der Firma Y&Y aus dem letzten
Jahrtausend. Wenn ich die in den alten Dateien angegebene Homepage (http://www.yandy.
com/) richtig deute, hat diese Firma entweder aufgegeben oder zumindest ihr Geschäftsfeld verän-
dert.” – Erik, 28.02.2014
ii
Abstract
In the last decades, our knowledge of the molecular basis of life, the building blocks
known as DNA and RNA, has increased tremendously. Technical breakthroughs
from physics, chemistry, biology, and computer science facilitated the development of
new technologies. Thus, the systematic analysis of massive amounts of data became
increasingly important and challenges computer scientists day-to-day. Particularly,
Next-Generation Sequencing (NGS) has dramatically increased the accessibility of
genetic information, generating massive amounts of genomic and transcriptomic data
that is rapidly changing the landscape of many life science disciplines.
Nowadays, large amounts of NGS data can be produced in a rapid and cost-
effective way. However, the novel data needs to be processed in a comprehensive,
well documented and transparent way. Unfortunately, the computational analysis
often remains an opaque process – comparable to a Black Box. Therefore, the
automated analysis of NGS data and the incorporation of different methods and
workflows combined with a clear presentation of the obtained results gets more and
more important.
The applications of NGS are manifold and range from the analysis of genomes
themselves to how proteins interact with nucleic acids. Various techniques and
protocols exist to generate sequencing data from DNA and RNA to answer a wide
variety of biological questions. As each bioinformatical analysis can be only as good
as the underlying data, the experimental design of an NGS project is of utmost
importance. Current NGS techniques like Illumina produce enormous amounts of
sequencing reads, short snippets of nucleotides derived from fragmented DNA or
RNA molecules. Therefore, the bioinformatical challenge in many applications is to
solve the NGS puzzle and to find the correct connection between the short reads to
reconstruct a full genomic or transcriptomic representation.
Complete and well-annotated reference sequences build an important basis to
successfully tackle various biological problems. The quality of the reference genome
of a certain species is of great importance for the success of the conducted com-
putational analysis. With the help of NGS data, existing reference sequences can
be improved or newly constructed from scratch. This process is called assembly,
and involves the adequate connection of short sequencing reads to build-up the full
genomic sequence. However, the assembly process is not straightforward and can
be jarred by DNA contaminations in the data or repetitive regions in the sequence.
Often, different assembly tools and parameter settings need to be tested and eval-
uated to construct an appropriate genome assembly. While the assembly process
can already be challenging for small bacterial genomes, it is exceedingly difficult
for larger eukaryotic genomes. Furthermore, a novel genome assembly needs to be
well-annotated to be useful in many applications.
However, NGS does not only allow for the sequencing of DNA, in fact it can also
be modified to sequence RNA transcripts (e.g. mRNAs, ncRNAs) that are present
in a biological sample at a given moment in time. RNA sequencing (RNA-Seq)
emerged as a powerful method for the discovery, profiling, and quantification of
RNA transcripts. Nevertheless, with currently available short-read NGS techniques
iv
like Illumina it is not possible to directly sequence RNA molecules. The RNA has
to be reversely transcribed to complementary DNA (cDNA) for sequencing.
The RNA-Seq reads can be used to reconstruct transcripts. If a reference genome
is available, the reads can be aligned to this sequence to deviate the transcript se-
quences. This process is also known as mapping. However, in many applications no
reference genome is available and its construction is complicated, time consuming
and costly. Instead, RNA-Seq data can be directly used to assemble the transcripts
de novo. In the last decade, various tools emerged to solve the de novo transcriptome
assembly problem, however it is still a difficult question which tool and parameter
settings perform best for a certain data set. A sensible selection of good assem-
blies produced from different tools followed by an appropriate combination of the
assembled sequences is one way to overcome the limitations of current assemblers.
By mapping RNA-Seq reads to an annotated genome or transcriptome, the
abundances of transcripts that were present in a certain biological sample at a
certain point of time can be measured. If samples from different conditions were
sequenced, the measurements can be compared to identify differentially expressed
genes (DEGs). With the help of NGS, significant changes in gene expression can
be identified in a fast and reliable way, providing researchers with a comprehensive
overview of the genes and pathways that are regulated, for example during a viral
infection. Furthermore, RNA-Seq allows for the genome-wide analysis of transcripts
at a single nucleotide resolution and therefore includes the identification of single
nucleotide variants, gene fusions, allele-specific expression and alternative splicing
events. However, in many NGS studies exists a gap between how the data is com-
putationally processed and a meaningful interpretation of the final results.
While broad NGS studies can provide a comprehensive overview about significant
regulated genes and pathways, it is of utmost importance to take a closer look
at single genes and single nucleotide positions that were previously identified as
significant with NGS. One scenario involves the detection of recombination events
and positively selected sites in an alignment of homologous protein-coding genes.
Such a gene might have been previously identified in a differential expression study
as a key player during a viral infection. If positive selection can be detected, the
gene might be in an evolutionary ’arms-race’ with the host and the determination
of amino acid sites that are under positive selection can help researchers to develop
countermeasures against the pathogen.
Combining different approaches of complementing fields, such as genomics, tran-
scriptomics and single nucleotide investigations, has the greatest potential of pro-
ducing comprehensive and adjuvant results. Furthermore, the visualization of in-
formation is a cruical part that provides researchers with a way to quickly examine
large amounts of data, to expose trends and to find patterns and correlations. With
this work, we aim to shed some light on the darkness of Next-Generation Sequenc-
ing by combining, presenting and discussing fundamental approaches for genomics,
transcriptomics, differential gene expression, and beyond.
v
Zusammenfassung
In den letzten Jahrzehnten hat sich unser Wissen bezüglich der molekularen Basis
des Lebens, den Bausteinen bekannt als DNA und RNA, rasant vermehrt. Rich-
tungsweisende Errungenschaften aus Bereichen wie Physik, Chemie, Biologie und
Informatik haben die Entwicklung neuer Technologien motiviert und vorangetrieben.
In diesem Zusammenhang gewann die systematische Auswertung riesiger Daten-
mengen immer mehr an Bedeutung und stellt Bioinformatiker tagtäglich vor neue
Herausforderungen. Durch die Entwicklung neuer Sequenziertechnologien, auch
bekannt als “Next-Generation Sequencing” (NGS), können enorme Genom- und
Transkriptom-Datenmengen in kürzester Zeit und zu immer geringeren Kosten gener-
iert werden.
Der enorme Durchsatz und die vergleichsweise geringen Kosten von NGS haben
jedoch auch eine Kehrseite: die stetig anwachsende Datenflut muss prozessiert und
ausgewertet werden, sodass neue Erkenntnisse auf umfassende und transparente
Art und Weise zur Verfügung gestellt werden können. Viel zu häufig stellt sich die
Auswertung von NGS Daten für Außenstehende als ein undurchsichtiger Prozess dar
– vergleichbar mit einer Black Box. Somit spielt die automatisierte Auswertung von
NGS Daten unter Einbeziehung verschiedener Methoden in Kombination mit einer
klaren Präsentation der Ergebnisse eine immer wichtigere Rolle.
Die Anwendungsmöglichkeiten von NGS Daten sind vielfältig. Im direkten Bezug
auf die biologische Fragestellung können verschiedene Sequenziertechniken und Pro-
tokolle angewendet werden, um sowohl DNA als auch RNA Moleküle zu sequen-
zieren. Da jedoch jede bioinformatische Auswertung nur so gut sein kann wie es
die zugrundeliegenden Daten erlauben, spielt das experimentelle Design eines NGS
Projektes eine entscheidende Rolle. Häufig findet die von Solexa entwickelte Sequen-
ziertechnologie Illumina Anwendung. Mit Hilfe von Illumina können Sequenzier-
daten im Gigabasen Bereich in kurzer Zeit erzeugt werden. Jedoch werden aufgrund
technischer Beschränkungen in jeder einzelnen Sequenzierreaktion nur kurze DNA-
Abschnitte (Reads) abgelesen. Eine Aufgabe der Bioinformatik ist es nun, diese
kurzen Abschnitte wieder zu einer gesamten Genomsequenz zusammen zu setzen.
Für viele bioinformatische Anwendungen sind eine gute Referenzsequenz ver-
bunden mit einer umfassenden Annotation der enthaltenen Gene unabdingbar. Die
Qualität der assemblierten Sequenz hat großen Einfluss auf den Erfolg einer com-
putergestützten Analyse. Die Konstruktion der genomischen Sequenz auf der Basis
von NGS Reads bezeichnet man als Assemblierung. Die kurzen Read Fragmente
müssen wieder korrekt zusammen gesetzt werden um die vollständige Genomse-
quenz zu erzeugen. Dabei können Probleme wie DNA Kontaminationen und repe-
titive Sequenzbereiche zu Schwierigkeiten während der Assemblierung führen. Oft-
mals müssen verschiedene Assemblierungs Programme und Parameter getestet und
evaluiert werden, um ein möglichst optimales Assemblierungsergebnis zu erzielen.
Bereits die Assemblierung eines kleinen bakteriellen Genoms kann eine Heraus-
forderung darstellen. Die Assemblierung und Annotation umfangreicherer eukary-
otischer Genome ist jedoch noch um ein Vielfaches schwieriger.
vi
NGS Technologien können jedoch nicht nur für die Sequenzierung von DNA ver-
wendet werden, sondern, mit leichten Modifikationen, auch für die Entschlüsselung
der Nukleotidabfolge von RNA Molekülen (bspw. mRNAs, ncRNAs). Die Sequen-
zierung von RNA (RNA-Seq) hat sich als mächtiges Werkzeug etabliert um RNA
Transkripte zu untersuchen und zu quantifizieren.
RNA-Seq Reads können verwendet werden um die zugrundeliegenden Transkripte
zu rekonstruieren. Falls ein Referenzgenom verfügbar ist, können die Reads gegen
das Genom aligniert werden um die Nukleotidabfolge der Transkripte abzuleiten.
Diesen Prozess bezeichnet man auch als Mapping. In vielen Anwendungsfällen ist
jedoch keine Referenzsequenz verfügbar, da der entsprechende Organismus schlicht
noch nicht sequenziert und assembliert wurde. In solch einem Fall können die
RNA-Seq Reads direkt verwendet werden um das Transkriptom de novo zu as-
semblieren. In den letzten Jahren wurden verschiedene Programme entwickelt,
welche sich diesem Assemblierungproblem annehmen. Es ist jedoch nach wie vor
unklar, welches Programm die besten Ergebnisse erzielt und sich am besten für
die Assemblierung bestimmter RNA-Seq Daten eignet. Die Idee eines kombinierten
Ansatzes umfasst die Verwendung mehrerer Assemblierungsprogramme und Param-
eter. Durch eine Auswahl und die Kombination der besten Ergebnisse lassen sich
so Nachteile der einzelnen Programme ausgleichen, um vollständigere Transkriptom
Assemblierungen zu erzeugen.
Weiterhin können RNA-Seq Daten verwendet werden um die relativen Häu-
figkeiten von Transkripten, die zu einem bestimmten Zeitpunkt in einer biologis-
chen Probe exprimiert waren, zu bestimmen. Die Reads können auf ein Referen-
zgenom oder Transkriptom aligniert und mit Hilfe einer Annotation quantifiziert
werden. Durch den Vergleich der so ermittelten Häufigkeiten der Transkript ver-
schiedener Proben mit unterschiedlichen Konditionen (bspw. eine gesunde Zelle und
eine Virus infizierte Zelle) können differentiell exprimierte Gene (DEG) bestimmt
werden. NGS Methoden erlauben eine schnelle und effiziente Bestimmung signifikan-
ter Veränderungen im Transkriptom und geben somit einen umfassenden Einblick
in die Regulation von Genen. Weiterhin ermöglicht RNA-Seq eine, sich über das
gesamte Genom erstreckende, Analyse einzelner Nukleotide. Somit können mit Hilfe
von RNA-Seq Daten einzelne Nukleotidvarianten, Genfusionen, allel-spezifische Ex-
pressionsmuster und alternative Splicingprozesse detektiert werden.
Zusammenfassend ermöglichen NGS Studien einen umfassenden Einblick in sig-
nifikant regulierte Gene und Stoffwechselwege. Dennoch sollten derartige Analysen
als nützliches Werkzeug verstanden werden, um interessante Kandidaten zu identi-
fizieren, welche im Folgenden weiter untersucht werden müssen. Beispielsweise kön-
nen mit Hilfe von NGS Gene identifiziert werden, bei denen selbst die Veränderung
einzelner Nukleotidpositionen bereits starke biologische Auswirkungen haben kann.
Entsprechend können für Proteine kodierende homologe Gene auf Rekombination-
sereignisse und positive Selektion untersucht werden. In einer Genexpressionsstudie
könnte so bereits ein Gen identifiziert werden, welches eine wichtige Rollen während
einer viralen Infektion spielt. Eine zusätzliche Detektion positiv selektierter Bere-
iche in diesem Gen könnte ein Indikator für ein evolutionäres Wettrüsten zwischen
dem Virus und seinem Wirt sein. Basierend auf der Nukleotidsequenz dieses Genes
können so positiv selektierte Aminosäuren identifiziert werden, welche wiederum als
vii
Ausgangspunkt genutzt werden können um antivirale Gegenmaßnahmen zu entwick-
eln.
Die Kombination verschiedener Ansätze von sich ergänzenden Themenbereichen,
wie Genomik, Transkriptomik und die Analyse einzelner Nukleotide, hat großes
Potenzial um umfassende und hilfreiche Resultate zu erzeugen. Weiterhin spielt
die Visualisierung der Daten stets eine entscheidende Rolle. Eine gute und trans-
parente Darstellung der Resultate ermöglicht auch anderen Wissenschaftlern eigene
Schlüsse und Erkenntnisse zu gewinnen um weitere Fragestellungen zu bearbeiten.
Innerhalb dieser Arbeit werden fundamentale Anwendungen in Bezug auf Genomik,
Transkriptomik, differentielle Genexpression und darüber hinaus präsentiert, kom-
biniert und umfassend diskutiert, um somit ein wenig Licht in die Dunkelheit von
Next-Generation Sequencing Analysen zu bringen.
viii
Preface
This thesis covers large parts of my research in the field of Next-Generation Se-
quencing (NGS) over the last four years. Besides my main projects, dealing mostly
with the analysis and interpretation of various NGS data, I was involved in several
side projects, partly also included in this thesis. During this time, I was working at
the RNA Bioinformatics and High Throughput Analysis group of Professor Manja
Marz at the Friedrich Schiller University Jena.
During my PhD I got into contact with various kinds of NGS data, confronting
me with different biological questions and computational problems. When writing
up this thesis, I have been already involved in the experimental design of roughly
50 NGS projects, all aiming to answer manifold biological questions and involving
different species like human, mouse, bats, fungi, algae, bacteria and also viruses. For
most of the projects presented here, I was more or less involved from the early start
(experimental setup) over the sequencing design (technology, parameters, costs) to
the bioinformatical analyses of the obtained data and the final interpretation of the
results. With this thesis, I want to encourage other researchers involved in similar
NGS projects to get in touch with cooperation partners as soon as possible to discuss
the experimental design and to obtain the most out of your NGS run.
Most of the results presented in this work have been published and have been
achieved in cooperation with my supervisor Manja Marz, my great colleagues and
many amazing collaborators (details will be given at the beginning of each chapter).
During my PhD I was primarily responsible or at least involved as a co-author in
overall 19 publications (see pages xiv and xv), from which eight have been already
published [1–8], two are currently submitted [9, 10], and nine are in preparation [11–
19] at the point of submitting this thesis. The already published work comprises
one first authorship, three joint first authorships and two second authorships.
Since it is not possible to present all topics I was involved in over the last four
years in this thesis, I will focus on [1, 2, 4–7, 9, 16, 17] (see page xiv) complemented
with data and results from [3, 8, 10–15, 18, 19] (page xv).
It was a well-thought decision to also include unpublished work in this thesis, if
additional benefit for the presented topics could be gain. The side projects I was
involved deal with different biological and computational problems, therefore some
of them are slightly tackled in this thesis, others are just mentioned here.
In one of those projects, I had the great opportunity to support my former
colleague Abdullah in his research on tRNA remolding events in metazoan mitoch-
ondrial genomes. Here, my main contribution was the calculation of alignments
of tRNAs and the implementation of a novel maximum likelihood based algorithm
called MLRD (Maximum Likelihood Remolding Detection) to identify the position
of a remolding event by utilizing a previously calculated phylogenetic tree. Further-
more, I was mainly responsible for the visualization of the alignments, trees and
detected remolding events. I really appreciate this joint work that at the end found
its way in a great publication [3] and Abdullahs thesis [20].
I was further involved in two broad annotation studies, dealing with the detection
of non-coding RNAs (ncRNAs) in bats [12] and the lift-over annotation of ncRNAs
of various nematode species [13]. Especially, with the extended annotation of bat
x
ncRNAs, it was possible to improve my further research involving different bat
species [4, 7, 15].
In September 2015 I attended a workshop in Copenhagen, that was aiming to
collect and describe bioinformatical tools related to the RNA world. The main goal
was to generate a community driven catalog of RNA bioinformatic resources and
their relationships [11].
Together with Dr. Daniel Steinbach from the Universitätsklinikum Jena I am
working on whole-exome sequencing data (a special kind of NGS data), to identify
somatic single nucleotide variants between the exome of bladder cancer patients at
different tumor states [14].
Another RNA-Seq project I am involved in, from the early beginning of the se-
quencing design until the currently ongoing bioinformatical analysis of the obtained
data, involves the transcriptional response of Myotis daubentonii cells to interferon
treatment and a Rift Valley Fever virus infection [15]. The Project is conducted to-
gether with Prof. Friedemann Weber from the Justus-Liebig University Gießen and
only included in this thesis as a short detour to compare the different NGS setups
between Sec. 5.2 and 5.3.
Another ongoing project, I am currently supervising, deals with the implemen-
tation of a web server to perform interactive principal component analyses (PCA)
on RNA-Seq quantification data [19]. The project is implemented by one of our
master students, Ruman Gerst. The idea of such an interactive tool, allowing for
the visualization of 2- and 3-dimensional PCAs with flexible variance cutoffs, was
born during my work on the transcriptional response of human monocytes to various
infections under vitamin treatment [5, 6].
This thesis consists of seven chapters. The main results are presented in Chap-
ters 3–6.
In Chapter 3 and 4 fundamental approaches for genome and transcriptome as-
sembly, two applications of NGS, are presented. Chapter 3 deals with the more
simple genome assemblies of two bacterial species [1, 2]. In Chapter 4 a compre-
hensive across-species comparison of different de novo transcriptome assembly tools
is presented [16]. Our idea of a merged assembly of various tools and parameter
settings is presented and evaluated as a proof of concept [17].
Chapter 5 deals with another application of NGS data: the identification of
differential expressed genes between certain conditions. I selected two exemplary
RNA-Seq studies for this thesis. The first one is drawing a comprehensive picture of
the transcriptional response of human monocytes to fungal and bacterial infections
under vitamin treatment [5, 6]. In contrast to this study the second computationally
much more complicated project deals with the transcriptional response of human
and bat cells to Ebola and Marburg virus infections at different time points [4].
After the comprehensive discussion of different high-throughput NGS projects,
special use cases for single genes are presented in Chapter 6. The presented ap-
proaches deal with the detection of positive selection and recombination events in
homologous protein-coding sequences. Although the analyses are not directly based
on NGS data, NGS can help to identify target genes. By working on differential gene
expression during certain types of infections [4–7, 15], I found positive selection to of-
ten be a result of an evolutionary host-virus ‘arms-race’. To comprehensively detect
xi
and visualize positive selected sites, I developed a web server called PoSeiDon [9],
that has been already applied on the Mx1 gene of 13 bat species [7].
For almost all of the projects I conducted during my PhD time, I generated ex-
tensive electronic supplement pages, allowing also other researchers to easily obtain,
interpret and re-use the results in an effective and transparent manner.
Furthermore, I was heavily involved in the organization and execution of two
great and very intensive ‘hackathons’ in Jena. The first one, dealing with one of my
first NGS projects (presented in Sec. 5.2), took place in the second year of my PhD.
After the start of the 2014 Ebola outbreak in West Africa, we decided to speed up
our analyses and invited specialized scientists on the field of RNA-Seq to join us for
one week in Jena to “Fight against Ebola”. The second one, I was organizationally
involved, took place in April 2017. Under the topic “Stay young or Die trying” we
met again with scientists from all over Europe to work on the JenAge data set and
tackled different questions related to the field of aging. Although the organization,
execution and wrap-up of such workshops is very stressful and time-consuming, the
outcome of such intensive weeks, bringing together many great scientists of different
expertise, is of invaluable benefit for all participants.
xii
This thesis is first and foremost based on the
following publications:
[1] Petra MöbiusΨ , Martin HölzerΨ , Marius Felder, Gabriele Nordsiek, Marco Groth,
Heike Köhler, Kathrin Reichwald, Matthias Platzer, and Manja Marz. “Compre-
hensive insights in the Mycobacterium avium subsp. paratuberculosis genome using
new WGS data of sheep strain JIII-386 from Germany”. In: Genome Biology and
Evolution 7.9 (2015), pp. 2585–2601.
[3] Martin Hölzer, Karine Laroucau, Heather Huot Creasy, Sandra Ott, Fabien Vo-
rimore, Patrik M Bavoil, Manja Marz, and Konrad Sachse. “Whole-genome sequence
of Chlamydia gallinacea type strain 08-1274/3”. In: Genome Announcements 4.4
(2016), e00708–16.
[4] Konstantin RiegeΨ , Martin HölzerΨ , Tilman E. Klassert, Emanuel Barth, Julia
Bräuer, Collatz Maximilian, Franziska Hufsky, Nelly B. Mostajo, Magdalena Stock,
Bertram Vogel, Hortense Slevogt, and Manja Marz. “Massive Effect on LncRNAs
in Human Monocytes During Fungal and Bacterial Infections and in Response to
Vitamins A and D”. In: Scientific Reports 7 (2017), p. 40598.
[5] Tilman E. Klassert, Julia Bräuer, Martin Hölzer, Magdalena Stock, Konstantin
Riege, Christina Zubiría-Barrera, Mario M. Müller, Silke Rummler, Christine Skerka,
Manja Marz, and Hortense Slevogt. “Differential Effects of vitamins A and D on
the Transcriptional Landscape of Human Monocytes during Infection”. In: Scientific
Reports 7 (2017), p. 40599.
[6] Jonas Fuchs, Martin Hölzer, Mirjam Schilling, Corinna Patzina, Andreas Schoen,
Thomas Hoenen, Gert Zimmer, Manja Marz, Friedemann Weber, Marcel A. Müller,
and Georg Kochs. “Evolution and antiviral specificity of interferon-induced Mx pro-
teins of bats against Ebola-, Influenza-, and other RNA viruses.” In: Journal of
Virology (2017), JVI–00361.
[7] Martin Hölzer and Manja Marz. “PoSeiDon: a web server for the detection of
evolutionary recombination events and positive selection”. In: Bioinformatics (2017),
submitted.
[8] Martin Hölzer and Manja Marz. “The Dark Art of de novo Transcriptome Assem-
bly: A Comprehensive Across-species Comparison of short-read RNA-Seq assemblers”.
In preparation (2017).
[9] Martin Hölzer and Manja Marz. “GOAssembler: A Method Pipeline for the Con-
struction, Evaluation and Clustering of de novo Transcriptome Assemblies”. In prepa-
ration (2018).
xiv
... and partially based and complemented by:
[10] Abdullah H Sahyoun, Martin Hölzer, Frank Jühling, Christian Höner zu Siederdis-
sen, Marwa Al-Arab, Kifah Tout, Manja Marz, Martin Middendorf, Peter F Stadler,
and Matthias Bernt. “Towards a comprehensive picture of alloacceptor tRNA remold-
ing in metazoan mitochondrial genomes”. In: Nucleic Acids Research 43.16 (2015),
pp. 8044–8056.
[11] Petra Möbius, Elisabeth Liebler-Tenorio, Martin Hölzer, and Heike Köhler. “Eval-
uation of associations between genotypes of Mycobacterium avium subsp. paratu-
berculosis and presence of intestinal lesions characteristic of paratuberculosis”. In:
Veterinary Microbiology 201 (2017), pp. 188–194.
[12] Petra Möbius, Gabriele Nordsiek, Martin Hölzer, Michael Jarek, Manja Marz, and
Heike Köhler. “Complete genome sequence of JII-1961 – a bovine Mycobacterium
avium subsp. paratuberculosis field isolate from Germany”. In: Genome Announce-
ments (2017), submitted.
[13] The RNA tools and software consortium. “A community-driven catalog of RNA
bioinformatics tools and their ontologies”. In preparation (2017).
[14] Nelly B Mostajo, Martin Hölzer, Abdullah H Sahyoun, Verena Krähling, Stephan
Becker, and Manja Marz. “A comprehensive annotation of non-coding RNAs in bats”.
In preparation (2017).
[15] Sebastian Bartschat, Clara Bermudez-Santana, Anke Busch, Alexander Donath, Jan
Engelhardt, Andreas R Gruber, Jana Hertel, Michael Hiller, Martin Hölzer, Fran-
ziska Hufsky, Emanuel Barth, Frank Jühling, et al. “Comparative Analysis of Non-
Coding RNAs in Nematodes”. In preparation (2017).
[16] Martin Hölzer, Manja Marz, Marc-Oliver Grimm and Daniel Steinbach. “Elucida-
tion of the molecular mechanisms of progression of the non-muscle invasive urothelial
carcinoma of the urinary bladder (NMIBC) and identification of possible prognos-
tic markers and therapeutic targets by exom and 3’/5’ UTR mutation analyzes”. In
preparation (2017).
[17] Martin Hölzer, Friedemann Weber, and Manja Marz. “Description of the tran-
scriptomic landscape of the microbat Myotis daubentonii in response to interferon
stimulation and an infection with the Rift Valley fever virus”. In preparation (2017).
[18] Barbara Müther, Martin Hölzer, Manja Marz, and Georg Kochs. “Evolution and
antiviral specificity of Mx proteins in rodents”. In preparation (2017).
[19] Martin Hölzer, Ruman Gerst, and Manja Marz. “PCAGO: An interactive web
service to analyze RNA-Seq data with principal component analysis”. In preparation
(2017).
xv
Contents
1 Introduction 1
1.1 The Dark Art of Next-Generation Sequencing . . . . . . . . . . . . . 1
1.2 Contribution and scope of this thesis . . . . . . . . . . . . . . . . . . 2
1.3 Comprehensive supplemental materials . . . . . . . . . . . . . . . . . 4
3 Genome Assembly 29
3.1 Assembly of the whole-genome of Chlamydia gallinacea . . . . . . . . 30
3.1.1 The genus Chlamydia . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Sequencing and assembly . . . . . . . . . . . . . . . . . . . . . 31
3.2 Comprehensive insights in the MAP genome . . . . . . . . . . . . . . 33
3.2.1 The genus Mycobacterium . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Sequencing and assembly . . . . . . . . . . . . . . . . . . . . . 35
3.2.3 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.4 Phylogenetic reconstruction . . . . . . . . . . . . . . . . . . . 38
3.2.5 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 40
3.2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 Transcriptome Assembly 55
4.1 The Dark Art of de novo transcriptome assembly . . . . . . . . . . . 56
4.1.1 RNA-Seq: a revolution in transcriptomics . . . . . . . . . . . 56
4.1.2 Material and methods . . . . . . . . . . . . . . . . . . . . . . 59
xvi
4.1.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 68
4.1.4 Conclusions and future perspectives . . . . . . . . . . . . . . . 80
4.2 Cluster de novo transcriptome assemblies: a proof-of-concept . . . . . 84
4.2.1 How to improve assemblies? . . . . . . . . . . . . . . . . . . . 84
4.2.2 Cluster approaches . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2.3 Evaluation of merged assemblies . . . . . . . . . . . . . . . . . 87
4.2.4 A possible cluster-assembly pipeline and future work . . . . . 92
Bibliography 173
Appendix 208
Introduction
1
Chapter 1. Introduction
HiSeq 2500
acggactaga
acggactaga
acggactaga
Data
acg acg
Alg
s
se
ori
aly
thm
An
s
Figure 1.1: The comprehensive processing and preparation of huge amounts of data, e.g. obtained
from a NGS experiment, involves many tools and methods, moreover based on different algorithms,
in order to perform the various analyses needed to tackle the specific computational and biolog-
ical problems raised. For many researchers, the way how sequencing data (or also other kind of
biological data) is bioinformatically transfered to final result tables and figures, often remains an
untransparent and obscure process – like a Black Box. This often leads to problems in the correct
interpretation of the data. For example if scientists of other fields are just not able to understand
what happened inside the box.
If the biological questions that should be answered with the help of a NGS run
are clear, an appropriately chosen experimental design can give great insights in the
biolgocial context and furthermore a lot of additional information, that can lso be
used by other scientists to answer different questions.
However, the huge amount of obtained data needs to be handled and processed in
a comprehensive, well structured and transparent way. The emerging development
of new sequencing technologies, comprising for example longer reads, but also higher
error rates, still challenges current bioinformatic approaches for quality control, map-
ping, assembly and differential gene expression detection. In the fast emerging field
of NGS technologies, new algorithms and bioinformatical tools have to be evaluated
and developed continuously, in order to comply with the requirements of novel NGS
data. Obviously, it is often hard for other scientists to follow and completely un-
derstand the workflow of huge bioinformatical pipelines producing the output that
should be finally interpreted by them. Often, this remains as an opaque process –
comparable with a Black Box (Fig. 1.1).
2
1.2. Contribution and scope of this thesis
At the end, obtaining an overall picture by NGS and combining this with more
detailed investigations of single genes and nucleotide positions has the greatest po-
tential to generate high quality and sophisticated insights in manifold biological
topics.
In the following chapters of this thesis, I will present a selection of different
projects dealing with whole genomes, comprehensive transcriptomes and single nu-
cleotides as well. Furthermore, a broad variety of different species, from bacteria to
fungi and eukaryotes, even viruses, will be in the focus of different chapters.
In Chapter 2 (Welcome to the Black Box ) I will present some basic mechanics
and methods that will be referenced consistently in this thesis. However, it is not
possible to give a full and detailed overview of all the biological and computational
background, the content of this thesis is based on. Nevertheless, the basic ideas
presented in Chapter 2 should be comprehensive enough to understand the following
chapters.
In Chapter 3 (Genome Assembly) the studies of two bacterial genome assem-
blies are discussed. The first one, dealing with the obligate intracellular bacterium
Chlamydia gallinacea, presents a basic genomic study, whereas the second one, fo-
cusing on the pathogen Mycobacterium avium subsp. paratuberculosis, presents not
only the assembly of this bacteria but also a comprehensive genome-wide comparison
and annotation study.
In Chapter 4 (Transcriptome Assembly) I switch from the topic of DNA sequenc-
ing and genome assembly to the sequencing of RNA and the reconstruction of whole
transcriptomes instead of genomes. Different problems and challenges need to be
taken into account when constructing a transcriptome instead of a genome. A com-
prehensive across-species comparison of different de novo transcriptome assembly
tools is presented, complemented by a proof of concept study discussing the ad-
vantages and disadvantages of merging assemblies of different tools and parameter
settings.
Chapter 5 (Differential Gene Expression) combines now genomic and transcrip-
tomic concepts presented beforehand in Chapter 3 and 4 and mainly deals with the
detection of differential expressed genes out of RNA-Seq data. In the first part,
the transcriptional landscape of human cells infected with either the bacteria Es-
cherichia coli or one of the two fungi, Candida albicans and Aspergillus fumigatus,
is presented. The other (much more complicated) study deals with the differential
transcriptional responses to Ebola and Marburg virus infections in cells from bats
and humans. The transcriptome assembly process presented in Chapter 4 is picked
up again here. At the end of this chapter, a short detour compares two different
NGS designs and sequencing setups.
Finally, after the comprehensive discussion of different high-throughput NGS
projects, in Chapter 6 (Single Nucleotide Investigations) I will enter more deeply
inside the Black Box and investigate more special use cases and restricted topics.
The following two sections of this chapter are not directly related to NGS data and
deal with the detection of positive selection and recombination events in homologous
protein-coding genes. Due to my work on differential expressed genes during certain
viral infections [4, 7, 15], I also got in touch with positive selection, often a result of
a host-virus ’arms-races’ during evolution. To comprehensively detect and visualize
3
Chapter 1. Introduction
positive selected sites I developed a web server called PoSeiDon, that is presented
and already used on the Mx1 gene of 13 bat species in this chapter.
In Chapter 7 the results of this thesis are summed up, conclusions are drawn and
a future outlook is given.
With this thesis, I give deep insights in different bioinformatical techniques and
design principals, looking from more far away (Chapters 3–5) and also more closely
(Chapter 6) on different kind of data (not exclusively NGS data) aiming to answer
various biological questions.
4
1.3. Comprehensive supplemental materials
5
Chapter 1. Introduction
6
Chapter 2
formats. This extended ontology will be used Figure 2.1: In February 2016 there
to improve the already available EDAM ontol- were already 389 tools registered by the
ogy [21] and the tools with their meta data will ELIXIR RNA community [11]. The ten
most frequently used topic terms are
be provided to the life-science community via shown. Topics, mainly (red) and casu-
the ELIXIR node [22], with the goal to build a ally (yellow) tackled in this thesis are
stable and sustainable infrastructure to spread marked.
biological information across Europe. Such a
community driven database represents an important resource to help also non-
bioinformaticians to obtain better insights into the Black Box of bioinformatical
methods and tools.
Before we enter the Black Box, some general algorithms, methodologies and
workflows will be introduced, that are widely used all over the different chapters
presented in this thesis (Fig. 2.2). More specific methods are separately described
in the corresponding chapters.
7
Mapping & Quantification gene 3
Mapping C Visualization M
(genome and/or transcriptome reference) • Bowtie
gene 1 gene 2 • TopHat2
Alignment
• Segemehl
Preprocessing
Wet Lab Preparation A • SAMTools
(experimental design) • unique/multi split
Control Treatment
R|FPKM, TPM
Read count
IKK /TBK1/DDX3X
G IL12A-AS1 (IL12A)
Sequencing DEGs
PDGF complex IRS1
CTD-3203P2.3 (IL4R)
RNA DNA
SQSTM1 DIAPH3 ENAH WASF2 SRC ITG REDD1 4EBP1
RIPK3 GRID1 RP11-104J23.1 (CCL15)
PIK3CA PDK1
NCK1 YLPM1 SYK (MYD88) TIRAP IRSp53 AC069363.1 (CCL3)
caveolin
CSE1L DDIT AC131056.3 (CCL3L3)
(e.g. Illumina)
RALBP1/TRADD/ TRAF2 / TNFR1 TRIB3
(e.g. 3x) PAK1 CARD9/MALT1/BCL10 IRAK2 / IRAK4 / TRAF6 MAP3K1
Akt = Rac
PDGF
mTor
RPTOR
(MYCN)
CTB-186H2.3 (CCL14)
RP11-536K7.5 (IL2RA)
INHBA-AS1 (INHBA)
CXXC1
IKBKG MAP2K4 AC073072.5 (IL6)
TAB1 /TAB2/TAB3/TAK1 EGF EGFR MICAC1
RAC1 PIK3CA
• DESeq
IKK MAP2K7 control atRA vitamin D control atRA vitamin D control atRA vitamin D control atRA vitamin D
67 IKK
FPR1 MAP2K1
control A. fumigatus C. albicans E. coli
120 137 NFKB1 / I B / RELA VEGFA FGF
PAK1 MAP2K2
RAF1 RAS
SOS1 RTK CBL Epsin 0 2 4 6 8
• R Bioconductor
MAP2K3
Libraries
MAP3K11 GF Sense IL6 Antisense AC073072.5
MAP2K6 DUSP4 MYLK
CDK9 MAP3K12 MAPK3
NFKB2 / RELB
MAPK1 CENPE 106.0
24
● ● ●
●
SMAD ● 102.0
• KEGG
● ●
● ● ●● ● ●
● ● ● ●
• Cufflinks
MAPK14 NR4A1 105.0 ● ● ●
● ●
●● ●● ●●
● ● ●
● ●
14
DUSP1 MAPK8 ●
13
● ●
DUSP5 DUSP6
p105 / TLP2 / ABIN2 101.5
104.0 ● ● ●
●
• GO
JUN FOS TGFB1 TGFBR1 PPP1R15A ●
DUSP10 EP300 ●
●
• IGO
103.0 ● ●
CDKN1B DUSP16 101.0
59
(IL8) ● ● ●
Dab2 DUSP8 SARA = ZFYVE9
CCNE1 GDNF ● ●
SKP2 CycE SMAD7 102.0 ● ●
●
• polyA+ • Pathview
CCNE2 ZFP36 MK2 KCNA3 SMURF2 ●
LDL
Control Treatment CDKN1A ATF3 EGR1 TGFB3 100.5
● ●
●
101.0 ●
● ● ●
●
CREB5 ●
● ●
HSP27 = HSPB1 ●
100.0 ● 100.0 ● ● ● ●● ● ●
• rRNA- C5 C3.2|C5
control A. fumigatus C. albicans E. coli control A. fumigatus C. albicans E. coli
IL12A IL12A-AS1
104.0 102.0 ●
• sRNA
●
●
●
●
● ●
●
103.0 ● 101.5
●
●● ●
●
Gene Annotation
●
Assembly E F
●
●
● ●
102.0 ● ●
101.0
●
GBE
●
• SPAdes
● ●
●
● ●●
MAP JIII-386: Genome Assembly, Annotation, and Comparison ●
● ● ●
●
●
101.0 100.5 ● ●
● ●● ●
●
(proteins, ncRNAs, database)
● ● ●
●
● ● ● ●
(de novo, reference-based)
● ●● ●
• Trinity
control atRA vitamin D control atRA vitamin D control atRA vitamin D control atRA vitamin D ● ● ●
Raw read
● ● ●
● ●
control A. fumigatus C. albicans E. coli ● ● ● ●
100.0 ● ● ● ● ● 100.0 ● ● ● ●● ●● ●● ● ●●
● ●
control A. fumigatus C. albicans E. coli control A. fumigatus C. albicans E. coli
Table 1 aminoacid metabolism
(genome, transcriptome)
other basic cell processes
General Genome Features of the Different Mycobacteria Strains
−2 −1 0 1 2
data • cd-hit-est
immune system
inflammation infections vitamins
apoptosis control C. albicans A. fumigatus E. coli control atRA vitD
• Cluster-Assembly • Ensembl −3 −2 −1 0 1 2 3
ATGCGAG TGCGAGG GCGAGGG CGAGGGT
• Blast
FOSB
FOS
CNN1
EGR1
CXCL3
EP300
NFKB1
Preprocessing B • GORAP
NPC1
GAGGGTG AGGGTGC GGGTGCA GGTGCAA LMOD1
RPS17
CXCL2
ATGCGAGGGTGCAATCGA
AREG
RELA
NR4A1
• Bacprot
ZEB2
GTGCAAT TGCAATC GCAATCG CAATCGA J UN
CYR61
RASGEF1B
(quality check and trimming)
IL32
PLA2G4C
assembled contigs/scaffolds
ATF3
PPP1R15A
DUSP8
de Bruijn graph
NFKB2
reads C3|C4|C5.2|C6.2 C3|C5.2
RELB
DUSP1
DUSP10
SQSTM1
KLF4
RPS17L
IL8
PIM1
FOSL1
MX1
DDX58
DDIT4
TMEM72
NPR3
SLC2A12
8
MN1
CDH6
Phylogeny I Alignment H Genetic Variation L
SFRP2
RTL1
Artibeus jamaicensis
IL6
SDPR
NPPC
• FastQC
STAT2
100 STAT1
Sturnira lilium
FAM49A
DUSP6
(SNVs, INDELs, isoforms)
MYCNOS
100
(nucleotide, amino acid)
ZG16
MYCN
Carollia perspicillata DUSP5
before
CHAC1
• quality trimming
ATP2B4
AK4
FAM71E2
Myotis davidii GAC CAA GAG TAT CGG ACT TGG CTG CAG AAG ATC CGA GAG AAG GAA TCA GAA CAA CAG
Chapter 2. Welcome to the Black Box
MAST4
Processed read data
TRIB3
TRAF6
100 88 TFCP2L1
• ClustalW
GAC CAA GAA TAT CGG ACT TGG CTG CAG AAG ATC AGA GAG AAG GAA TTA GAA CAA CAG
Myotis daubentonii
HAVCR1
TLR3
• adapter clipping
3h 7h 23h 3h 7h 23h 3h 7h 23h 3h 7h 23h
97 GAC CAA GAG TAT CGG ACT CAT CTA CAG ATG ATC AGA GAG AAG GAA TTA GAA GAA AAC EBOV MARV EBOV MARV
Myotis lucifugus NOTE.—The number of ORFs with a homologous sequence in NCBI (homologous ORFs) and additionally hypothetical ORFs, both predicted by BacProt, are provided.
NcRNAs and riboswitches were annotated by homology search of Rfam (v.11.0) (Gardner et al. 2009) families using the GORAP pipeline (unpublished data), see Materials and
Methods. For further information (fasta, gff, stk files), see supplementary tables S11, S14, S20, and S22, Supplementary Material online. chr, chromosome; scaff, scaffolds; con,
77 GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCT AAA GAA GAG
• TranslatorX
contigs; N50, length of the shortest con/scaff, so that at least 50% of all bp in the assembly are represented by this and all longer contigs; ?, candidate, further analysis
Myotis brandtii needed. TPP binds thiamin pyrophosphate (TPP) to regulate thiamin biosynthesis and transport (Winkler et al. 2002); Cobalamin binds adenosylcobalamin to regulate vitamin
• RAxML
B12 (cobalamin) biosynthesis and transport (Nahvi et al. 2002); Glycine binds glycine to regulate glycine metabolism genes, including use of glycine as energy source (Mandal
100 GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCA AAA GAA GAG et al. 2004); SAM-IV binds S-adenosyl methionine (SAM) to regulate methionine as well as SAM biosynthesis/transport (Weinberg et al. 2007); SAH recycling of S-adeno-
Eptesicus fuscus
sylhomocysteine (SAH), produced during SAM-dependent methylation reactions (Weinberg et al. 2007); pan predicted riboswitch function, located in 50-UTRs of genes
encoding enzymes involved in vitamin pantothenate synthesis (Weinberg et al. 2010); pfl predicted riboswitch function, consistently present in genomic locations corre-
GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG sponding to 50-UTRs of protein-coding genes (Weinberg et al. 2010); ydaO–yuaA, genetic “off” switch for ydaO and yuaA genes, maybe triggered during osmotic shock
100
• Mafft
(Barrick et al. 2004); ykok, MG2þ -sensing riboswitch, controls expression of magnesium ion transport proteins (Barrick et al. 2004); ykkC–yxkD, upstream of ykkC and yxkD
Pipistrellus spec. genes in Bacillus subtilis and related genes in other bacteria, function mostly unclear (Weinberg et al. 2010); ykkC-III predicted riboswitch function, appears to regulate genes
• MrBayes
related to preceding motifs such as ykkC and yxkD (Weinberg et al. 2010); NA, not applicable.
GAC CAA GCG TAC CGG GGT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG
Pteropus alecto GAC CAA GCG TAC CGG GCT TCG CTG CAG AAG ATC CGA GAG AAG GAA TCG GAA GAG AAG
• Newick Utilities 100
100
Rousettus aegyptiacus GAC CAG GCG TAC CGC ACC GCG CTG GGA AAG ATC CGA GGG ATG GAA GCC GAA GAG AAT
• cmalign Genome Biol. Evol. 7(9):2585–2601. doi:10.1093/gbe/evv154 2589
Hypsignatus monstrosus GAC ACG ATG TAT CAG AGA TCG TTA CGG AAA ATC AGG GAG AAG GAA AAA GAG AAG GAA
72
90 GAC CAG GTG TAT CGG AAA TCA TTA CAG ATA GTC AGG GAG AAG GAG AAG GAG AAG GTT
Illumina Universal Adapter
Illumina Small RNA Adapter
Eidolon helvum
80 Nextera Transposase Sequence substitutions/site GAC CAG GTG TAC CAG AGA TCA TTA CAG AAA GTC AGG GAG AAG GAG AAT GAA GAA CAA
SOLID Small RNA Adapter
0 0.1 0.2 0.3
70
GAT CAG ATG TAC CAG AGT TCA TTA CAG AAA ATC AGG GCG AAG GAG AAG GAG AAG GAA
C3.2|C6.2-4 C3.2|C5|C6
Sequence
AAGCTGCCAGTTGAAGAACTGTTGGAATTCTCGGGTGCCAAGGAACTCCAG
60
TGTCTGAGCGTCGCTTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTGA
after
% Adapter
50 TATTGCACTTGTCCCGGCCTGTTGGAATTCTCGGGTGCCAAGGAACTCCAG
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTGACCAATCTCGTATGCC
40 CAACGGAATCCCAAAAGCAGCTGTGGAATTCTCGGGTGCCAAGGAACTCCA
TAGCAGCACGTAAATATTGGCGTGGAATTCTCGGGTGCCAAGGAACTCCAG
30 ...
RP11−79H23.3
10
AC002480.3
RP11−662I13.2
AC008697.1
LINC01181
MIR155HG
20 LINC00158
RP11−701P16.5
AC022816.2 AC003092.1
RP11−158I3.3 RP3−325F22.3AC058791.1
RP11−561O23.5
RP11−452H21.4
RP11−221N13.3
RP11−955H22.1 RP11−236B18.5
RP11−567C2.1 AC073072.5
RP1−313L4.3
CTB−114C7.4AC002480.2
RP11−536K7.5
BX255923.3
10 RP11−325F22.2
LINC00299 RP4−607I7.1
Remolding J K
RP11−44K6.2
Positive Selection
AP000355.2 LINC00520
AC046143.3 RP11−370F5.4
RP11−91K9.1
AC114730.3
CTA−293F17.1
AC002480.4
RP11−175K6.1 LINC00884
RP11−253D19.1
MIR3945
AC112518.3 AC061992.2 RP11−283G6.5
RP11−157D23.2
INHBA−AS1 RP11−503N18.1
RP11−404F10.2 LINC00346
ADORA2A−AS1 ROR1−AS1
RP1−249F5.3 RP11−347E10.1
FAM157ARP11−444D3.1 RP11−705C15.4
RP11−383J24.1
VTRNA1−3
RP11−384O8.1
LINC01093
RP11−253D19.2
CTD−2527I21.7
RP11−611O2.5
RP11−561B11.3 MIR222HG
RP11−96A15.1
SERPINB9P1 RP11−44N11.1
RP11−396F22.1AC007036.6
RP11−439L18.1
LINC01336 LINC01262
0 AC099552.4
RP11−133N21.10
MIR4645
AC004988.1
KCNJ2−AS1
MIR222 RP11−21C4.1
RP11−131K5.2
RP11−317P15.4
LINC01215
RP11−10E18.7 CTC−231O11.1
GS1−600G8.5
MIR221 LINC00659
5
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 PACERR
RP11−672A2.6 RP11−806H10.4
LINC01260
RP11−114F3.4
RP11−22N19.2
ZMIZ1−AS1 LINC01050
RP11−20G13.2
RP11−866E20.3 RP11−54O7.17
RP1−239B22.5
RP11−280G9.1
RP11−408H1.3
LINC01388
DLGAP1−AS2
ERICDRP11−214N9.1 RP11−588G21.2
RP3−333A15.1
AC114730.2
DNAJC3−AS1 AC147651.4
RP3−395M20.9
KB−1507C5.4
AC016831.7
RP11−214O1.2
RP11−519G16.3
RP11−439L18.2 AF064858.6
RP11−1151B14.4
DLGAP1−AS1RP11−351I24.1 RP1−28O10.1
AC017002.1
AC097495.2
LINC01465
AC133644.2
ARHGAP31−AS1
Position in read (bp) RP11−834C11.4 CITF22−49E9.3
LINC01136
CFAP58−AS1
RP3−393E18.2
LUCAT1
CTC−550B14.7
RP11−672A2.5 SNORD3B−2
U1 SNORD3B−1
AC010226.4
RP1−68D18.2
RP11−47I22.2
RP11−228B15.4 LINC00936
LINC00243
XXbac−BPG249D20.9
RP11−367G18.1 JAZF1−AS1
RP11−37B2.1
CTD−2313J17.6
AC069363.1 RP11−283G6.6
LINC01358
(tRNAs) (and recombination)
WI2−87327B8.2
LINC01268
RP3−508I15.14
RP11−242C19.2
RP11−157E21.1 RP11−667K14.4
RP3−508I15.19
LINC00152
RP3−508I15.21
GS1−114I9.1
AC131056.3 RP11−221J22.1
RP11−290F5.1
LINC01588
MSC−AS1
RP11−386I14.4
RP11−572M11.4
UXT−AS1
AC018816.3
RP11−734K23.9
RN7SL368P
SNHG15 CTD−2184D3.5
AC002511.3
AC002511.2 AC005220.3
AC007126.1
JARID2−AS1
MIR4435−1HG
JHDM1D−AS1
AC116366.5
AP001063.1 RP11−327P2.5
PAGR1 CTD−2650P22.2
RP11−238K6.2
CTB−58E17.1
AC002456.2
RP11−705C15.3
RP11−1080G15.1
CTB−58E17.3
RP11−61J19.5
RP1−40E16.12
AP000692.10
RP11−367G6.3
RP11−670E13.6RP11−560J1.2 USP12−AS2
RP11−815I9.4
HCG20
FAM157B
NPTN−IT1 CTD−2639E6.9
RP13−314C10.5
RP11−7F17.7
RP11−1149M10.2
RP11−65I12.1
RP1−257A7.4 RP11−57H14.2
LINC01176
CTB−41I6.2
RP4−561L24.3
RP4−673M15.1
RP11−799D4.4
NRIR
RP11−809O17.1
LINC00926
RP11−295G20.2 AF064858.11
LINC01415
RP11−171I2.5
CFLAR−AS1
chr22−38_28785274−29006793.1
BISPR RP11−58E21.3
RP11−212I21.2
RP11−775C24.5
RP11−134L10.1
ADPGK−AS1
AC005071.2 RP11−20G13.3
FAM157C AP001056.1
THUMPD3−AS1
RP11−499P20.2
ZNF674−AS1
ST3GAL6−AS1 USP30−AS1
RP11−326I11.3
CTD−3128G10.7 RP11−13A1.3
ST7−AS1
RP11−733O18.1
RP5−1136G13.2 RP11−13A1.1
C3|C4|C5
PAXIP1−AS1
CTD−2033D15.2 RP11−221J22.2
RP11−10J5.1 AC012363.4
TRG−AS1 RP11−166A12.1
AC006369.2
RP11−394I13.1 IL12A−AS1
CTB−61M7.2
AC074289.1 RP11−171I2.2
TTN−AS1
RP11−432J22.2
LINC00856
RMDN2−AS1
RP11−774O3.3
RP4−756G23.5 RP11−126O1.6
RP11−260E18.1
RP11−967K21.1
AC079767.4
RP11−28F1.2
LINC00472
RP11−347C12.10
RP11−645C24.5
KB−1732A1.1
RP11−434H6.6
RP11−342M1.3
RP11−34F20.7
RP11−345J18.2
LINC00996RP11−4C20.4
KB−1410C5.5
AC083949.1
AP001046.5 H1FX−AS1
RP13−297E16.4
AC025171.1
CTD−2105E13.14
RP11−728F11.4
XXbac−BPG252P9.10
RP11−65L3.2
C1orf132
RP11−89K11.1
RP11−259N19.1
0
AC004069.2
DLEU2LINC01010
FAM225B
FAM225A
LINC00278
RP11−303E16.2
RP11−467L13.7
PRKAG2−AS1
TRAM2−AS1 RP11−126K1.6
RP3−477O4.14
NCK1−AS1
LINC00339
RP11−445P17.8
CTC−378H22.2
KIAA1614−AS1
LINC01184
LINC00324 RP11−823E8.3
RP11−473M20.9 CTD−3224K15.3
CTC−510F12.4AATBC
RP11−121C2.2 TP53TG1
DHRS4−AS1 RP11−53B2.2
KB−431C1.4
LINC00654
RP11−804H8.6
RP11−617F23.1 LINC01504
UBL7−AS1 LINC01127
RP5−1091N2.9 ST3GAL5−AS1
RP3−475N16.1
PLBD1−AS1
RP11−545E17.3
AC068282.3
RP11−67L2.2
AC093627.10LINC01410
RP11−597D13.9
RN7SL138P
LINC00957
−5
LINC01503
CTD−2135D7.5
N
BAIAP2−AS1 AC011899.9
Databases
CEBPA−AS1
LINC01094
RP11−344B5.2
C3|C4|C5|C6
BLACK BOX
−5 0 5 10
A.W.K. • NCBI
• Ensembl • LoFreq
FoRk • PAML4, CODEML, GARD
• IGV
• Sashimi
• MLRD • PoSeiDon C6 C3.2|C5.2|C6
Figure 2.2: What is inside the Black Box ? In this thesis, a broad amount of different topics, combining a huge variety of bioinformatical methods and
tools, is presented. The figure shows an comprehensive overview about the different topics and the related chapters (CX) where those topics are discussed.
Exemplary, some tools are mentioned (written in italics) that are used throughout the corresponding chapters. Tools and pipelines I developed during my
PhD as parts of different projects are additionally marked blue (e.g. MLRD [3],IGO [4], PoSeiDon [9]). – Caption continued on the following page –
Figure 2.2: At the start of almost all projects there is an idea about the specific questions that
should be answered. (A) Depending on those questions, an experiment can be designed (Sec. 2.1.3)
to obtain sufficient data that is needed to answer those questions. This could be a Next-Generation
Sequencing (NGS) experiment, where samples are collected, RNA/DNA is extracted (e.g. in three
biological replicates) and specific molecules of interest (like miRNAs) are enriched, libraries are
generated and the sequencing itself is conducted to obtain the raw read data (Sec. 2.1). However,
the data can be also obtained from publicly available databases like Ensembl or NCBI (N). (B) The
raw data enters the Black Box and is preprocessed (Sec 2.1.4). The quality checked and adjusted
read data can be passed to mapping (C) (Sec. 2.4) and/or assembly (E) (Sec. 2.2), depending on
the questions raised. The mapped data can be further quantified and normalized (Sec. 2.5.1) to
estimate RNA abundances (C). If different samples and conditions were sequenced, these can be
compared to detect significant differential expressed genes (DEGs) (D) (Sec. 2.5). (G) Significant
genes can be applied for pathway analyses and GO-term enrichment. (F) A quantification of read
data or the comparison of core genes between different species is only possible if an annotation
is available (Sec. 2.3). Whereas methods shown in C–G aim to give a more general view on
many genes simultaneously and aim to provide various connections between them, taking a closer
look on single genes or even single nucleotide positions is an important task to finally obtain
a comprehensive picture of a particular biological topic. (H) Therefore, building an alignment
of sequences is always a basic and crucial task of many bioinformatical applications. (I) The
differences and similarities in an alignment can be utilized to calculate phylogenetic trees, obtaining
insights into the evolution of species and their genes. (J) It is also possible to focus only on a
special gene family, like tRNAs, where one can search for a particular biological phenomenon like
remolding by combing data of H and I. This joint work is not presented in this thesis, however the
interested reader can find details in the thesis of my former colleague Dr. Abdullah Sahyoun [20]
and our publication [3]. (K) Another possibility involves the detection of positively selected sites
and putative recombination events in an alignment of multiple protein-coding sequences. (L)
Another topic involves the identification of special genetic variations such as single nucleotide
variances and insertions/deletions, helping to understand what happens in a biological system on
nucleotide level. (M) Finally, visualization is always an important part for all topics (Sec. 2.6). An
comprehensive and clear visualization of results, for example obtained from huge NGS projects, is
crucial and helps researchers to understand and interpret the data correctly. Exemplary shown are
figures visualizing DEGs from far away (MA plots, 2D/3D PCA, heat map) and on a closer level
(box plots of individual genes, scatter plot comparing fold changes) as well as a Mauve alignment
of three bacterial genomes [1]. Importantly, the results can and should be validated again in the
wet lab and can be further used to develop new ideas and hypotheses for upcoming experiments.
The presented figure does not claim to be a complete representation of all possible analysis steps
of different bioinformatical approaches. However, it officiates as a comprehensive overview of the
different topics interlocked in this thesis. Figures are partially adapted from our publications [1,
3–7, 9, 14].
9
Chapter 2. Welcome to the Black Box
10
2.1. Next-Generation Sequencing
assembly of highly repetitive and huge genomes (Sec. 2.2). As we deal in the fol-
lowing chapters mostly with short read data, the concept of long read data will be
just mentioned and not discussed in more detail in this thesis. However, producing
longer reads and the sequencing of single molecules without an amplification step is
surely the direction NGS evolves.
RNA
NGS does not only allow for the se-
quencing of DNA molecules, in fact it ADNA 5' 3'
can be also modified for the sequenc- molecule Fragmentation and library preparation
}
fragments insert
NGS
in a biological sample at a given mo- r1
size
r2
Reads
ment in time. RNA sequencing (RNA- (paired-end)
Seq) is a powerful method for discov- BDNA 5' Exon 1 Exon 2
3'
ering, profiling, and quantifying RNA molecule
Translation & Splicing
11
Chapter 2. Welcome to the Black Box
General
HiSeq 2500
acggactaga
acggactaga
acggactaga
acg acg
TrueSeq Small RNA Indices A
Figure 2.4: Shown are the typical main steps of an RNA-Seq experiment, with the final goal to
identify differential expressed genes between different biological conditions. As we want to measure
transcript abundances in our biological samples we have to reverse transcribe the extracted RNA
molecules of interest to a cDNA library for sequencing. This procedure involves steps like the
fragmentation of the RNA and adapter ligation. After sequencing, we have to solve the inverse
problem: due to the fragmentation of the transcripts we need to find the true or most likely location
where each short read was originated from. By counting the reads and estimating RNA abundances,
we can then compare different conditions and search for significant differential expressed genes, see
Sec. 2.5 for details.
Before RNA-Seq came up, gene expression studies were performed with hybri-
dization-based microarrays. Contrasting the microarray technology, RNA-Seq al-
lows also for the identification of novel transcripts and does not necessarily need
a sequenced reference genome. Furthermore, RNA-Seq allows for the genome-wide
analysis of transcripts at a single nucleotide resolution and therefore includes the
identification of single nucleotide variants, gene fusions, allele-specific expression and
alternative splicing events [33].
Most frequently, the NGS platforms Illumina HiSeq/MiSeq, Ion Torrent and 454
Pyrosequencing are used for RNA-Seq. As most of the projects presented in this
thesis are based on the Illumina HiSeq system, this method should be described in
more detail in the following section.
2.1.2 Illumina
Illumina emerged as one of the most widely used NGS methods for both, DNA-
and RNA-Seq [29, 34]. The high accuracy and throughput together with the still
decreasing costs made the Illumina platform most suitable to target many biological
questions, including gene expression studies (Sec. 2.5) and the draft assembly of
(rather small) genomes and transcriptomes (Sec. 2.2).
After the DNA is fragmented and adaptors are ligated, the fragments need to be
immobilized on a plate (Fig. 2.5). The plate, also called flow cell, consists of multiple
channels with a dense lawn of primers fixed to the surface enabling the fragments to
bind. Bridge amplification is used to generate clusters on the plate. In this process,
the adapter of a free end of an already bound fragment interacts with the comple-
mentary primer fixed on the plate. Then, a double-stranded bridge is generated and
after denaturation two single-stranded templates are produced, which are able to
participate in a next amplification step. Following this workflow, clusters of iden-
tical sequences are produced, needed for the final sequencing step. The sequencing
12
2.1. Next-Generation Sequencing
A B C
F C A
F
F
G
F
F T F
F
F G F G G
F A F T F A F T A HO T
HO HO
Figure 2.5: Workflow of the standard Illumina sequencing. (A) Fluorescence labeld and terminally
blocked nucleotides are added to the flow cell. Previously, fragments were immobilized on the
surface and amplified into identical clusters using bridge amplification. Each cluster on the flow
cell can incorporate now a different base. (B) Each cluster emites a nucleotide-specific color,
recognized by a sensor. (C) The fluorophores are cleaved and washed away from the flow cell. The
3’-OH group is regenerated and a new cycle begins.
itself involves four fluorescently labeled and terminally blocked nucleotides that are
flooded iteratively and simultaneously over the flow cell (Fig. 2.5A). If a nucleotide
binds complementary to the template strand, a specific fluorescence color is emitted
and the newly added base can be identified (Fig. 2.5B). Before the next cycle the
fluorophores are cleaved and washed from the flow cell and the 3’-OH group of each
nucleotide is regenerated (Fig. 2.5C). In each sequencing round the reads are elon-
gated by at least one nucleotide and finally saved as simple text strings in a FASTQ
file [35] for further processing.
Next to the typical single-end reads (a fragment is only sequenced from the 5’
or 3’ end) produced by a standard Illumina run, the technology also allows for the
production of so called paired-end reads. In a paired-end design, each fragment
is sequenced from the 5’ and 3’ end, so two corresponding reads (e.g. r1 and r2)
are produced from each fragment (Fig. 2.3). As the mean fragment size in the
library is known, the calculation of an insert size between the two related reads
is possible. This additional information can be used to significantly improve the
assembly process (Sec. 2.2) and the mapping (Sec. 2.4) of the reads.
Besides the sequencing of single- and paired-end reads, specialized protocols for
strand-specific NGS, like the TruSeq Stranded Total RNA Library Prep Kit from
Illumina, exist. With such kits it is possible to produce strand-specific reads, whereas
in a default Illumina run the strand information is lost. To know from which strand
of the DNA an RNA-Seq read is derived is an important information that can greatly
improve de novo transcriptome assemblies and read quantification.
13
Chapter 2. Welcome to the Black Box
Design
First of all, it is of great importance to clarify from the start which biological ques-
tions should be tackled and answered with the NGS run. In most cases, many
different scientists with expertise in different fields are involved and it needs to be
clarified what the goals are and if they really can be achieved with the designed
NGS run. In most cases, it is needed to collect all the issues that might affect the
results and highlighting those that can, or can not be done to remove these issues.
How many reads are needed? Which read length is sufficient? What molecules need
to be addressed in the library preparation step?
Replication
Replication is a fundamental step in almost all experiments. The good thing is, that
technical replicates are not really needed for most well established NGS technologies
like Illumina [36]. Much more important are biological replicates, for example when
planning an RNA-Seq experiment to identify differential expressed genes between a
healthy and an ill (e.g. virus infected) condition. Three biological replicates should
be the minimum to consider for almost all experiments. If one sample fails to
generate meaningful results, the outcome of a whole experiment can be useless.
Four replicates are already a great improvement in the ability to detect differences
between groups, five adds even more power but after six replicates many experiments
start to tail off in this additional power [37, 38]. In most cases, the budget is also an
important factor, unfortunately limiting the number of replicates in way too many
cases.
Multiplexing
Multiplexing describes the process of sequencing more than one sample in one pool
(e.g. Illumina lane). This can be achieved by adding short oligonucleotide sequences
(barcodes) to the fragments. After sequencing, the resulting reads can be demulti-
plexed according to the known barcodes. The factor for multiplexing depends on the
14
2.1. Next-Generation Sequencing
amount of reads that are needed in the experiment. For example, for the assembly
of a bacterial sized genome not so many reads are needed to achieve a sufficient
genome coverage in contrast to an eukaryotic genome. The estimated coverage of a
genome/transcriptome can be calculated as
r·l
Coverage =
G
where r is the amount of reads, l is the read length and G is the expected genome or
transcriptome size. For a standard RNA-Seq sample, we can expect a ∼75 000 000
base pair sized transcriptome. Therefore, a possible sequencing setup might involve
the multiplexing of three samples in one Illumina HiSeq2500 lane, resulting in ap-
proximately 60 million reads with a read length of 50 bp per sample. Roughly, this
would result in a 40-times coverage of an eukaryotic transcriptome of the given size.
Protocols
5,254 bp
1,035,000 bp 1,036,000 bp 1,037,000 bp 1,038,000 bp 1,039,000 bp
Ribo-Zero
smallRNA
UABP2
SNORD121A
Figure 2.6: Shown is a part of contig GL430013 of the Myotis lucifugus genome assembly. Total
RNA was extracted from a cell line of a closely related species (M. daubentonii ) and sequenced on
an Illumina HiSeq 2500 with the Ribo-Zero (top, blue reads) and smallRNA (bottom, red reads)
protocol. Shown are mapped reads of one replicate of a non-infected sample, exemplary obtained
from a currently ongoing study [15]. By combining reads from both protocols, we are able to
observe the expression of the UABP2 gene and the SNORD121A ncRNA, that would be otherwise
lost when only performing the rRNA depletion protocol. Furthermore, the strand-specificity of the
reads from both protocols and the advantage of a splice-aware mapping tool (in this case STAR [39],
see Sec. 2.4) for RNA-Seq data is shown.
As we can only analyze those molecules that were previously prepared for the se-
quencing run in the library preparation step, it is clearly important to select an
appropriate protocol for each NGS setup. Currently, a wide variety of different li-
brary preparation protocols exist, targeting different types of nucleotide molecules
(we will focus here mostly on Illumina kits). Whereas for DNA-Seq the selection
of an appropriate protocol might not be that difficult, one of the biggest problems
for RNA-Seq is the high amount of ribosomal RNAs (rRNA) present in total RNA
15
Chapter 2. Welcome to the Black Box
A Standard-Gel BluePippin B
Standard-Gel
132
225
2
BluePippin
Figure 2.7: Comparison of smallRNA selection with a gel and the BluePippin protocol. (A) Shows
FastQC length distribution plots after adapter clipping. In the BluePippin selection, the ∼22 nt
peak, most likely occurring from microRNAs, is lost. The ∼33 nt peak is most likely produced
by small tRNA- and yRNA-derived RNAs [42]. (B) Shows the overlap between miRNAs, covered
by at least 10 unique mapped reads. Almost all expressed miRNAs are part of the standard gel
size selection, but 132 are completely lost with the BluePippin protocol. The results shown here
exemplary are based on smallRNA-Seq data obtained from the FLI in Jena. The BluePippin
protocol is currently established, so just preliminary results of the comparison of both protocols
are shown.
samples. The rRNA can comprise up to 90 % of the total RNA [40]. Various prior-
to-sequencing procedures, such as mRNA amplification kits, can help to enrich the
yield of mRNA [41]. This is mostly done by a specific selection of RNA molecules
with a poly-A tail. Of course, by this procedure many RNAs that lack a poly-A tail,
like certain long non-coding RNAs, will be lost. Also, many other non-coding RNAs
do not have a poly-A tail. Therefore, another sequencing kit is more and more used
for total RNA library preparation: the Ribo-Zero rRNA removal kit. Within this
kit, magnetic beads are used to specifically target certain rRNA types in order to
remove them physically. Unfortunately, with this protocol, small RNA molecules
like miRNAs will be lost (Fig. 2.6). Specialized protocols can be used to select small
molecules from total RNA, like the Illumina TruSeq Small RNA kits. Nevertheless,
between specialized protocols for the selection of smallRNAs huge differences can
occur, especially if the protocols are not already well established (Fig. 2.7). There-
fore, the adequately selection of a fitting library preparation protocol is a crucial
step in each NGS design.
For a standard RNA-Seq workflow with the goal to identify differential expressed
genes in organisms with a reference genome available, we obtained good results
by combining two different Illumina protocols for RNA-Seq: rRNA− (Ribo-Zero)
and smallRNA, to cover most of the transcriptome present in a biological sample
(Fig. 2.6). We also observed that for model organisms a read length of 50 bp is
sufficient for mapping and read quantification and to finally call differential ex-
pressed genes (Sec. 2.5). Important is the use of strand specific data for the correct
quantification of smallRNAs and antisense transcripts. Of course, if the goal of an
NGS project is the assembly of a new genome, longer paired-end reads should be
preferred.
16
2.2. Assembly
% Adapter
50 TATTGCACTTGTCCCGGCCTGTTGGAATTCTCGGGTGCCAAGGAACTCCAG
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTGACCAATCTCGTATGCC
tions, bases of low quality and possible 40 CAACGGAATCCCAAAAGCAGCTGTGGAATTCTCGGGTGCCAAGGAACTCCA
calculate certain statistics and a distri- Figure 2.8: Exemplary shown is the FastQC
bution of quality scores per nucleotide. adapter content report for one smallRNA-
Based on the quality values, bases with sequenced sample of a human HuH7 cell line. The
a comparatively low quality can be re- sequencing was done on a Illumina HiSeq2500,
producing strand-specific single-end reads with a
moved, for example with PRINSEQ [44]. length of 51 bp. Clearly, an Illumina small RNA
As the DNA/RNA was fragmented adapter was sequenced in some cases, because
into smaller pieces before sequencing many smallRNAs (like microRNAs) are shorter
and adapters were ligated, it can hap- than 50 bp. Those adapters need to be removed
pen that also the adapter is par- prior further analyses of the data. Adapter se-
quences are marked. In one case, the full adapter
tially sequenced. This especially occurs was sequenced. This data was partially used by
when the length of the sequenced frag- Mostajo et al. [12].
ment was shorter than the applied read
length, which is often the case for smallRNA sequencing data (Fig. 2.8). Adapters
can be removed with CUTADAPT [45].
2.2 Assembly
Recent advances in NGS technologies are able to generate incredible amounts of
sequenced data. Quite recently Illumina announced two new high throughput se-
quencing systems: the HiSeq X Ten and NextSeq 500, saying that the last one would
be able to produce a sequencing throughput of 120 Gb (giga bases) per run, includ-
ing 400 million paired-end reads up to a length of 150 bp each1 . In context of this
continuing evolution of high throughput sequencing technologies we are able to pro-
duce large amounts of read data in a cost-effective and time-saving way. Especially,
massively parallel cDNA sequencing (RNA-Seq) has established as a major tool for
transcriptome quantification and analysis [32] (Sec. 2.1.1).
Nevertheless, as a typical eukaryotic genome is millions of nucleotides in size, no
current sequencing technology can decode such long sequences in one shot. Of course,
recent advantages in NGS [29] lead to an increase of read lengths, but short-read
methods like Illumina (Sec. 2.1.2) are still on the case. They are well established
1
https://www.illumina.com/content/dam/illumina-marketing/documents/
products/datasheets/datasheet-nextseq-500.pdf
17
Chapter 2. Welcome to the Black Box
and have a comparatively low error rate combined with still decreasing costs per
sequenced base. As one major step of NGS involves the fragmentation of the DNA
(or RNA), subsequently all small pieces (reads) need to be computationally merged
to rebuild the genome (or transcriptome) sequence.
This can be either done according to an already known reference sequence (refer-
ence based assembly) or without the use of prior knowledge (de novo assembly). If a
reference genome (or transcriptome) is available, the reads can be mapped (Sec. 2.4)
to the reference and subsequently assembled with tools like Cufflinks [46] or
Scripture [47] by clustering overlapping reads which aligned to nearby positions
in the genome. If no reference is available, or if the reference might not be complete,
the direct de novo assembly of RNA-Seq reads is an alternative strategy.
18
2.2. Assembly
de Bruijn method uses unique subsequences of the reads to represent edges. These
subsequences are called k-mers and represent substrings of the reads of length k. An
edge in the de Bruijn graph describes an overlap of length k −1 between two k-mers.
Constructing the de Bruijn graph can be done in linear time because no pairwise
alignments are needed like in the overlap-graph approach. Thus, the assembly pro-
cess becomes the problem of finding a path in a graph that visits every edge at least
once, also known as an Eulerian path which can be efficiently solved [50]. However,
the entire graph has to be hold in memory for assembly so de Bruijn approaches
can become very memory-intensive for large datasets.
Some prominent and still widely used de novo genome assemblers based on
de Bruijn graphs are Velvet [51], ABySS [52] and SOAPdenovo2 [53]. Already in
the early 2000s Mira was developed (http://www.chevreux.org/projects_
mira.html), which is still based on overlap-graphs. Another recently published
de novo genome assembler is SPAdes [54, 55]. SPAdes was originally developed
for smaller bacterial-size genomes and single-cell data, but can be also applied for
standard isolates and other organisms, although it was not fully tested for larger
genomes.
All of the above mentioned de novo genome assemblers can be applied for tran-
scriptome assembly, in principle. However, the assembly of transcriptomic data has
some special requirements in contrast to genome assembly. Whereas the number
of genomic short reads should be more or less uniformly distributed over the whole
genome, the distribution of transcriptomic reads can differ in many magnitudes and
include transcripts expressed at both, low and high levels. Furthermore, de novo
genome assemblers were developed to reconstruct as few as possible continuous DNA
sequences by simultaneously maximizing the length. However, in a transcriptome
assembly, we can assume thousands of sequences (transcripts) and on top of that
different isoforms originating from alternative splicing. Therefore, de novo genome
assemblers can be in principal applied to transcriptomic short read data [56], but
there are also different challenges that need to be taken into account during the
assembly process of transcripts.
19
Chapter 2. Welcome to the Black Box
mapped back to the respective reference (see Sec. 2.4). This approach is also known
as mapping-first or reference-based assembly. Transforming a large assembly prob-
lem into a smaller one by reducing the number of short reads and the possible
connections between them by first mapping reads to the genome is a big advan-
tage of the reference-based transcriptome assembly. On the other hand, the success
of a reference-based strategy depends heavily on the quality of the used reference
genome. Most genome assemblies contain many errors like misassemblies or inser-
tions and deletions [59] and can therefore guide to biased or partially assembled
transcripts. Of course a reference-based assembly is not possible if no reference is
available.
Thus, next to the reference-based
approach a second, called de novo tran- Gene Exon a b c
20
2.3. Annotation
Finally, there are not only de Bruijn based transcriptome assemblers out there:
Mira, which is based on overlap-graphs like stated above, can be also applied to
RNA-Seq short reads using a special EST mode [64].
One approach to overcome with the problem of disparate distributed short reads
in RNA-Seq data is the use of multiple k-mer values instead of one to build the
de Bruijn graph [65]. The value of k has a big influence even on de novo genome
assembly, but mainly for transcriptomes it can be used to handle both, lowly and
highly expressed transcripts, more efficiently. Using different k-mers to represent the
short reads of an RNA-Seq experiment provides a better representation of transcript
isoforms resulting from alternate splicing events. The assembled contigs based on
a range of different k values can be merged together to generate longer transcripts
and a better overall assembly [66]. Nevertheless, the efficient merging of assemblies
resulting from different k-mers and/or tools is still a challenging task. On the one
hand, very similar isoforms should be not merged into one transcript, e.g. a smaller
isoform that is simply a subsequence of a larger one. On the other hand, merging the
contigs of different de novo assemblies can introduce high redundancy in the final
assembly if similar transcripts are not efficiently merged, therefore slowing down
and complicate further analyses.
2.3 Annotation
21
Chapter 2. Welcome to the Black Box
2.4 Mapping
Mapping describes the process of generating alignments of DNA- or RNA-Seq reads
to a reference genome or transcriptome. The general goal when mapping sequenced
reads is to determine for each read (or read pair) the true location (origin) with
respect to the reference. Therefore, mapping tools must be able to align millions
of short sequences to a huge reference sequence in a reasonable amount of time.
Generally, the process of mapping involves three major steps: 1) the creation of an
index of the reference sequence (just once), 2) the independent alignment of the reads
of each sample to the reference, and 3) the conversion of the resulting mapping file
for further downstream analyses like read quantification (Sec. 2.5.1) and differential
gene expression analysis. Many programs have been developed to map reads to a
reference sequence, varying in their algorithms and speed.
Some commonly used tools [70] for DNA-Seq data are BWA [71] and Bowtie [72]
and for RNA-Seq data Tophat [73], HISAT [74], STAR [39], and Segemehl [75].
The discrimination between a DNA- and RNA-Seq mapper is important, because
reads originating from transcripts of higher organisms need to be mapped in relation
to possible exon-intron-junctions in the reference sequence. (Fig. 2.10). Therefore, so
called splice-aware mappers like TopHat and Segemehl were developed. Whereas
most splice-aware mapping tools only allow for one split per read, Segemehl is
able to split a read multiple times. This behavior is especially advantageous if
longer reads are mapped to the reference and/or short exons are involved.
Furthermore, it is very likely for some reads to be mapped with the same quality
to multiple locations in the reference. This might be due to repetitive regions present
in a genome or similar transcript isoforms sharing same exons. Such ambiguous
mapped reads need to be considered in the downstream analysis (Sec. 2.5.1).
The SAMtools software suite [76] is widely used to convert, sort and index
alignment files (SAM/BAM) produced by current mapping tools.
22
2.5. Gene expression analyses
and additional treated with the vitamins A and D [5, 6] (Sec. 5.1) or between a
control HuH7 cell and a HuH7 cell infected with the Ebola virus [4] (Sec. 5.2).
By performing such experiments, a broad variety of biological questions can be
addressed, like: How many genes are significant differentially expressed between
treatments? Can DEGs be clustered in functional groups? Are they involved in the
same pathways?
23
Chapter 2. Welcome to the Black Box
Gene A Gene B
Control condition
Treatment condition
Figure 2.12: Shown are two genes with a homologous sequence part. Therefore, some of the reads
in the control (top) and treatment (bottom) condition are mapping unique to gene A and B, but
some reads are also multiple mapped (light blue). Let the biological truth be that all ambiguous
reads are originally derived from gene A. If we would now count all reads (including the multiple
mapped once) and theoretically compare the read counts between control and treatment, gene
B might be differential expressed, because the multiple mapped reads raise the read count in
the treatment condition. We would possibly detect gene B as false positive DEG. On the other
hand, if we remove the ambiguous reads completely (because we do not know for sure where they
originated from), the expression of gene A in the treatment condition is much lower and so is also
the calculated fold change (Sec. 2.5.2).
need to be normalized according to the library size. Otherwise, a gene could be just
2-fold expressed in one sample, because the double amount of reads was sequenced.
Additionally, the gene length also has an impact on the overall number of sequenced
reads. Statistically, from a longer transcript more fragments are derived during
the library preparation step and so just by chance more reads are sequenced. One
approach to normalize for the library size and feature length simultaneously is the
calculation of transcripts per kilobase per million (TPM [79]) values:
!
ci 1
T P Mi = · P cj · 106
li lj
j∈N
where ci is the raw read count of gene i, li is the length of gene i and N is the
number of all genes in the given annotation. TPM values can be used to filter out
lowly expressed genes in respect to their length.
Another widely used normalization unit is called RPKM or in case of paired-end
data FPKM [46]. Both units represent the same: reads (fragments) per kilobase per
million mapped reads (fragments). Therefore, this is the number of reads aligning
to a feature (like a gene), normalized by the total number of reads mapped (in
millions) and the length of the feature (in kilobases). Instead, the TPM value is
the number of reads from a particular feature normalized first by the feature length,
and then by sequencing depth (in millions) in the sample. Recently, it was shown
that the widely used R/FPKM measures are inconsistent among samples [80]. This
inconsistency essentially arises from the wrongful division by total read counts in the
library normalization step after normalizing by length. Therefore, the TPM value
is more accurate then the R/FPKM measure.
However, when searching for differential expressed genes between different sam-
ples, a normalization based on the gene length is not necessary, because we will only
compare the expression of the same genes in different samples (and therefore only
24
2.6. Visualization
6
● ●
●
● ● ●
● ● ●
●
● ●
●
●
● ●
● ● ● ● ●
●
●
●
● ●
● ●
●
● ●
●
● ● ●
●
4
4
● ● ● ● ●
● ● ●
● ●
● ●● ● ● ● ● ●
● ● ● ● ● ●
●
● ● ●
● ● ● ●● ●
● ● ● ●
● ● ●
● ●
● ● ● ● ●
● ● ● ● ● ●
log2 fold change
2
●
●● ●● ●● ● ● ●●● ● ●● ●● ● ●
● ● ●
●●
● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ●
● ●
● ● ●
● ● ● ● ●● ● ● ●● ● ● ● ● ● ●
●●●
●
●● ● ●● ●● ● ●●● ● ● ● ● ● ●
● ● ● ●● ●● ● ●●●●●● ● ●●● ●●● ●●● ●●●● ● ● ●● ● ● ● ●
● ● ●● ● ●● ● ●●
● ● ● ● ● ●●● ●● ● ● ●● ●●● ●● ● ●
● ● ● ●●●●● ● ● ●● ●●● ●● ● ●●●●
● ● ● ● ●●
● ● ● ●● ●● ● ●● ● ●●
●
● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ●●●● ● ●
●● ●● ● ● ● ● ●● ● ● ●● ● ●● ●
● ●●● ●● ● ●●● ● ●
●
●
● ● ●
● ●● ● ● ● ● ● ●● ● ●●●● ● ●
● ● ●
●●
●● ● ●● ● ● ●●
● ● ● ● ●●●● ●●●● ● ● ● ●●● ●●
●
●●● ●
●●●●● ●●●● ●
●● ●
●●● ● ●●●● ●●● ●●●
●●
●
● ● ●
● ● ●● ●● ● ● ● ●●●● ●● ● ● ●
● ●●● ● ●
●●●
●● ●●● ●● ● ● ●● ●●●● ● ●
● ● ● ●● ● ● ● ● ● ●●● ●●●● ● ●●●● ● ●● ●●● ● ●●●● ●●●●●●●●
●●
●
●
●●●● ●●●
● ●●● ● ●● ● ●●●
● ● ● ● ●● ●
● ● ● ●●● ● ● ● ● ●●● ● ●● ●●●● ● ●●●● ●●●● ●● ● ●●● ●●● ●●●
●● ●● ●
●●●
● ●●● ●
●●●
●●●●●●●
● ●●
●●
●
●
●
●
●●●
● ●● ●●
●
●● ● ●●●● ●●● ● ●●● ●● ●
●● ●
● ●● ● ● ●● ●
● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●
●● ● ●●●●●● ● ● ●
● ● ●●
● ● ●
●●● ● ●
●●●
● ●● ●
● ●● ●● ● ● ●●
●●●●● ● ● ●
●● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●●● ● ●●
●●
●●●● ●●● ●●● ●●
●●● ●●● ● ●● ●● ●● ●● ●● ●● ●●
● ● ● ● ● ● ● ●●
● ●
● ●● ● ● ●● ●● ●● ● ● ●●●● ●●●● ●● ● ●● ●● ●●● ●
● ●●
●● ●●●●●●
●●
●●●
●●●●●●●●● ●
● ●
●●
● ● ●●●
● ●
●●
●●●
●● ●●
●
●
●
●
● ●●●●●●
●●●●●
● ●●
●● ● ●
● ●●
●●●●●
● ●● ● ● ●●● ●● ● ●●●
● ●●●●
●● ●●● ●● ●● ●● ●
● ●● ●
●
●●
●● ●●
●●
●
●●●
● ● ●
●●●●●
●● ●
●●●●●
● ●● ●
●●●●●
● ●●●●
●●
●
●● ●
● ●●
●
●●
●
●
●
●
●●
●
●
●●
●●
●●●● ●●
●●●●●●● ●●●●●● ●●●●● ● ●●
●
● ● ● ●●●
● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●
●
● ● ●● ● ● ●●
●● ● ●●● ●●●●●●●● ● ● ●●
●● ●●●●●
●
●
●
● ●● ●
●
●
●●● ●●
● ●
●
●
●●●
●
●
● ●●●
●● ●●●●
●●●
●
●●●
●●
●
●
●●●
● ● ●
●
●
●●
●●
●
●
●
●●
● ●●●●
● ●●
● ● ●
●●
●●●●
●
●
●
●
●
●
● ●
●●●●●●
● ●● ●●●● ●● ●●●● ● ● ● ●
●● ●● ● ● ● ● ●
●
● ● ●●●●● ●●
●●
● ● ●●● ● ● ●●
●
●●● ●●●
● ● ● ●● ●●
●●●
●●
●
●● ●●●
●
●●
●
●● ●●
●●●●●
●
●●●●●●●
●●●
●
●● ●●●● ● ●
●
● ●
●
●●●
●
●●
●●●
●
●●
●●●
●●
●
●●●
●
●●●
● ●
●
●●
●
●
●●
●●
●
●●●
●●
●
●
●●
●
●●●●●
●●
●●●●
● ●●
●●
●●
● ●
●●
●
●● ●
●●●●●
●
●●
●
●●
●●●
●●
●●
●
●●●●
●
● ●●●
●●● ●
● ●● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●
● ●●●● ●● ●
●●
●●●●●
● ● ●
●●●
●●
●●●
●●
●
●●
●
●
●● ●●
●
●●
● ●●
● ●● ● ●●●●●●
●●●●●● ●●●
●
●●●●
●●●●
●● ●
● ●
●● ●
●
●●●● ●
●●
●●●●
● ●●
●
●●●●
●
●●
● ●
●●
●●
● ● ●
●●●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●●
●
●
●
●
●●
●●●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●
●
●●●
●●●●●●●
● ●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●● ●
●●
● ●●●●
●● ●● ●●●
● ●●● ●● ●
●
● ●● ● ●●●
● ●● ●● ●● ● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ●●● ● ●
● ● ●
●●●
● ●●●●
●●
●
● ●
●● ●● ●●
● ●
●●
●●
●
●●●●●● ●
●●●● ●● ●●● ●
●●
●●● ●● ●●●●●
●
● ●●
●
● ●
●●●● ●
●●
●
●●●
●
●● ●
●
●● ●
●●
●
●●
●
●●
●●
●
●●●
●●● ●●
●
●●
●
●●●●●
● ●●
●●
●
●●
●
●●●●
●
●●●
●
● ●●
●●
●
●●●
●
●●
●●
●●
●●
●●
●●●
●
● ●
● ●
●
●
●●
●
●●
●●
●●
●●
●
●●●
●
●
●● ●
●●
●●
●
●
●
●
●●
●
●●●●
● ●
●
●●●
●●
●
●
●●
●●
●● ●
●
●
●
●
●●
●
●●
● ●
●●
● ●●●
● ●●●
● ●●
● ●●
●
●● ●● ● ●● ●
●● ● ● ● ● ●●
● ● ●● ● ● ●● ●●●● ●●●●● ●●●●●● ● ● ● ● ● ●●●● ●●●● ●●●●●●● ●●
●
●●●●
●
●● ● ●●●● ●●●● ●● ●●
●●●●●● ● ●● ● ● ●●● ●●
●● ●●●●
●●● ●● ●● ●
●● ● ●●
●
●
●●
●
●●
●● ●●
●
● ●
●●
●● ●●● ●●
●●
●● ●
●
●
●●●●●
●
●●
●●
● ●
●
● ●●●
●●
●●
●●
●
●
●●
●●
●
● ●
●●●
●
●
●●●
●●
●
●
●●
●
●●
●●
●
● ●
●●
●
●●●●●
●
●●
●●
●
●●● ●
●
●●
● ●
● ●
●● ●
● ●●
●●●
● ●●
●
●●●
●●●●
●●● ●●
●●
●
●●●
●●
●●●●●
●●●●●
●● ● ●
● ● ●
●●●●
●●
●●● ● ●●● ● ● ●●●● ● ●● ● ●● ● ●● ●● ●
●● ● ● ● ●●● ●●●●●● ● ● ● ●●
●● ●
●●●●●● ●●●● ● ●
●● ●●●●●●● ●
●●
●●●●●●
●●
●
●●
●● ●
●●
●●● ●
● ●●
●●
● ●
●●●●●● ●●●● ● ●●●●
●
●
● ●●
●
● ●
●
●●
● ● ●●
●●
●●●●●
●●●●●
●●
●●●
●●
●
●●●
●●
●●
●
●●
●●
●
●
●●
● ●
●●
●
●●
●
●●
●
●
●●
●●
●●●
●●
●
●●
●●
●●
●
●●●
●●
●
●
●●●●●
●●
●●
●
●●
●
●
●●
●●●
●
●●
● ●
●
●●
●●●●●●
●
●● ● ●● ●
●●●
●●
●● ●●●
●● ● ●●● ●
●● ●
● ●●● ● ●● ●●● ●● ● ● ● ●
● ●●● ●
●● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ●
●● ●●● ● ●● ● ●● ● ●●● ● ● ● ●●● ● ●
●
●●
● ●
● ●●●● ●● ●
● ●
●●● ● ● ●●●●●● ●● ●● ● ● ●● ●●●● ●● ●●
●●●●● ●
●●
●●
● ●●●● ●●
● ●● ●●● ●●
●● ● ●● ●
●● ●● ●● ●●●●
●●●●● ● ●●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ●● ●
● ●● ●● ● ● ●●●●
● ●
●●●
●● ● ● ●●
●●
●
● ●● ●
●●
● ●●
●●
●
●●
●●
●
● ●●●●●●
●●● ●●●●●
●●●
● ●●
● ●● ●
● ●
●● ●
● ●
●
●●● ●●
●
● ●
●●
● ●●●
●
●●●●
● ●
●●● ●
●●
●●
●● ●
● ●●●
●
●●●
● ●●●
●●
●●●●●
●
●●
●
●●
●●
●●●●
●●
●
●●●
●●
● ●
●
●●● ●
●
●●●
●●●●
● ● ●
●●
●
●●
●●
●●
●●
●●
●
●●
●
● ●
●
●●
●
●●
●●
●●
●
●
●●
●●
●
●●●
●
●●
●●
●
●
●
●●
●
●
●●●
●
●
●
●●●
●●●
●●●
●
●●
●
●●
●●
●●
●●
●
●
●●●●
●
●●
●
●●
●
●●
●●
● ●●
●●●● ●● ●
●●●
● ●●●● ●●
● ●
●●●
● ●● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ●
●●
● ●● ● ●●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ●●● ●● ● ●
● ●●●● ●●
●
●●●●● ●●● ●●●●● ●●●● ●●
●●
●
●●●●
●●
●
●●●●●
●
●
●●●●
●●
●
●●●●●
●
●
●●
●●●●●
●
●
●
●●●●
●
●● ●
●
●●●
● ●
●●
●
● ● ●
●
●●
●
●
●●
●●●● ●
●●
●
●
●●
● ●
●●
● ●●●●
● ●
●●●●●
●●
● ●●●
● ●
●●
●
●●
●●●●●●●●● ● ● ● ● ●●
● ●●
●●● ● ●● ●●
●●●
●● ●
● ●●● ●● ●● ●● ● ● ● ●
● ●●●●● ●● ●
● ●● ●
●●●● ●
●●●● ●
●●●●
●●
●
●
● ●
●●● ●●●
●●●●●● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
● ●
●●
●
●
●
●
●●
●
●
●
●●●●
●●●● ●●
●●
●●●●
●● ●●
●
●
●
●
● ●
●●
●
●●
●
●
●●
●
●●
●●
●●
●
●
●●●●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
● ●
●
●●●●
●●●
●●●
●●
●●
●● ●
●
●
● ●
●
●●●
●●●
●
●●
●
●
●●
●●
●
●●●●
●●
●
●
●
●
●
●●●
●●●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●●
●
●
●●●
●
●●
●
●● ●
●
●●
●
●●●●
●●● ●
●
●
●
● ●
●
●●●
●●●●●
●●●●●
●●●
●●
●●●●
●● ●
● ●
●●
●● ●●●●● ●●● ●
● ●●●
●●●●●● ●●
● ● ●
●●
●●●●●
●●●
●●●●●
●● ●
●● ●●●● ●
●●
●
●●
●●
●●
●●
●
●●
●
●● ●●
●
●
●●●
● ●
● ●
●●
●●●
● ●
●
● ●
●●●●
●●●●
●
●●
●●
●●
●
●●
●●
●●
●●
●●
●
●●●●
●●
●●
● ●
●●
●●●●●
●
●
●●●●●
●●● ●
● ● ●
●●
●●●
●● ●●●●
●●●● ●●
● ●●●● ●●●●● ● ● ● ● ● ● ● ●●●●●
●● ●● ●● ●●
●●●●●●
●
●● ● ●●
●● ● ● ●●
●
●● ●●●
● ●●
● ●● ●
●●●
●● ●
●●
●●●
●●●
●●●●●●●
●●●
●●
● ●
●● ●
●●● ●●
●●
●●●●
●●
●● ● ● ●
● ●
●
●●●
● ●
●
●● ●
●
●●
● ●●
●●●
●● ●●●●
●● ●
●●
●
●●
●●●●
●
●●
●●●
●●●●
●●
●●
●
●
●●
●
●●
●●
●●
●●●
●
●
●●
●●
● ●
●●●●
●●●
●
●●
●●
●●
●●
●●●
●
●●
●
●
●●
●●
●
●●
●
●●
●●
●●
●
●●
●
●●●
●●
●
●
●●
●●
●●●
●●
●●
●
●●●
●●
●
●●
●●
●●
●
●●
●●
● ●
●
●●●
●
●●
●●●
●●
●●
●●
●
●●●
●
●
●●
●
●●
●
●● ●●
●●●
● ●
●●
● ●
●●●●●●●
● ● ●● ●
● ●●●● ●● ●
●● ●
●●●●● ● ●●● ● ● ● ● ● ● ●● ●● ● ●
●●● ● ●● ●● ● ● ● ● ● ●● ●●● ●● ●●
●● ●
●● ●● ● ●● ● ● ● ●●●● ●● ● ●● ● ● ● ●● ●● ● ●●● ● ●
● ● ●
●●
● ● ●
● ● ● ●● ●
● ●
● ● ● ● ● ●
● ● ● ●●● ●● ● ●●
● ●
● ●● ●●●
●● ●●● ●●
●
● ● ● ●● ●●●
●● ●● ●●●
●● ●●●
● ●
● ●●●
●●● ●●●●
● ●
●● ●●●
●
●●●
● ● ●
●● ●●●
●●
●●
●
●●●
●
●
●●● ●●●●
● ●●
●●● ●
●●●●
●
●●
●
● ●
●
●
●
●
● ●
●●
●●
●
●●
●●
●●●
●●
●●
●●
●●
●
●
●●
●●
●
●●
●●
●
●●
●
●●
●●
●●
●
●
●●
●●
●
●
●●
●●
●●
●
●●
●●
●●
●●●
●●
●
● ●
●●●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●●●
●●
●
●●
●
●●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●●●●
●
●●
●●
●
●
●
●●
●
●●
●
●
●
●●
●●
●
●●
●●
●●
●●
●
●
●●
●●
●●
●
●
●
●●
●●
●
●●●
●
●
●
●●
●
●
● ●
●
●
●
●●
●● ●
●
●●
●●
●
●
●●●
●●
● ●
●●●●●●●
● ●●
● ●●
●
●●●●●
● ● ●● ●●● ●● ● ● ●●● ●
●● ●
●●●
●●●●
●●
●
●●●● ●●●
●● ●
●●●●●
●
●●
●●
●●
●●
●
●● ● ●● ●
●●
●
● ●●● ●
●●●●
● ●
●●●●
● ●
●
●
●● ●●●●●
● ●●●
●●●
●●●
●
●●●
●●
●●
●●●● ●
●●●●
●●
●●●
●●●
●●
●●●●●
●●
●●
●●
● ●●
● ●
●●●●
●
●●●● ●●●
●
●
●●●
●
●
●●
● ●●
●
● ●●
●
●●
●
●●●●
●
●●
●●●
●●
●
●●●
●●●●
●
●
●●
●●
●
●●●
●
●●
●
●●
●●
●
●●
●
●●
● ●
●
●●
●
●●
●●●
●●●
●
●●
●
●
●●
●
● ●
●●
●●●
●
●
●●
●
●
●●
●
●●
●●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●●
●●
●
●●
●●
●
●
●
●●
●●
●
●
●
●●
●●●
●
●●●
●●
●
●
●●●
●●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●●
●●●
●
●
●
●●
●
●
●●
●
●●●●●
●●●
●
● ●
●●●
●
●●
●●●●
●
●●●
●●
●●
●●●
●
●●
●●
●● ●
●●●●
●●●● ● ●●● ● ●●●●●●● ● ●
●●● ●●●
●●●
● ●●●● ●●●●●●
●
● ●
●●●
●
● ●●
● ●●
● ●
●
●
●●●●●
●●
●
●●●● ●
● ●●●●●
●
●
●
●●
●●
●●
● ●●
● ●
●●●
● ●
●
●●
●●●
●●
●●●
●●
●
●
●●
●
●●●
●
●
●
●
●●
●
●
●
●●
●●
●
●●
●
●●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●●
●
●●
●●
●
●●
●
●
●●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●●
●●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●●
●●
●●●●
● ●
●
●● ●●
● ●●
● ●
●●
●●
●●● ●●● ●● ●● ● ● ●●
●● ●
●● ●
●●
●●●
●
●●●●●
●●● ●
●
●●
●●
● ●
●
●●● ●● ●●●
●●
●●
●●●
●
●
●●
●
● ●●
●● ●●●●
●●●●●● ● ● ● ● ●
● ●
●●●
●●●
● ●●
●
●●
●●●●
●●●
● ●●
● ●●
●
●●●●
● ●●
●●
●●●
● ●
●●●●●●●●
●
● ●
●●
●●●● ●● ●
●●●
●●● ●
●
●●
●
●●
●
●
●
●
●●●●●
●●●●
●● ●
●●
● ●●
●
●●
●●
●
●●●● ●
●●
●●
● ●
●
●●
●●
●
●●
●●
●
●●●
●●
●
●
●
●
●●●●
●●
●
●●
●●
●
●●●
● ●
●
●● ●●●
●
●
●
●
●
●●●
● ●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
● ●●
●
●
●●
●●
●
●
●●
●
●
●
●●
●
●●
●
●●●
●●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●●
●
●●
●
●●
●
●
●●●
●
●
●
●●
●●
●●
●
●●●
●
●●
●
●
●●
●
●
●●
●
●●●
●
●
●●
●
●
●●
●●
●
●●
●
●●
●
●●
●●
●
●●
●●
●
●●●
●
●●●●
●●
●
●●
●●
●
●●
● ●
●●
●
●
●
● ●
●
●
●● ●●
●
●
●
●
●●● ●● ● ●
●●● ● ● ●
0
● ● ● ● ● ● ● ●
●● ● ● ●
0
● ●●●
●●●
●●
●●
●
●
●●
●
●●
●●
●
● ●
●
●●
●●
●
●●●
●●
●
● ●
●
●●●●
●●
●●●●● ●●●●●●
●
●●●
●
●● ●
●●●
●●
●
● ●● ● ●●
● ●
● ● ●
●●● ●●●
●● ●
●●
●●
● ● ●
●●
●
●
●●●
●●●●
●●
●●
●
●●
●●●
●●
●●
●
●●●
●
●
●●
●●●●
●●
●
●●
●●
●●
● ●
●
●●
●●
●
●●
●●●
●●
●
●●●
●●●
●●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●●
●
●●
●
●●
●
●
●●
●●
●
●●
●●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●●
●
●
●●
●
●●
●●
●●
●
●
●●●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●●●
●
●
●
●
●●
●
●
●
●
●●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●
●●●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●●
●
●
●
●●
●
●●
●
●●
●
●●●
●●
●●
●●●
●
●●
●●
● ●●
●
● ●●● ●●● ●● ●● ●
●●
●●●
●●
●
●
●●
●
●●
●
●●
●●
●
● ●●
●
●●
●●●●
●●
●●
●●
●
●●●
●
● ●● ●
●●
●
●
● ●● ●●●●● ●
●
●● ●
●●
●
●●
●
●
● ●● ● ●●●
● ● ● ●●
●●●●
●●●●●●● ●
●●
●●●●●●●● ● ● ●
● ●
●●●●
● ●
●
● ●●
●●
●
●
● ●
●●
●● ● ●
●●●●●●
●●
● ●●●●
●
●●●●
●●●
● ●●
● ●●●
● ●●
●●● ●
●●●
●●
●● ● ●●
●●●●●●
●● ●●
●●●●
●●●
●● ●
●
●●●●
●●
●
●●●
●●
● ● ● ●
●
●●●
●
●●
●
●
●●
●●
●
●●●●
●
●●●●●
●
●
●●
●
● ●
●
●●
●
●●
●
●●●
●●
●
● ●
●●
●●
●
●●
●● ●
●●
●●●
● ●
●●●
●
●●●
●●●●
●●
●
●●
●●●
●
●●
●
●●
●
●●●●●●●●●●●
● ●●●●
●●● ● ● ● ● ●● ●● ● ● ● ●
● ●●
● ●●
● ●●
●●●
●●● ●
●●●●
●●●●●●●●●●
●
●
●●●
●●
●
●●
●
● ●
● ●●
●
●●●
●●●
●●●● ●●
● ●●●●
● ●●●
● ●●●
●●●●●●●●●
● ●●
●●●
●●●● ●
●●●●●● ●●
●●●●●● ●●
● ●
● ● ●● ●
●● ● ●● ●
●
●●●●
●
●●●●
●●●
●●
● ● ● ●●●
●●
●●●●● ●
●●●● ● ●●
●●●
●
●●
●
●●
●●
●
●●● ●●
● ● ●●●●
●●●●
●●
● ●●
●●●
●
●●●
●●●●●●●
●●
● ●●● ●
●
●●●
●●
●
●●●
●●
●●● ●
●●
●●
●●●●●●●
●●● ● ●●
●
●●
●●
●●
●
● ● ● ●●●●●● ●●● ●
●●●● ●● ●
● ● ●
●● ●
●● ● ● ●●●● ● ● ●●
●●●●
● ●●●●
●●
●●● ●
●● ●
●●
●●
●●
●
●●●
●●●
● ●●●●●
●●●
●●
●●
●●●● ●●●● ● ● ●●●● ●● ●●● ● ●
●● ●●● ●● ● ●● ●● ● ● ●●● ● ● ●
●● ● ●●●●● ● ● ●●
●● ●●
● ● ● ● ●● ●● ● ● ● ●●●
● ● ●
●●●● ● ● ●● ●● ●● ●
●●
●●
●● ● ●●●●●
● ●●● ● ●●● ●●
●●
● ● ●
●●
● ●
●
● ●● ●
● ● ●●●●●● ●●● ●● ●●● ● ●● ●●
● ●
● ●
●
●●
● ●
●●
●●●
● ●●
●
●●●
●●
●
●
●●
●
●●
●●●
●●
●
●●
●
●
●
●●
●●
●
●●●●
●
●● ●●●●
●●
●●●●●
● ● ●
●●
● ●● ●
● ●
●
●●●
●●
●●● ●
●
●●●
●●●● ●
●
●● ●●●● ●●
●
●● ● ●
●●●
●●
●
●●●
●
●●●
●
●●
●●
●
●
●●●●
●●
●
● ●
●
●●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●●
●
●●
●●
●
●
●
● ●●
●
●●
●
●
●
●●●
●
●
●●●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●●●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●●
●●
●
●
●●
●
●●
●●
●
●●
●
●●
● ●
●
●
●●
●●
●●
●
●
●●
●
●
●●
●
●●●●
●
●
●
●
●●
●●●
● ●
●●●●
●
●●●●●●●
● ●● ●● ●
●● ●
● ●●●●●
●●●
●●
●●
●●
●
●●
●●●
●
●●
●●
●
●●
●
●●
●●
●●
●
●●
●
●●●
●
●●
●
●●
●●●
●
●
●●
● ●
●●
●●●
●
●
●
●●●●
●●●●
● ●
● ●●●
●
●●●●
●●●
●
●
● ●
●●●
●●● ●
● ●
●
●●●●●
●●●
●●● ●●●
● ●●
●●● ●
●
●●
●●●●
● ●
●●● ●●
●●
● ●
●●●
● ●● ●
●●●
●● ●
●
●●●
●● ● ●●
●●●
● ●
● ●●●●
●● ●
● ●
●
●●●
●●●
●●●
● ●●
●● ●
● ●●●
●
●● ●
●●
●●
●●
●
●
●
●●●●
● ●●●●
●●●●●
●●
●
●●
●●●●
●
●
●●
●
● ●●
●
● ●●●●
●●●
●●
●
●●
●
●●●
●●●●
●
●●
●●
●●●●
●
●
●●
●●●
●●
●
●
●●
●
●●●
●●
●
●●●
●●
●●
●
●●
●
●●
●●
●
●
●●
●●●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●●●
●
●●●●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●
●●
●●
●●
●
●
●
●●
●
●
●●
●
●●
●●
●
●
●●●
●●
●
●●
●
●●
●●
●●
●
●
●
●●
●
●
●●●●
●
●●●
●
●●
●●
●●
● ●
●
● ●
●
●●
● ●
●● ●●● ●●● ●●● ●●● ●●
●● ●●● ● ●● ●
● ●
● ●
●●●
●●● ●
●
●
●●
●
●
● ●
●
●
●●
● ●●● ●●●
●●●●
● ●
●
●●
●
● ●●●●●●●
●● ●
●
●
●
●●
● ●
●●
●
●
● ●●●
●●●●
● ●●●●
●
●●
●
●●●●
●
●●
●
●
●● ●●
●●
●
●●
●
●●
●
●●
●●
●●
●●
●●
●
●●
●
●●
●●
●
●●●
●
●
● ●
●
● ●●
●
●
●●
●●
●
●
●●●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
●●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●●●●
●●●
● ●●
●●
●●
●
●●
●●
●
●
●●
●●●
●● ●
●
●●
●●
●●●●●●
●
● ●● ●●●●●●
●● ● ● ●● ● ●●●●
●
● ●●
●●● ●
● ● ●●●●
●
●●
●
●
●●●
●●
●
●●●
●
●●●●
●
●●●
●●● ●●
●●
● ●●●
●●●
●● ●● ●
●●●
●●
● ●
●●●
●
●
●●●●
●
●●●
● ●●
●
●●
●● ●
●
●
●
●●●●●
●
●●●
●
●
●
●
●●
●●●●
●●
● ●●
●
●●
●
●●●
●●●
●●
●●●●
●●
●●●●
●●
●
●●
●
●●
●●●
●● ●●
●●●●
●●●
● ● ●
●●●
●●
●●
●●
●●
●●●●
●●●
● ●●●
●●●●
●
●●
●
● ●●●
●
●
●●
●
●●●●●
●
●●
●
●
●●
●
●
●●●
●●●
●●
●
●●●●●
●
●●
●
●●●
●
●●●
●●
●
●●●
●●●
●
●●
●
●●
●
●
●●
●●
●
●●
●
●●
●●
●
●●
●
●●
●●
●
●●
●●●
●●
●
●●
●
●●●
●●
●
●●
●
●
●●
●●●
●●
●●
●●
●●
●●
●
●●●
●●●
●
●●
●●
●
●●
●
● ●
●
●
●●
●
●
●
●
●●
●●
●
●●●
●●
●
●
●●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●●
●●●●
●●
●
●
●
●
●●●
●
●●
● ●
●●●
●●
●●
●
●●
●●
●
●●
●●●●●●
●● ●●
●●●● ●
● ● ●● ● ● ● ●
●● ● ●● ● ●●●
● ●●●●●●●
●
●● ●●
●
● ●●● ●
●●●
●
●●●
●●
●●● ●●●
●●●●●
●
●●
●● ●
●
● ●
●●●
●●●
●●
● ●●
●
●
●●●●●
●
●
●●
●
●●
●●
●●●
●
●
●
●●●
●
●●
●
●●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●
●
●●
●●
●●
●
●
●
●●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●●
●
●
●●
●
●●
●
●●
●
●
●●
●●●
●
●
●
●
●
●
●
●●●●
●●
●
●●●
●
●●
●
●
●
●●● ●●
●●
●
●●
●●●
●
●● ●
●● ●
●●●●●
●
● ●● ●●● ●
● ●●●●● ●
●●
●●● ●●
●●
●●
●●●
●● ●● ●●●
●
● ●
●●
● ●
●
●●
● ●●●
●●
●
●
●●●
● ●●● ●●●
●●●
●
●●
●●
●
●
●● ●●
●
●
●●
● ●
●
●●●
●
●
●●●
●
●●●●
●●●●●● ●
●●
● ●
●● ●●
●●
● ●
●●● ●
●
●●●
●
●●
●● ●
●
●
●●
●●
● ●
●
●●
●
●
●●
●●
●●●
●● ●
●●●
●
●●●●
●●●
●●●●
●
●●●
●●●
●
●
●
●
●
●●●
●
●
●
●●●
●
●
●●
●
●●●
●
●●
●
●●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●●●
●●
●●
●
●
●●
●
●
●
●
●●●
●
●●
● ●
●
●
●
●
●
●●●
●●
●●●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●●
●
●●●●
●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●●
●●
●●
●
●●
●●
●●
●
●
●
●
●
●●
●
●
●●
●
●●●●
●
●
●●
●●
●●
●
●●
●●
●
●●●●● ●
●●●●
● ●
●
●
●
●●
●● ● ●
●●
● ●●●
●
●●●●●●● ● ● ●● ●● ● ●●
● ●
●● ●●
●
●● ● ●● ●
●
●●●
● ●●
● ●
●
● ●●● ●
●● ●●●
●●
●
●
●
●●
●●●●●
● ●
●
●
● ●
●●
●
● ●●
●
●●●
●
●
●●
●
●●
●●
● ●●
● ●
●
●
●●●
●
●●
●●
●
● ● ●
●●
●
●
●
●
●
●
●
●●●
●
●●●
●
●●
●
●
●
●●
●
●●
●●
●●
●●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●●
●
●●
●●
●●
●●
●●●●
●●●
●
●
●
●
●●
●
●●●●
●●●
●
●
●●
●●● ●
● ●●
● ●● ● ●●●●●●●●●● ●●●
●●
●●● ● ●● ● ●●● ● ● ● ●
●● ● ● ●●●●●
● ● ● ● ● ●
●
●
●●●●●●●
●●
●● ●
●●●●
●●
● ●●
●
●● ●● ●●
●●
●●●●
● ●●
●●
●●● ●
●●●
●●
●
●●●●●●
●●●
●●
●●
●
●●●
● ●
● ●
●●●
●●●●
●
●
●
●
●● ●●
●●●●
●
●●●
●●●●
●●
●●
●
●●
●●
●●●
●●●
●
● ●●●
●
●
●●
●
●
●●●●
●●
●●●●
●
●●
●●●
●●●
●
●●
●●
●
●
●●
●●
●●●
●
●●
●
●●
●
●●●
●
●●●●
●●
●●
●●
●●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●●
●●
●●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●●
●●
●
●●
●●
●●
●●
●●●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●●
●
●●
●●
●
●●
●
●
●
●●●
●
●●
●●
●
●●
●
●●
●
●●●●
●
●
●●
●●
●
●
●
● ●
●
●●●●
●●
●
●●
●
●●
●
●
●
●
●●●
● ●● ●
●
●●
●
●●
● ●
●●●●●●●
●
● ●●
●●●
● ●●
●●● ●●●●● ● ●
●● ● ● ●●
●● ● ● ●● ● ●●
●●● ● ●●
●● ●●●
●●
● ●
● ●●●●
●
● ●● ●
● ●●
●● ●
●
●●●●
●●●
●●●●●
●●●●
●●
●
●●
●
●
●● ●
●●
●●●
●●
● ●●●●
●● ●●
●●●●●
●● ●●
●
●● ●●●
● ●●
●● ● ●●●●
●
● ●●
●●
●● ●●
● ●●●● ● ● ● ●
●
●● ●●● ●● ●●
● ● ●●● ● ●
●
● ● ● ● ● ●● ●●
● ●●● ● ●● ●●
●●●●●
●●
●●●●● ●●
● ●
●●●● ●●
●
●● ●●
●●●●
●●●
●●●
●● ●
●●●
●●●
●●
●●●
●●●●
●● ●
●●
● ● ●
●●●
●●
● ●
●●
●●●
●
●
●●●●
● ●●●
● ●
●●
●
●●●
●●●
●
●●
●
●●
●●
●●
●
●●●●●●●●
●●
●●●●●●●
●●
●
●●
●●
●●
●
●●●
●●
●●
●● ●
●●
●●
●●●
●
●●●
●●
●●
●
●●
● ●●
●●
●●
●
●●
●●●
●●
●
●●●
●●●
●
●
●
●
●
●●●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●●●
●●
●●
●
●●
●●
●
●
●
●
●●
●●
●
●●
●
●
●●
●
●●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
● ●
●
●
●
●●
●●
●
●●
●●●
●
●
●●
●
●
●●
●●
●
● ●●
●
● ●
●
●
●
●●
●●●
●
●●
●●●
●
●●
●
●●●
●●
●
●
●●●●
●●●
●●
●
●●
●●
●●●
●
●●●
●
●●
●
● ●●
●●● ●●●●
●●●
●
● ●● ●
●●
●●●●
●●● ● ●●● ●
●● ● ● ● ●●●
●●●●●● ● ●●●
● ●●●●● ● ●●
● ●●
●●●●●●
●●
●●● ●● ●●
● ● ●● ●
● ●●
● ●
● ●●●●●●
●
● ●
●
●●●● ●● ●●●●●●●● ●●
●●
●●
●● ●●●●●●●
●●●● ●
●
●●●
● ●
●●
●
●
●●●
● ● ●●● ●●●
●●
●
●●●●
●●●
● ●●
●●●
●●●●●
● ● ●
●●●
●● ●
●●●●
●● ●
●
●
●● ●●●
●●
●●●●●●
●●●●
●●●●
●● ●
●●
●●
●●
●●●
●●
●●
●●
●●
●
●●
●●●
●●
●
●●
●●●●
●●●
●●●
● ●
●●●
●
●●
●●●
●
●
●●
●●
●
●
●
●●●
●●●
●●
●
●●
●
●●
● ●
●
●●●
●●
●●
●●
●● ●
●●●
●●
●
● ●
●● ● ●
●
●●
● ● ● ●
●●● ●●● ●●●
●●●●
●
● ●
● ●● ● ●
●●
●
●● ●●●● ● ●●●
● ● ●● ●●● ●
●● ● ● ● ●●●● ●
●●● ●
● ● ●● ● ●● ●● ● ●●
● ● ●●
● ●●●
●●
●● ●●
●●●●● ●● ●●●●●
●●●●● ●●
●●● ●●
●●●
●●
●●● ●●
●● ●●● ●●●
●●●
●● ●●
●●●●
● ● ●
●●● ●
●●●
●●
●
●
● ●●
●●●●
●●●
●●
●●
●●●
●
●●
● ●
●●●●
●
●
●●
●●●
● ●
●
●●
●●
●●
●●
●●
●
●
●●
●
●
●●
●●
●●
●●
●●
●●●●
●
●●
●
●
●●
●●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●●
●
●
●
●●● ●
●
●
●
●
●●
●●
●
●
●●
●
●●●●
●●
●●●●
●
●●●
●
●
●
●
●
●
●●●
●
●●
●
●●
●●
●
●
●
●
●●
●●
●●
●
●●
●
●
●●
●
●
●
●●●
●●
●●
●
●●
●●●
●●
●
●●
●
●
●●
●●●
●●●
●●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●●●
● ●●
●
● ●●
●● ●●●●● ●●
●●
●● ●
● ●●
●●●● ● ● ● ● ●● ●
●●
● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●● ●
●●●
● ●●●● ●
●●●● ●●● ●● ● ●●● ● ●●● ● ●
● ● ● ●
● ● ●● ●● ●
●● ● ●●●●●● ● ●
●● ● ●●●
● ●● ●
●●● ●●
●● ●● ●
● ●
● ●
●●
●● ●●●●●●● ●
●● ●●●
●
●●●● ●●●●
● ●●
●●●● ●
● ●●
●●●●●
●
●●
●●
●
●●●
●
●● ●
●●
●
●●
●● ●
●●
●●
●●
●●● ●●
●●
●
●●●●
●●●
●●
●●
● ●
●●
●●●
● ●
●
●●
●●●●
●
●●●
●
●
●●
●●●
●●
● ●
●●●●
●●●
●●
●●●●
●
● ●
●●
●●
●●
●
●
●
●●
●
●●
●●
●
●●
●●
●
●●●
●
●●
●
●
●●
● ● ●
● ●
●
●●
●●
●●●●
●●●● ●●●●●●
● ●●
●●
●● ●
●●
● ●●●
●●
●●● ●●● ● ●
●● ●● ●
●
●●
● ● ●
● ●● ● ● ●●●
●●● ●●
●●●● ●● ●●●●
●● ●● ●●●
●●●●●●
● ●●
● ●●● ●●●● ●
●
●
●●●
●●● ●●
●●
●● ●●●
●●●●●
●
●●
●●
●
●●
●
●●●
●●●●
●●
●●
●
●●●●●
●
●●
●
●
●●●●●●
●●
●●●
●
●
●●
●
●
●●●
●●
●
●
●●● ●
●
●●●
●
●●
●●
●●
●
●●
●
●
●●
●●●
●●
●●
●
●
●●●●
●
●●
●
●●
●
●
●●
●
●
●
●
●●●
●●●
●●
●
●●●
●
●●
●●
●
●●
●●
●●●●●
●●
●
●
●●
●
●●●
●
●●
●●
●●●●
●●●
●●
●●
●
●
●
●●●
●
●
●
●●
●●●●●
●●
●
●●●
●●●
● ●●
●●
● ●
●●●●●
●
● ● ●
● ●●
●●● ● ●●●● ● ●●● ● ●● ● ●
● ●●● ● ● ●
● ●● ● ● ●● ●
● ●●●● ●● ●●●● ●
●●●●
● ●●● ●● ●
●●●●●● ●●●●
●
● ●
●● ●●●●●
●●
●●●●
●● ●●
●●
●
●●● ●
●●
●
● ● ●●●
● ●●●
●●●●●● ●●●● ●
●●
●●
●●● ●
●●●●
●
● ●
●●
●●
●
●●●●● ●
●
●
●●●
●
●●●●●
●
● ●
●● ●
● ●●●
●●●
●●●
●●●●
●
● ● ●
●●
●
●●
●
●●●●
●●●
●●●
●●
●●●●
● ●●●●
●●
● ●●●● ● ●
●●
●● ●● ●
●●●●●
●●●●●●● ● ●
●●
● ● ●
● ● ● ●● ●●● ●
●●●● ● ● ● ●●●●● ●●
● ●
●
●●
●●●●●
●● ●
●● ●●●
● ●● ●● ●●●●●● ●
●
●●●●
● ●
●●
● ● ●
● ●●
●●●●●
● ●●
●
●●●● ●
●
●
●●
●●
● ●
●●● ●
●
●●
●●
●
●●
●●
●
●●●
●●
●●●●●
●
●●●●●●
●●●
●●●●
●●●●
●
● ●
●
●●
●●● ●
●●
●
●●●●●●
●●
●
● ●●●●● ●●●●●●●●●
●
●●
●● ●● ●●
● ●●●●● ●●●
● ● ●
● ● ●● ● ●● ●● ●●● ●
●●● ● ●● ●●●● ●●
●●
●●●●
● ●●
● ●●●●● ●●● ●●
●
● ● ●
●
●●
●
● ●●●● ●● ● ●●
●● ●●●●
●
●● ●●
●●●●●
●
● ●
●●●●
●●
● ●●●●●
●
●
● ●
●
●●
●● ●●
● ●
●●●● ●
●●●●●●
●●
●
●● ●
●
●
●●● ●● ●●●●●●● ●●● ●●
●● ●● ● ● ● ● ● ●●
●● ● ●● ● ●
● ● ● ●●● ● ●●● ●●
●●● ●
●●
● ●● ●●
●●● ● ●●● ●●
●
●
●●● ●● ●
●●
●●●
●●●●
●●●●
●
●
●
●●●
●●
●●●●●
●●
●●
●
●●
●
●
●●●●●
●
●
●● ●
●●
●
●●
●●●●
●●●
●
●
●●
●●●●
● ●●●●●
●●
●●●●●●●
●●●
● ●●
●●
●●●
● ●●
●●
●
●●●●● ● ● ●●
●
●● ●●● ●●
●●●●
● ● ● ●
●●
●
●●●
●●●● ●●● ● ●
●
● ● ● ● ● ●●● ●● ● ●●● ●●●●
●
●●
● ●
● ●●● ●● ●●●●
● ●●
●●●●●●
●●●● ●
●●
●
●
●
●●
●●
●●
●
● ●●●●●●●
● ●●
●
● ●
●●●● ●●●●
●●●● ●
●
●
●●
●
●●
●
●
●
●
●
●
● ●●●●●
●● ●●●● ●●●●●
●● ● ● ●●●●●●●● ● ● ●●●● ● ●● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
●● ● ● ●●● ● ●● ●
● ●●
● ● ● ● ●●
●
●●
●● ● ● ● ● ● ● ●● ●● ● ●● ● ●
● ●●● ● ●●●● ● ● ●●●
●●●● ● ●● ● ●
● ●●
●● ●● ● ●●●●● ●●
● ●● ●●
●●●
●● ● ●● ● ●● ●● ●●●●
● ●● ●● ●●●● ●● ●● ● ●
● ●● ●● ●● ●
●●● ● ●●
●● ● ●● ● ● ●● ● ●●
●● ●● ●●●●● ●●● ●●● ●● ●●● ● ● ● ● ●●●● ●●●
● ●●●
● ●● ● ● ●● ●
●●● ●● ● ●
●● ● ●● ●●● ●● ● ● ●● ● ● ● ● ● ●● ●
● ●●●● ● ● ● ●
●
● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●●● ●● ●
● ●● ●● ● ● ●●● ●● ● ●● ●● ● ●● ●
● ●● ● ● ●● ●●
●
● ●
●●●● ● ●● ● ●●● ● ● ● ● ●● ●● ●
●
● ● ●●● ● ●● ●
● ● ● ●● ●● ● ●
● ●
●
−2
−2
●●
● ● ●
●● ● ● ●
● ● ● ● ●
●
● ●
−4
−4
−6
−6
1e−01 1e+01 1e+03 1e+05 1e−01 1e+01 1e+03 1e+05
Figure 2.13: Exemplary shown are two MA plots of some preliminary expression data of a bat cell
line (Myotis daubentonii ), infected with a Rift Valley fever virus (RVFV) clone. Here, the non-
infected samples (mock) were compared against the virus infected samples 6 and 24 h post infection
(p.i.), shown in sub figures (A) and (B), respectively. The plots show the normalized mean read
count (x-axis) vs. the log2 fold change (y-axis) for each analyzed gene (each dot corresponds to
one gene). Red dots indicate significant (p-value <0.1) differential expressed genes as calculated
by DESeq [81]. After 6 h p.i. only few genes are differential expressed, whereas after 24 h p.i.
we can observe a dramatic reaction of the bat cells due to the infection with the RVFV clone.
Preliminary data was obtained from [15].
genes with the same length). Nevertheless, the normalization for different library
sizes is crucial to estimate good RNA abundances to call DEGs.
25
Chapter 2. Welcome to the Black Box
eukaryotic cell line and comprising for example two different conditions (untreated,
infected), three time points and four biological replicates already results in the se-
quencing of 24 samples. The current Ensembl annotation of the human genome
(version 85) consists of 58,051 genes from which 19,961 are protein-coding. In a
differential gene expression analysis we can now compare all expressed genes be-
tween different conditions and time points, resulting in an overwhelming amount of
data. Genes can be further analyzed for differential expressed isoforms and clustered
according to their function. With a de novo gene prediction, one of the huge ad-
vantages of RNA-Seq data in comparison to microarrays, an incomplete annotation
can be further extended and even more genes can be possibly involved. The use of
different library protocols (like Ribo-Zero and smallRNA) extends the complexity
of such an RNA-Seq study even further. It is easily possible to overtax researchers,
especially if they are not that familiar with all bioinformatical methods and the
statistical background happening inside the Black Box (Fig. 2.2). Important ob-
servations may be lost in the huge variety of results. Therefore, the presentation
and visualization of all results obtained not only from an NGS experiment, but also
from other data-rich projects, is crucial to allow other researchers to understand,
interpret, and investigate the obtained data successfully.
NR4A1 RASGEF1B
ZEB2 IL32
General Overview J UN
CYR61
RASGEF1B
Specific Observations PLA2G4C
ATF3
PPP1R15A
IL32 DUSP8
PLA2G4C NFKB2
ATF3 RELB
A C −3 −2 −1 0 1 2 3
D PPP1R15A
DUSP8
NFKB2
RELB
F IL12A-AS1
DUSP1
DUSP10
SQSTM1 (IL12A)
DUSP1 KLF4
CTD-3203P2.3 (IL4R)
FOSB
FOS DUSP10 Fragment
0.950
1 0.686
RPS17L Fragment 2
0.832 0.534 0.655
CNN1
0.857 0.966 RP11-104J23.1
0.566
IL8
0.932 1.000 0.916
(CCL15)
0.998 0.989 0.992 0.991 0.538 0.823 0.978
SQSTM1
EGR1
CXCL3 67 KLF4
0.883 0.985 0.762 0.665 0.963 1.000 0.910 0.996 0.986 0.977
AC069363.1 PIM1 (CCL3)
0.977 0.722 0.934 0.971
EP300
NFKB1 120 137RPS17L
558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574
FOSL1
AC131056.3 (CCL3L3)
575 576 577 578
IL8 GAA TCA GAA CAA CAG - - - - - - - - - - - - AAA AGG AAA TCC ACC TTG GTG ACT
NPC1
LMOD1
MX1 TCT GAA AGC AGC
NR4A1
ZEB2
DDX58 INHBA-AS1 NPR3 (INHBA)
J UN
CYR61
13 14 DDIT4 GAA TCC AAA GAG CAG - - - - - - - - - AAG GGG AGT
AC073072.5
TCT CGC GAG CAG ACG TCC
SLC2A12 (IL6)
TCT CTG GAG GAT
TMEM72 GAA TCG GAA GAG AAG AAG GGT TGT TCG CGC CAG CAG AAG GAG CAG AAT TTC TAT CAG GAG GAT
RASGEF1B
MN1
IL32 control atRA vitamin D control atRANPR3 vitamin D control atRA vitamin D control atRA vitamin D
PLA2G4C GAA GCC GAA GAG AAT - - - - - - - - - AAG AAG AAG CDH6
ATF3 control 59 SLC2A12
A. fumigatus C. albicans E. coli
AAG AAG GAG CAT ATT TTC
SFRP2
TTT GAA GAG GAC
MN1
C3
PPP1R15A
GAA AAA GAG AAG GAA - - - GAA GAA AGG AAG AGA ACA TTA GGT CGG GCG ATC TGC GAA GAG AGT
E
DUSP8
P NFKB2 CDH6 RTL1
RELB
DUSP1 0 SFRP2 2 4 6 8 GAG AAT GAA GAA CAA - - - - - - AAC AAG AAT AAA IL6TCA AGA GTT TTG GAC CTT GTA CAG AGT TCT
DUSP10 RTL1 SDPR
GAG AAG GAG AAG GAA - - - - - - GAA GAA ATG AAG AAG AAA TTT AAT TGT TTG AAC CTT CAA CAG
SQSTM1
PC1 KLF4
IL6 NPPC
RPS17L Sense SDPR
IL6 Antisense AC073072.5 STAT2
B
IL8
NPPC STAT1
G
PIM1
FOSL1 106.0 STAT2
● ● ● FAM49A
MX1 ●
DDX58 STAT1
●
● 10 2.0
● DUSP6
DDIT4 ● ●FAM49A ●● ● ●
●
TMEM72 ● ● ● ● MYCNOS
NPR3 105.0 ● ● ●●DUSP6 ● ●
●● ●●
SLC2A12 ●
MYCNOS ● ● ●
● ZG16
MN1 ● ● ●MYCN
CDH6 ZG16 101.5
SFRP2
104.0 MYCN ● ● DUSP5
RTL1 ● ●
IL6 DUSP5 ●
CHAC1
●ATP2B4
SDPR
NPPC CHAC1 ●
STAT2 103.0 ● ATP2B4 ● AK4
log fold change
1.0
STAT1 10
FAM49A ● ● AK4 ● FAM71E2
DUSP6 ● ● FAM71E2 MAST4
MYCNOS
ZG16 102.0 ●
● ● MAST4 TRIB3
MYCN ● TRIB3
DUSP5 0.5 TRAF6
TRAF6 10
CHAC1
1.0 ● ●
● TFCP2L1
ATP2B4
AK4
10 ●
● ● ● TFCP2L1 ●
● HAVCR1
●
FAM71E2 HAVCR1 ●
MAST4
TLR3
TLR3
TRIB3 ●
3h 7h 23h 3h
TRAF6
7h 23h 100.0 7h ●23h
3h 3h
3h 7h 7h 23h23h 3h 7h 100.0 3h
23h ● ● ●
7h ●● 23h ● ● 3h 7h 23h
TFCP2L1
HAVCR1
EBOV MARV
TLR3
EBOV EBOV
control MARV EBOV
A. fumigatus
MARVC.EBOV
albicans MARV
MARV
E. coli EBOV
control A. fumigatus MARV
C. albicans E. coli
3h 7h 23h 3h 7h 23h 3h 7h 23h 3h 7h 23h
EBOV MARV EBOV MARV
IL12A IL12A-AS1
mean expression 104.0 ● 102.0 ●
●
26
2.6. Visualization
27
Chapter 2. Welcome to the Black Box
28
Chapter 3
Genome Assembly
29
Chapter 3. Genome Assembly
protein- and non-coding genes for all assemblies included in this study and compared
the results with the available annotations from NCBI. By combining the output of
different assembly strategies, we were able to improve the initial assembly, how-
ever, we could not close all gaps. This section is accompanied by a comprehensive
Electronic Supplement1 .
30
3.1. Assembly of the whole-genome of Chlamydia gallinacea
A B
NODE_4_length_7610 NODE_4_length_7610
NODE_3_length_185839
NODE_2_length_228815
NODE_1_length_643147
NODE_2_length_228815
NODE_1_length_643147 NODE_3_length_185839
Reference Chlamydia avium Reference Chlamydia avium
Figure 3.2: Comparison of dot plots before (A) and after (B) rearrangement with CAR [96]. (A)
Dot plot between de novo assembly of C. gallinacea (y-axis) and the C. avium genome as reference
(x-axis). (B) Dot plot between the rearranged de novo assembly of C. gallinacea (y-axis) and
the C. avium genome as reference (x-axis). With the help of the reference the de novo assembled
scaffolds could be ordered and used for primer design to close remaining gaps. Red dots represent
sequence homology between the query and the reference in the same orientation, blue in antisense
orientation. The order corresponds with the de novo predicted scaffold order, see Fig. 3.1.
calculated by the assembly tool, but at the polymorphic locations, the assembler was
not able to decide for a unique path in the de Brujin graph (Sec. 2.2). Finally, with
additional Sanger sequencing, we were able to close the remaining gaps.
Here, we present the whole-genome sequence of the C. gallinacea type strain
08-1274/3 that consists of the 1,059,583 bp chromosome with 914 CDS coding for
proteins and plasmid p1274 sized 7,619 bp with 9 protein-encoding sequences.
31
Chapter 3. Genome Assembly
to determine the putative genomic order of the resulting scaffolds (Fig. 3.1). The
genomic order was further confirmed with CAR [96], using the genome of C. avium
(GCF_000583875) as a reference (Fig. 3.2).
Based on both visualization methods, flanking scaffolds were determined and
corresponding primer sites were selected to close the gaps. The primers were used
in PCR to generate DNA fragments of 600–800 bp (Gap 1) and 1,300–1,500 bp (Gap
2), which were sent to Eurofins Genomics (Paris, France) for Sanger sequencing.
Alignment of Sanger sequences to Scaffolds 1–3 using BLAST and Mafft [97] fi-
nally enabled closure of the gaps. The complete chromosomal sequence was sized
1,059,583 bp. Provisional annotations using Prokka [98] revealed 914 protein-
encoding genes and 46 non-coding RNAs, including amongst others 39 tRNAs, 3
rRNAs and 1 tmRNA. The size of plasmid p1274 has been determined as 7,619 bp
with 9 proteins encoded. The average G+C content of the genome is 37.9 mol-%.
This is the first report of a completely assembled genome sequence of C. gallinacea.
It can serve as a reference genome for future studies.
The updated sequence data of C. gallinacea type strain 08-1274/3 WGS and
its plasmid p1274 have been deposited in NCBI GenBank under accession numbers
CP015840 and CP015841, respectively.
32
3.2. Comprehensive insights in the MAP genome
33
Chapter 3. Genome Assembly
MAP Type I strains predominantly infect ovine hosts; MAP Type II strains
principally infect cattle but also deer, goat, sheep, and other ruminants [109–111].
MAP Type III strains (intermediate) are closely related to Type I strains and have
been isolated up to now from sheep, goat, cattle and camels [112–116].
In another study, we evaluated the potential association between the genotype of
individual field strains belonging to the MAP-C group (Type II) and the presence
of macroscopic intestinal lesions characteristic of paratuberculosis in the infected
animals [8]. Overall, 88 MAP-C isolates were sampled from clinically healthy cows
at slaughter. Cows were grouped as A (n=46) with, and B (n=42) without macro-
scopic intestinal lesions. A Fisher’s Exact Test was applied to determine if specific
genotypes were more strongly associated with one of the two groups of isolates (A
and B). Groups were considered to be significantly different, if the probability of er-
ror was lower than 5 %. MAP isolates from groups A and B exhibited similar strain
diversity: 20 and 18 combined genotypes, altogether 32 genotypes. Six of these geno-
types were detected in both groups. Although no association was found between
individual combined genotypes and presence of macroscopic intestinal lesions, IS900 -
RFLP-(BstEII)-Type-C1 (the most common type worldwide) was found more often
in group A (p<0.01). Further studies will have to elucidate, which genomic struc-
tures and variations, regulatory elements, phenotypic or functional characteristics
of an MAP isolate are fundamental for its virulence.
The high entry of the MAP organism into the immediate environment of diseased
animals by shedding and the high tenacity of MAP within the environment [117]
generate an increasing risk of exposure to MAP for ruminants but also for other
mammals. For instance, MAP was detected in tap water, rivers, dams [118, 119],
and also in raw milk [120]. Furthermore, MAP was isolated from a clinically diseased
donkey [121]. MAP has also been isolated from man. Its etiologic role in Crohn’s
disease is under discussion [122, 123].
About ten years ago, first sequence data of MAP strains were published. Isolate
MAP K-10 from U.S. – belonging to the MAP-C group – was fully sequenced [124].
Later on it was re-sequenced and better annotated by Wynne et al. [125]. Recently,
ovine derived isolates CLIJ361 (MAP-S, Type I) from Australia [126], MAP S397
(MAP-S, Type III) from U.S. [127], and the first human derived isolate MAP4
(MAP-C) from U.S. [128] have been sequenced.
The availability of these MAP-S genome sequences, although not fully assem-
bled, improved the informational value of genome comparisons no longer only based
on MAP K-10 and M. avium strain 104. In the meantime, MAP-S and MAP-C
specific loci, genome deletions and insertions have been identified and evolutionary
relationships proposed [104, 127, 129–132].
Besides the comparative whole genome sequence analysis, in the past decade
non-protein-coding fractions of the transcriptome were studied in bacteria [133–135].
Regarding Mycobacteria, non-coding RNAs (ncRNAs) were identified in M. tuber-
culosis and their role in the regulation of the pathogen metabolism was studied [136,
137]. Furthermore, RNA sequences were analyzed in M. avium including MAH and
MAA [138]. Until now, no data have been published on the full set of ncRNAs in
MAP.
34
3.2. Comprehensive insights in the MAP genome
The objective of this study was to sequence a further MAP-S strain: the ovine
derived strain JIII-386 from Germany (Europe), and to compare sequence data with
seven assemblies of related genomes from other continents to examine previously
defined genomic differences between MAP-S and MAP-C strains (Type I/III and II,
respectively). Complete genome sequences of a bovine derived MAP-C strain (also
from Germany) and the M. avium strain 104 were included. A genome-wide annota-
tion of protein coding sequences (CDS) was performed by using two data resources,
NCBI and BacProt. For the first time, a comprehensive annotation of regulatory
RNAs in MAP was performed. Based on the current data analysis we wanted to
find out new aspects regarding proposed ancestral relationship of M. avium complex
strains and indications for an evolution or conservation of regulatory RNAs.
Isolation, identification and characterization of the MAP-S, Type III isolate JIII-386
was described by Möbius et al. [114]. The strain was isolated in 2003 and belongs
to the strain collection of the Friedrich-Loeffler-Institut in Jena (Germany). Briefly,
JIII-386 was isolated from ileal mucosa of a sheep from a migrating herd in the
north-west of Germany. The animal showed no clinical symptoms. Paratuberculosis
was suspected based on positive serological results, detection of MAP in feces by
culture, and pathomorphological and histological results after necropsy. JIII-386
had been cultured using modified Middlebrook 7H11 solid medium (Difco) con-
taining 10 % OADC, Amphotericin B, and Mycobactin J (Allied Monitor, Fayette,
USA). Subcultivation was done on modified Loewenstein-Jensen solid medium, also
supplemented with Mycobactin J.
Additionally, MAP strain JII-1961 (MAP-C) isolated from cattle in 2003 and
sequenced and assembled on the chromosomal level at the Helmholtz Centre for
Infection Research (Braunschweig, Germany) was included and annotated in this
study [10]. This isolate originated from the ileocaecal lymph node of a clinically
diseased dairy cow from a paratuberculosis positive herd in eastern Germany. JII-
1961 was isolated and subcultivated using Herrold’s Egg Yolk Medium (HEYM)
supplemented with Mycobactin J.
JIII-386 had been grown for up to seven months, strain JII-1961 for up to six
weeks. Both isolates were characterized by positive acid-fast staining and their
growth characteristics and were proved to be MAP by cultural confirmation of
mycobactin-dependency and detection of the presence of the IS900 insertion se-
quence using PCR [139].
The genotypes of isolates were determined [114] by multi-target genotyping based
on IS900 -RFLP (4 digestion enzymes)-, MIRU-VNTR (9 loci)-, and SSR (4 loci)-
analysis [140–142]. Isolates were expanded for sequencing on HEYM supplemented
with Mycobactin J. Genomic DNA was prepared by the cetyltrimethylammonium
bromide method described by Van et al. [143], and identity of the strain was con-
firmed by MIRU-VNTR-genotyping.
35
Chapter 3. Genome Assembly
Sequencing
Whole-genome shotgun sequencing was performed. Illumina paired-end (fragment
size ∼300 bp) and mate-pair (fragment size ∼2.2 kb) libraries were generated from
fragmented genomic DNA of MAP strain JIII-386. Libraries were sequenced using
Illumina GAIIx (paired-end library) and HiSeq2000 (mate-pair library) and resulted
in 28.6 million 101 bp paired-ends (∼1.100-fold genome coverage) and 10.9 million
100 bp mate-pairs (∼440-fold genome coverage) (see Tab. S3, supplementary mate-
rial online2 ).
Assembly improvement
To improve the assembly (II) the following steps to handle low-coverage regions,
low-quality reads, misassemblies, replacing gap regions and connecting scaffolds were
applied: Four additional de novo genome assembly tools were separately used on
both libraries: Velvet (v1.2.10, k=55) [51], ABySS (v1.3.4, k=45) [52], SPAdes
mainly implemented for single-cell data (v2.5.1, k=43,55,65) [54] – all de Bruijn
graph based [48] – and the seed-and-extend approach based JR-Assembler (v1.0.3,
default parameters) [145]. The resulting contigs (>1,000 bp) were merged and clus-
tered for sequence similarities using CD-HIT-EST (v4.6, -c 0.95) [146] to reduce
redundancy. Statistical information and each assembly can be retrieved from the
supplementary material online (Tab. S4).
Related genomes
Related genomes served as reference genomes in the current study to assist in as-
sembly, open reading frame (ORF) predictions and annotation. Furthermore, they
were used for comparison of different MAP types including strains originating from
different geographic regions of the world. The selection comprises the genomic se-
quences of the three MAP-C (Type II) strains: K-10/K-10’ [124, 125], MAP4 [128]
and JII-1961 (Möbius et al., unpublished), two sheep derived MAP-S strains, one of
Type I: CLIJ361 [126] and one of Type III: S397 [127], as well as one MAH strain:
M. a. strain 104 designated as MAH 104 [147]. Strains K-10, MAP4, and S397 orig-
inated from U.S., strain CLIJ361 from Australia and strain JII-1961 from Germany.
2
http://www.rna.uni-jena.de/supplements/mycobacterium/
36
3.2. Comprehensive insights in the MAP genome
Mycobacterium
Genotypes of investigated MAP
isolates within this study are
Genus
MAC MTBC
shown in Tab. S2. Currently, (Mycobacterium avium complex) (Mycobacterium tuberculosis complex)
Species
Mycobacterium avium (M.a.) Mycobacterium tuberculosis
these three MAP-C strains are
available. The two ovine iso-
Subspecies
M.a. subsp. paratuberculosis M.a. subsp. hominissius
(MAP) (MAH)
lates are available at contig
level: S397 which comprises 176 MAP−C (Type II) MAP−S (Type I/III)
contigs and CLIJ361 based on
Type
1,147 contigs (draft genomes). Type I Type III
Strain
K−10 K−10’ MAP4 JII−1961 CLIJ361 S397 JIII−386 104
full full full full 1147 con 176 con 6 scaff
available annotation files for full
3.2.3 Annotation
Annotation of protein-coding sequences (CDSs)
Annotations for MAP strains K-10, K-10’, MAP4, S397 and strain MAH 104 were
downloaded from NCBI (see Tab. S1). For reference based annotation of CDSs,
BacProt (unpublished data) based on Proteinortho [150, 151] was used to
complement present annotations. Furthermore, the novel open reading frame (ORF)
prediction of BacProt, containing Shine-Dalgarno and Pribnow box motif informa-
tion, was applied. For each M. avium strain re-annotated and previously annotated
ORFs as well as statistics like codon usage and occurrence of Shine-Dalgarno se-
quence motifs were calculated. For the ovine derived strain JIII-386 annotation
was complemented with data from Bannantine et al. [127] by using BLAST [67]
(v2.2.27+, E-value ≤ 10−4 ) with at least 90 % identity and an alignment length of
90 %. ORFs with sequence homology to genes with an assigned function in the NCBI
annotation were identified and designated as protein coding sequences (CDS).
For each isolate, annotations provided by NCBI were merged with the BacProt
annotations, to find ORFs being present or absent between two strains by using
BLAST (E-value ≤ 10−4 ). All ORFs of strain A, which could not achieve a sequence
overlap of at least 50 % in length and identity against the genome of strain B, were
marked as present in A but absent in B. ORFs without an assigned function were
excluded from Tab. S15a. These data provide an overview of the different ORFs,
present/absent between the investigated M. avium strains. Detailed analyses of
37
Chapter 3. Genome Assembly
single genes and gene clusters as well as large sequence polymorphisms (LSPs) and
phylogenetic relationships were performed by more restrictive parameters (E-value
≤ 10−20 , alignment length ≥ 95 % of query, sequence similarity ≥ 90 %; depending
on the kind of analysis) and manual investigation of all BLAST results, alignments
and sequences.
Single nucleotide variants (SNVs) were searched by pairwise comparison of protein-
coding sequences of the eight investigated genomes. First, BLAST (E-value ≤ 10−4 )
was used to assign homologous sequences between two strains which were aligned
in a second step using MAFFT (v.7.017b, method: L-INS-i) [152]. The resulting
alignments were searched for SNVs by individual ruby scripts.
The presence or absence of 35 large sequence polymorphisms (LSPs), each con-
taining several ORFs and previously reported by Alexander et al. [132] and Bannan-
tine et al. [127], were examined by using BLASTn+ across the investigated strains.
Annotation of ncRNAs
NcRNAs were annotated by homology search of Rfam (v.11.0) [153] families using
the GORAP pipeline [69] which currently comprises Infernal (v1.1) [68], Bcheck
(v0.6) [158], RNAmmer (v1.2) [159] and tRNAscan-SE (v1.3.1) [160] for detection
of different ncRNA classes. Within the pipeline family specific parameters and
several filter steps based on taxonomy, secondary structure and primary sequence
comparison were used. To compare the amount of ncRNAs GORAP was used to
perform additional annotation of ncRNAs for two well-known bacteria: E. coli and
S. entericus. All resulting stockholm alignments were hand-curated with the help
of Emacs RALEE mode [161].
38
3.2. Comprehensive insights in the MAP genome
Table 3.1: General genome features of the different Mycobacteria strains. The number of ORFs
with a homologous sequence in NCBI (homologous ORFs) and additionally hypothetical ORFs,
both predicted by BacProt, are provided. NcRNAs and riboswitches were annotated by homol-
ogy search of Rfam (v.11.0) [153] families using the GORAP pipeline [69], see Sec. 3.2.3. For further
information (FASTA, GFF, STK files) see supplementary tables S11, S13, S19 and S21, supple-
mentary material online. chr – chromosome; N50 – length of the shortest contig/scaffold, so that
at least 50 % of all bp in the assembly are represented by this and all longer contigs; ORF – open
reading frame; ? – candidate, further analysis needed.
General Features
Genome (bp) 4,829,781 4,832,589 4,829,424 4,829,628 4,850,274 4,813,711 4,612,386 5,475,491
Assembly 1 Chr 1 Chr 1 Chr 1 Chr 6 Scaff 176 Con 1,147 Con 1 Chr
N50 n.a. n.a. n.a. n.a. 1245802 56150 7088 n.a.
Max contig 4,829,781 4,832,589 4,829,424 4,829,628 1,505,968 137,410 49,981 5,475,491
G+C (%) 69.3 69.3 69.3 69.3 69.16 69.31 68.96 68.99
39
Chapter 3. Genome Assembly
K-10'
S397
Figure 3.4: Genome comparison of K-10’ (top), JIII-386 (middle) and S397 (bottom) calculated
with Mauve. Colored blocks connected by lines indicate homologous regions which are internally
free from genomic rearrangements. White areas within blocks indicate sequence regions of lower
similarity. Blocks below the center line are aligned reverse complementary. A detailed Fig. S10a
is available in the supplementary material online.
40
3.2. Comprehensive insights in the MAP genome
Table 3.2: Annotations obtained from NCBI and those additionally calculated using BacProt
lead to an extended annotation for each investigated M. avium (last column). In the second
lines (bold): only predicted ORFs with homology to genes with an assigned function in the NCBI
annotation are shown (CDSs). Corresponding – ORFs identified by BacProt and NCBI originating
from same positions in the genome; Start/End shifted – ORFs identified by BacProt and NCBI
but with differences in length (only 5’ or 3’); NCBI/BacProt only – ORFs identified only by
NCBI/BacProt; Extended – total number of ORFs (combination of NCBI + BacProt only). All
*.gff files are provided in the supplementary material online, Tab. S1, S11 and S12.
g
ndin
nly
fted
nded
e
only
rot o
t
shift
espo
t shi
Pro
p.
Exte
BacP
NCBI
NCBI
Corr
strai
subs
Bac
host
Star
End
K-10 4350 4048 2332 411 458 1149 847 5197
1146 3096 998 60 77 11 1961 3107
K-10’ 4394 4048 2374 432 433 1155 875 5269
MAP-C
(see Fig. S10b). Further analysis showed that these two regions comprise 16 and 15
CDSs, respectively, which are really absent in the investigated MAP-C genomes but
present in MAP-S and also in MAH 104 (see Tab. S17).
41
Chapter 3. Genome Assembly
Table 3.3: Distribution of 10 large sequence polymorphisms (LSPs) in M. avium strains, previously
described to be present in MAP-S but absent in MAP-C. Labels and locations according to Ban-
nantine et al. [127]. LSPS 8 was only partially detected with an alignment length of 692 bp in all
MAP-C strains. Homologous FASTA sequences for LSPS 1–10 of MAP JIII-386 as well as further
details and additional information about the distribution of 25 other LSPs [132, 164] can be found
in the supplementary material, Tab. S14a and b. full – full-length hit; part – partial hit; * – all
ORFs comprised by the LSPS are present but split on different contigs or genomic locations.
For all strains except MAP S397 BacProt identified fewer ORFs than provided
to date in the NCBI annotations, however additional ORFs were found (Tab. 3.2).
Both annotations were merged to generate an extended prediction of ORFs (Tab. 3.2,
last column). A large fraction of ORFs without an assigned function was included
in both annotations, therefore a second line was added to Tab. 3.2 for each strain in
which only the number of ORFs with an assigned function was presented (CDSs).
Approximately 49 % (first line) and ∼62 % (second line) of all genes (extended
panel) were annotated on the same positions by BacProt and NCBI (corresponding
genes) whereas ∼20 % (first line) and ∼9 % (second line) of all were found on unique
positions with either BacProt or NCBI only. The intersection of corresponding
ORFs seem to be more reliable, although the additionally detected ORFs using
BacProt were of special interest for further analyses.
For all ORFs annotated by Bac-
Prot, the Shine-Dalgarno sequence 2
AGCTGG
2
bits
bits
AG TGG 1 1
0
1
2
3
4
5
T T
bits
AG TGG 1
T CA G
S397 A
1
T A
0 0
A
C
A
A
A
C CG A A
0
1
2
3
4
5
0
1
2
3
4
5
bits
1 1
JII-1961
and involved in the recognition of
C
0 T CA
G
A A
C A
0 TC C
G
T
A C
A A
0
1
2
3
4
5
0
1
2
3
4
5
AT A
C
bits
1
T
AG TGG
G MAH 104
A A
1
T C
ACA
0 0
CG
CA C AA A
0
1
2
3
4
5
0
1
2
3
4
5
A T
42
Table 3.4: New large sequence polymorphisms (LSPs) regions, extended and revised previous described regions, present in MAP-S but absent in MAP-C.
The number of novel ORFs, additionally predicted by BacProt and with no overlap against previously annotated MAPs ORFs, are listed. For further
information about genomic positions of homologous ORFs (CDSs) in MAP JIII-386 and gene annotation see Tab. S17. # ORFs – Number of ORFs including
homologous as well as hypothetical ORFs.
LSP LSPS included* new LSP Island size (bp) Including MAPs # ORFs # ORFs present in
(genomic region) (MAP-C negativ) (BacProt) MAP-S MAP-C MAH 104
LSPS Ia+b 34,377 MAPs_15870–16180 31 +5 yes not partly
new 2 ORFs of LSPS 1* LSPS Ia 10,227 MAPS_15870–15950 9 +2 yes not not
(this study)
extended 23 ORFs of LSPA 4-II** LSPS Ib 24,150 MAPS_15961–16180 22 +3 yes not yes
(this study) 8 ORFs of LSPS 1
extended LSPS 2 + 4* / LSPA 18** LSPS II 16,392 MAPs_46170–46350 18 +2 yes not yes
(this study) new (BacProt) MAPs_46241–46242† 1
43
extended LSPS 5 + 7* / GPL** LSPS III 16,015 MAPs_17580–17700 12 +5 yes not partly
(previously*) new (BacProt) MAPs_17690† 1 yes not not
LSPS 5 + 7* / GPL** LSPS IIIa 12,142 MAPS_17580–17680 11 +4 yes not yes
LSPS IIIb 3,873 MAPS_17690–17700 2 +1 yes not not
Using this newly identified Shine-Dalgarno sequence and Pribnow box, ORFs with
and without known function were predicted by BacProt and listed in Tab. S11 (see
GFF-files).
Additionally to the annotation of CDSs an overview of codon usage for each
investigated Mycobacteria strain is given in Tab. S13. As expected, similarities
regarding the ratio of G+C-rich codons were found. The codon preferences for
G+C correspond likely with the high (almost 70 %) G+C content (see Tab. 3.1) of
M. avium genomes.
Several regions that are present multiple times in the genome of JIII-386 were
discovered. Among these there are insertion sequences (IS), previously described
to act as transposable elements, also responsible for the genomic diversity of My-
cobacteria [166–168] and used as molecular epidemological markers. 17 copies of
MAP-specific IS900 [169, 170] were verified in JIII-386, as in strains K-10 and S397
[127]. Additionally, six copies of ISMap02 were present in JIII-386 as described
before for K-10 and S397 by Bannantine et al. [127]. Two copies of ISMpa1 (not
three copies as detected by Olsen et al. [171] for MAP strains) were found in the
genome of JIII-386. Only five copies of IS1311 are present in JIII-386, instead of
seven copies as reported previously for K-10 and S397 [127]. Furthermore, we found
eight copies of IS1311 in the genomes of K-10’ and MAP4.
Annotation of ncRNAs
In the last decade, non-coding RNAs (ncRNAs), possible regulators of cellular pro-
cesses and virulence control [137, 172] gained more importance. They were charac-
terized for M. tuberculosis by [136, 173]. For M. avium (including MAH and MAA)
two riboswitches as well as several antisense and intergenic transcripts have been
identified [138].
Hits for all known ncRNAs provided by the Rfam database, based on a screening
of the seven MAP genomes and MAH 104 are presented in Tab. 3.1. All correspond-
ing files are available in STK, GFF and FASTA-format in the supplementary material
online, Tab. S19.
In general, ncRNAs among the investigated M. avium lineages are extremely
conserved (e.g. tRNAs and riboswitches; Tab. 3.1) - there are only few exceptions
like ASpks and ykkC-III which differ in the number of detected copies between MAP-
C, MAP-S and MAH 104 and are discussed in detail below. The high conservation of
ncRNAs between the different M. avium strains is remarkable - even other bacteria,
like the closely related strains of the obligate intracellular family Chlamydiaceae [90],
show more differences in their small ncRNA repertoire compared to the M. avium
strains presented here.
44
3.2. Comprehensive insights in the MAP genome
tmRNA, SRP RNA) were identified exactly once per genome (see Tab. 3.1).
Riboswitches. One third of the known riboswitches are present in MAP JIII-386:
Two copies of TPP and Cobalamin and one copy of SAM-IV, SAH and Glycine ri-
boswitches, respectively, were found in all investigated M. avium genomes (Tab. 3.1).
Additionally, two copies of SAH in MAH 104 were identified. The genome of CLIJ361
is lacking the TPP upstream of thiE.
Additionally, riboswitch features were found in several 5’-UTRs, however a func-
tion for these has not yet been confirmed: pan (synthesis of the vitamine pantothen-
ate), pfl (absent in MAH 104), ydaO-yuaA, ykoK and ykkC-III. The latter one has
lost its second copies in MAP-C strains. Three of the riboswitches (SAM-IV, Cobal-
amin, ykoK) have been reported previously in MAH, MAA [138] and M. tuberculosis
[136] and were confirmed in this study.
The pan RNA motif represents a conserved RNA structure previously identified
in only three bacterial families: Chloroflexi, Firmicutes and Proteobacteria [156].
Its secondary structure consists of one or two stemloops containing two bulged
adenosines and is located in 5’-UTRs of genes involved in the synthesis of the vita-
min pantothenate. If the observed RNA motif is truly a pan like sequence, it would
be the first discovery of this RNA family in Actinobacteria.
Other ncRNAs. Using GORAP and manual alignment correction one PyrR bind-
ing site was identified in each isolate, which is located upstream of a variety of genes
involved in pyrimidine biosynthesis.
With the exception of MAP strain CLIJ361 (likely due to the limited assembly
quality) one copy of 6C RNA was found within each investigated M. avium [155].
The Actino-pnp RNA motif was previously described as a conserved structure in
Actinobacteria, apparently located in the 5’-UTR of genes encoding exoribonucleases
[156]. For each investigated M. avium strain one copy of Actino-pnp was confirmed
in this study (Tab. 3.1).
The mraW RNA motif is a highly conserved RNA structure consisting of one
hairpin with a highly conserved terminal loop sequence 5’-CUUCCCC-3’. Previ-
ously, it was predicted in many Actinobacteria and particularly within Mycobac-
teria. MraW was detected twice in investigated genomes, one copy being located
consistently in the 5’-UTR of mraW genes and another copy, with similar secondary
structure features, located in a region with multiple types of mur genes which likely
form operons with mraW.
A study by Arnvig et al. [136] discovered at least nine putative small RNA fam-
ilies in the genome of Mycobacterium tuberculosis by RACE analysis and Northern
blot experiments resulting in four cis- and five trans-encoded ncRNAs. With GORAP
and a manual correction of stockholm alignments three of these ncRNAs were iden-
tified in all of the studied Mycobacteria samples: ASdes, ASpks and F6 (Tab. 3.1).
Additionally, two ncRNA homologous classes were discovered: The trans-encoded
ncRNA G2, which has been lost in MAP-C strains and the AS1890 alignment,
which achieved a very good bit score, however lacked the antisense protein homolog
Rv1890c. These domains were described to act as cis-encoded and trans-encoded
ncRNAs [136].
45
Chapter 3. Genome Assembly
ASdes and ASpks are involved in lipid metabolism by regulating the Polyketide
synthase-12 (pks12 ) and fatty acid desaturase (desA1 ), respectively. The pks12
gene contains two identical copies of ASpks, acting as antisense regulators of pks12
mRNA. In the current study, two clusters of potential ASpks ncRNAs were identified.
One cluster, including two identical copies of the region encoding ASpks, as described
for M. tuberculosis [136] and a novel cluster, comprising one copy (in MAP-S) and
two copies (in MAP-C and MAH 104). Within K-10’, ASpks homologs of the second
cluster were detected in two copies, localized in different, but adjacent PKS genes:
pks7 and pks8. In addition to one copy of ASdes, located antisense of desA1 gene,
we were able to find further copies in desA2.
6S RNA is a highly-abundant ncRNA, which was initially identified in E. coli
[174] and was amongst the first small RNAs to be sequenced [175], further be-
lieved to be necessary in at least one copy for each bacterium. By binding to
the σ 70 -containing housekeeping RNAP holoenzyme, it inhibits a large number of
σ 70 -dependent genes and thus enables a better adaption to stationary phase and en-
vironmental stress [176–179]. Although 6S RNA is known for all bacteria branches
(except Deinococcus/Thermus, Chlamydiae, most Actinobacteria) [180], until now
no 6S RNA is known for Mycobacteria. Results of the current study confirm these
data: based on the analysis of the eight investigated genome sequences no 6S RNA
could be identified.
Using GORAP, in all investigated MAP strains and MAH 104 about 80 ncRNAs
were found (see Tab. 3.1), whereas for E. coli about 155 and S. entericus about 200
ncRNAs could be detected (see Tab. S20). Some ncRNAs not known for the latter
two bacteria were listed in Tab. S19. Possibly, in MAP strains and MAH strain 104
there are also more ncRNAs, however, they have not been studied intensively so far,
and transcriptome profiles for discovering novel, specific ncRNAs are lacking and
should be investigated in more detail in the future.
In 2013, Ignatov et al. [138] described the non-coding transcriptome of Mycobac-
terium avium resulting in 87 antisense and 10 intergenic small RNAs, which can
roughly also be expected for MAP strains.
Altogether, in the current study a different number for Aspks, G2 and YkkC-III
among MAP Type-S and -C was detected.
Based on the multiple alignment of 70 ncRNAs, the phylogenetic reconstruction
(Fig. 3.6C) divides all MAP strains into MAP-S and MAP-C clusters, with a low
bootstrap support within the highly similar MAP-C strains.
46
3.2. Comprehensive insights in the MAP genome
LSPs (consisting of at least four ORFs as gene cluster) but also the gain or loss of
single genes were explored using comparative sequence analyses.
First, the presence or absence of 25 genome regions characteristically distributed
between isolates of M. a. subspecies [132] was confirmed in the examined genomes
(see Tab. S14b). LSPA 11 was missing in MAP-S strain JIII-386 from Germany, as
previously only reported for porcine MAP strain LN20 of sheep type originating
from Canada [132] – both belonging to MAP Type III.
Insertions. Ten regions of specific LSPs (LSPS 1 to LSPS 10) present in MAP-S
but absent in K-10 [127] were confirmed in S397 in this study and detected as homol-
ogous regions also in JIII-386 and CLIJ361 (Tab. 3.3 and supplementary material,
Tab. S14a and b). In contrast to Bannantine et al. [127], four out of these ten LSPs
(LSPS 3, 6, 9 and 10) were identified also in all MAP-C strains. Furthermore, only
LSPS 6 and LSPS 10 are absent in MAH 104. The distribution of the ten LSPS s is
MAP-type associated; it shows no differences between individual strains of MAP-S
or MAP-C regarding their geographical origin.
However, our analysis showed, that LSPS 1 (9 kb) and LSPS 2 (6.6 kb) [127] are
subsets of previously described larger elements LSPA 4-II (28.9 kb) and LSPA 18
(16.4 kb) identified in MAH 104 and MAP-S, but absent from MAP-C [131]. LSPA 4-
II and LSPA 18 are related to the PIG-RDA20 and PIG-RDA10 regions detected
by Dohmann et al. [129]. Based on the newly assembled JIII-386, homologous
sequences of S397 and merged annotation, 23 ORFs (MAPs_15961-16180) homol-
ogous to LSPA 4-II sequences and comprising 8 ORFs of LSPS 1 were identified in
the examined MAP-S strains. Two ORFs of LSPS 1 (MAPs_15940 and 15950) were
absent in the genome of MAH 104 and no homologs were found in LSPA 4-II. This
region could be extended by six adjacent ORFs (MAPs_15870-15930) and addition-
ally by BacProt annotated ORFs. A new LSP was defined: LSPS Ia (see Tab. 3.4)
comprising 11 ORFs, absent in MAP-C and absent in MAH 104. This LSP re-
ally could represent an insertion, with genes encoding proteins involved in CoA
energy metabolism and tetracycline-controlled transcriptional activation. Further-
more, LSPS 2 (MAPs_46190-MAPs 46270) matched to LSPA 18, the nine ORFs of
LSPS 2 are homologous to ORFs MAV5227-5235 in MAH 104. LSPS 2 was combined
with LSPS 4, extended by 10 adjacent ORFs and newly designated as LSPS II (see
Tab. 3.4).
ORFs belonging to LSPS 5 and LSPS 7, (see Tab. 3.3) and additionally six ad-
jacent ORFs (MAPs_17621, MAPs_17622, MAPs_17680–MAPs_17710) were de-
scribed as novel region in MAP-S and MAH 104 genomes by Bannantine et al. [127]
comprising also the GPL region (missing MAPs_17680-17710) published by Alexan-
der at al. [132]. This genome region was predicted to encode proteins involved in the
biosynthesis of glycopeptidolipids (GPL) [182]. GPLs are discussed to contribute to
the virulence of members of the Mycobacterium avium complex (MAC). Different
genes involved in the synthesis of GPLs would be expected to alter indirectly the
interaction of the bacterium with its host. We analyzed that MAPs_17650, 17670,
and 17690 are homologous to the GPL genes mtfC, dhgA, and hlpA belonging to
the GPL biosynthesis cluster that is known to be diversely organized among indi-
vidual strains and subspecies of M. avium [182, 183]. In the present study the 14
47
Chapter 3. Genome Assembly
ORFs were detected to be present in MAP-S, and absent in the examined MAP-C
strains. It was possible to assign a function for MAPs_17620, 17621, and 17690 (see
Tab. S17). Otherwise, MAPs_17690–17710 are absent in MAH 104, but genes dhgA
and mtfC are still present in the annotation of MAH 104. Altogether, this region
included in addition five BacProt annotated ORFs and was newly designated as
LSPS III (see Tab. 3.4).
A further region (21.3 kb) was identified and newly defined as LSPS IV, com-
prising 22 ORFs (MAPs_20550-20770). This LSP is present in JIII-386, S397,
CLIJ361 and MAH 104 (with 243 mismatches), but absent in MAP-C strains (see
Tab. 3.4). Additionally, in JIII-386 two ORFs were predicted on the opposite strand
by BacProt. Sequences of 15 ORFs (MAPs_20620-20770) are homologous to se-
quences of previously described LSP MAV-14 [132]; 7 adjacent ORFs of LSPS IV
(MAPs_20550-20610) are absent from MAV-14.
Deletions. Several deletions in MAP-S strains, which have already been described
earlier, were verified in the current study, but also differences were found. Three gene
clusters (LSPs) comprising 32 genes, annotated in MAP K-10 were previously char-
acterized to be absent in MAP-S isolates: MAP1432–MAP1438c (deletion s∆-1),
MAP1484c–MAP1491 (deletion #1) and MAP1728c–MAP1744 (deletion #2) [104,
127, 130, 131, 184]. Genes included in deletion s∆-1, deletion #1 and deletion #2
were tested to be absent in sheep strains from United States [104, 127]; those of dele-
tion #1 and #2 tested as absent in Australian sheep strains [130]. In the current
study, genes of deletions #1 and #2 were also absent in JIII-386 from Germany and
CLIJ361 from Australia. But in contrast to sheep strain S397 from U.S., the seven K-
10 genes belonging to deletion s∆-1 were identified as being present in sheep strains
JIII-386 from Germany and CLIJ361 from Australia (genes MAP1433c–MAP1438c
in full length; MAP1432 with mismatches; see Tab. 3.5 and Tab. S16). Differences
regarding the presence or absence of deletion s∆-1 could reflect diversities among
MAP-S strains originating from different geographic regions of the world. Marsh et
al. [130] identified ORF MAP2325 in cattle strains but its loss (designated as dele-
tion #3) in Australian sheep isolate Telford 9.2 (MAP-S, Type I) using microarray
and confirmed this deletion #3 in 16 sheep strains by PCR. In contrast, MAP2325
was found to be present in MAP-S (Type III) isolates from the U.S. [104, 127]. This
discrepancy suggested a difference between MAP isolates recovered from sheep in
Australia and the United States. Furthermore, within the current study, MAP2325
could be found with 100 % sequence identity also in MAP-S strains JIII-386 (Type
III) from Germany as well as in CLIJ361 (Type I) from Australia, and confirmed
in all MAP-C isolates. Again, results reflect diversities within MAP-S group, but
could also partially indicate discrepancies between results of different methods (se-
quencing, microarray, PCR).
48
3.2. Comprehensive insights in the MAP genome
80 homologous CDSs were identified which are absent from K-10/K-10’ and 82
homologous CDSs which are absent from MAP4 and JII-1961. Tab. S15a presents in
detail gain and loss of genes detected in the current study comparing MAP genomes
and MAH 104. 40 genes with assigned functions (homologous genes) as well as 30
hypothetical genes, all previously described by [127] to be present in three sheep
isolates of MAP-S, Type III (from U.S.) but absent in K-10 strain (MAP-C), were
part of this analysis. However, 36 out of these 70 genes belonged to the ten MAP-S
specific LSPS regions also published by Bannantine et al. [127] including all ORFs
of LSPS 1, 2, 4, 5 and 7, and two out of four ORFs of LSPS 8 (see Tab. 3.3). Four
genes (hypothetical genes) were still present in MAP-C strains (Tab. S15b). For
nine out of the 30 above mentioned hypothetical genes it was possible to assign a
function based on homology. 34 additional ORFs were found in all MAP-S, but
absent in MAP-C, among them five ORFs which were annotated only by BacProt
(Tab. S17).
Altogether 80 CDSs (ORFs with an assigned function), present in MAP-S but
absent in MAP-C strains, were annotated in this study and listed in Tab. S17. Nine
CDS were also absent in MAH 104. Eight of these genes belong to the new designated
LSPS Ia, possibly indicating a specific insertion region into MAP-S strains.
MAP-S (Type III) strains JIII-386 and S397 differed in the presence and/or
absence of altogether 33 CDS (see Tab. S15a). In detail: 25 CDSs of S397 were
present in MAP-C and partially in CLIJ361 but absent in JIII-386 including four
ORFs (MAPs_23210–MAPs_23240), and six ORFs (MAPs_39450–MAPs_39500),
possibly representing specific deletions in JIII-386. The last gene cluster encodes
for three mammalian cell entry (mce) family proteins and virulence factor mce.
Mce genes were originally identified and studied in M. tuberculosis and have been
associated with survival within macrophages and increased virulence in this species
(see review of Behr et al. [185]). Eight CDSs of JIII-386 were present in MAP-C,
CLIJ361 and MAH 104 but absent in S397 and included 4 complete ORFs with an
assigned function (CDSs) of deletion s∆-1.
In contrast, MAP-C type strains showed high similarities regarding their gene
repertoire. Only two genes are absent from MAP4 (coding for ATP/GTP-binding
integral membrane protein and CsbD-like protein) and two other genes are absent
from JII-1961 (coding for inosine 5-monophosphate dehydrogenase and a PE-PGRS
family protein, see Tab. S15a). As expected, with loss and gain of about 700 CDSs,
MAH strain 104 emerged as the most different strain among the investigated My-
cobacteria (see Tab. S15a). The large number of genes (up to 208 compared to
K-10’), absent in MAP-S, Type I strain CLIJ361, includes a high amount of false
negative hits most likely caused by the lower assembly quality. Probably, some of
the genes are present in the genome of CLIJ361 but could not be identified in nearly
full-length and were therefore counted as absent from this strain. Nevertheless, bet-
ter assembled MAP Type I strains could enable more reliable comparisons among
MAP-S: Type I and III strains.
49
Chapter 3. Genome Assembly
PE/PPE/PGRS genes and Table 3.5: Gene cluster comprising seven K-10 ORFs absent
mmpL5. The PE and PPE in S397 but present in JIII-386 and CLIJ361. Table based
gene families are restricted to on Bannantine et al. [127]. Homologous sequences of all
ORFs were found on scaffold S02 in MAP JIII-386. For
mycobacteria, encode acidic, additional information and BLAST results see Tab. S16.
glycine-rich proteins and sev-
eral of them are proposed to be ORF Size (bp) Description
involved in antigenic variation
MAP1432* 1490 REP-family protein
and in the pathogenesis of in- MAP1433ce 1745 3-oxosteroid 1-dehydrogenase
fection [186–188]. They com- MAP1434e 1118 putative phthalate oxygenase
prise anywhere from 1 % of the MAP1435 713 short chain dehydrogenase
e
genome (MAP) to nearly 10 % MAP1436c 782 putative oxidoreductase
MAP1437c 986 hypothetical proteinh
(M. tuberculosis) [189]. In this
MAP1438cd 983 probable lipaseh
study individual strains show e
– involved in energy metabolism
six, seven or eight PE genes as d – involved in degradation of macromolecules
well as 32 (JIII-386, S397) or 33 * – partial hit (alignment length 1 484 bp) with mismatches
and 35 (MAP-C strains) PPE h – treated as hypothetical ORFs during analyses
genes, annotated by BacProt
– there is no clear differentiation between MAP-S and -C strains possible. Fur-
thermore, PE_PGRS family protein genes – the largest sub-family of PE family
genes, also suggested to play an important role in the persistence of mycobacteria
and to be involved in antigenic variation and immune evasion [190] – were searched.
It was previously assumed that M. avium, including MAH and MAP lack these
PE_PGRS family protein genes [188, 191–194]. However, in this study at least one
(S397, CLIJ361, MAH 104) or two (JIII-386, K-10’, JII-1961, MAP4) homologue to
PE_PGRS gene family could be annotated by BacProt, confirming results of Tian
et al. [190] for M. avium. Marri et al. [195] suggested that the paucity of PE/PPE
virulence genes in MAP in comparison to M. tuberculosis was compensated by the
acquisition of other virulence factors as a result of lateral gene transfer.
Otherwise, many mycobacterial membrane protein large (mmpL) genes are as-
sociated with clusters involved in the biosynthesis of cell wall-associated glycolipids
[196]. MmpL5 gene encodes a protein involved in lipid transport [195]. The current
study confirms the absence of mmpL5 gene in MAP-S strains and its presence in
MAP-C strains (and MAH 104) previously described by Marsh et al. [184] possibly
indicates that some of these mmpL gene products could also help in host association.
50
3.2. Comprehensive insights in the MAP genome
As shown before for the M. tuberculosis complex (MTBC) [197] also two thirds of
SNVs among MAP strains are nonsynonymous (see Tab. S18) which is unlike in most
other organisms in which synonymous SNVs predominate. This has been proposed
to be the consequence of the relatively short evolutionary age of MTBC [198] which
applies also to MAP. Furthermore, this could indicate an adaptive evolution of MAP
to different hosts with positive selective pressure [105].
51
Chapter 3. Genome Assembly
(A)
4.5217
MAH 104
0.0026
MAP K-10'
50.8321 0.0204
100
100 MAP K-10
0.2497 100 0.0045
0.0136
MAP JII-1961
3.8641 0.0026 49
100 MAP4
0.0154
182.2227 0.1277
MAP CLIJ361
0.1912 100 0.1575
0.0761
MAP JIII-386
100
0.0764
MAP S397
3.0895
56.3016 MI MOTT-02
100 3.1072
MI MOTT-64
182.2227
MTB H37Rv
0.1
(B)
14.0856
MAH 104
0.3902
MAP JII-1961
0.6880
33.0826 100
0.4289
100 0.2453
100 MAP K-10
0.8376 MAP K-10'
100 0.0041
1.9932 MAP4
100 0.3046
139.3741 0.9703
MAP CLIJ361
1.8905 100
0.3751
0.2964
MAP JIII-386
100 0.5250
MAP S397
2.9815
39.1698 MI MOTT-64
100 3.6466
MI MOTT-02
139.3741
MTB H37Rv
1.0
(C)
0.1208
0.8132
MAP K-10
MAH 104 0.0013 0.0013
18 MAP JII-1961
0.0013 10
0.1208 0.0013
58.6396
MAP K-10 0.0013 11 MAP K-10'
MAP JII-1961 0.0013
99 MAP4
1.0041
MAP K-10'
MAP4
0.2417
MAP CLIJ361
0.1208
99 0.1208
MAP S397
0.2418
99
0.2419
MAP JIII-386
58.6396
MTB H37Rv
0.1
Figure 3.6: Phylogenetic reconstructions for all investigated M. avium strains based on sequence
comparison of 790 corresponding CDSs on nucleotide (A) and amino acid level (B) and 70 cor-
responding ncRNAs (C). M. tuberculosis strain H37Rv was used as an outgroup. Mycobacterium
intracellulare (MI) strains were included as members of the Mycobacterium avium complex. Float
numbers correspond to substitutions per site and integer numbers represent RAxML bootstrap val-
ues. Long branches are shrunken. Detailed figures, all multiple sequence alignments and tree
representations in newick format can be found in the supplementary material online, Fig. S22–S26.
Shown are strains of MAP-S, Type I: CLIJ361 and Type III: JIII-386, S397 (red); Strains of MAP-
C, Type II: K-10, K-10’, MAP4, JII-1961 (green), MAH strain 104 (blue), MI MOTT-64 and MI
MOTT-02 (orange) and M. tuberculosis strain H37Rv (brown, used as outgroup).
52
3.2. Comprehensive insights in the MAP genome
unique to MAP-S strains and MAH 104, but also large deletions in the MAP-S
strains [130].
To decipher the complexity of the evolutionary processes leading to MAP-S and
MAP-C strains, future genome comparisons should investigate additional target
regions or genes as such for metabolic pathways, and especially use a higher number
of WGS of MAP-S, Type I and III as well as of related strains among the species
M. avium (MAA, MAS, and MAH).
3.2.6 Conclusions
With the newly sequenced JIII-386 genome the so far best assembled MAP-S se-
quence was presented here. We could show that the combination of different de novo
genome assembly tools and parameter settings can improve an initial assembly, how-
ever we were not able to obtain the completely closed genome. Using merged results
from NCBI and BacProt a comprehensive annotation of CDSs was obtained, in-
cluding a large fraction of CDSs identified by both approaches, and also additional
ones identified exclusively with one of the approaches. This relativizes absolute
numbers of annotated genes in studies using only one annotation program. Newly
annotated CDSs complete the previously detected differences between MAP-S and
MAP-C strains. Within this study BacProt re-annotations of CDSs for each of
the seven M. avium strains are provided. A new Shine-Dalgarno sequence motif
was extracted; further studies should disclose if this motif was conserved among
Mycobacteria.
For the first time about 80 ncRNAs and riboswitches of MAP were presented,
differing in numbers in three cases from MAH 104 but also between MAP-S and -C.
Furthermore, a pan like sequence was observed; which is the first discovery of this
RNA family in Actinobacteria. The performed genome comparison is the most com-
prehensive to date since it comprises three MAP-S and three MAP-C isolates from
three, respectively two different continents. Using extended annotation, previously
reported genome differences between S and C strains were partially revised and new
MAP Type-S specific regions were identified.
The concordant presence and absence of specific LSPs and distribution of ncR-
NAs among the examined MAP-S, Type I and III strains show that these strains
are very closely related subgroups of MAP-S.
In conclusion, our data will improve the understanding of the Mycobacterium
avium subsp. paratuberculosis genome, help to decipher the genetic basis for different
phenotypic characteristics of MAP-S and -C (Type I/III and II, respectively) strains
and the evolution of MAP types, also in relation to Mycobacterium avium subsp.
hominissuis.
53
Chapter 3. Genome Assembly
54
Chapter 4
Transcriptome Assembly
This chapter is based on “The Dark Art of de novo transcriptome assembly: a com-
prehensive across-species comparison of short read RNA-Seq assemblers” [16]* and
“GOAssembler: a method pipeline for the construction, evaluation and clustering
of de novo transcriptome assemblies” [17]*. We present our idea of a merged or
clustered transcriptome assembly, utilizing different tools and parameter settings,
to construct a more complete and comprehensive assembly out of a given RNA-Seq
data set. The approaches and findings collected out of the conceptual comparisons
and calculations presented in this chapter, have been already incorporated in the
bacterial genome assembly processes discussed in Chapter 3 [1, 2], in the de novo
transcriptome assembly process of the fruit bat Rousettus aegyptiacus presented
in Section 5.2 [4] and in the transcriptome construction of the jamaican fruit bat
Artibeus jamaicensis, used for the annotation of a Mx1 homolog in Section 6.2 [7].
In this chapter, we will mainly focus on the topics Preprocessing, Assembly, and
Visualization (see Fig. 2.2; B, E, and M).
In Sec. 4.1, we comprehensively compare ten publicly available tools for de novo
transcriptome assembly across different RNA-Seq data sets comprising various species
and sequencing parameters. We define different metrics to evaluate the performance
of each assembly tool on the various data sets. This section is accompanied by a
comprehensive Electronic Supplement1 . In the second part (Sec. 4.2), we present
our idea of a clustered transcriptome assembly and discuss the advantages and dis-
advantages of this approach by incorporating results from Sec. 4.1. We show as
proof-of-concept how the transcripts obtained from different de novo assembly runs
may be combined to a final and more comprehensive transcriptome, ready for an-
notation, quantification and differential gene expression analysis.
55
Chapter 4. Transcriptome Assembly
56
4.1. The Dark Art of de novo transcriptome assembly
Therefore, reads derived from one exon can be part of multiple paths in the as-
sembly graph. Furthermore, some transcript variants with a low expression level
might be considered as sequencing errors by various tools and removed from the
assembly process [203]. As in genome assembly, repetitive regions are also a huge
problem for the construction of transcripts. One of the main challenges in de novo
genome assembly of DNA-Seq data is to deal with repeats that are longer than
the reads. In de novo transcriptome assembly, we have fewer and shorter repeated
sequences. However, they could create ambiguities and confuse assemblers if not
addressed properly [204]. Another point involves the coverage vs. cost relation, an
important question during the design of each NGS experiment. In direct comparison
to DNA-Seq, a much higher coverage is necessary for RNA-Seq to detect rare and
low expressed transcript variants [205]. Additionally, it is less straightforward to cal-
culate the sufficient coverage for a transcriptome assembly compared to a genome
assembly, because the true number of expressed transcripts and their isoforms is
usually not known or can only be estimated [32]. Furthermore, the transcriptome
varies between different cell types, environmental conditions and time points. A
successful transcriptome assembler should be aware of all of these points and be
capable to recover full-length transcripts of different expression levels.
For most analyses, a transcriptome assembly is only useful if a functional anno-
tation is available (see Sec. 2.3). Such annotations are often based on a homology
search to annotate protein-coding genes as done for the Mycobacterium genome in
Sec. 3.2.3), but can get much more complicated if non-coding genes should be also
targeted [12, 13].
The de novo transcriptome assembly of non-model organism has recently been
on the rise in concern with the number of de novo transcriptome assembly tools.
There is a knowledge gap which assembly software and parameter settings should
be used for the construction of a good de novo assembly. Additionally, there is a
lack of consensus on which evaluation metrics should be used to assess the quality
of de novo transcriptome assemblies to select good ones.
Several tools for de novo transcriptome assembly were developed in the last
decade [202]. Some of them are build on top of already existing genome assembly
tools, others were especially designed for transcriptome assembly. The question
that comes to mind: which tool should be used for which kind of data? Some tools
may fit the needs of eukaryotic transcripts, where alternative splicing has to be
considered to construct different isoforms, whereas other tools can handle simpler
prokaryotic transcripts. More complicating, there are different RNA-Seq library
preparation protocols, resulting in reads of many kinds: single-end or paired-end,
strand-specific or not strand-specific, and with different insertion sizes. To tackle
these and other questions, we performed a comprehensive comparison of ten de novo
assembly software programs across Illumina RNA-Seq data sets of different species,
sequencing parameters, and library preparation protocols.
The evaluation of de novo transcriptome assembly tools has been already done in
the past, however those studies often rely on limited data sets (e.g., a single species,
a single sequencing protocol) or focus on classic assembly tools.
In 2010, Kumar and Blaxter [206] compared five assemblers based on Roche
454 pyrosequencing data, however the most frequently used NGS platform today
57
Chapter 4. Transcriptome Assembly
is provided by Solexa Illumina [202] and will be the focus in this study. In 2011,
Chen et al. [66] evaluated the impact of different k-mer sizes on the de novo tran-
scriptome assembly results. They found out, that using a single k-mer value for
assembly is not enough to generate good assembly results. Instead, the combina-
tion of different contigs constructed based on different k-mer sizes could yield much
longer transcripts and greatly improve the final assembly. However, using larger k-
mers can improve the assembly quality of common transcripts and transcripts with
repetitive regions, but the assembly of rare transcripts may suffer. In this study,
only a single genome assembler (Velvet [51]) was used to build the different k-mer
assemblies. Also, Zhao et al. [205] showed the improvement of de novo transcrip-
tome assemblies from short-read RNA-Seq data by combining multiple assemblies of
different k-mer values. Here, four single k-mer assemblers (Oases [60] (build on top
of Velvet), ABySS [56], Trinity [62], SOAPdenovo [207]) and three multiple
k-mer (MK) methods (SOAPdenovo-MK, Oases-MK, Trans-ABySS [61]) were
tested. They showed, that small and large k-mer values performed better in the re-
construction of lowly and highly expressed transcripts, respectively. They suggest,
that generally multiple k-mer approaches should be considered to achieve better
assemblies. A newer study from 2013 [208] compared different de novo assembly
and genome-guided assembly strategies for transcriptome reconstruction. Overall,
five assemblers were used in this study, from which three (Oases, Trans-ABySS,
Trinity) can be applied de novo on the RNA-Seq data. Here, the merged as-
semblies of all five tools achieved the best overall assembly. Of course, how the
merging of the assembled transcript sequences is performed has a high impact on
the quality of the final assembly. One common approach is to build a merged (com-
bined, clustered ) assembly out of the concatenation of multiple FASTA files. Tools
like CD-HIT-EST [146] can be used for this task to cluster the sequences by sim-
ilarity. The process of merging different assemblies is not trivial, as highly similar
isoforms might be clustered into on sequence if the sequence similarity cutoff is too
low. Otherwise, high redundancy can be introduced in an assembly that combines
the output of many single assemblies of different tools and parameter settings. Also
in 2013, Clarke et al. [209] evaluated five de novo assemblers, ABySS, Mira [64],
Trinity, Velvet and Oases on simulated and real RNA-Seq data. All of those
assembly tools are based on de Bruijn graphs (see Sec. 2.2), except Mira using an
overlap graph algorithm. Clarke et al. [209] suggest, that there is an urgent need of
novel assembly tools for assembling transcriptome data generated by current NGS
techniques. In a recent study from 2016 by Wang and Gribskov [210], eight assembly
tools (Oases [60], SOAPdenovo-Trans [63], Trans-ABySS [61], Trinity [62],
BinPacker [211], Bridger [212], IDBA-Tran [213], SSP [214]) were compared
based on two RNA-Seq data sets from Arabidopsis thaliana with a series of k-mer
values. In this study, SOAPdenovo-Trans and Trans-ABySS performed best
regarding base coverage and the number of recovered full-length transcripts, respec-
tively. While this study is of general interest, especially because novel assembly
tools are included, the results are only based on data sets from one species (plant)
and also the novel assembly tools can not really show their strength on the more
simple single-end and non-strand-specific data.
58
4.1. The Dark Art of de novo transcriptome assembly
All of these studies agree in one point: currently there is no optimal assembly
tool for all RNA-Seq data sets out there. Different species, sequencing protocols
and parameter settings need different approaches and adjustments of the underlying
algorithms to obtain the best possible results out of the RNA-Seq data. Merging
the contigs of different assembly tools and parameter settings to overcome the dif-
ferent disadvantages of certain assemblers and to combine their advantages seems
to be the best way to obtain a comprehensive de novo transcriptome assembly (see
Sec. 4.2). Nevertheless, knowing the advantages and disadvantages of each tool is
an important step in the direction of an automated merging algorithm for multiple
de novo transcriptome assemblies.
Here, we present a comprehensive evaluation of ten de novo assembly tools (long-
standing and novel ones) across RNA-Seq data sets of different species and based
on different Illumina sequencing parameters and protocols. In comparison to recent
studies, we do not only focus on RNA-Seq data of one species or kingdom. Instead,
we use real data sets from bacteria, fungi, plants, and higher eukaryotes. We also in-
clude data sets that underwent viral infections. We further tested promising metrics
of various evaluation tools to assess and compare the performance of each assembler.
In a next step, such metrics could be used for an automized selection of good as-
semblies or contigs to build a more comprehensive and better cluster-assembly. Our
results give insights in the performance and usability of the different assemblers and
how they perform on the different data sets. As far as our knowledge goes, this is
the most complete comparison of short-read de novo transcriptome assembly tools
currently available.
59
Chapter 4. Transcriptome Assembly
60
4.1. The Dark Art of de novo transcriptome assembly
Table 4.1: Overview of the different de novo assembly tools evaluated in this study. We obtained the
most recent versions in December 2016. We further rated our experiences regarding the installation
and usability of each tool ( – excellent, – good, – unsatisfactory). These experiences might
be subjective, nevertheless we want to share them to give non-experienced users an idea of how
difficult it is to get each tool installed (Setup) and executed (Usage), see Sec. 4.1.3 for details. MK
– Whether or not the tool has a built-in multiple k-mer approach and is able to automatically merge
the output of different k-mer runs. a Oases was used on top of the de novo genome assembler
Velvet (v1.2.10) [217]. b SPAdes, originally designed as a de novo genome assembler for single-
cell data, was used in RNA-Seq modus (-rna) and single-cell modus (-sc), respectively. c When
running SPAdes in RNA-Seq modus, only a single k-mer value is allowed.
For our comparisons, we adopted the most recent versions in December 2016.
Finding the best parameter setting for each tool and each data set is obviously
beyond the scope of this evaluation. Therefore, we used the default settings of each
tool and adjusted only few key parameters (like k-mer values, strand-specificity)
whenever possible. Execution details can be found in the Electronic Supplement,
Tab. S3. For the tools with built-in function to automatically merge the output
of different k-mer values (Oases, Trans-ABySS, IDBA-Tran, SPAdes-sc; see
Tab. 4.1), we applied a set of selected k-mers (for details see Tab. S3). If strand-
specific data was used for the assembly, we applied the corresponding option in each
tool, if possible. In application one should try several different parameter settings
and compare the resulting assemblies to optimize the whole assembly process. In
particular, different k-mers should be tested and evaluated against each other. Here,
we carefully chose k-mer values to obtain a somewhat fair comparison between the
assemblers, although some parameters may not be optimal.
Whenever a tool was difficult to install (e.g. due to missing dependencies) or
could not be run on a specific data set, we attempted to debug the source code and
in few cases also contacted the authors to solve the problem. Therefore, we also
decided to share our experiences regarding the installation procedure and execution
of each tool, because we observed that usability differs widely between them (see
Tab. 4.1).
61
Chapter 4. Transcriptome Assembly
H. sapiens Human + EBOV 3h Human Chr1 simulated M. musculus A. thaliana C. albicans E. coli
96 million reads (pe) 17 million reads (pe) 60 million reads (pe) 43 million reads (pe) 16 million reads (se) 36 million reads (pe) 6.6 million reads (se)
100 bp, strand-specific 100 bp, unstranded 100 bp, unstranded 76 bp, strand-specific 100 bp, unstranded 34 bp, strand-specific 94 bp, strand-specific
Human + EBOV 7h
24 million reads (pe)
100 bp, unstranded
FastQC Prinseq
de novo Assembly
SOAPdenovo-
Trinity Oases Trans-ABySS IDBA-Tran
MK MK
Trans MK
Evaluation
Figure 4.1: Overview of the used RNA-Seq data sets (orange – eukaryote, light orange – simulated
human chromosome 1, green – plant, pink – fungi, yellow – bacterium) and evaluated assembly
tools. Each data set was quality controlled with FastQC and preprocessed with Prinseq prior
to assembly. Overall, more than 200 single k-mer assemblies were calculated. For details about
the used data sets and assemblies tools, see Electronic Supplement Tab. S1 and S2, respectively.
We further used several tools and statistics for the evaluation of each assembly. The CPU/RAM
consumption and the usability of each assembler were not included in the selected evaluation
metrics, see 4.1.2). se/pe – single-end/paired-end; MK – the assemblers built-in multiple-k-mer
approach was applied.
representatives for bacteria (Escherichia coli ; ECO), fungi (Candida albicans; CAL),
plant (Arabidopsis thaliana; ATH), and higher eukaryotes (Mus musculus; MMU
and Homo sapiens; HSA). We further included three data sets of a human HuH7
cell line, infected with the single-stranded RNA virus Ebola at three different time
points (HSA-EBOV-3h, HSA-EBOV-7h, HSA-EBOV-23h) [4]. With the help of
these data sets, we evaluated the assemblers capability to reconstruct the viral RNA
genome directly out of the mixed host and viral reads.
We further simulated one artificial data set based on protein-coding and non-
coding transcripts of human chromosome 1 (labeled HSA-FLUX).
With our selection of the different data sets, we further aim to represent different
experimental setups for RNA-Seq data: 1) single-end vs. paired-end data, 2) strand
specificity vs. unstranded protocols, 3) polyA enriched vs. rRNA depleted library
preparations, 4) different read lengths and 5) different sequencing depths, see Fig. 4.1
and Electronic Supplement Tab. S1.
62
4.1. The Dark Art of de novo transcriptome assembly
Escherichia coli. Raw read RNA-Seq data of E. coli str. K-12 substr. MG1655
was downloaded from the NCBI Short Read Archive (SRA), study PRJNA238884,
run SRR1173967 [218]. The run is comprised of roughly 8 million single-end reads
with a length of 94 bp each. A protocol retaining the strand-specificity was used
for sequencing. The reference genome, annotation data and coding sequences were
obtained from the Ensembl [219] bacteria database, release 344 .
Candida albicans. Candida albicans is one of the major invasive fungal pathogens
of humans [220]. Here, we obtained 11.5 million 51 bp paired-end reads (not strand
specific, sequenced on an Illumina HiSeq 2500) from the SRA (study: PRJNA213618,
selected run: SRR1654847), previously used in a comprehensive study about the
stress response of this fungal pathogen to weak organic acids [221]. Genome and an-
notation data for C albicans SC5314 were obtained from www.candidagenome.
org (Ca22, 11.12.2016).
Mus musculus. For M. musculus, we used an RNA-Seq data set that was previ-
ously conducted for the evaluation of the Trinity assembler [223]. The 52.6 mil-
lion strand-specific paired-end reads with a length of 76 bp were downloaded from
the SRA, study PRJNA140057 (run SRR203276). The mouse reference genome
(GRCm38), annotation data and coding sequences (CDS) were downloaded from
Ensembl, release 87.
Homo sapiens. The human data set was derived from the widely studied cell line
GM12878. A detailed description of the data set can be found in the ENCODE data
center (https://www.encodeproject.org/experiments/ENCSR000AED/)
with accession ENCSR000AED. Overall, we obtained 97.5 million strand-specific
paired-end reads with a length of 101 bp, sequenced by a polyA mRNA protocol.
The human reference genome (GRCh38) and annotation were obtained from En-
sembl, release 87.
Homo sapiens with EBOV infection. Here, we utilized three samples from our
study of a Ebola virus (EBOV) infected HuH7 cell line 3, 7 and 23 h post infection
(poi) [4], comprising ∼17–26 million paired-end reads with a length of 100 bp (not
4
ftp://ftp.ensemblgenomes.org/pub/bacteria/release-34/gtf/bacteria_0_
collection/escherichia_coli_str_k_12_substr_mg1655/
5
ftp://ftp.ensemblgenomes.org/pub/release-34/plants/fasta/
arabidopsis_thaliana/dna/
63
Chapter 4. Transcriptome Assembly
strand specific). This data, including details about the experimental design, RNA
extraction and sequencing is presented and discussed in Sec. 5.2. The Ebola virus
(filovirus) consists of a single-stranded RNA genome with a negative orientation
that is approximately 19 kb in size and encodes for seven structural proteins [224].
By performing RNA-Seq without a polyA selection step, we sequenced the EBOV
genome together with the host transcripts. With these data sets, we aim to test the
performance of de novo assembly tools on a viral RNA genome. We assembled the
three different time points individually, to investigate how the different assemblers
perform on varying amounts of viral reads in the data (3 h ∼ 0.1 % viral reads, 7 h
∼ 2 %, 23 h ∼ 20 %). Details about the amount of viral reads in each data set can
be found in Sec. 5.2 and Appendix A.2 For the evaluation, we used again human
genome and annotation data from Ensembl (release 87) and concatenated the data
with the EBOV genome of strain Zaire, Mayinga (GenBank: NC_002549).
Homo sapiens flux simulated data. In addition to the real RNA-Seq data
sets, we quasi-simulated RNA-Seq data based on a selection of protein- and long
non-coding transcripts of human chromosome 1.
We downloaded the human annotation GTF file and cDNA sequences (excluding
ab initio predictions) from Ensembl (GRCh38, release 87) and selected all protein-
coding genes from chromosome 1 (2,044 genes), comprising 352 genes with one iso-
form, 196 with two isoforms and 1,496 with more than two isoforms. We extended
this set of protein-coding genes by 1,075 non-coding genes from chromosome 1. The
combined set of protein- and non-coding genes was used to create a set of transcripts
including all known isoforms with a length >200 nt and without ambiguous N bases
from which paired-end reads should be simulated. Our final set of transcripts com-
prised 12,793 protein-coding transcripts as well as 1,006 lincRNAs, 839 antisense
RNAs and 7 snoRNAs of human chromosome 1.
This 14,645 transcript sequences were further used as an input in flux sim-
ulator [225] for RNA-Seq raw read simulation, yielding 60 million paired-end
100 bp reads (Tab. S1). We used flux simulator as suggested for Illumina
data, utilizing the default 76-bp error model. With this simulated sequences, we
attempt to mimic a state-of-the-art RNA-Seq data set based on Illumina’s Ribo-
Zero protocol for library preparation and rRNA depletion, further multiplexed three
times and sequenced on one HiSeq 2500 lane. As such a protocol also allows for the
detection of untranslated transcripts as part of the RNA-Seq data, we also included
the sequences of non-coding transcripts in the simulation.
64
4.1. The Dark Art of de novo transcriptome assembly
In total, we calculated more than 200 single k-mer assemblies. Each assembler was
run on each data set (see Fig. 4.1). If possible, multiple k-mers were used (see
Tab. 4.1). Trans-ABySS, Oases and IDBA-Tran dispose a built-in functionality
for multiple k-mers. SPAdes-sc can automatically chose multiple k-mers for the
assembly process and was therefore executed with this default option. All assem-
blers were run with default parameters, if not otherwise stated. Details about the
execution of each tool on each data set can be found in the Electronic Supplement,
Tab. S3. For the E. coli, A. thaliana, H. sapiens and the artificial data sets k-mers
25, 35, 45, 55 and 65 were used with Trans-ABySS, Oases and IDBA-Tran. The
short-read C. albicans data was run with k-mers 21, 27, 33 and 39. M. musculus
data was assembled with the k-mers: 25, 35, 45 and 55, because the read length
is shorter in comparison to the bacterial and plant data sets. The EBOV infected
HuH7 samples were run with k-mers 25, 29, 33, 37 and 41. The k-mer values were
selected based on previous results for these data sets and in relation to the different
read lengths and sequencing setups.
As IDBA-Tran assumes paired-end reads to be in order (->, <-; forward–
reverse), we manually converted reads if necessary before running IDBA-Tran (see
Tab. S3).
We tried to run Bridger in RF (reverse–forward ) mode for strand-specific data,
however this was not working. Therefore, we used the non strand-specific mode for
the M. musculus and H. sapiens assemblies.
We benchmarked the different assembly tools using several evaluation tools and met-
rics, summarized in Fig. 4.1. Some of the metrics are based on reference sequences
and annotations, whereas others are only based on the final assembly itself (the
contigs) or the reads that were used to construct the assembly.
Evaluation metrics are very important to assess the quality of a genome or tran-
scriptome assembly. However, there is a lack of consensus which evaluation metrics
work best for de novo transcriptome assembly. For example, Rana et al. [226] com-
pared different assemblers and k-mer strategies using killifish RNA-Seq data and
based their comparisons on eleven selected metrics, such as contig number, N50
value6 , contigs >1 kb, re-mapping rate, number of full length transcripts, number of
open reading frames, Detonates RSEM-EVAL score and percentage of alignments
to closely related fish. In another study, Chopra et al. [227] performed comparisons
on peanut RNA-Seq data and evaluated the assemblies on metrics like N50, average
contig length, number of contigs and the number of full length transcripts. Moreton,
Dunham, and Emes [228] used also the N50 length, the number of transcripts, the
number of transcripts ≥1 kb and RMBT and CEGMA percentages when evaluating
different assemblies of duck. Surely, more information on which metrics best pre-
dict the quality of a de novo transcriptome assembly would help to establish “best
6
The Nx value describes the length of the shortest contig in the assembly, so that the accumu-
lated bases of all contigs of this length or longer cover x % of all the bases in the assembly.
65
Chapter 4. Transcriptome Assembly
Mapping rate. We used Hisat2 [74], a fast splice-aware aligner with low mem-
ory consumption, to map the quality controlled reads back to each assembly. The
mapping rate can give insights in the amount of reads that were incorporated in
final transcripts during the assembly process (see Electronic Supplement Fig. S4).
Therefore, this value tells us how many reads were incorporated in the assembly
process and how well. However, reads that are not part of the true transcriptome
assembly but are still included in the RNA-Seq data (e.g., due to contamination)
can induce chimeric contigs and higher mapping rates. Furthermore, contigs that
were just wrongly constructed can also increase the mapping rate. Therefore, the
re-mapping rate can give first insights in the quality of a transcriptome assembly,
but further metrics are needed to obtain a more complete picture of the assemblers
performance.
66
4.1. The Dark Art of de novo transcriptome assembly
Detonate. We further used the Detonate workflow: a pipeline for the DE novo
TranscriptOme rNa-seq Assembly with or without the Truth Evaluation [230]. The
pipeline consists of two component packages, RSEM-EVAL and REF-EVAL. Both
packages are mainly intended to be used to evaluate de novo transcriptome assem-
blies, although REF-EVAL can be used to compare sets of any kinds of genomic
sequences. Here, we mainly focus on Detonates RSEM-EVAL score as a novel
reference-free evaluation method to assess the quality of transcriptomes. The tool
calculates a statistically based evaluation score using multiple factors, such as the
compactness of the assembly and its support from the RNA-Seq reads used to create
it [230]. Therefore, the RSEM-EVAL score can be used to evaluate assemblies even
when the ground truth is unknown. At the end, assemblies with higher RSEM-EVAL
scores are considered better.
We further calculated nucleotide F1, contig F1 and KC scores with Detonate.
The F1 score is a measure of a test’s accuracy. An F1 score of 1 would mean that
all nucleotides/contigs in the estimated true assembly were recovered with at least
90 % identity. The k-mer compression score (KC score) reflects the similarity of each
assembly to Detonates estimated “true” assembly and combines two measures:
weighted k-mer recall and inverse compression rate [230].
67
Chapter 4. Transcriptome Assembly
Detonate was run for all assemblies as recommended in the online vignette8 .
The main metrics calculated by Detonate can be found in Electronic Supplement
Tab. S8.
Computational resources
Each assembly was executed on 48 threads. All calculations were run on two sym-
metric multiprocessing servers with 14 TB storage (raid-5) and 48 CPU cores, com-
prising four AMD Opteron 6238 CPUs and 512 GB RAM running on a Debian 64 bit
system.
Usability
We further aimed to install and run all tools without root rights on our systems
(Debian GNU/Linux 8 (jessie) 64-bit). Of course, how easy a tool can be installed
and executed heavily depends on the used machine, the server setup and how familiar
the user is with the programing language the tool is based on. Nevertheless, it should
be the goal of each public available piece of software to be as user friendly as possible.
Therefore, we collected our experiences during the installation and execution of each
assembler to share our observations (Tab. 4.1).
68
4.1. The Dark Art of de novo transcriptome assembly
strategies that performed the best for the H. sapiens data set were SPAdes-rna
(12/20; the assembler performed within the top three scores of 12 out of the 20 met-
rics) followed by SOAPdenovo-Trans (9/20), the Trans-ABySS assembly (9/20)
and Trinity (8/20) (Fig. 4.2 and Tab. 4.2).
In the following sections, we will present the performance of each assembler over
all data sets based on the selected evaluation metrics (4.1.2). For the H. sapiens data
set (∼96 million strand-specific paired-end reads with a maximum length of 101 bp),
all 20 selected metrics and the scores for each of the ten assembly tools are shown in
Tab. 4.2. The tables for all other data sets can be found in Appendix B.1–B.8 and in
detail in the Electronic Supplement (Tab. S9). Detailed plots and further statistics
for all data sets and assembly tools can be found in the Electronic Supplement.
To get a general overview of the performance of each assembler, we summed
up the metric scores achieved for each data set to calculate an overall metric score
(OM S) for each assembler. Because of the similarity of the three human RNA-
Seq data sets treated with the Ebola virus 3, 7, and 23 h post infection (same read
length, paired-end, not strand-specific, roughly the same amount of reads), we used
the mean of all three scores when accumulating the scores of all data sets. For
example, Trans-ABySS performed very good on all three Ebola-infected data sets
(10/20), whereas IDBA-Tran did not (4/20, 5/20, 4/20) (Fig. 4.2).
Trans-ABySS
69
Chapter 4. Transcriptome Assembly
5 9 8 9 9 10 10 10 10 Trans-ABySS (60)
8 12 6 7 5 7 6 6 9 SPAdes-sc (53.3)
5 5 7 10 8 2 8 9 9 Trinity (50.3)
SOAPdenovo- (45.6)
6 7 6 8 9 7 6 7 3
Trans
5 7 5 7 12 9 9 6 6 SPAdes-rna (50)
7 5 5 7 4 4 5 4 8 IDBA-Tran (40.3)
5 4 6 1 2 6 2 6 7 Oases (29.6)
4 4 4 4 5 5 3 4 3 BinPacker (28)
6 3 1 6 2 3 2 4 1 Bridger (22)
0 4 3 1 4 7 9 4 4 Shannon (22.6)
Metric score
12 10 8 6 4 2 0
Figure 4.2: Heat map showing for each data set (column) and each assembler (row) the summed
up metric score based on the 20 metrics presented in Sec. 4.1.2. For each metric, an assembler got
a point if the resulting assembly arranges within the top three results. The hierarchical clustering
of the metric scores divides the assembly tools in two groups, performing generally better (upper
half) and generally not so good (lower half) on the tested data sets. The maximum achievable
metric score for the E. coli and A. thaliana data sets is 17 and not 20, because the optimal score,
the percentage of good mappings and the percentage of uncovered bases are only calculated by
TransRate in the case of paired-end data. Please note, that for the HSA-EBOV-7h data set no
rnaQUAST statistics were calculated for the Oases assembly. rnaQUAST was not able to finish
the calculations if the Oases assembly for this data set was included. Numbers in brackets next to
the assembler names present the summed up metric scores (overall metric score, OM S)for all nine
data sets. For the three similar human data sets infected with the Ebola virus, we added the mean
value to the OM S. BinPacker and Bridger, build on the same principals, performed similar
and cluster together. However, BinPacker achieved more consistent scores. SPAdes in single-
cell mode worked best for the bacterial data set. SOAPdenovo-Trans worked generally good on
all real data sets, but was outperformed by other tools for the articifical data set. Interestingly,
Trinity was outperformed by other tools for the HSA-EBOV-3h data set, but worked well on the
later time points. Suprisingly, the RNA mode of SPAdes performed best on the human data set,
whereas SPAdes in single-cell mode achieved a much lower score. Details about the used metrics
can be found in the Electronic Supplement, Tab. S9 and Appendix Tab. B.1–B.8.
70
Table 4.2: Selected metrics based on the output of rnaQUAST, Hisat2, Detonate, TransRate and BUSCO for the transcripts assembled by all ten
assembly tools on the Homo sapiens RNA-Seq strand-specific paired-end library with read length 101 bp (accession number ENCSR000AED). Details and
much more statistics, complementing this evaluation, can be found in the Electronic Supplement, content S4–S8. In each row the top three values are
indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of ambiguous bases is given in thousand. N50 – the length of the shortest
contig in the assembly, so that the accumulated bases of all contigs of this length or longer cover 50 % of all the bases in the assembly. F1 score – a measure
of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs in the estimated true assembly were recovered with at least 90 % identity.
KC score – k-mer compression score reflecting the similarity of each assembly to Detonates estimated “true” assembly.
Trans-ABySS Oases SOAP-Trans Trinity IDBA-Tran Shannon Bridger BinPacker SPAdes-sc SPAdes-rna
k-mer size 25,35,45,55,65 25,35,45,55,65 default default 25,35,45,55,65 default default default default default
Hisat2
Overall mapping rate 91.66 88.04 98.36 89.93 86.83 72.6 64.61 84.27 90.26 90.76
rnaQUAST
Transcripts >1000 bp 72685 207474 68662 27529 43201 22611 23516 31328 26245 15945
Database coverage 0.23 0.08 0.29 0.1 0.07 0.06 0.09 0.01 0.09 0.1
Misassemblies 2739 216128 2878 279 7329 5603 302 2837 1566 570
Mismatches per transcript 1.04 1.25 0.61 0.27 1.44 4.63 0.67 1.26 0.82 0.41
71
Average alignment length 781.94 343.48 258.13 218 654.41 2335.73 487.11 711.83 429.36 207.92
Mean isoform coverage 0.52 0.33 0.49 0.27 0.33 0.7 0.35 0.28 0.34 0.28
TransRate
N50 1613 1230 1913 3391 1386 3511 566 1446 641 16469
Reference coverage 0.23 0.09 0.27 0.09 0.09 0.07 0.08 0 0.08 0.09
Mean ORF percentage 51.47 42.09 51.09 48.02 45.1 42.57 52.46 55.7 48.28 55.04
Optimal score 0.08 0.02 0.08 0.27 0.14 0.07 0.25 0.07 0.32 0.35
Percentage good mappings 0.22 0.06 0.17 0.59 0.32 0.26 0.49 0.22 0.63 0.64
Percentage bases uncovered 0.66 0.94 0.66 0.33 0.42 0.84 0.02 0.5 0.04 0.31
Number of ambiguous bases 306314 843235 460747 241236 206635 72918 138699 117068 159111 186834
DETONATE
Nucleotide F1 0.4 0.18 0.49 0.57 0.48 0.15 0.55 0.35 0.58 0.56
Contig F1 0.02 0.02 0.2 0.21 0.01 0 0.02 0.02 0.02 0.09
KC score 0.49 0.24 0.56 0.37 0.4 0.37 0.29 0.42 0.36 0.33
RSEM EVAL -6.63 -1.18 -6.22 -9.03 -7.71 -1 -1.63 -8.95 -1.38 -1.34
BUSCO
Complete single-copy 1401 1321 2079 2151 2360 1010 1677 2302 2347 2551
Missing BUSCOs 1810 1922 1772 2164 1812 4078 2615 2133 2457 2392
4.1. The Dark Art of de novo transcriptome assembly
Chapter 4. Transcriptome Assembly
A B
Trans-ABySS C:301 [S:234, D:67], F:314, M:166, n:781 Trans-ABySS C:1458 [S:348, D:1110], F:144, M:109, n:1711
Oases C:299 [S:136, D:163], F:310, M:172, n:781 Oases C:1382 [S:611, D:771], F:189, M:140, n:1711
SOAP-Trans C:316 [S:316, D:0], F:287, M:178, n:781 SOAP-Trans C:1042 [S:1039, D:3], F:421, M:248, n:1711
Trinity C:281 [S:258, D:23], F:311, M:189, n:781 Trinity C:1369 [S:1279, D:90], F:180, M:162, n:1711
IDBA-Tran C:296 [S:296, D:0], F:289, M:196, n:781 IDBA-Tran C:1070 [S:1069, D:1], F:401, M:240, n:1711
Shannon C:280 [S:261, D:19], F:303, M:198, n:781 Shannon C:1088 [S:460, D:628], F:264, M:359, n:1711
Bridger C:285 [S:281, D:4], F:306, M:190, n:781 Bridger C:1416 [S:1149, D:267], F:162, M:133, n:1711
BinPacker C:50 [S:48, D:2], F:20, M:711, n:781 BinPacker C:1416 [S:1146, D:270], F:160, M:135, n:1711
SPAdes-sc C:332 [S:332, D:0], F:277, M:172, n:781 SPAdes-sc C:1515 [S:1510, D:5], F:112, M:84, n:1711
SPAdes-rna C:96 [S:96, D:0], F:315, M:370, n:781 SPAdes-rna C:1464 [S:1458, D:6], F:154, M:93, n:1711
0 20 40 60 80 100 0 20 40 60 80 100
%BUSCOs %BUSCOs
C D
Trans-ABySS C:1119 [S:732, D:387], F:97, M:224, n:1440 Trans-ABySS C:4104 [S:2079, D:2025], F:316, M:1772, n:6192
Oases C:1108 [S:546, D:562], F:84, M:248, n:1440 Oases C:3588 [S:1321, D:2267], F:682, M:1922, n:6192
SOAP-Trans C:1058 [S:1042, D:16], F:134, M:248, n:1440 SOAP-Trans C:2625 [S:2151, D:474], F:1403, M:2164, n:6192
Trinity C:1094 [S:858, D:236], F:124, M:222, n:1440 Trinity C:3925 [S:1401, D:2524], F:457, M:1810, n:6192
IDBA-Tran C:930 [S:908, D:22], F:241, M:269, n:1440 IDBA-Tran C:1682 [S:1677, D:5], F:1895, M:2615, n:6192
Shannon C:1049 [S:804, D:245], F:95, M:296, n:1440 Shannon C:3385 [S:2302, D:1083], F:674, M:2133, n:6192
Bridger C:1103 [S:978, D:125], F:108, M:229, n:1440 Bridger C:3909 [S:2360, D:1549], F:471, M:1812, n:6192
BinPacker C:262 [S:203, D:59], F:16, M:1162, n:1440 BinPacker C:2009 [S:1010, D:999], F:105, M:4078, n:6192
SPAdes-sc C:1077 [S:1053, D:24], F:139, M:224, n:1440 SPAdes-sc C:2357 [S:2347, D:10], F:1378, M:2457, n:6192
SPAdes-rna C:878 [S:859, D:19], F:205, M:357, n:1440 SPAdes-rna C:2564 [S:2551, D:13], F:1236, M:2392, n:6192
0 20 40 60 80 100 0 20 40 60 80 100
%BUSCOs %BUSCOs
E F
Trans-ABySS C:4135 [S:1938, D:2197], F:289, M:1768, n:6192 Trans-ABySS C:563 [S:290, D:273], F:90, M:18, n:671
Oases C:4055 [S:704, D:3351], F:320, M:1817, n:6192 Oases C:613 [S:86, D:527], F:36, M:22, n:671
SOAP-Trans C:3667 [S:3362, D:305], F:670, M:1855, n:6192 SOAP-Trans C:289 [S:191, D:98], F:273, M:109, n:671
Trinity C:3718 [S:1873, D:1845], F:686, M:1788, n:6192 Trinity C:597 [S:203, D:394], F:52, M:22, n:671
IDBA-Tran C:2134 [S:2128, D:6], F:1841, M:2217, n:6192 IDBA-Tran C:226 [S:226, D:0], F:303, M:142, n:671
Shannon C:3777 [S:1108, D:2669], F:460, M:1955, n:6192 Shannon C:242 [S:143, D:99], F:65, M:364, n:671
Bridger C:3979 [S:2471, D:1508], F:428, M:1785, n:6192 Bridger C:525 [S:316, D:209], F:118, M:28, n:671
BinPacker C:3590 [S:1909, D:1681], F:210, M:2392, n:6192 BinPacker C:527 [S:256, D:271], F:115, M:29, n:671
SPAdes-sc C:3493 [S:3481, D:12], F:866, M:1833, n:6192 SPAdes-sc C:393 [S:393, D:0], F:218, M:60, n:671
SPAdes-rna C:3617 [S:3606, D:11], F:702, M:1873, n:6192 SPAdes-rna C:363 [S:363, D:0], F:233, M:75, n:671
0 20 40 60 80 100 0 20 40 60 80 100
%BUSCOs %BUSCOs
Missing (M) Fragmented (F) Complete (C) and duplicated (D) Complete (C) and single−copy (S)
Figure 4.3: Selected BUSCO assessment results for E. coli (A), C. albicans (B), A. thaliana (C),
H. sapiens (D), HuH7 cells infected with EBOV 7 h post infection (E) and flux simulated reads
of human chromosome 1 (F). The numbers indicate the absolut amount of complete (C) and
single-copy (S), complete and duplicated (D), fragmented (F), and missing (M) BUSCOs. BUSCO
results for all other data sets can be found in the Electronic Supplement, Fig. S7.
72
4.1. The Dark Art of de novo transcriptome assembly
Oases
Oases was also run in MK and -strand_specific mode if suitable (Tab. S3).
The re-mapping rate was good (>85 %) for most data sets, however for the simulated
human data (73.26 %), the HSA-EBOV-23h data (70.05 %) and the E. coli data
(49.16 %) it dropped below acceptable thresholds. For the E. coli data set, a similar
behavior could be observed for SOAPdenovo-Trans (56.62 %) and IDBA-Tran
(34.31 %).
Oases introduced the highest amount of ambiguous bases in the assemblies in
comparison to the other tools and arranges among the last places regarding the
TransRate statistics. Based on the optimal score calculated by TransRate,
Oases occupies the last place for six out of the seven evaluated data sets.
Oases arranges in the last third of the RSEM-EVAL scores calculated by De-
tonate.
Based on the Oases assemblies, a comparable good amount of complete BUS-
COs could be detected, however many duplicate hits are included that might be
a result of the MK approach (Fig. 4.3). Oases assembled the highest amount of
complete BUSCOs for the simulated data set (∼90 %), however also had the highest
amount of duplicate BUSCOs within these hits (∼80 %).
Regarding the selected metrics, Oases performed best for the human simulated
data (7/20), the EBOV-infected samples (6/20) and the plant data (6/20). The
calculated metric score for the HSA-EBOV-7h data set could be comparatively low,
because we were not able to calculate rnaQUAST statistics for this assembly. Oases
achieved only an OM S of 29.6 (Fig. 4.2).
SOAPdenovo-Trans
SOAPdenovo-Trans was run on a single k-mer, because the tool has no build-in
function to merge assemblies from multiple k-mers. No strand-specific assembly is
supported. According to the authors this is planned for a future release to further
improve the algorithm [63]. The re-mapping rate was generally good (>85 %), except
for the E. coli data set.
SOAPdenovo-Trans performed quite well based on the TransRate statis-
tics. Almost all of the conducted assemblies achieved great scores for the percent-
age of good mappings, the percentage of uncovered bases, the number of ambigu-
ous bases and the optimal score calculated by TransRate. In most of the cases,
only the SPAdes assemblies could outperform SOAPdenovo-Trans regarding the
TransRate metrics.
The RSEM-EVAL scores vary depending on the assembled RNA-Seq data set.
For the HSA-EBOV-23h and M. musculus sample SOAPdenovo-Trans performed
good, whereas for the bacterial, the fungal, the plant and the simulated RNA-Seq
data the tool is among the last three assemblers regarding the RSEM-EVAL metric.
SOAPdenovo-Trans arranges in the middle field regarding the amount of as-
sembled complete BUSCOs. The amount of CD BUSCOs is very low (Fig. 4.3),
which correlates with the tools ability to detect different isoforms (see mean isoform
coverage calculated with rnaQUAST, Tab. S5). However, this might be also a result
of the single k-mer approach.
73
Chapter 4. Transcriptome Assembly
Trinity
Trinity was run on a single k-mer and, if suitable, in strand-specific mode on each
data set (Tab. S3). The re-mapping rate was generally good and between 85.56 %
(E. coli ) and 97.29 % (C. albicans).
Trinity assemblies arrange in the midfield regarding the TransRate metrics,
in some cases (C. albicans, HSA-EBOV-23h) the assemblies can be even found in
the top four of optimal TransRate scores.
Trinity performed very well on almost all data sets (except HSA-EBOV-3h)
by scoring among the top three RSEM-EVAL values.
Trinity performed well regarding the detection of complete BUSCOs for most
of the data sets (Fig. 4.3). For the eukaryotic data sets, approximately the half
amount of the detected complete BUSCOs is included multiple times in the assembly,
which could be a result of the sub-graphs Trinity relies on to detect different
isoforms of one transcript.
The accumulated metric scores for the Trinity assemblies resulted in one of
the top three scores (OM S=50.3, Fig. 4.2). Trinity achieved the best score for
the M. musculus data set (10/20), what might be not so suprisingly, because this is
the data set that was among others used for evaulation of the tool in the Trinity
paper [223]. Trinity achieved also good scoorings for the artificial data set (9/20)
and the HSA-EBOV-23h data (9/20). Interestingly, Trinity performed generally
good on the virus infected data sets, except the 3 h sample (2/20).
IDBA-Tran
IDBA-Tran was run with multiple k-mers and has no option for strand-specific
assembly.
For the E. coli (34.31 %), C. albicans (86.34 %), H. sapiens (64.61 %), and H. sap-
ines EBOV 7 h (76.39 %) data sets, the tool showed the lowest re-mapping rates
in comparison to all other assemblers. The best mapping rate was achieved for
A. thaliana with 89.04 %. All other mapping rates are between 48.37 % (human
EBOV 23 h) and 85.34 % (human simulated).
IDBA-Tran shows the lowest percentage of uncovered bases in the assemblies,
meaning that the contigs constructed by the tool are highly accurate. Accordingly,
the number of ambiguous bases is very low. Furthermore, some of the IDBA-Tran
assemblies arrange within the top three assemblies regarding the optimal score cal-
culated by TransRate. The optimal scores of the IDBA-Tran assemblies are
comparable with the SOAPdenovo-Trans scores. Overall, the TransRate met-
rics of the IDBA-Tran assemblies are generally good.
IDBA-Tran performed worse regarding the Detonate RSEM-EVAL calcula-
tions. For the E. coli, C. albicans, M. musculus, H. sapiens and HSA-EBOV-7h
data sets IDBA-Tran is placed last regarding to this metric and never reaches the
top five (Tab. S8).
74
4.1. The Dark Art of de novo transcriptome assembly
Furthermore, IDBA-Tran is one of the tools with the lowest amount of com-
plete BUSCOs and the highest amount of missing BUSCOs (Fig. 4.3 and Fig. S7).
Within the low amount of complete BUSCOs, the assembler included almost no du-
plicate contigs. Therefore, it seems that IDBA-Tran (although an MK approach)
is not performing well in constructing full-length transcripts and different isoforms.
Furthermore, the amount of fragmented BUSCOs in the IDBA-Tran assemblies is
comparably high.
IDBA-Tran is placed in the midfield of all metric scores (OM S=40.3, Fig. 4.2)
and showed the best performance for the artificial data set (8/20), the E. coli data
(7/17), and the M. musculus data (7/20).
Shannon
The Shannon assembler was used with the single default k-mer value and if suitable
in strand-specific mode (--ss).
Shannon showed the most variant re-mapping rates, ranging between 30.77 % for
the human simulated data set and 96.51 % for A. thaliana. Interestingly, Shannon
had a low mapping rate on the simulated data, whereas all the other tools (except
Oases, 73.26 %) showed a mapping rate >85 %.
The Shannon assemblies do not result in good TransRate optimal scores.
For most of the data sets, the Shannon assemblies arrange in the lower third of
optimal scores. However, the percentage of uncovered bases lays within the midfield
of all scorings and Shannon does not introduce that many ambiguous bases in the
assembled transcriptome.
The RSEM-EVAL scores of Shannon vary among the assembled data sets. For
some assemblies, the tool performed very well (H. sapiens, HSA-EBOV-3h and -
7h), whereas for others (A. thaliana, HSA-EBOV-23h, simulated data) it completely
failed regarding to this metric.
Shannon arranges in the midfield regarding the amount of assembled com-
plete BUSCOs, however the tool showed a relatively high amount of duplicated
hits (Fig. 4.3). Before, this behavior was mainly observed for the MK approaches
like Trans-ABySS and Oases. Interestingly, for the simulated data, Shannon
showed the highest amount of missing BUSCOs in comparison to the other assem-
blers (Fig. S7).
Shannon achieved one of the lowest accumulated metric scores (OM S=22.6,
Fig. 4.2). The best metric scores were obtained for the assemblies of the HSA-
EBOV-3h and -7h data sets (7/20 and 9/20). All other metric scores are below
5.
Bridger
Bridger can only handle single k-mer values between 19 and 32 with a default of
25. Whereas for most assembly applications of short read RNA-Seq data this range
might be acceptable, especially for longer read data (like produced by an Illumina
MiSeq, >150 nt) also longer k-mers can be advantageous. Here, we used the default
k-mer size in the assemblies performed by Bridger. If possible, we also used
the strand-specific option of the tool (--SS_lib_type), however for some of the
75
Chapter 4. Transcriptome Assembly
strand-specific RNA-Seq data sets Bridger failed (M. musculus and H. sapiens)
and so we executed the tool in the default unstranded mode. There seems to be a
problem to handle strand-specific paired-end data in this version of the tool. The
strand-specific assembly of the single-end E. coli data (--SS_lib_type F) was
running well.
Bridger showed quite good re-mapping rates between 87.35 % (E. coli) up to
96.72 % (C. albicans).
Over almost all TransRate metrics, the Bridger assemblies arrange in the
midfield of scores and are already far away from the top scores produced by the
SPAdes, SOAPdenovo-Trans and IDBA-Tran assemblies.
Bridger assemblies are among the top four RSEM-EVAL scores over all data
sets, therefore the tool is performing generally well according to this metric.
Furthermore, Bridger performed well in the detection of complete BUSCOs
with a moderate amount of duplicated hits. The amount of missing BUSCOs is low
(Fig. 4.3).
Bridger performed best for the E. coli data set (6/20) and the M. musculus
data set (6/20). However, the general performance of the tool is comparatively
humble, underlined by the lowest overll metric score of all assemblers (OM S=22,
Fig. 4.2).
BinPacker
BinPacker, build on the principals of Bridger, was also executed on a single
k-mer value and if suitable in strand-specific (-m F|RF) mode.
The re-mapping rate of BinPacker was quite low, depending on the data set.
For HSA-EBOV-3h (36.6 %) and M. musculus (54.31 %) BinPacker showed the
lowest mapping rates in comparison to the other tools. For all other data sets, the
mapping rate varies between 67.15 % (A. thaliana) and 96.66 % (C. albicans).
The BinPacker assemblies behave similar to the Bridger assemblies regard-
ing the TransRate metrics, however are slightly worse in direct comparison, placing
BinPacker among the worst three tools according to the TransRate statistics.
On the other hand, BinPacker introduces only a low amount of ambiguous bases
in the assemblies.
BinPacker arranges in the midfield or on the last places regarding the RSEM-
EVAL score, except on the human simulated data the tool achieves a scoring similar
to Bridger and reaches the third place (behind Trinity and Trans-ABySS).
Regarding the amount of complete BUSCO detections, BinPacker performed
well on the C. albicans, HSA-EBOV-7h and human simulated data set, but com-
pletely failed for the others (Fig. 4.3). Over 90 % of the BUSCOs included in each
database could not be identified in the BinPacker assemblies. Regarding the or-
tholog detection, BinPacker had the worst performance.
Regarding the selected metrics, the performance of BinPacker is similar to the
performance of Bridger (OM S=28, Fig. 4.2). This observation is not suprising,
because BinPacker is build on the principals of Bridger. BinPacker showed
a more consistent behaviour than Bridger regarding the metric scores, however
only reached scores between 3 and 5 (Fig. 4.2).
76
4.1. The Dark Art of de novo transcriptome assembly
77
Chapter 4. Transcriptome Assembly
Usability
We further rated our experiences regarding the installation and usability of each
tool (Tab. 4.1). These experiences might be subjective, nevertheless we want to
share them here to give non-experienced users an idea of how difficult it is to get
each tool installed and executed. Some of the tools rely on many dependencies
and/or were difficult to compile, at least on our system without administrative per-
missions (Shannon, SOAPdenovo-Trans, Trans-ABySS), while others could be
installed straight out of the box (SPAdes). Furthermore, some assemblers need ad-
ditional parameter files for execution (SOAPdenovo-Trans), are circuitous to run
(Trans-AbySS, Oases, SOAPdenovo-Trans, ), needed additional preprocess-
ing steps of the reads for some of the data sets (IDBA-Tran assumes paired-end
reads to be in order forward–reverse), or were just not terminating for all the data
sets (Bridger), while with others we had no problems and could execute them
straightforward (Trinity, SPAdes, BinPacker, IDBA-Tran).
Bridger failed in the path search step for some of the generated sub files.
Therefore, we combined the transcript output manually, because this is anyway the
last step of the tool. Furthermore, we had to start Bridger two times for each
data set, because the tool crashed each time after the first start, but continues with
the assembly when started a second time on the same output folder.
In the past, Oases and Trans-ABySS were always circuitous to run, because
the corresponding genome assemblers Velvet and ABySS needed to be executed
first with multiple k-mers. This difficulties were somehow emasculated by new
wrapper scripts provided by the developers to automatically execute the underlying
genome assemblers.
Computational efficiency
Since de novo transcriptome assembly can involve the analysis of large sequencing
data, computational efficiency is an important benchmark, especially for deep se-
quencing projects and large sample sizes. Furthermore, it is highly recommended
10
http://spades.bioinf.spbau.ru/release3.10.1/rnaspades_manual.html
78
4.1. The Dark Art of de novo transcriptome assembly
to run multiple assemblies with different tools and parameter settings (for example
different k-mers), so computation time is an important part of each tool. Electronic
Supplement Fig. S10 summarize the computation time and the maximum memory
peak of all data sets and assemblers.
79
Chapter 4. Transcriptome Assembly
proteins [224]. As we assembled the three human samples infected with EBOV at
three different time points individually, we were able investigate how the different
assemblers perform on varying amounts of viral reads in the data (3 h ∼0.1 % viral
reads, 7 h ∼2 %, 23 h ∼20 %). Details about the amount of viral reads in each data
set can be found in Sec. 5.2 and Appendix A.2
Interestingly, with a higher amount of viral reads, the performance of most of
the assembly tools dropped. For example, Trans-ABySS was able to construct
the full EBOV genome out of the 3 h (18,926 nt, 99.984% sequence similarity) and
7 h (18,903 nt, 99.974%) data set, but failed on the 23 h data set with a viral read
contamination of roughly 20 % (many small contigs, longest hit: 8,500 nt).
In general, Trans-ABySS, SOAPdenovo-Trans, Shannon, Bridger, Bin-
Packer and SPAdes (-sc and -rna mode) performed well and constructed the full
EBOV genome out of the 3 h data set. In the BinPacker assembly we found
only one homologous sequence with a length of 18,896 nt and 99.984% similarity.
The exactly same contig was found in the Bridger assembly, the precursor tool
of BinPacker. Oases produced many small contigs and a 16,149 nt hit with
similarity to the EBOV genome.
On the 7 h data set, with a ∼2 % amount of viral reads, Trans-ABySS, SOAP-
denovoTrans and Shannon performed best. However, the longest hit found in the
Shannon assembly comprises only 17,107 nt and many other fragments of different
sizes with similarity to parts of the EBOV genome. Bridger and BinPacker
were only able to construct the same 10 kbp partial EBOV genome. SPAdes-sc
and SPAdes-rna assembled viral contigs up to a length of 12 kbp and 14 kbp,
respectively.
Out of the 23 h data set (∼20 % viral reads), only SOAPdenovo-Trans was
able to construct the full EBOV genome (18,901 nt, 99.53%), but also including
many small contigs with similarity to the viral genome. Bridger and BinPacker
construct contigs of a length of 14.8 kbp and 12 kbp, respectively. All other assembly
tools were not able to construct any longer contigs with a similarity to the EBOV
genome out of this data set.
Interestingly, Trinity was the only tool that was not able to construct any
full-length EBOV genome out of the three data sets.
In summary, SOAPdenovo-Trans performed very well on all three data sets by
constructing accurate full-length contigs with high similarity to the EBOV genome.
Therefore, it could be interesting to evaluate the performance of SOAPdenovo-Trans
for the construction of RNA viral genomes out of meta-transcriptomic RNA-Seq data
in the future. If the amount of viral reads is low (∼0.1 %), all assembly tools ex-
cept Trinity, Oases and IDBA-Tran produced accurate viral contigs with high
similarity to the EBOV genome and a length >18 kbp.
80
4.1. The Dark Art of de novo transcriptome assembly
81
Chapter 4. Transcriptome Assembly
constructing different isoforms, most likely a result of the missing multiple k-mer
approach.
Oases also performed generally well, for example when taking a look at the
BUSCO results (Fig. 4.3). However, the tool produced the highest amount of com-
plete and duplicated hits, which might indicate that highly similar isoforms derived
from the multiple k-mer approach are nor efficiently merged. For all data sets,
Oases produced also the highest amount of contigs, however did not achieve the
best database coverage in all test cases. For example the Oases assembly of the
H. sapiens data set comprises ∼207,000 transcripts with a length >1000 bp, cover-
ing 8 % of the reference transcripts (Electronic Supplement Tab. S9, Tab. 4.2). In
comparison, the Trans-ABySS assembly needs only ∼68,000 contigs to achieve a
database coverage of 29 %. Therefore, Oases can create good assembly results, but
also produces big assemblies with many contigs that might complicate and confuse
downstream analyses.
The fastest tool executed on all data sets was SOAPdenovo-Trans. The
tool outperformed all other assemblers regarding the runtime (Electronic Supple-
ment Fig. S10). Combined with the moderate memory consumption, this makes
SOAPdenovo-Trans the most resource-efficient tool evaluated in this study. How-
ever, it might be interesting to run multiple k-mer assemblies with SOAPdenovo-
Trans and use another assembly merge strategy (e.g., conducted from Oases or
TransABySS) to merge the final transcripts resulting from each run. In general,
multiple k-mer approaches performed better than single k-mer approaches.
Here, we summarize some key conclusions from the comparative study:
(I) No tool performed dominantly best for all data sets. However, Trans-ABySS,
Trinity, SOAPdenovo-Trans and SPAdes performed consistently among
the best assembly tools (Fig. 4.2).
(II) SOAPdenovo-Trans performed best for the construction of the viral RNA
genome at all three time points tested.
(IV) Based on our results, we recommend to apply different tools and parameter
settings for de novo transcriptome assembly, followed by the evaluation of the
output transcripts and selecting the best-performing results. This general idea
needs to be investigated in more detail in future studies, because the selection
of the best assemblies based on appropriate metrics and also the following clus-
tering procedure (without the lose of isoforms, avoidance of building chimeric
transcripts, redundancy) are still challenging and open tasks. In Sec. 4.2, a
proof-of-concept is presented, comparing the performance of different cluster-
ing approaches on the single assemblies of the M. musculus data set presented
here.
The complementary performance of the top performing tools motivated the devel-
opment of an ensemble method by combining the best performing methods to achieve
an overall better assembly day. Therefore, we developed the idea of a pipeline that
82
4.1. The Dark Art of de novo transcriptome assembly
automatically selects the top performing assemblies (or only the best transcripts
from each assembly) based on various metrics and clusters them based on sequence
similarity to achieve a more comprehensive assembly.
A common problem of many comparative studies is that they can only provide
limited suggestions based on the tools and data sets that have been available at
the time point they were carried out. If new versions of an assembly tool or com-
pletely new tools are released, there is almost no way to estimate the advantages
(or disadvantages) without carrying out a new comparison study. Therefore, next
to the main focus of this study (comparing de novo short-read assembly tools on
various RNA-Seq data sets), we developed a pipeline that allows an easy integration
of updated versions and novel assembly tools in the comparison process. All met-
rics, figures and tables are created by scripts and finally combined with the help of
an in-house ruby script to build up the electronic supplement. Evaluation metrics
can be changed or additional metrics (next to the 20 selected by us) can be easily
added. If a new assembly tool needs be included in the comparison, we just execute
it on each of the selected data sets and use the resulting multiple FASTA files as
input for the evaluation pipeline. The electronic supplement would be automatically
extended by the evaluation results for the new assembly tool.
Therefore, we can easily extend the comparison presented here and update all
tables and figures in the electronic supplement to investigate the performance of
upcoming tools for de novo assembly of short-read RNA-Seq data.
In the following section, we combined all assembly results and only the best
performing onces (based on our selected metrics) of the M. musculus data set to
serve as a proof-of-concept. An evaluation of the different parameter settings for
clustering the single assemblies is performed. The information gathered out of this
proof-of-concept study may be further contribute to develop new cluster methods
for the construction of transcriptome assemblies.
For the large bioinformatics community working in the area of RNA-Seq, the
development of a high-performing (accurate and fast) de novo transcriptome cluster
workflow to automatically select and combine the output of top-performing assembly
tools remains an important and challenging task.
83
Chapter 4. Transcriptome Assembly
84
4.2. Cluster de novo transcriptome assemblies: a proof-of-concept
Recent studies already used such merging approaches to combine multiple as-
semblies for annotation, comparative studies, quantification and differential gene
expression [226, 232–235]. In most cases, the multiple FASTA files resulting from all
assembly runs are concatenated into one big FASTA file and clustered by sequence
similarity with CD-HIT-EST [146]. However, the quality of assemblies resulting
from such clustering approaches was never systematically investigated. Another
problem is the high redundancy that can be introduced by clustering the output of
multiple assembly runs.
85
Chapter 4. Transcriptome Assembly
Transfuse
Transfuse is currently under development14 and based on the output of Trans-
Rate [229], a transcriptome evaluation pipeline already used for the calculation of
several metrics in Sec. 4.1 and shortly explained in 4.1.2.
Transfuse merges multiple de novo transcriptome assemblies. The input are
multiple assemblies calculated with different de novo assemblers, or different param-
eter settings in the same assembler. The output is a single high quality assembly
representing the transcriptome.
To cluster the multiple assemblies, Transfuse does not only rely on the contigs
(like CD-HIT-EST), but also takes the reads used to perform the transcriptome
assemblies as an input. The idea is to first calculate reference-free assembly scores
for each single assembly and transcript with TransRate, followed by the clustering
of the transcripts that pass the filter. A multiple sequence alignment is calculated
for each cluster, and then a splice graph is resolved for each alignment to generate
merged contigs. According to the authors, the tool is still in development but it
appears to be performing well and the manuscript is in preparation.
Currently, Transfuse can only work with paired-end RNA-Seq data like Trans-
Rate.
86
4.2. Cluster de novo transcriptome assemblies: a proof-of-concept
In Fig. 4.4 we visualize the workflow presented in Sec. 4.1 and how the dif-
ferent clustering setups were conducted here. (1) – At first, we clustered the
transcripts of all ten assemblies with CD-HIT-EST (Fig. 4.4). (2) – In a sec-
ond approach, we selected only the best six assemblies according to our results
in Sec. 4.1 before clustering again with CD-HIT-EST. The top six assemblies were
Trinity, Trans-ABySS, SOAPdenovo-Trans, IDBA-Tran, SPAdes-sc, and
SPAdes-rna and are additionally marked in Fig. 4.4. (3) – In a third approach, we
selected from all ten assemblies only those contigs for clustering with CD-HIT-EST
that were previously defined as good based on the TransRate scoring. (4) – In a last
setup involving the clustering with CD-HIT-EST, we selected only the TransRate
good contigs from the best assemblies (a combination of (2) and (3)) for clustering.
For each approach, the CD-HIT-EST clustering was performed on three similarity
thresholds (−c parameter): 1.0, 0.99 and 0.95. (5) – We further used the unpub-
lished Transfuse pipeline (utilizing the TransRate scoring scheme) to merge all
ten assemblies. In summary, we build 13 merged assemblies (Fig. 4.4).
For evaluation and comparison of the merged assemblies with the single assembly
tools we used the same metrics and evaluation software like previously described in
Sec. 4.1.
Detonate
We investigated the RSEM-EVAL scores calculated by Detonate for each of the
13 clustered assemblies (Tab. 4.3). The scores arrange between -2.10 (CD-HIT-
Metrics-99p and Transfuse) and -2.60 (CD-HIT-TransRate-Metrics-100p). The best
RSEM-EVAL scores for single assemblies were achieved by Trans-ABySS (-2.14),
Trinity (-2.26) and Bridger (-2.37) (Electronic Supplement Tab. S8 and Ap-
87
Chapter 4. Transcriptome Assembly
Sec. 4.1
RNA-Seq data set
Mouse
2x 43 Mio reads
76 bp, strand-specific
Preprocessing
FastQC Prinseq
Bridger BinPacker
Trinity Oases
SPAdes-rna SPAdes-sc
Shannon IDBA-Tran
Metrics + Selected
TransRate
TransRate metrics
Select 'good' (3) Select 'good' (4) (2)
Select 'best'
contigs from contigs from
assemblies
all assemblies 'best' assemblies
Transfuse CD-HIT-EST
(1)
(5)
100% 99% 95%
Clustering
Merged assemblies
CD-HIT-TransRate-Metrics-100p
CD-HIT-TransRate-Metrics-99p Transfuse
CD-HIT-TransRate-Metrics-95p
Figure 4.4: Proof-of-concept transcriptome pipeline. We chose the Mus musculus data set pre-
sented in Sec. 4.1 and the corresponding assemblies to evaluate the performance of different clus-
tering approaches and parameter settings to build a merged assembly. (1) – we clustered the tran-
scripts of all ten assemblies with CD-HIT-EST [146]. (2) – before clustering with CD-HIT-EST,
we selected only the best six assemblies according to the 20 metrics defined in Sec. 4.1. The top
six assemblies are marked with a star. (3) – From all ten assemblies we selected only those contigs
for clustering that were previously defined as good based on the TransRate scoring. (4) – We
select only the TransRate good contigs from the best assemblies (combination of (2) and (3)) for
clustering. The clustering is performed on three similarity thresholds (−c parameter): 1.0, 0.99
and 0.95. (5) – We further used the unpublished Transfuse pipeline (utilizing the TransRate
scoring scheme) to merge all ten assemblies. In summary, we build 13 merged assemblies.
88
Table 4.3: Selected metrics based on the output of rnaQUAST, Hisat2, Detonate, TransRate and BUSCO for the transcripts clustered with 13 different
approaches (Fig. 4.4) based on the single assemblies created previously for the Mus musculus RNA-Seq strand-specific paired-end library with read length
76 bp (Sec. 4.1). In each row the top three values are indicated with bold italic. We rated those approaches as top scoring that ended up with the lowest
amount of transcripts, because the goal of merging multiple assemblies should be to reduce the size of the overall assembly by still keeping the correct
representatives for each transcript. The RSEM-EVAL score is multiplied by 109 . C+SC – complete and single-copy BUSCOs.
89
>1000 bp 123,651 89,973 73,251 57,236 33,897 24,969 36,419 24,328 21,603 26,054 16,995 15,263 41,935
Database coverage 0.23 0.191 0.162 0.238 0.208 0.185 0.147 0.117 0.109 0.15 0.125 0.119 0.181
Misassemblies 59,484 57,891 53,545 1,582 1,475 1,289 5,210 5,186 5,072 321 308 290 7,051
Isoform coverage 0.556 0.522 0.484 0.588 0.571 0.545 0.579 0.558 0.552 0.588 0.574 0.571 0.649
TransRate
Optimal score 0.003 0.007 0.011 0.010 0.067 0.127 0.0539 0.166 0.221 0.094 0.268 0.341 0.014
Detonate
RSEM EVAL -2.49 -2.35 -2.29 -2.19 -2.10 -2.11 -2.33 -2.29 -2.30 -2.60 -2.58 -2.59 -2.10
BUSCO
C+SC BUSCOs 493 1,178 1,665 761 2,064 2,745 1,309 2,873 3,347 1,667 3,172 3,507 1,167
Missing BUSCOs 1,866 1,854 1,849 1,892 1,893 1,894 1,920 1,921 1,922 1,962 1,964 1,965 1,872
4.2. Cluster de novo transcriptome assemblies: a proof-of-concept
Chapter 4. Transcriptome Assembly
pendix Tab. B.4), whereas BinPacker (-4.89) and IDBA-Tran (-5.03) performed
worst. All RSEM-EVAL scores calculated for the merged assemblies arrange within
the top three scores of the single assemblies. RSEM-EVAL scores better than the
score achieved with the top-performing tool regarding this metric (Trans-ABySS)
were obtained with the CD-HIT-Metrics-99p clustering (-2.10), the Transfuse clus-
tering (-2.10), and the CD-HIT-Metrics-95p (-2.11). In general, merging only the
transcripts of the preselected best performing assembly tools (Fig. 4.4) according
to the 20 metrics defined in Sec. 4.1 worked well and outperformed the other clus-
tering approaches according to the RSEM-EVAL score. Interestingly, the merged
assembly calculated by Transfuse achieved the same top score (-2.10) like the CD-
HIT-Metrics-99p approach. Using only the good contigs (defined by TransRate) of
the best performing assemblers seems to be too restrictive to achieve better RSEM-
EVAL scores (-2.58, -2.59, -2.60). The RSEM-EVAL scores were multiplied by
109 for more clarity and easy comparison. Furthermore, within each clustering ap-
proach (all assemblies, best assemblies, good contigs, good contigs of best assemblies)
the CD-HIT-EST clustering achieved the best RSEM-EVAL score with an identity
threshold of 0.99.
In summary, the Transfuse merging and using a combination of different met-
rics to preselect the best performing assemblies and all contigs of these assemblies
for clustering achieved the best Detonate evaluation results (Tab. 4.3).
TransRate
For the single assemblies, the best TransRate scores were achieved for SPAdes-rna
(0.43), SOAPdenovo-Trans (0.40), and SPAdes-sc (0.37) (Electronic Supple-
ment Tab. S9 and Appendix Tab. B.4). BinPacker (0.13), Trans-ABySS (0.09)
and Oases (0.02) performed worst. For the merged assemblies, there was not a sin-
gle TransRate score calculated that outperforms the results of the top three single
assembly tools regarding this metric (Tab. 4.3). The best score was achieved by the
CD-HIT-TransRate-Metrics-95p clustering (0.34), followed by CD-HIT-TransRate-
Metrics-99p (0.27), and CD-HIT-TransRate-95p (0.22). The clustering approach
involving the transcripts of all ten single assemblies did not perform well regarding
the TransRate scoring. The CD-HIT-99p and CD-HIT-100p assemblies achieved
a TransRate score below 0.01. The best TransRate scores were achieved for the
merged assemblies with metric preselection and the best contig filter of TransRate.
Surely, this does also introduce a bias in the evaluation. If we only select the tran-
scripts for merging that previously achieved a good scoring with TransRate, the
resulting assemblies should perform better when again evaluated with the same
metric. However, using only the TransRate-defined good contigs of the six best
performing assembly tools (Fig. 4.4) did greatly improve the TransRate score
(Tab. 4.3). For example, the CD-HIT-TransRate-95p clustering achieved a score
of 0.22, whereas the CD-HIT-TransRate-Metrics-95p achieved a score of 0.34. The
same holds for the CD-HIT-EST clustering with an identity threshold of 99 % (0.17
using the good transcripts of all assemblies and 0.27 using only the good contigs
of the six best assemblies) and with an threshold of 100 % (0.05 vs. 0.09). Inter-
estingly, the Transfuse merging of all ten single assemblies did not perform well
(0.01) although the tool is based on the output of TransRate.
90
4.2. Cluster de novo transcriptome assemblies: a proof-of-concept
rnaQUAST
First of all, the number of transcripts in the merged assemblies varies widely between
43,108 (CD-HIT-TransRate-Metrics-95p) and 418,712 (CD-HIT-100p) (Tab. 4.3).
This behaviour can be easily explained by the amount of restriction each clustering
approach applies on the input data. In the most restrictive approach (CD-HIT-
TransRate-Metrics-95p) only few transcripts are used as input and clustered with
a low identity threshold of 95 %. Therefore, only 15,263 transcripts with a length
>1000 bp are left in the final assembly, whereas the CD-HIT-100p approach still
comprises 123,651 transcripts. Those numbers are especially interesting for further
downstream analyses like the annotation of the transcripts or differential gene ex-
pression estimations. The goal of merging the transcripts of multiple assemblies
should be to finally obtain as few transcript sequences as possible without losing
important information, such as different isoforms.
Whereas most clustering approaches introduced only a moderate amount of mis-
assembled transcripts (290 (CD-HIT-TransRate-Metrics-95p) – 7,051 (Transfuse)) in
the final assembly, the simple CD-HIT-EST clustering approaches of all transcripts
of all ten single assemblies resulted in >53,000 misassemblies (Tab. 4.3). When com-
paring with the number of misassemblies of the single assemblies (Tab. B.4), this high
number most likely derives from the Oases assembly (52,665 misassemblies). The
lowest numbers of misassemblies were achieved for SPAdes-rna (30), IDBA-Tran
(41), and SOAPdenovo-Trans (61). Therefore, all clustering approaches seem to
introduce novel misassemblies or accumulate present ones in the final assemblies.
However, the CD-HIT-Transrate-Metric approaches performed best regarding the
amount of misassemblies (290–321, Tab. 4.3).
The mean isoform coverage calculated by rnaQUAST revealed Transfuse (0.649),
CD-HIT-Metrics-100p (0.588), and CD-HIT-TransRate-Metrics-100p (0.588) as the
top performing approaches regarding this metric (Tab. 4.3). In summary, almost all
tools performed with only slight differences in the mean isoform coverage rate. How-
ever, for the single assemblies mean isoform coverage values of 0.81 (BinPacker),
0.66 (Trinity), and 0.62 (Trans-ABySS) were achieved (Electronic Supplement
Tab. S9, Appendix Tab. B.4).
In conclusion, the CD-HIT-EST algorithm seems to be not optimal in recovering
highly similar isoforms. As Transfuse also takes the paired-end read information
into account, the resolution of different isoforms is much higher (0.649) in comparison
to the other twelve clustering approaches. Therefore, CD-HIT-EST should be used
carefully with a high identity threshold in order to keep similar isoforms in the
merged assembly.
BUSCO
The BUSCO assessment results are shown in Fig. 4.5. When only comparing the
single assembly tools (Fig. 4.5A), Trans-ABySS (3,989), Trinity (3,921) and
91
Chapter 4. Transcriptome Assembly
Bridger (3,914) produced the assemblies with most complete orthologs to the
BUSCO Euarchontoglires data set. Trans-ABySS, Trinity, Oases and Bridger
also assembled many complete and duplicated transcripts, whereas IDBA-Tran,
Shannon, SPAdes-rna and Oases included many fragmented hits.
By combining the transcripts of the single assemblies, we were able to increase
the maximum number of complete BUSCOs from 3,989 (Trans-ABySS) to 4,111
(CD-HIT-95p, Fig. 4.5B). However, higher redundancy is introduced in the clus-
tered assemblies, as much more complete and duplicated transcripts are detected
by BUSCO (1,969 for Trans-ABySS and 3,598 for CD-HIT-100p). By lowering
the similarity threshold of CD-HIT-EST, this number can be reduced (for example
2,446 complete and duplicate BUSCOs for CD-HIT-95p, Fig. 4.5).
By using the metrics defined in Sec. 4.1 as a filter and additionally only the
contigs defined as good by the TransRate evaluation, we could heavily reduce the
overall amount of contigs in the final assembly, without losing to much sensitivity
and specificity of the assembly. Whereas the CD-HIT-100p assembly still comprises
418,712 sequences, the CD-HIT-TransRate-Metrics-95p consists of only 43,108 con-
tigs. The amount of ambiguous N bases is also decreasing (in this case from 426
million to 53 million).
92
4.2. Cluster de novo transcriptome assemblies: a proof-of-concept
A Single assemblies
Trans-ABySS C:3989 [S:2020, D:1969], F:274, M:1929, n:6192
0 20 40 60 80 100
%BUSCOs
B Merged assemblies
B.1 CD-HIT-100p C:4091 [S:493, D:3598], F:235, M:1866, n:6192
0 20 40 60 80 100
%BUSCOs
Complete (C) and duplicated (D) Complete (C) and single−copy (S)
Figure 4.5: BUSCOs for M. musculus single and clustered assemblies. (A) shows the BUSCO
results for each single assembly tool (compare Fig. 4.3). The six assemblies that performed best
according to the metrics presented in Sec. 4.1 are marked with a star. In (B) the BUSCO hits for
the different merging approaches are shown. B.1 – All ten assemblies produced by the single tools
shown in (A) are clustered with CD-HIT-EST. B.2 – Only the six best assemblies (according to the
metrics defined in Sec. 4.1, marked in (A)) are clustered by CD-HIT-EST. B.3 – Only those contigs
defined as good by TransRate are clustered. B.4 – Good TransRate contigs were only selected
from the six best assemblies and clustered. For each of the four merging approaches (B.1–B.4)
CD-HIT-EST was applied with three different sequence identity thresholds (−c parameter: 1.0,
0.99, 0.95). This sequence identity threshold is calculated as the number of identical nucleotides in
the alignment divided by the full length of the shorter sequence. In B.5 the results for the merging
by Transfuse are given. For a description of the BUSCO assessment results see Sec. 4.1.2.
93
Chapter 4. Transcriptome Assembly
has its limitations, for example regarding the efficient merging of isoforms.
The Transfuse software (currently under development) already goes in this
direction, however the pipeline can only work with paired-end data at the moment.
However, many RNA-Seq projects are still based on single-end data or sequencing
techniques different from Illuminas paired-end protocol. Therefore, the usage of
Transfuse is currently restricted to a limited set of NGS projects.
This proof-of-concept and the scripts already conducted here will be used as
a starting point for the implementation of an automated pipeline for the efficient
calculation, evaluation and clustering of de novo transcriptome assemblies [17]. To
achieve this, we will 1) define meaningful reference-free metrics, 2) automate the
detection of the best assemblies and/or contigs to 3) finally merge them into a
comprehensive de novo transcriptome assembly based on short-read RNA-Seq data.
94
Chapter 5
95
Chapter 5. Differential Gene Expression
The second part of this chapter (Sec. 5.2) was performed in cooperation with
the virology group of Prof. Dr. Stephan Becker at the Philipps University Marburg.
A project is presented that confronted us with much more complicated problems.
In the first place, at the start of this project no genome of the fruit bat Rousettus
aegyptiacus was available, so we decided to construct a comprehensive de novo tran-
scriptome assembly with various tools (see Chapter 4) to find differential expressed
genes of this bat. Furthermore, the lack of biological replicates and the high repli-
cation rate of the Ebola virus made this project one of the most challenging ones
during my PhD. Nevertheless, in this section a great entanglement is presented by
combining the analysis of genome reference data with transcriptome assembly data
for human and bat, respectively. In this project, we also exemplarily showed that
differentially expressed genes identified with a genome and a transcriptome reference
approach are actually comparable. This project is accompanied by a comprehen-
sive Electronic Supplement as well as an interactive gene observer, both available
online2 .
2
available at http://www.rna.uni-jena.de/supplements/filovirus_human_bat/
96
5.1. Differential effects of vitamins on human monocytes after infections
97
Chapter 5. Differential Gene Expression
causes of systemic mycoses [220]. During these systemic infections, monocytes play
a central role in the host defense contributing not only to pathogen recognition,
but also as phagocytes and effector cells [266]. Hence, in this exhaustive study, we
analyzed the noteworthy immunomodulatory role of vitamins on human monocytes.
Monocyte isolation
Human monocytes were isolated from 500 ml fresh whole blood (drawn within 1 h
before use) of healthy male donors. Blood was layered onto an equal volume of 1-Step
Polymorphs (Accurate Chemical & Scientific Corporation, USA) and centrifuged at
650 × g for 35 min. After centrifugation, the peripheral blood mononuclear cells
(PBMCs) were collected, and normal osmolarity was restored by adding an equal
volume of 0.45 % cold NaCl. After erythrocyte lysis using a hypotonic buffer, cells
were washed twice in cold PBS and counted using a Neubauer chamber. Cell viability
of >95% was assessed by trypan blue staining. Monocytes were isolated from the
PBMCs using the monocyte isolation kit II and quadro-MACS (Miltenyi Biotec,
UK), following manufacturer’s instructions.
98
5.1. Differential effects of vitamins on human monocytes after infections
Ethics statement
The blood of healthy male donors was drawn after written informed consent. This
is in accordance with the Declaration of Helsinki, all protocols were approved by the
Ethics Committee of the University Hospital Jena (permit number: 3639-12/12).
Stimulation assays
Monocytes were resuspended at 5 × 106 cells/ml in RPMI 1640 GlutaMAX medium
(Gibco, UK) supplemented with 10 % FBS (Biochrom, Germany) and 1 % Peni-
cillin/Streptomycin (Thermo Fisher Scientific, USA). They were seeded on 6-well
plates (VWR International, Germany) and allowed to equilibrate at 37 °C and
5 % CO2 for 2 h. Cells were then pre-incubated with 1 µM atRA or 1α,25(OH)2 D3
for 30 min. Then, the heat-killed pathogens were added at a pathogen:host ratio of
1:1 for C. albicans yeast and A. fumigatus germ tubes, and 10:1 in case of E. coli
stimulation. After 6 h of incubation at 37 °C and 5 % CO2 , cell viability >90 % was
assessed by trypan blue staining, and the monocytes were harvested for RNA isola-
tion. The whole experimental workflow is depicted in Fig 5.1.
In total, we had four different immune-stimulatory settings (w/o infection, A. fu-
migatus infection, C. albicans infection and E. coli infection), in each of which we
aimed to address the effect of vitamin A (atRA) or vitamin D supplementation.
Figure 5.1: Experimental workflow. Human monocytes were isolated from fresh whole blood and
purity of the cells was analyzed by flow cytometry. Upper scatterplot: Forward scatter (FSC)
and side scatter (SSC) measurement. Lower scatterplot: Fluorescence intensities of cells stained
with FITC-conjugated CD14 antibody and APC-conjugated CD16 antibody. Monocytes were
then pre-incubated with vitamin A (atRA) or vitamin D, followed by stimulation with heat-killed
A. fumigatus, C. albicans or E. coli for 6h. Poly-(A) RNA was isolated from the monocytes and
subjected to RNA sequencing.
RNA sequencing
RNA was isolated from 5 × 106 monocytes using the RNeasy Mini Kit (Qiagen, Ger-
many). An additional step was included to remove the residual genomic DNA using
99
Chapter 5. Differential Gene Expression
DNaseI (Qiagen, Germany). Total RNA was quantified using a Nanodrop ND-
1000 spectrophotometer (Thermo Fisher Scientific, USA). The quality of the RNA
samples (RNA Integrity Number (RIN) values ≥ 7.0) was measured using a Tape
Station 2200 (Agilent Technologies, USA). Poly-(A) RNA was purified from 2 µg of
total RNA using the Dynabeads mRNA DIRECT Micro Purification Kit (Thermo
Fisher Scientific, USA), according to manufacturer’s instructions. Quality control
for the depletion of rRNA was carried out using High Sensitivity RNA Screen Tapes
(Agilent Technologies, USA).
Strand-specific whole transcriptome libraries were prepared using the Ion To-
tal RNA-Seq Kit v2.0 (Thermo Fisher Scientific, USA). RNAse III was employed to
fragment the purified RNA. Ion adapters were ligated to the resulting fragments, and
reverse transcription was performed using the SuperScript III Enzyme Mix (Thermo
Fisher Scientific, USA). Barcoded primers were used to amplify the libraries with the
Platinum PCR High Fidelity polymerase (Thermo Fisher Scientific, USA). Size dis-
tribution analysis and quantification of the final barcoded libraries was performed on
D1000 Screen Tapes on the Tape Station 2200 (Agilent Technologies, USA). Library
templates were clonally amplified on Ion Sphere particles using the Ion PI Hi-Q Chef
Kit and Ion Chef instrument (Thermo Fisher Scientific, USA), loaded onto Ion PI
Chips and sequenced on an Ion Proton Sequencer (Thermo Fisher Scientific, USA).
For sequencing, in total 48 samples were multiplexed on 12 chips. The raw sequence
data in FASTQ format are stored in the Sequence Read Archive (SRA) at National
Center for Biotechnology Information (NCBI) and can be accessed at NCBI home-
page (https://www.ncbi.nlm.nih.gov/; accession number: SRP076532).
Mapping
The quality controlled and rRNA cleaned reads were aligned to reference genomes
using Segemehl [75] (v0.2.0). The mapping was performed against the human
genome version GRCh38, downloaded from Ensembl (release 80). Indices were
build together with the corresponding pathogen genomes, depending on the samples
to be mapped. For E. coli, the complete genome of strain K-12 substr. MG1655
(NC_000913.3) was downloaded from NCBI. Genomes of A. fumigatus and C. albi-
cans were obtained from aspergillusgenome.org (21.05.2015) and candida
genome.org (Ca21, 28.05.2015). All mappings were performed with default pa-
3
http://www.rna.uni-jena.de/supplements/fungi_infection/
100
5.1. Differential effects of vitamins on human monocytes after infections
rameters and the -splits option of Segemehl to allow for multiple spliced read
alignments.
Gene filtering
In order to filter out low-expressed mRNAs, we calculated for each gene the tran-
scripts per kilobase per million (TPM) value to eliminate potential biases due to the
transcript length in normalized read counts [79].
!
ci 1
T P Mi = · P cj · 106
li lj
j∈N
where ci is the raw read count of gene i, li is the length of gene i and N is the
number of all genes in the given annotation.
For each gene, we calculated four mean TPM values (T P MM ), based on the
12 samples corresponding to control or one of the three different infection types.
Subsequently, for each stimulatory setting we used TPM=5 as a minimum limit for
detectability [31] of transcripts.
101
Chapter 5. Differential Gene Expression
To compare the effect of atRA and vitamin D during different infections, log2 fold
changes (FC) as computed by DEseq2 were visualized using scatter plots in R. The
scatterplots were overlaid with contour plots for a two-dimensional kernel estimate
(kde2d; MASS package) using the default parameters. Outliers were labeled with the
respective gene names. Box plots of certain gene expression patterns were visualized
with the help of [85].
Heat maps
Pairwise comparisons were carried out to address the effect of each pathogen stimula-
tion (unstimulated samples versus pathogen-stimulated samples) and also the effect
of the vitamin-mediated regulation in each stimulatory setting (pathogen-stimulated
samples versus pathogen-/vitamin-stimulated samples).
K-means clustering was performed on variance-stabilized read counts to build
a heatmap (selected gene set with adjusted p-value ≤ 0.05) in R. For this, the
pheatmap function was applied with the kmeans option and euclidean clustering
distance of the rows. Beforehand, the model-based optimal number of clusters was
determined using Mclust of the mclust package in R [271, 272]. The assigned
genes of the resulting clusters were annotated by gene ontology analysis using the
PANTHER (Protein ANalysis THrough Evolutionary Relationships) classification
system5 and the Partek Genomics Suite 6.6 (Partek, USA). Pathway analy-
sis was performed using the Partek Pathways software tool (Partek, USA) which
employs the Kegg pathway database. Furthermore, for all genes that are affected by
either atRA or vitamin D treatment during any infection setting, STRING networks
were generated using the STRING database6 . In order to identify key pathways
involved in either atRA- or vitD-mediated immunomodulation across the three in-
fections, we have applied the KeyPathwayMiner tool [276, 277]. For each vitamin,
we used the log2 fold changes derived from the comparisons of pathogen-stimulated
samples versus pathogen-/vitamin-stimulated samples. All three input tables for
each pathogen (and each vitamin, respectively) were logically connected with AND,
and the parameters K and L were kept as default. All nodes of the resulting net-
works represent either genes significantly down-regulated by the vitamins, or were
inferred by KeyPathwayMiner to connect subnetworks.
5
http://pantherdb.org/; http://geneontology.org/; [273, 274]
6
http://string-db.org [275]
102
5.1. Differential effects of vitamins on human monocytes after infections
Stimulation assays were repeated for an earlier time point. After three hours of stim-
ulation, RNA was isolated as previously described. Complementary DNA (cDNA)
was synthesized from 1.5 µg of RNA using the High Capacity cDNA Reverse Tran-
scription Kit (Applied Biosystems, UK) following manufacturer’s instructions. For
PCR analysis, specific primers for each target gene were designed using the online
Primer-BLAST tool of the National Center for Biotechnology Information7 . In
order to improve the PCR efficiency, possible secondary structures of the amplicons
were taken into account by characterizing their nucleotide sequence using the Mfold
algorithm [278].
To quantify the relative expression of each gene, a Corbett Rotor-Gene 6000
(Qiagen, Germany) was used as RealTime qPCR apparatus. Each sample was an-
alyzed in a total reaction volume of 20 µl containing 10 µl of 2× SensiMix SYBR
Master Mix (Bioline, UK) and 0.2 µM of each primer. All qPCRs were set up using
a CAS-1200 pipetting robot (Qiagen, Germany). The cycling conditions were 95 °C
for 10 min followed by 40 cycles of 95 °C for 15 s, 60 °C for 20 s and 72 °C for 20 s. For
each experiment, an RT-negative sample was included as control. The specificity of
the qPCRs was assessed by melting curve analysis. The relative expression of the
target genes was analysed using a modified Pfaffl method [279, 280]. To determine
significant differences in the mRNA expression between different experimental con-
ditions, the relative quantity (RQ) for each sample was calculated using the formula
1/ECt , where E is the efficiency and Ct the threshold cycle. The RQ was then nor-
malized to the housekeeping gene peptidylprolyl isomerase B (PPIB ). The stability
of the housekeeping gene was assessed using the BestKeeper algorithm [281]. The
normalized RQ (NRQ) values were log2 -transformed for further statistical analysis
with GraphPad PRISM v5.0. Statistical analysis was performed using repeated
measures ANOVA and Bonferroni correction.
7
NCBI, http://www.ncbi.nlm.nih.gov/tools/primer-blast/
103
Chapter 5. Differential Gene Expression
Figure 5.2: Bird’s eye view of transcriptome changes upon stimulation with vitamin during infec-
tion. Our results demonstrate a huge impact of the pathogens and vitamins on the transcriptional
landscape of human monocytes. Furthermore, the transcriptional regulation by the vitamins is de-
pendent on the pathogenic stimulus. (A) 3-dimensional Principal Component Analysis (3D-PCA)
of the top 300 most variant genes was plotted with the scatterplot3d package in R [282]. The
first three principal components (PC1-PC3) account for ∼78 % of the total variance of the data.
(B) Representation of the total number of genes as bars (left y-axis) and the ratio of up-/down-
regulated genes as diamonds and triangles (right y-axis) in response to atRA and vitD during all
stimulatory settings. (C) Venn diagram showing the overlap of the atRA-regulated genes (F C > 2,
p < 0.05) during A. fumigatus stimulation (A.f., blue), C. albicans stimulation (C.a., green), E. coli
stimulation (E.c., magenta) or in absence of pathogen stimulation (w/o inf., orange). (D) Venn
diagram showing the overlap of the vitD-regulated genes (F C > 2, p < 0.05) during A. fumigatus
stimulation (A.f., blue), C. albicans stimulation (C.a., green), E. coli stimulation (E.c., magenta)
or in absence of pathogen stimulation (w/o inf., orange).
104
5.1. Differential effects of vitamins on human monocytes after infections
105
Chapter 5. Differential Gene Expression
Figure 5.3: Heatmap of K-means clustering of DEGs and subsequent GO enrichment analysis.
K-means clustering was performed on variance-stabilized read counts to build a heatmap for the
6,076 differentially expressed protein-coding genes (adjusted p-value <0.05). A priori, the model-
based optimal number of K = 18 was determined. The clustering of the rows is based on euclidean
distance. The colors in the map represent row-scaled expression levels: blue indicates the lowest
expression, white indicates intermediate expression, and red indicates the highest expression. Se-
lected clusters were analyzed with regard to their biological function by GO enrichment analysis.
Most enriched GO categories are shown for representative groups of clusters displaying their fold
enrichment.
vealed the Immune System Process (GO:0002376) as the most enriched GO category,
followed by Response to Stimulus (GO:0050896) (Fig 5.5A). A similar enrichment
was obtained also for the 624 vitD-regulated genes (Fig 5.5B), demonstrating the
important role of both vitamins in immune processes. Moreover, Immune System
Process was also the most prominent category among the genes regulated by atRA
during C. albicans and E. coli infections when settings were analyzed separately
(Fig 5.5C). For vitamin D, Immune System Process was the top-enriched GO cate-
gory upon all settings. In addition, most of the immune-relevant genes regulated by
vitamins A and D are highly expressed in monocytes (Fig. 5.4). Kegg-pathway anal-
ysis of vitamin-regulated genes showed the Cytokine-Cytokine Receptor Interaction
as the pathway with highest enrichment score for both vitamins. Other significantly
enriched pathways included Chemokine signaling, TNF signaling and Hematopoi-
etic cell lineage, among several other immune-relevant processes. Thus, pathway
analysis underpins the remarkable impact of the vitamins on immune functions.
106
5.1. Differential effects of vitamins on human monocytes after infections
Figure 5.4: Expression plots (MA plots), showing the vitamin-dependent transcriptional profiles.
The scatter plots display the mean expression (x-axis) and log2 fold changes (y-axis) of differentially
expressed genes in response to atRA and vitD under each of the stimulatory settings (w/o infection,
A. fumigatus infection, C. albicans infection, E. coli infection). Red dots represent significantly
(p < 0.05) regulated genes. Blue dots represent DEGs belonging to the Gene Ontology (GO)
category GO:0002376 (Immune System Process).
107
Chapter 5. Differential
www.nature.com/scientificreports/ Gene Expression
Figure 4. Gene ontology analysis of the vitamin-induced transcriptional changes during infection. Analysis
revealedontology
Figure 5.5: Gene the Immune System Process
analysis of theasvitamin-induced
the most affected GO transcriptional
category in responsechanges
to vitamin during
treatment.infection.
(A) GO enrichment
Analysis revealed the Immune analysis of all atRA-regulated
System Process asgenesthe(1573
most genes, FC > 2, p <
affected GO0.05) during anyinof response
category the to
three analysed infections. Percentage of the total enrichment scores are shown for the top five GO categories
vitamin treatment. (A) GO enrichment analysis of all atRA-regulated genes (1,573 genes,
(biological process). (B) GO enrichment analysis of all vitD-regulated genes (624 genes, FC > 2, p < 0.05) F C > 2,
during any
p < 0.05) during anyof of
the the
three three
analysedanalyzed
infections. (C) Top atRA-regulated
infections. GO categories
Percentage of theduring
totaleach of the infections.
enrichment scores
(D)the
are shown for Top vitD-regulated
top five GOGOcategories
categories during each of the
(biological infections. (B) GO enrichment analysis of all
process).
vitD-regulated genes (624 genes, F C > 2, p < 0.05) during any of the three analyzed infections.
(C) Top atRA-regulated GO categories during each of the infections. (D) Top vitD-regulated GO
during C. albicans and E. coli infections when settings were analysed separately (Fig. 4C). For vitamin D, Immune
categories during each ofwasthe
System Process the infections.
top-enriched GO category upon all settings. In addition, most of the immune-relevant
genes regulated by vitamins A and D are highly expressed in monocytes (Supplementary Fig. S1). Kegg-pathway
analysis of vitamin-regulated genes showed the Cytokine-Cytokine Receptor Interaction as the pathway with
highest enrichment score for both vitamins (Supplementary Table S2). Other significantly enriched path-
Counteracting theChemokine
ways included transcriptional response
signaling, TNF signaling against cell
and Hematopoietic pathogens
lineage, among several other
immune-relevant processes. Thus, pathway analysis underpins the remarkable impact of the vitamins on immune
functions.
The question remained as to the direction and extent of the vitamin-mediated reg-
ulation in Counteracting
each stimulatory setting. response
the transcriptional In order to address
against pathogens. these The questions,
question remained weas subse-
to the
direction and extent of the vitamin-mediated regulation in each stimulatory setting. In order to address these
quently analyzed
questions, we the differential
subsequently analysedexpression
the differential of all those
expression immune-relevant
of all those immune-relevant genes genes that
that were
were regulated by both the vitamins and the pathogens. Interestingly, there was a
regulated by both the vitamins and the pathogens. Interestingly, there was a huge overlap of genes regulated
by both stimuli. Thus, of the 235 immune-relevant genes (GO:0002376) that were regulated by atRA during E.
huge overlap of genes
coli infection, up toregulated
195 genes (83%) bywere both stimuli.
also regulated Thus,
by the pathogen ofitself.
theSimilar
235 overlaps
immune-relevant
were observed
during fungal infections with up to 70.5% and 72.7% for A. fumigatus and C. albicans stimulation, respectively.
genes (GO:0002376) that were regulated by atRA during E. coli infection,
For vitamin D, these overlaps with the infections were 73.6%, 64.4% and 74.2% for A. fumigatus, C. albicans and
up to
195 genes E.(83 coli %) were
challenge, also regulated
respectively. By plotting foldbychanges
the relative
pathogen itself. Similar
to their unstimulated controls, we overlaps were
could discrimi-
nate between counteractive and synergistic effects between the vitamins and the pathogenic stimulus in each case
observed during(Fig. 5). fungal infections with up to 70.5 % and 72.7 % for A. fumigatus
and C. albicans stimulation,
In all settings, respectively.
the vast majority For vitamin
of the immune-relevant D, these
genes were up-regulated afteroverlaps with asthe
pathogen challenge,
expected, and this effect was reversed by the vitamins. Especially atRA showed an important counteractive effect
infections against
were the 73.6 %, 64.4
pathogen % and
challenge. AtRA 74.2 % fortheA.
counteracted fumigatus,
effect of the pathogens C.in albicans
78% of the genesandregulated
E. coli
challenge, also by A. fumigatus, 65%
respectively. By ofplotting
the genes regulated
fold by C. albicans,
changes and 85% ofto
relative the their
genes regulated by E. coli. Similar
unstimulated con-
results were obtained for vitD-mediated regulation, with 69%, 62% and 68%, respectively (Fig. 5). Moreover,
trols, we couldthis effectdiscriminate
becomes even morebetweenapparent when counteractive and ofsynergistic
analysing the expression genes belongingeffects
to the GObetween
category
Immune Response (GO:0006955), especially in the case of vitamin A (see Supplementary Fig. S2). This significant
the vitamins and the pathogenic stimulus in each case (Fig 5.6).
In all settings, the vast majority of the immune-relevant genes were up-regulated
after| 7:40599
Scientific Reports pathogen challenge (Fig. 5.4), as expected, and this effect was reversed by the
| DOI: 10.1038/srep40599 7
vitamins. Especially atRA showed an important counteractive effect against the
108
5.1. Differential
www.nature.com/scientificreports/ effects of vitamins on human monocytes after infections
Figure 5. Vitamins A and D strongly counteract the transcriptional response of human monocytes to
Figure 5.6: pathogens.
VitaminsGraphicA and D strongly
representation counteract
of the the transcriptional
expression dynamic of immune-relevantresponse of human mono-
genes (GO:0002376)
differentially regulated by both the pathogens and the vitamins. Patterns are divided by the type of correlationgenes
cytes to pathogens. Graphic representation of the expression dynamic of immune-relevant
(GO:0002376)observed between the regulated
differentially effects of pathogen
by bothand vitamin stimulations: counteractive
the pathogens effect (up-regulation
and the vitamins. Patterns by are di-
and down-regulation by vitamin, or vice versa) and synergistic effect (same direction observed in the
vided by thepathogen
type of correlation observed between the effects of pathogen and vitamin stimulations:
differential expression induced by pathogen and vitamin stimulations). Pie charts show the proportion of genes
counteractive effectcounteractive
depicting (up-regulation effects by pathogen
(red) andeffects
and synergistic down-regulation
(green). by vitamin, or vice versa) and
synergistic effect (same direction observed in the differential expression induced by pathogen and
vitamin stimulations). Pie charts show the proportion of genes depicting counteractive effects (red)
counteractive
and synergistic effects effect suggests an important immunomodulatory potential for both vitamins during bacterial and
(green).
fungal infections.
109
Chapter 5. Differential Gene Expression
A B C
CCL2 RPS6KA1 PDGFB
TINAGL1 VSIG4 PTGER4
TREM1 NLRC4 ALCAM
GEM FES P2RX7
SEMA3C CFP POLR1C
HBEGF MARCO POLR3C
CTSL CD84 PIK3R1
LIF PIK3CB TNFSF14
PRKCB GPR65 IRF8
CCL20 HMGB2 PRDX1
IL2RA FOS POLR3D
BMP6 CLEC4A TNFRSF14
OSM ADAM15 IL20
SLC11A1 ITGB2 CCL23
CCL22 CD14 CD55
NCF2 PYCARD GZMB
CLEC5A PPP3CA FCAMR
JAG1 CCL24 LILRB1
CDKN1A LILRB3
ICOSLG LILRA4
IL3RA CCL13
IL1RAP LILRA1
CD274 ICOSLG LILRA6
PDGFA IL12A SPTBN5
PDE1B IFNA14 RASA2
EBI3 TNFRSF4 FFAR3
IL7R TNFRSF18 ENPP2
IL6 IL3RA UBE2D1
TNFRSF4 CALM1 LILRA5
TNFRSF18 IL27 DUSP4
P2RX7 IL12B CTSL
ALCAM IL36G HBEGF
MALT1 EBI3 TREM1
IL18BP TNFSF9 LILRB4
CCL24 IL7R DEFB1
DTX4 CD276 SLC11A1
SEMA7A SEMA7A ADARB1
CCL4 BCL2 GEM
PDCD1 CLEC4E PDCD1LG2
CCL3 SRC SEMA3C
RASAL2 DTX4 BMP6
KDM6B CD209 OSM
PRDX1 CSF2 HAMP
TNF LAMP3 FFAR2
EREG IRF8 TNFRSF4
LY9 CD83 TNFRSF18
CXCL6 P2RX7 TNFRSF9
CXCL3 TNFSF18 BIRC3
CXCL1 RGS1 BCL2L1
CD276 CD80 HSPD1
IL36G CXCL11 REL
HAMP IFNB1 LTA
MYO10 SEMA7A
MYO10 CSF2
IL1R1 CXCL9
PPBP CCL19
PDCD1 CXCL2
VNN1 LTA
TNFSF18 EBI3
IL27RA CSF3
PTX3 CLEC6A
CLEC4A CXCL1
IL1R2 TXN
PSTPIP1 PTGER4
LAT2 CXCL8
NCF1 IL1A
ADAM15 CLEC5A IL6
OAS3 ORAI1 IL23A
CASP1 LGALS3 RIPK2
OAS1 PRDX1 CD40
SAMHD1 ALCAM IL36G
LILRA6 CCL23 CCL18
VSIG4 CASP9 NEDD4
LILRA5 IL12B
HLA−DMB CD300LB
LILRB1 LILRA5 TNFSF9
LILRA1 GZMB IL24
CFP LGMN KDM6B
SLAMF1 CSF1
PIK3CG TNF
CD1D LIF
IL2RB IL36RN
PIK3R1 SYNGAP1
CASP9 CTNNB1
FFAR3 MYO10
NLRP3 IL27
CD300LB CD55
CD86 DUSP5
CD84 ZC3HAV1
ADAMDEC1 SPTBN5
ENPP2 CDKN1A
FFAR2 IFIH1
CD14 CXCL5 APOL1
SYK CCL20 CD80
NOD2 CCL2 PELI1
PYCARD CXCL3 IFNG
GPR183 TRAF3IP2 SLAMF7
LILRB5 IL2RA MAP3K8
GPR65 AIM2 CCL3
HMGB2 CTSL IRAK2
control A. fumigatus A. fumigatus A. fumigatus UBE2D1 CCL4
+ atRA + vitD RASGRP3 CCR7
TREM1 NFKB1
GEM RAPGEF2
HBEGF ISG15
FFAR2 PTX3
SEMA3C CCL5
OSM DHX58
MAPKAPK2 IL19
CEBPB BCL2
LILRB4 RGS1
control C. albicans C. albicans C. albicans SOS1
+ atRA + vitD
ADAM17
CLEC4E
RASGRP1
SRC
LAMP3
CXCL9
CD83
DUSP3
CXCL11
CXCL10
CCL22
IL15
VAV3
PIK3CG
SMAD3
NLRC4
CD300LB
C5AR1
RPS6KA1
CFP
PRKACA
CTSH
WAS
GPR65
MAPK14
NCKAP1L
HMGB2
MAP3K14
CD84
TNFAIP8L2
FOS
CCR2
LAMTOR2
CSF1R
FCGR3A
1 0.5 0 −0.5 −1 CAMP
NOD2
CD14
SYK
IRAK4
log2 fold change ITGAM
MAVS
TLR1
MARCH1
FYB
PPP2R5D
control E. coli E. coli E. coli
+ atRA + vitD
Figure 5.7: Hierarchical clustering of differential expressed genes of GO:0006955 (Immune Re-
sponse). Heat map of all genes differentially regulated (DEGs) by both the pathogens and any of
the vitamins in each infection model. (A) during infection with A. fumigatus and treatment with
either vitamin A or D; (B) during infection with C. albicans and treatment with either vitamin A
or D; (C) during infection with E. coli and treatment with either vitamin A or D.
110
5.1. Differential effects of vitamins on human monocytes after infections
Figure 5.8: Immunomodulatory footprint of vitamin A during infection. AtRA shows a pathogen-
specific regulatory role on immune-relevant genes leading to an overall down-regulation of cy-
tokines, chemokines and matrix metalloproteases and an up-regulation of complement-related
genes. (A) Venn diagram showing the overlap and amount of atRA-regulated immune-relevant
genes (GO:0002376) in each of the infections analyzed: A. fumigatus (blue), C. albicans (green)
and E. coli (magenta). (B) Network based on experimental and database-derived knowledge
(edges) generated with the STRING database. # atRA regulated these genes in at least two of
the pathogenic infection settings; P atRA-regulated genes during C. albicans infection; atRA-
regulated genes during E. coli infection; D atRA-regulated genes during A. fumigatus infection;
red: down-regulation, green: up-regulation.
111
Chapter 5. Differential Gene Expression
Figure 5.9: Immunomodulatory footprint of vitamin D during infection. VitD shows a pathogen-
specific regulatory role on immune-relevant genes leading to an overall down-regulation of cytokines,
chemokines and matrix metalloproteases. (A) Venn diagram showing the overlap and amount of
vitD-regulated immune-relevant genes (GO:0002376) in each of the infections analyzed: A. fumi-
gatus (blue), C. albicans (green) and E. coli (magenta). (B) Network based on experimental and
database-derived knowledge (edges) generated with the STRING database. # vitD regulated these
genes in at least two of the pathogenic infection settings; P vitD-regulated genes during C. al-
bicans infection; vitD-regulated genes during E. coli infection; D vitD-regulated genes during
A. fumigatus infection; red: down-regulation, green: up-regulation.
Taking all together, both vitamins show an important role as modulators of the
transcriptional response against fungi and gram-negative bacteria, with a strong
impact on specific cytokine- and chemokine-expression, depending on the stimula-
tory setting. Moreover, for both vitamins, we could identify consensus inhibitory
networks across the three different pathogenic stimulations using the KeyPath-
wayMiner [276, 277] tool. These networks confirmed the important regulatory
role of both vitamins on the TNF signaling, metalloprotease production, and IFN
pathways.
112
5.1. Differential effects of vitamins on human monocytes after infections
5.1.5 Conclusions
We used a high-throughput RNA-Seq-based approach to characterize the whole
immunomodulatory potential of the vitamins A and D during infections of bacterial
and fungal origin. Using human monocytes as a host-cell model, we analyzed the
differential role of both vitamins upon four different stimulatory settings: upon
A. fumigatus infection, upon C. albicans infection, upon E. coli infection or in
absence of any inflammatory stimulus. Gene ontology and pathway analyses of the
differential expression patterns were carried out to define the regulatory role of these
vitamins upon each infection type, and to identify their underlying mechanisms.
We observed an important and specific impact of the inflammatory stimulus on
the vitamin-mediated regulation of transcription. Especially in the case of vitamin
D, where infection drastically reduced the amount of vitD-regulated genes when
compared to its regulation in the absence of inflammatory stimulus (Fig 5.2B). Also
the relation of up- vs. down-regulated genes was shifted upon infection, with more
113
Chapter 5. Differential Gene Expression
infections vitamins
control C. albicans A. fumigatus E. coli control atRA vitD
Figure 5.10: Analysis of the expression profiles of immunomodulatory genes in response to atRA
and vitD after three hours of stimulation. Relative mRNA expression levels of selected genes were
measured by qPCR. Data were obtained from five independent experiments, each performed with
cells from different donors. Statistical analysis was carried out by using repeated measures ANOVA
and Bonferroni correction. Results are presented as mean SEM of the fold change relative to the
control (unstimulated cells). For RNA-Seq based expression patterns of this genes see Fig. 5.11.
*** p ≤ 0.001, ** p ≤ 0.01, * p ≤ 0.05.
114
5.1. Differential effects of vitamins on human monocytes after infections
BTK STAT1
● ●
●
● ● ●
● ●
● 10^4.5 ●
●
●●
●
10^3.2
●
●
● ●
● ● ● ●
● ●
● ● ●
● ● ●
●
● 10^4.0 ● ●
● ●
10^3.0 ● ● ●
● ● ● ●
● ● ●
●
● ●
● ● ●
● ●
● ●
● 10^3.5 ● ● ●
● ●
10^2.8 ●
●
●
●
●
● ● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ● ●
10^3.0 ●
10^2.6 ● ●
●
●
●
● ● ●
PPP3CA CD300A
10^3.5 ● ●
●
10^3.5 ●
●
●
●
●
● ●
● ● ●
10^3.0
● ●
● ●
● ●
●
10^3.0 ●
●
● ● ●
● ●
● ● ●
●
●
●
● 10^2.5
● ●
● ●
●
●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● 10^2.0 ● ●
● ●
●
● ● ● ● ●
● ●
10^2.5
●
● ● ●
● ● ● ● ●
● ●
10^1.5
●
●
●
● ● ●
● ●
● ● ●
● ●
LILRB1 PTPN7
10^3.4
● ● ●
● ●
● ●
●
● ●
●
●
10^3.2
● ●
● ●
● ●
●
10^4.5 ● ●
●
●
●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
●
10^3.0
● ● ●
● ●●
● ● ●
● ● ●
● ●
10^4.0
●
●
●
● ● ●
●
10^2.8
●
● ● ●
●
● ●
● ●
●
● ●
● ● ● ●
● ● ●
10^3.5 ● ●
●
10^2.6
●
●
● ●
● ●
● ● ●
10^2.4
10^3.0 ● ●
DUSP1 DUSP7
● ●
● ●
●
● ● ●
10^4.5
●
●
● ●
●
● 10^2.5 ●
● ● ●
● ●
●
● ● ●
● ●●
10^4.0 ●
●
●
● ●
● ● ● ● ●
● ● ●
● ●
10^2.0
● ●
● ● ●
●
●
● ●
10^3.5
●
●
● ● ●
● ● ●
● ●
● ● ● ●
●
●
● ● 10^1.5 ● ●
●
10^3.0 ● ●
● ●
●
● ● ● ●
● ●
● ●
● ●
● ● ●●
● ●
infections vitamins
control C. albicans A. fumigatus E. coli control atRA vitD
Figure 5.11: Analysis of the RNA-Seq expression profiles of immunomodulatory genes in response
to atRA and vitD after six hours of stimulation. The box plots show the normalized expression (x-
axis) and the log2 fold change (y-axis) for the same genes like selected in Fig. 5.10 (central regulators
of the main signaling cascades identified in this study). The RNA-Seq abundances estimated here
six hours after infection correspond nicely with the qPCR measured patterns already observed
after three hours and presented in Fig. 5.10.
115
Chapter 5. Differential Gene Expression
Figure 5.12: Comparison of the immunomodulatory potentials of vitamin A and vitamin D. During
infection, an important overlap among the gene sets regulated by both vitamins could be revealed.
Nevertheless, differential modulatory effects by each of the vitamins could be addressed among sev-
eral immune-relevant genes. (A) Venn diagram showing the overlap of atRA- and vitD-regulated
protein-coding genes during any of the infections. (B) Venn diagram showing the overlap of atRA-
and vitD-regulated genes belonging to GO category GO:0002376 (Immune System Process). (C)
Scatter plots display the differential effects of each vitamin on the expression of immune-relevant
genes. Each axis depicts the vitamin-induced log2 fold change as compared to the corresponding
pathogen stimulation alone. Genes written in red indicate significant differences in their regulation
induced by each vitamin (real F C > 3 between the fold changes induced by atRA and vitD during
infections).
116
5.1. Differential effects of vitamins on human monocytes after infections
served in gene clusters with none or minimal impact of pathogen stimulation, while
the immunomodulatory role of vitamins was highlighted across pathogen-triggered
clusters (Fig 5.3). Overall, during inflammation, the GO category that was mostly
enriched by vitamin stimulation was Immune System Process (GO:0002376). For
vitamin D this enrichment could be confirmed for each of the infections. For atRA,
Immune System Process ranked first during C. albicans and E. coli infection, but
only 4th upon A. fumigatus infection (Fig 5.5). This slight shift might be explained
by a generally lower transcriptional response in response to A. fumigatus, with less
immune-relevant genes to be susceptible to atRA-mediated regulation. Neverthe-
less, the huge importance of both vitamins as immunomodulators becomes even more
evident considering the amount of immune-relevant genes (GO:0002376) regulated
by atRA and vitD during infection: 346 and 176 genes, respectively. In addition,
pathway analysis underpinned the notion that immune response is the most regu-
lated biological function by both vitamins. Although the immunomodulatory effect
of both vitamins has already been described for single genes in different cell mod-
els [259, 260, 262, 263, 284–286], the dimension of this regulatory function has not
been previously reported.
For all stimulations, we could describe an overwhelming and still unreported
counteractive effect between the pathogen- and the vitamin-driven regulation of
immune-relevant genes. Both vitamins, especially vitamin A, counteracted the tran-
scriptional regulation in response to the pathogens (Fig 5.6). We also observed this
behavior for antisense transcripts and lncRNAs [5]. The most prevalent expression
dynamic was defined by genes that were up-regulated by the pathogens, and this
effect reversed by vitamin A. The functional classification of the atRA-regulated
genes allowed us to identify the cytokines as the best representatives of this dy-
namic. AtRA down-regulated almost all cytokines, and a similar tendency was ob-
served also for the chemokines. On the other hand, complement-activity genes were
rather up-regulated by atRA. These findings might suggest a scenario in which atRA
could lead to an attenuation of the immune response, in terms of pro-inflammatory
cytokine release and immune cell recruitment, but sustain effective phagocytosis.
This type of immunomodulation might have large-scale clinical potential, especially
during systemic infections with hyper-inflammatory response, including C. albicans
and E. coli -infections. During the last decade, much effort has been devoted to
develop new therapeutic strategies for severe sepsis treatment, highlighting the need
for new immunomodulatory agents [287], especially since most of the proposed im-
munomodulators have failed to disclose any clinical benefit or have shown limited
clinical efficacy [288, 289].
We could observe a high degree of specificity in the vitamin-mediated regula-
tory patterns between the different infection models. On the one hand, E. coli
infection triggered the capability of vitamins to regulate gene expression, as com-
pared to the other two infections. This might be attributed to the fact that E. coli
stimulation led to the strongest transcriptional response, thereby presenting more
targets susceptible for regulation by atRA or vitD. On the other hand, we could
identify genes regulated by the vitamins in a pathogen-specific manner. Type-I
interferons (IFN), for instance, were down-regulated by both vitamins exclusively
upon C. albicans-infection. The type-I IFN response was identified in our dataset
117
Chapter 5. Differential Gene Expression
118
5.1. Differential effects of vitamins on human monocytes after infections
119
Chapter 5. Differential Gene Expression
120
5.2. Differential expression in EBOV/MARV infected human and bat cells
121
Chapter 5. Differential Gene Expression
EBOV
3h p.i. Rate of infection (IFA) Mock 3h p.i. 7h p.i. 23h p.i.
7h p.i.
Quality/Quantity check
Transcriptome analyses
23h p.i. Viral propagation (PCR)
Mock MARV EBOV
HuH7 or
R06E-J cells ~ 90% infected cells
25.8 3h 7h 23h
25 24.8
23.4
22.8 21.5 21.2
20 19.8 19.2 19.4 19.8
16.9
Ct value
MARV
15 15.7
0
human fruit bat human fruit bat
MARV EBOV ~ 70% infected cells
Figure 5.14: Monitoring sample preparation. (A) HuH7 and R06E-J cells were infected with
MARV or EBOV (MOI = 3) or left uninfected (Mock). Samples were collected after 3, 7 and 23 h
post infection (p.i.). (B) RNA of infected and uninfected cells was isolated at 3, 7 and 23 h p.i.,
checked for quality and quantity, and filovirus-specific real time PCR was performed according to
Panning et al. [314]. (C) and (D) To determine the number of infected cells, immunofluorescence
analyses were performed with infected cells grown on one coverslip within each well used for RNA
preparation. Infected cells were visualized (red) using mouse monoclonal antibodies against EBOV
(C) or MARV (D) nucleoproteins and fluorescently tagged secondary antibodies. DAPI staining
was used to visualize cell nuclei (blue). Ct – cycle threshold.
We infected 4×105 HuH7 cells [315] (a human hepatoma cell line) and 4×105 R06E-
J cells [316] (an embryonic cell line from R. aegyptiacus) with EBOV (Ebola virus
strain Zaire, Mayinga, GenBank: NC_002549) or MARV (Lake Victoria Marburg
virus, Leiden, GenBank: JN408064.1 [317]) at a multiplicity of infection (MOI) of
three (Fig. 5.14A). We used HuH7 cells because EBOV infections in humans in-
duce the majority of their histopathological features in the liver, and these cells are
highly susceptible to filovirus infections [318]. We used immortalized cells because
primary cells from bats (macrophages and dendritic cells) are not available in the
large quantities we required for our RNA-Seq analyses (9 samples from each cell
type). Our analyses consisted of computational and extensive manual investiga-
tions as shown in Fig. 5.15. At 3, 7 and 23 h p.i., cells were harvested, and total
RNA was isolated using an RNeasy Mini Kit (QIAGEN) according to the manu-
facturer’s instructions. These time points correspond to the different stages of the
viral replication cycle (Fig. 5.16). Replication and transcription take place after 3 h,
proteins are produced at 7 h, which may regulate further transcription, and a com-
122
5.2. Differential expression in EBOV/MARV infected human and bat cells
plete replication cycle occurs after 23 h. DNaseI digestion was performed. At each
time point, RNA was also isolated from non-infected (Mock) control cells. Qual-
ity controls were performed to ensure proper infection rates and viral propagation.
Real-time PCR was used to detect filovirus RNA (polymerase genes) [314] and to
demonstrate the amplification of viral RNA over the time course of the infections
(Fig. 5.14B). Ct-values are inversely proportional to the amount RNA detected, and
these values for the MARV-infected HuH7 cells at 3 and 7 h p.i. were lower than the
values for the MARV-infected R06E-J cells. The Ct-values for the EBOV-infected
cells showed no clear difference. An immunofluorescence analysis (IFA) of the cells
was performed using mouse monoclonal antibodies directed against nucleoproteins
of EBOV (B6C5, 1:20) and MARV (59-9-10, 1:100). An anti-mouse secondary anti-
body coupled with Alexa 594 (1:500) was used to detect these viral nucleoproteins,
and DAPI (4’,6’-diamidino-2-phenylindole) staining was used to visualize cell nuclei
(1 mg/ml, 1:2000). IFAs of the MARV and EBOV nucleoproteins revealed that a
MOI of 3 was sufficient to initially infect a high percentage of cells (90 % of human
and bat cells infected with EBOV, 70 % of bat cells and 99 % of human cells infected
with MARV, Fig. 5.14C/D). The quantity and quality of the RNA was assessed
using a NanoDrop ® spectrophotometer and an Agilent Bioanalyzer. Nine samples
at different time points were generated from human (HuH7) and bat (R06E-J) cells:
The total RNA of the 18 samples was shipped to LGC Genomics for the construc-
tion of cDNA libraries. Ribo-Zero was used for rRNA depletion, and the Illumina
TruSeq kit was used for library construction. Illumina sequencing was performed in
a 2 × 100 nt paired-end mode on a HiSeq 2000 system. R06E-J cells were stimulated
with interferons, PolyIC or thapsigargin to mimic the induction of the interferon
system or a stress response by the endoplasmic reticulum (ER) of the cells. Prior to
stimulation, R06E-J cells were examined for interferon competence via a vesicular
stomatitis virus (VSV) bioassay. The cells secreted cytokines after PolyIC transfec-
tion, and those cytokines partially protected R06E-J cells from VSV infection (data
not shown). RNA was isolated from these cells, pooled with the 9 previously men-
tioned R. aegyptiacus cell samples and shipped to GATC Biotech for normalization
and sequencing on an Illumina MiSeq system (2 × 300 nt mode). This library of
longer paired-end reads was used to improve the de novo transcriptome assembly of
R. aegyptiacus. All reads were preprocessed based on their Phred quality score. At
the 3’ end, bases with a quality score <20, a 5’-bias and poly-A tails were removed
with PRINSEQ [44] (v0.20.3). Quality was assessed and controlled before and after
processing with FastQC [43] (v0.10.1).
123
Chapter 5. Differential Gene Expression
124
(1) HuH7/R06E-J cells (2) R06E-J (5) Hg19 gene Ensembl
EBOV, MARV, Mock 3x EBOV pooled Id / String
MiSeq search Genes Literature
3x (3h/7h/23h p.i.) 3x MARV } assembly
3x Mock (2x 300bp)
PVA
9x Mira
isolation total RNA HiSeq homology
assembly search
(2x 100bp)
CDS
sample preparation (Ribo-Zero) Velvet/Oases
final R06E-J homology
ABySS/Trans-ABySS transcriptome search
SOAPdenovo-Trans assembly
HiSeq MiSeq)
sequencing (HiSeq, Trinity CD-HIT-EST
De novo transcriptome
assembly
RAE comparison R06E-J
genome transcriptome
Data acquisition
quality / trimming / FastQC
HSA comparison HuH7
genome transcriptome
(3) Segemehl / TopHat assembly
R06E-J
ge
uniq & multi Cufflinks
n
3x EBOV
es
mapped
RAE+EBOV do novo enriched pathway
3x MARV
RAE+MARV
R06E-J
3x Mock analysis
homologs
s
ne
Hg19 genes
Scale 1 kb hg19
chr22: 41,486,500 41,487,000 41,487,500 41,488,000 41,488,500 41,489,000 41,489,500 41,490,000 41,490,500
81 _ hg19_ebola_tophat-HUH-EBOV-3h__uniq
ge
HUH-EBOV-3h__uniq
R06E-J
1_
120 _ hg19_ebola_tophat-HUH-EBOV-7h__uniq
HUH-EBOV-7h__uniq R06E-J
1_
164 _ hg19_ebola_tophat-HUH-EBOV-23h__uniq
3x EBOV
HUH-EBOV-23h__uniq
uniq & multi 3x MARV transcriptome (4) HES2
(6)
TFCP2L1
1_
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
MIR1281EP300
read counting AKR1B15
SAMD14
gene name synonyms
Human mRNAs
Spliced ESTs
Layered H3K27Ac
100 _
Human ESTs That Have Been Spliced
H3K27Ac Mark (Often Found Near Active Regulatory Elements) on 7 cell lines from ENCODE
3x Mock assembly CNN1
PLA2G4C
0_ KLRC3
DNaseI Hypersensitivity Clusters in 125 cell types from ENCODE (V3)
DNase Clusters
UCSC
FOSB
Transcription Factor ChIP-seq (161 factors) from ENCODE with Factorbook Motifs
Txn Factor ChIP
4.88 _ 100 vertebrates Basewise Conservation by PhyloP VGF
SYPL2
100 Vert. Cons
0-
-4.5 _
Multiz Alignments of 100 Vertebrates
CLDN9
KLF4
manual gene inspection
Rhesus
Mouse
Dog LOC338651
Elephant
Chicken
X_tropicalis EREG
Zebrafish
Lamprey EGR2
Simple Nucleotide Polymorphisms (dbSNP 141) Found in >= 1% of Samples
Common SNPs(141)
RepeatMasker
Repeating Elements by RepeatMasker
R06E-J RRAD
RASGEF1B
(IGV & UCSC)
HDAC9
IL8
FLJ 20021
DUSP8
PPP1R15A
3x EBOV multi mapped ATF3
DUSP1
HSPA6
PVA+EBOV DESeq RDH12
conservation
ARHGDIB
AREGB
3x MARV TopHat AREG
DNAH5
CXCL5
RPS17L
G G
PVA+MARV LOC100292680
ANXA1
nucleotide modification analysis
C
A C
T C A
3x Mock LOC100287366
FOSL1
FOS
LOC100287415
TGA TCA
125
SHISA4
LY6G6C
SELPLG
intronic transcripts
SERPINA12
TopHat ACCN5
FAM49A
LOC100506546
HuH7 TopHat HPR
MYCNOS
isoforms
MYCN
clustering HP
SAMD13
MARV_3h_fc
MARV_7h_fc
EBOV_23h_fc
MARV_23h_fc
Differential gene
expression analysis
Mapping
Manual inspection
annotation
Figure 5.15: Methods pipeline. (1) Data acquisition: Total RNA from HuH7 and R06E-J cell lines 3, 7 and 23 h p.i. was depleted of ribosomal RNA
and sequenced. We controlled the quality and trimmed the data with PRINSEQ and FastQC. (2) For bat RNA, we assembled a de novo transcriptome
by adding pooled MiSeq to HiSeq data using various assembly tools and parameter settings. (3) Mapping was performed for Mock-, EBOV-, and
MARV-treated cells onto human/bat genomes and the bat transcriptome with Segemehl and TopHat. (4) Differential gene expression analysis was
performed by counting uniquely mapped reads and applying a DESeq analysis. The results were used for clustering and scatter/group plot analyses. (5)
Homology searches in bats were performed for all significantly differentially expressed genes from (4) and for the genes that were presumed to be involved
in the response to infection based on the literature and an enriched pathway analysis. The R. aegyptiacus genome and coding sequences from P. vampyrus
were used to validate and detect homologous sequences in the bat transcriptome. Detected homologs were used for the differential gene-expression analysis.
We also investigated the quality of the transcriptome assembly by comparing the human and R. aegyptiacus genomes with the corresponding assembly. (6)
During the manual inspection, we identified the synonyms of gene names and noted their existence in the relevant pathways. Each candidate gene was
manually investigated in the IGV and UCSC browsers for the human and bat samples from all time points. We report the conservation of genes according to
the 100 Species Vertebrate Multiz Alignment to chimp, mouse, dog, elephant and chicken sequences. We searched for nucleotide modifications (differential
5.2. Differential expression in EBOV/MARV infected human and bat cells
SNPs, posttranscriptional modifications), intronic transcripts and regulators, alternative splicing and isoforms, and upstream and downstream transcript
characteristics.
Chapter 5. Differential Gene Expression
We denoted these filtered subsets as expressed and blasted (E-value < 10−10 ) them
against the de novo transcriptome assemblies of the human or bat cells. We defined
a transcript (derived from the genomic sequence) as valid, and therefore correctly
assembled, if we obtained a minimum of one blast hit with an alignment length
>90 % of the query. For the human transcriptome assembly, we found between
93.0 % and 98.1 % of the expressed transcripts, and for R. aegyptiacus 81.3–94.0 %.
Therefore, the transcriptome assemblies were of sufficient quality. The results for
different transcript subsets are shown in Tab. A.3. Most of the missing transcripts
can be explained by a low read coverage in comparison to the length of the transcript
or a non-uniform distribution of reads along the transcript. These transcripts may be
assembled as partial contigs (alignment length ≤ 90 %). The higher number of valid
transcripts derived from the human genome can be explained by its better annotation
and assembly status compared to that of the relatively new R. aegyptiacus genome
at the scaffold level.
126
5.2. Differential expression in EBOV/MARV infected human and bat cells
passed to DESeq [81] in the R/Bioconductor package (v2.14). Due to the lack of
replicates, we performed pairwise comparisons between all of the different infection
conditions and time points with a false discovery rate of 0.1. To compare the charac-
teristics between HuH7 and R06E-J cells at 3, 7 and 23 h p.i., we used the EBOV and
MARV samples at the identical time points as replicates in the DESeq analysis (padj
≤ 0.1). In addition to using gene annotations from the NCBI, we processed the de
novo gene loci obtained from Cufflinks in an identical manner. Furthermore, we
used Cuffquant and Cuffnorm to calculate FPKM values for each locus from the
Cuffmerge -G results. In addition to the normalized read-count-based method,
we calculated the maximum read peak for each gene and sample using bedtools
genomecov (-d -split). We used the notation “=” when the difference between
the two samples was <15 %, “↑/↓” when there was up to a two-fold difference, and
“ n ↑” when the difference was up to n-fold (greater than 2-fold).
To compare the differential gene expression for the Mock/EBOV and Mock/MARV
treatments in HuH7 and R06E-J cells, log2-fold changes, as computed by DESeq,
were visualized using scatter plots in R (x-value: FC of Mock/EBOV; y-value:
FC of Mock/MARV). The scatterplots were overlaid with contour plots for a two-
dimensional kernel estimate (kde2d; MASS package) using the default parameters.
Outliers are labeled with their respective gene names in the Electronic Supplement
(Fig. ES4B).
To compare human and bat genes, we defined homologous loci between human genes
and the transcripts in the R. aegyptiacus assembly. A direct comparison between
human genes and R. aegyptiacus transcripts led to several unidentified orthologous
pairs. To improve ortholog detection, we used the annotated genes of P. vampyrus
from Ensembl to act as an intermediate between the two closely related species
(Fig. 6.5). Sequence comparisons were performed using BLASTn+ (v.2.2.27+), and
hits under the restrictive E-value threshold of < 10−50 were considered orthologs. We
127
Chapter 5. Differential Gene Expression
defined homologous genes between the human and recently published R. aegyptiacus
genomes.
Pathway enrichment
We examined which of the differentially expressed genes were over-represented in
KEGG database [326] pathways. Gene set enrichment analyses were performed us-
ing a hypergeometric test. FDR-corrected p-values were significant at 0.05. We set
the threshold to a value of p < 0.1 using a hypergeometric test and FDR correc-
tions [327]. The evaluation was performed with the R-package GAGE [86] to obtain
KEGG pathway information and with pathview [87] to allow for visualization.
qRT-PCR
We repeated the infection of HuH7 and R06E-J cells and isolated RNA for qRT-PCR
analyses. An IFA of these cells (Fig. ES7E) revealed that the infection rates were
lower than in the previous experiment used for RNA-Seq. Only ∼40 % (∼70 %)
128
5.2. Differential expression in EBOV/MARV infected human and bat cells
of the R06E-J cells were infected with EBOV (MARV) in comparison to ∼90 %
(∼70 %) in the first experiment (Fig. 5.14C). The infection rates were only slightly
lower for HuH7 cells (Fig. ES7E). RNA was isolated and reverse transcribed with
random hexamers. For qRT-PCR analyses of NPC1 and TLR3 gene expression,
degenerate primers that amplified both the human and the bat mRNA sequences
were used. HAVCR1 gene expression was measured using human- and bat-specific
primers. Expression values were normalized to 18S rRNA levels. All primer se-
quences are listed in File ES7A. qPCRs were performed on an Applied Biosystems
7500 Real-Time PCR System using SYBR Green chemistry (iTaq mix, BioRad)
according to the manufacturer’s instructions. Mean values from triplicate analy-
ses were calculated and quantified with the system’s built-in software according to
standard curves constructed for each amplicon from serial cDNA dilutions. Relative
quantification values, melting curves and agarose gel pictures for the three genes are
presented in the Electronic Supplement, Sec. ES7.
40. 6 41. 6
synthesis slowed down in 15. 6
fold change
fold change
RNA 7-23 h
RNA 3-7 h
3. 0
and transcript levels were E M E M E M E M
nearly identical at 23 h p.i.
(Tab. A.2). The EBOV and
MARV viral proteins were
more abundant in HuH7 Figure 5.16: Human HuH7 cells support an earlier onset of filovi-
ral RNA synthesis than bat R06E-J cells. The first viral repli-
than in R06E-J cells at cation cycle is finished after 15–18 h, when virions are released
7 h p.i. and were present from the host cells. Between 3 and 7 h p.i. of EBOV-infected
at similar levels at 23 h R06E-J cells, we observed an ∼4.8 X increase in the number of
p.i. (Fig. 5.14C and D) reads that mapped uniquely to the EBOV genome. This indi-
cates that EBOV genes are rapidly replicated and transcribed in
as determined using IFAs.
the bat cells in the first 4 h p.i. (see Tab. A.2 for normalized read
Early increases in the lev- counts). We observed a further 41 X increase in reads between
els of viral RNA and pro- 7 and 23 h p.i. in R06E-J cells. So, this RNA synthesis rate
teins in HuH7 cells may slows down within this next 16 h (compared to 4.8 X4 ' 530 X).
be attributed to a faster In comparison, unique reads mapping to the EBOV genome in
HuH7 cells increased 15.6 X between 3 and 7 h p.i. and a further
rate of early replication of
15.5 X in the following 16 h. This result indicates a significant
filoviruses in HuH7 cells increase in the RNA synthesis rate of viral RNAs in the first few
compared to the rate in hours and a marked decrease in the following hours.
R06E-J cells. The differ-
129
Chapter 5. Differential Gene Expression
130
5.2. Differential expression in EBOV/MARV infected human and bat cells
A B Z-score of log2(FC)
3 2 1 0 1 2 3
Ebola infection Marburg infection
down-regulated genes up-regulated genes 800 800
600 600
400 400
200 0 0 200 0 2 67
801
0 0
3h 7h 23h 3h 7h 23h
0 0
1 0 877 0 0 7
200 200
400 400
600 600
800 800
C
Gene Samples FC Supplement Type
ATF3 EV 3-23h 5.89 A6, C7, J8 Transcription factor
FOS EV 7-23h 6.09 A4, E10, I7 Transcription factor
FOSB EV 7-23h 6.89 A2, E3, I3 Transcription factor
PPP1R15A EV 3-23h 5.37 B2, E9, I10, J1 Protein phosphatase
DUSP1 EV 7-23h 4.57 B7, F7 Protein phosphatase
DUSP8 EV 7-23h 5.01 B9, F4, J7 Protein phosphatase
NFKB2 EV 3-23h 4.08 B10 Transcription factor
CHAC1 MV 3-23h 3.70 C2 Cation transport
TRIB3 MV 3-23h 4.76 C1 Protein kinase
SQSTM1 EV 7-23h 2.92 C6 Ubiquitination
CDH6 EV 3-23h -2.97 C5 Cadherin
D
Binding
Protein binding
29
18
5
9
12
2 2 3 3
Transporter activity Catalytic activity
Chromatin binding 3h 7h 23h 3h 7h 23h 3h 7h 23h 3h 7h 23h
Structural molecule activity Receptor activity
Nucleic acid binding EBOV MARV EBOV MARV
Signal transducer activity
Figure 5.17: Significantly regulated genes in cells infected with EBOV or MARV. (A) Number of
strongly regulated human genes after infection with EBOV or MARV. There were only a few genes
that were significantly regulated (padj <0.1) at 3 and 7 h p.i. in both EBOV- and MARV-infected
cells compared with their expression in Mock-treated cells. At 23 h p.i., the number of regu-
lated genes was higher (1,678) in EBOV-infected cells than in cells infected with MARV. Adding
these ∼1,600 strongly regulated genes to the findings from analyses comparing the different time
points or viruses resulted in approximately 2,500 genes being identified as significantly differentially
transcribed. (B) Heat map of row-scaled log2-fold changes in expression in infected HuH7 and
R06E-J-samples against the corresponding Mock samples (e.g., column three shows the fold change
between HuH7 Mock-treated cells and HuH7 EBOV-treated cells at 23 h p.i.). The input matrix is
scaled within the rows to visualize changes in expression at the gene level. Fold changes are based
on unique genome read counts of H. sapiens and R. aegyptiacus. Genes without a clear homologous
sequence in the R. aegyptiacus genome or transcriptome assembly are marked with a star. We
identified homologous locations (LOC107508087, LOC107515336, LOC107498547 ) for three genes
that were not directly annotated in the R. aegyptiacus genome (EP300, RPS17L, MX1 ). These
locations were identified using our de novo transcriptome assembly (red boxes). We indicated the
molecular function of each gene based on the color scheme presented in (D). (C) Highly regulated
genes in EBOV- and MARV-infected HuH7 and R06E-J cells. FC – log2 -fold change based on
DESeq normalized read counts. See Appendix (Tab. A.5 and A.6) and corresponding entries for
detailed information. (D) The PANTHER database (v11.0) [329] was used to assign molecular
functions to each of the 64 genes in (B). We further subdivided the dominant group of genes
that we identified to have a general binding function. During filovirus infections, the most promi-
nent regulatory effects were observed for genes encoding transcription factors, those regulating the
NFκB and MAPK pathways, their DUSP inhibitors and growth factors (Fig. ES2A and Tab. A.5
and A.6, full tables in the Electronic Supplement). In addition, changes were also observed for
genes that regulate protein translation (RPS17, PPP1R15A), ubiquitination (TRAF6, SQSTM1 ),
autophagocytosis (SQSTM1 ) and cation transport (CHAC1, ATP2B4 ). We also observed the
strong up-regulation of genes that are involved in energy transfer (e.g., RASGEF1B ). Details can
be found in Tab. A.5–A.8.
131
Chapter 5. Differential Gene Expression
regulation (Fig. 5.18). Interestingly, we found that the FOS /JUN -motif has sig-
nificantly increased activity after both EBOV and MARV infections. This result
indicates that genes having FOS /JUN -motif binding sites in their promoter region
are primarily up-regulated (Fig. 5.18). Consistent with this, transcription factors
that are associated with this motif (e.g., FOSB ) were up-regulated in infected cells
(Fig. 5.17B). The transcription factor AP1, a homo or heterodimer of differentially
expressed FOS and JUN, plays important roles in different viral infections [330, 331].
Other motifs, such as the KLF12 - and the NRF1 -associated motifs (Fig. 5.18), are
more specific to EBOV and MARV, respectively, reflecting the differences in the
impacts of theses two viruses on the transcriptional landscape of infected cells. In
summary, we found various motifs, including the antiviral signaling-associated mo-
tif for NFκB, to have significant changes in activity (Fig. 5.18). For each motif, we
provide associated regulators and target genes (Sec. ES5).
132
5.2. Differential expression in EBOV/MARV infected human and bat cells
activity change
activity change NFKB1_REL_RELA.p2 4.36 E2F1..5.p2 4.53
0.10
G
0.05
A C
KLF12.p2 4.20 YY1.p2 3.55 C TC A
0.05
0.10
TTT GCGC
GATA1..3.p2 3.13 ELK1,4_GABP{A,B1}.p3 2.64
SRF.p3 3.12 HNF4A_NR2F1,2.p2 2.61
0.00
G
0.15
3 7 23 3 7 23
0.02
0.15
0.02
0.05
A
0.00
G
AC
T
0.10
C
A
C
T
G
T
AAG T
T
A
T
ACA CA T
A
C
A
T
0.00
0.05
0.02
0.00
CGC TGCGC
0.02
A
0.00
A
A
A
G
C
G
T
G
T
3 7 23 3 7 23 3 7 23 3 7 23
time p.i. [hours] time p.i. [hours] time p.i. [hours] time p.i. [hours]
Figure 5.18: Motif activity response analysis. The table shows the top significant motifs after the
infection of HuH7 cells with EBOV (red) or MARV (blue) compared with the response in Mock
controls. Regulated motifs are predicted to target (1) the cell cycle (E2F1..5.p2) by down-regulating
CDC6, PCNA and MCM6 ; (2) NFκB -signaling (NFKB1_REL_RELA.p2) by targeting CXCL
isoforms, ELF3, NFκB isoforms, FOSL2 and JUN ; (3) EGR1 expression in EBOV-infected cells
(KLF12.p2, YY1.p2 and others); or (4) chromatin organization in MARV-infected cells (NRF1.p2,
YY1.p2 and others). For selected motifs, the inferred activity changes (points +/- 1 SD) after
EBOV or MARV infection relative to the corresponding Mock controls are shown for the different
time points (3, 7 or 23 h p.i.) adjacent to and below the table. Selected regulatory motifs, the
associated genes and their important targets (including their fold change between two time points)
can be viewed in Sec. ES5 and are summarized in File ES5D.
133
Chapter 5. Differential Gene Expression
HuH7. We determined that the majority of the host genes reacted in a similar
manner in response to EBOV and MARV infections, which may explain the common
symptoms caused by these viruses in humans. CYR61 was among the most up-
regulated genes in HuH7 cells and was usually highly expressed at 3 and 23 h p.i.,
which correspond to the periods of inflammation and wound repair, respectively [341]
(Fig. ES4C). The cytokine genes IL8 and IL32 responded in both MARV- and
EBOV-infected HuH7 cells, showing a significant up-regulation (Fig. 5.17B). IL32
expression can be induced by IL8 and is involved in the apoptosis of T cells in EBOV-
infected patients [342, 343]. It is also up-regulated in response to influenza A virus
infections. The up-regulation of IL8 results in the activation of pro-inflammatory
pathways [344, 345]. We identified NRAV up-regulation 7 h p.i. This long non-
coding RNA was recently reported as a key regulator of antiviral innate immunity
that acts via the suppression of interferon-stimulated gene transcription [346]. We
propose that a cellular component exists, in addition to the filoviral inhibition of
the innate immune system by VP24 and VP35 [347–351], that results in the same
inhibition of innate immunity. At 23 h p.i. we identified several highly up- and
down-regulated genes, which we were unable to categorize (see Sec. ES4). We also
determined that ANXA3 was markedly up-regulated. Annexin A3 is an inhibitor of
phospholipase A2 and possesses anti-coagulant properties [352].
Our data indicate an initial inflammatory response (3 h p.i.) [332], followed by a
repression of antiviral defenses (7 h p.i.) with the majority of up- and down-regulated
gene expression occurring at 23 h p.i. in the HuH7 cells (Fig. 5.19).
R06E-J. R06E-J cells responded differently to filovirus infections than HuH7 cells.
At 3 h p.i., we identified down-regulations of nuclear receptors involved in cell prolif-
eration and differentiation (e.g., NR4A3 [353]), and the cell-cycle-regulating ubiqui-
tin ligase ANAPC10, which controls the progression through mitosis [354]. BLCAP,
which controls cell proliferation, apoptosis and the cell cycle [355], was also down-
regulated at 3 h p.i. HAVCR1 (previously known as TIM-1 ), which was down-
regulated at 7 h p.i., is a receptor for many viruses, including filoviruses [356] and
Dengue virus [357]. We observed that various histone genes (e.g., HIST1H2B6,
HIST1H1C ) were down-regulated at 23 h p.i. This may be an epigenetic signal that
could induce cell death. NPR3 was also down-regulated in R06E-J cells 23 h after
filovirus infection. This gene is involved in the regulation of blood volume and pres-
sure, cardiac function and some metabolic and growth processes [358]. Additional
information about the significant co-regulation of genes occurring during filovirus
infection in HuH7 and R06E-J cells can be found in Sec. ES4.
These findings may describe the major differences between human and bat cells
that occur during filovirus infection.
134
3h 7h 23h
●
● ● ●
2
2
2
● ●
●
● ● ●
●
● ● ●
● ● ●
● ●
● ●●
● ●●●● ● ● ● ●
● ● ●● ●
● ● ●
● ●
● ●●● ● ●● ●
● ●
●● ●●●●●
●
●● ● ● ●● ● ● ●● ●
● ●
● ●● ● ●●
● ●● ●
●●● ● ● ●●● ●
1
1
1
0
0
0
●●
-1
-1
-1
● ● ● ● ●●
● ● ● ● ●● ● ●●●●
●
● ● ●● ● ●● ●
●
● ● ● ●
● ● ● ●
●●
● ●
● ●
● ●
● ●
●
-2
-2
-2
●
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2
2
2
2
●
●
● ●
● ● ● ●
●●
● ●
● ●
●
●
●●●● ● ●
● ● ●●● ●
●● ●
●●
1
1
1
135
0
0
0
Mock vs. MARV expression (log2)
● ●
-1
-1
-1
● ●● ●● ●
●●● ●
●● ● ● ● ●
● ● ● ●
● ● ● ● ●
●
● ●
-2
-2
-2
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2
Figure 5.19: Common gene regulation patterns after filovirus infection. The scatterplots demonstrate the fold changes in expression as determined by
DESeq of coding and non-coding RNAs in MARV- and EBOV-infected cells compared with expression in Mock controls 3, 7 and 23 h after EBOV and
MARV infections. We observed similar expression patterns in HuH7 cells at 3h p.i. and in the bat cell line 7 h p.i., suggesting that the progress of filovirus
infection is slower in R06E-J cells. The scatter plot derived from the differential expression analysis of HuH7 cells at 23 h p.i. shows the large number of
differentially expressed genes. A detailed view of the figures (including genes outside of the plotted range of fold changes) can be found in the Electronic
Supplement, Fig. ES4B. Genes demonstrating a similar expression after infection with EBOV and MARV and with an abs(log2 (F C)) > 1 are marked in
5.2. Differential expression in EBOV/MARV infected human and bat cells
The JAK/STAT pathway. In EBOV-infected HuH7 cells, all genes coding for
members of the JAK/STAT pathway were slightly induced between 3 and 7 h p.i.
(Fig. 5.20A). However, the JAK/STAT system in R06E-J cells demonstrated only
a minimal response to EBOV and MARV infections. The EBOV protein VP24 has
a negative impact on STAT1 signaling [359–361], and STAT1 /2 were found to be
down-regulated at 23 h compared with their expression at 7 h p.i. in the HuH7 cell
line. Downstream genes in this pathway, such as EP300 and PIM1, were highly
up-regulated at 23 h at the mRNA level in EBOV-infected HuH7 cells (Fig. 5.17B).
mRNA of the signaling receptor IFNGR2 was up-regulated, and mRNA of the
interacting JAK2 was down-regulated at 23 h p.i. in EBOV-infected HuH7 cells.
This regulation may be attributed to the activation of feedback from PIM1 via
CISH to the receptor IFNGR2 [362, 363] on protein level.
MARV infections trigger a different reaction in HuH7 cells than do EBOV infec-
tions: the PIM1 mRNA and the receptor IFNGR2 are down-regulated (Fig. 5.20A).
Most strikingly, while we currently do not know whether VP24 also interacts with
STAT1 in bats, the complete JAK/STAT pathway is mostly unaffected at the mRNA
level during EBOV infections in R06E-J cells (Fig. ES6.22).
The DUSP pathway. The most striking difference between EBOV-infected HuH7
and R06E-J cells was observed among the mRNAs encoding the various DUSPs,
which represents a possible key to the contrasting innate immune responses of hu-
man and bat cells [368, 369] (Fig. 5.20C). An up-regulated expression of DUSP1
has been observed for vaccinia virus-infected cells [370] and EBOV-infected human
macrophages [311]. DUSPs are critical regulators of several cellular pathways be-
cause they inhibit central immune activator genes such as MAPK8, MAPK14 and
136
5.2. Differential expression in EBOV/MARV infected human and bat cells
MAPK1 /3 (also known as ERK2 /1 ) [369]. While the expression levels of these three
immune activator genes did not change significantly p.i., the drastic up-regulation
at 23 h p.i. of DUSP genes in HuH7 (up to 25 X) but not R06E-J (up to 3 X) cells
is worth noting.
We hypothesize that the following sequence of events occurs during EBOV in-
fections. Upon EBOV invasion of the host cells, an antiviral response is induced, in-
cluding the activation of NFκB and MAPK. Genes of other innate immune response
pathways (e.g., JAK/STAT, DDX58 ) are then suppressed. After EBOV infection,
DUSPs are highly up-regulated in HuH7 cells, correlating with the down-regulation
of MAPK8, MAPK1 /3 and MAPK14 mRNA levels. When translated, these genes
are responsible for the innate immune response [372]. PPP1R15A plays a central role
by binding to the receptor TGFBR1 and inhibiting additional components of the in-
nate immune system. Compared with HuH7 cells, R06E-J cells demonstrate almost
no or only a very slight up-regulation of PPP1R15A and DUSPs, very likely leading
to a stable antiviral response. Furthermore, the mRNA levels of all JAK/STAT
pathway genes in R06E-J cells remain constant during EBOV-infection compared
with the levels in HuH7 cells. The viral protein VP24 inhibits STAT1 activity in hu-
mans, blocking signaling into the nucleus. We suggest that a possible feedback loop
exists that increases the number of IFNGR2 receptors in HuH7 cells. The robust
expression of interferon-stimulated genes could orchestrate the antiviral response of
the infected cell [373].
Differences in baseline expression levels between human and bat cell lines
To validate the RNA-Seq-derived read counts and observed differences in the base-
line expression levels of certain genes between the HuH7 and R06E-J cell lines,
we performed qRT-PCR analyses for the putative EBOV receptors NPC1 [374] and
HAVCR1 [375] and the toll-like receptor TLR3 on Mock and EBOV samples 3 and
23 h p.i. (Sec. ES7). We compared the 18S-normalized mRNA levels of these genes
(File ES7B) with the RNA-Seq-derived read counts from human and bat cells and
found a strong overall correlation. Based on our data, all three genes are expressed
in both cell lines. We observed that NPC1 is clearly more abundant in HuH7 cells
than in the R. aegyptiacus cell line. TLR3 is expressed at a greater level in R06E-J
cells than in the human cell line, but was also not differentially expressed. These
results support our RNA-Seq data. Interestingly, we observed differences in the
melting curves for TLR3 and HAVCR1 (Fig. ES7B), which could be due to dif-
ferences in the amplified sequences of human and bat RNA as identical degenerate
primers were used for the amplification of TLR3 from HuH7 and R06E-J cells. For
HAVCR1 expression, we observed a slight down-regulation between 3 and 23 h p.i.
in the R06E-J cells. However, further studies with different cell lines and primary
cells are required.
We observed clear differences in the baseline expression of some genes between
the HuH7 and R06E-J cell lines. However, we avoided the complications this issue
may cause when comparing homologous human and bat genes by focused on the
calculated log2 -fold changes instead of directly comparing read counts.
137
A B C Legend
= =
2 = =2 EBOV MARV EBOV
IFNGR2 TGFBR1 plasma membrane human human bat
JAK2 PPP1R15A ZFYVE9
{
{
=
2 = = 3-7h 7-23h
45 2 =
EP300 3
2 = =
Change in expression:
STAT1 STAT2 down/up regulation, >15%
TGFB3
=2 =
= = 0 = 2-fold up regulation
2 2
= 24
2 (indicated if change >100%)
10 = 6 NA
DUSP8 =
expression change <15%
== =2
homo/hetero
3
DUSP10 DUSP6
STAT dimers DUSP16
mTOR cytoplasm Difference observed in:
nucleus human vs human
DUSP1
NA transcription =
MAPK8 MAPK14 8
EBOV vs MARV
EP300 ST13 P 4EBP1 = = = DUSP4
=
= =
3 2
human vs bat
MAPK3
138
DUSP1 MAPK1 DUSP5
Chapter 5. Differential Gene Expression
PIM1
25 3
= = == =
2 3 3
4 ==
Figure 5.20: Effects of filovirus infections on JAK/STAT, PPP1R15A, and DUSP pathways. (A) The JAK/STAT pathway. The JAK/STAT pathway
shows a common trend in expression levels: STAT1, STAT2 and JAK2 were up-regulated (↑) between 3 and 7 h p.i. and then down-regulated (↓) between
7 and 23 h p.i. in EBOV-infected HuH7 cells. The cytokine receptor IFNGR2 is not regulated between 3 and 7h (=) and shows a 2 X up-regulation between
7 and 23 h (2 ↑) (Fig. ES6.22). (B) The PPP1R15A pathway. Growth arrest and DNA damage 34 (GADD34, officially known as PPP1R15A) can be
rapidly induced by several types of cellular stress. In R06E-J cells, PPP1R15A was slightly up-regulated (2 X) due to EBOV infection after 23 h; in HuH7
cells, we observed a strong up-regulation (45 X) in EBOV-infected cells and no up-regulation in MARV-infected cells. (C) The DUSP pathway. DUSP1,
8 and 10 demonstrate the highest specificity for MAPKs (MAPK14 and MAPK8 ). DUSP1 is localized in the nucleus, whereas DUSP8 and DUSP10 are
also available in the cytosol. The nuclear DUSPs are thought to be inducible phosphatases [369], and the implications of DUSP s during viral infections
have been demonstrated for DUSP1, which is up-regulated during Epstein-Barr virus [371] and vaccinia virus infections [370]. In response to the vaccinia
virus, DUSP1 is actively involved in antiviral countermeasures of the host cell via the regulation of MAPK phosphorylation. Legend: Boxes indicate
up/down-regulation from 3 to 7 h and from 7 to 23 h p.i. in EBOV-infected HuH7 cells (red); MARV-infected HuH7 cells (green); and EBOV-infected R06E-J
cells (blue). For cases where the expression level changed by more than 15 %, an arrow indicates the direction of regulation (↑/↓). When the expression
level changed by more than 100 % (2-fold change in transcription), the number beside the arrow indicates the fold change. “=” indicates expression changes
of <15 %. Squares around gene names indicate differential expression within the HuH7 cell line (red), between EBOV- and MARV-infected cells (green)
and between HuH7 and R06E-J cells (blue).
5.2. Differential expression in EBOV/MARV infected human and bat cells
5.2.5 Conclusions
The Ebola and Marburg filoviruses cause severe and often fatal infection in humans,
whereas bats, shown to be carriers, do not develop disease symptoms after infection.
As a first step towards identifying the cellular response that allows bats to survive
a filovirus infection, we provide a systematic overview of the genes that are differen-
tially expressed between human and bat cells during EBOV and MARV infections
at three time points p.i. Our investigations are based on 18 full transcriptomic
datasets.
In addition to the state-of-the-art RNA-Seq data analysis, comprising read count-
ing, normalization and calculations of fold changes, we investigated 1,500 genes
(∼7 % of human genes) in detail, overlapping them with the 2,500 genes poten-
tially affected during filovirus infections. For each gene, we investigated the follow-
ing aspects: (1) gene synonyms, (2) functional information, (3) different isoforms,
(4) characteristics in 5’/3’-UTR, (5) intronic transcripts, (6) single-nucleotide ex-
changes, (7) ncRNA detection, (8) description of novel genes, (9) genomic context,
(10) conservation, (11) expression profile changes, and (12) homologous gene detec-
tion in R. aegyptiacus. The result of this multidimensional bioinformatics analysis
is a comprehensive Electronic Supplement17 that provides quick insights into how
individual genes of interest are regulated during EBOV and MARV infections via
transcriptional changes and the generation of alternatively spliced forms. We were
only able to investigate expression patterns in two immortalized cell lines of differ-
ent tissue origin (humans, liver; bat, embryonic). The data collected and presented
here serve as valuable sources of information for generating and testing hypotheses
concerning the regulatory circuits that are active in filovirus-infected cells and may
17
www.rna.uni-jena.de/supplements/filovirus_human_bat/igo.php
139
TRIM69 TRAF3/ TANK /CXCL8 IKK /TBK1/DDX3X
IRS2
CLEC7A
PDGF complex IRS1
SQSTM1 DIAPH3 ENAH WASF2 SRC ITG REDD1 4EBP1
RIPK3 GRID1
PIK3CA PDK1
NCK1 YLPM1 SYK (MYD88) TIRAP IRSp53 caveolin
CSE1L DDIT
RALBP1/TRADD/ TRAF2 / TNFR1 TRIB3
PAK1 CARD9/MALT1/BCL10 IRAK2 / IRAK4 / TRAF6 MAP3K1
Akt = Rac mTor (MYCN)
PDGF RPTOR
CXXC1
IKBKG MAP2K4
TAB1 /TAB2/TAB3/TAK1 EGF EGFR MICAC1
IKK MAP2K7 RAC1 PIK3CA
IKK
FPR1 MAP2K1
RAF1 RAS
PAK1 MAP2K2
NFKB1 / I B / RELA VEGFA FGF SOS1 RTK CBL Epsin
MAP2K3
MAP3K11 GF
MAP2K6 DUSP4 MYLK
CDK9 MAP3K12 MAPK3
NFKB2 / RELB CENPE
MAPK1
SMAD
MAPK14 NR4A1
140
DUSP1 MAPK8
DUSP5 DUSP6
Chapter 5. Differential Gene Expression
141
Chapter 5. Differential Gene Expression
142
5.3. A short NGS design detour: RVFV infection in bat cells
ples (IFN ), and samples infected with the RVFV Clone 13 (RVFV ) at 6 h and 24 h
post infection. Briefly, the RNA-Seq data was quality controlled and mapped to the
Myotis lucifugus (a closely related species with a genome available) reference genome
with STAR [39]. Read quantification was done with FeatureCounts [78] and the
DESeq2 [81] pipeline was used to call DEGs. Preliminary data of the rRNA− and
smallRNA libraries [15] is shown and briefly compared in Fig. 5.22.
With this short detour, we want to point out again the importance of a well-
thought-out study design, that can greatly improve the downstream bioinformatical
analyses. In some cases, factors like the budget or security reasons in the laboratory
(e.g. when working with deadly viruses) set limits to the NGS study design (like
for the project presented in Sec. 5.2). In such cases, a state-of-the-art analysis of
the RNA-Seq data is theoretically possible, but does in most cases not untangle
a comprehensive story out of the data. Much more manual work and specialized
methods are needed, to obtain meaningful results (like shown in Sec. 5.2). Therefore,
factors like the sequencing depth, the amount of biological replicates, the chosen
protocols for molecule enrichment, and the chosen read length are always important
parameters to consider and to optimize within the given terms and conditions of a
new NGS project.
143
Chapter 5. Differential Gene Expression
A smallRNA rRNA-
Mock vs RVFV 6h
Mock vs RVFV 24h
B
smallRNA rRNA-
● ●
10
●
●
●
5
condition 5
● IFN
● Mock ●
●
PC2: 11.7% variance
● RVFV ●
● 0
timepoint
●●
0 ● 24h
6h
−5 ●
●
−10
−5
●
●
● −15 ●●
−10 0 10 −10 0 10 20
PC1: 45.1% variance PC1: 67.8% variance
Figure 5.22: Shown is some preliminary data comparing the smallRNA (left) and rRNA− (right)
protocols used for library preparation. (A) MA plots visualizing the mean expression per gene (x-
axis, unique normalized counts) against the log2 fold change (y-axis). In the first row, control and
RVFV-infected samples 6 h post infection (p.i.) are compared. In this early state of infection, only
a few genes are significantly (red) dysregulated. The second row shows the comparison between
control and RVFV 24 h p.i., showing a huge amount of differentially expressed genes especially in
the rRNA depleted (rRNA−) samples. Interesting are the different expression patterns between
smallRNA and rRNA− protocols: smallRNAs are not that highly expressed, but show also high
fold changes. (B) PCA plots of all 18 samples based on the smallRNA (left) and rRNA− (right)
samples. A nice separation between the IFN-stimulated samples and the Mock and RVFV samples
can be observed. Also, the Mock and RVFV samples 6 h p.i. cluster together, showing the slow
response of the cells at early steps of infection when viral replication just started. At 24 h p.i. the
RVFV samples separate from the rest. With the help of the PCAs, one outlier within the RVFV
24 h triplicates could be observed. Preliminary data was obtained from [15].
144
Chapter 6
145
Chapter 6. Single Nucleotide Investigations
In Sec. 6.1 we present our pipeline called PoSeiDon, that allows for the detection
of putative recombination events and positively selected sites in an alignment of
protein-coding homologous sequences. The pipeline was developed during my work
on two other projects [7, 18] and is now publicly available1 as an easy-to-use web
server [9].
In the last section of this chapter (Sec. 6.2), we present a comprehensive study of
the immunorelevant gene Mx1 in 13 bat species. Here, we focused again on bats (see
Sec. 5.2), because those flying mammals seem to be an important model organism
to study host-virus ‘arms-races’ as they show no symptoms when becoming infected
by viruses such as EBOV or MARV [4]. In this project, the PoSeiDon pipeline
(Sec. 6.1) was applied to detect positively selected sites in the Mx1 gene of bats.
The project was conducted together with Jonas Fuchs and Prof. Dr. Georg Kochs,
who performed the wet lab work and experiments at the Institute of Virology in
Freiburg.
Besides the single nucleotides investigated within the projects presented here,
we were also working on other more restricted and specialized topics, that will only
mentioned here briefly.
In one project, conducted together with Dr. Daniel Steinbach from the Univer-
sitätsklinikum Jena, we are working on a different type of NGS data than presented
so far, derived from so called whole exome sequencing. Here, we aimed to identify
somatic single nucleotide variants and insertions/deletions between the exome of
bladder cancer patients of different tumor states [14]. The exome combines the part
of the genome that is formed by exons, so the sequences that remain in a mature
RNA after transcription and splicing. It consists of all the DNA that is transcribed
into mature RNA in any cell at any time point. Therefore, the exome is different
from the transcriptome (examined via RNA-Seq) where only RNA that has been
transcribed in a specific cell population and time point is visible. The project is
again accompanied by a comprehensive Electronic Supplement2 , providing interac-
tive online tables to check for specific variant positions in the human genome.
In another project, we were focusing on a single ncRNA class: tRNAs (transfer
RNAs), instead of looking on many (e.g. all annotated) genes in parallel (like in
Chapter 5). Even more specific, we only focused on a special cell organelle: the
mitochondrial genome, instead of looking at the full genome level. On this limited
data set, we tried to detect so called remolding events in alignments of tRNAs by
utilizing maximum likelihood functions. My main contribution in this project was
the calculation of alignments of tRNAs and the implementation of a novel maximum
likelihood based algorithm called MLRD (maximum likelihood remolding detection),
for identifying the position of a remolding event by utilizing a previously calculated
phylogenetic tree. Furthermore I was mainly responsible for the visualization of
the alignments, trees and detected remolding events. More details about the evo-
lutionary process of tRNA remolding and mitochondrial genomes in general can be
found in our publication [3] and in the thesis of my former colleague Dr. Abdullah
H. Sahyoun [20].
1
The PoSeiDon web server: http://www.rna.uni-jena.de/poseidon
2
available at http://www.rna.uni-jena.de/supplements/urology_all/
146
6.1. PoSeiDon: Positive Selection Detection
147
Chapter 6. Single Nucleotide Investigations
148
PoSeiDon
Alignment Recombination Tree Positive Selection Output
GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCT AAA GAA GAG
M8a M8 0.883 0.985 0.762 0.665 0.963 1.000 0.910 0.996 0.986 0.977 0.977 0.722 0.934 0.971
GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCA AAA GAA GAG
GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG
F61
GAC CAA GCG TAC CGG GGT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG
GAC CAA GCG TAC CGG GCT TCG CTG CAG AAG ATC CGA GAG AAG GAA TCG GAA GAG AAG 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582
GAC CAG GCG TAC CGC ACC GCG CTG GGA AAG ATC CGA GGG ATG GAA GCC GAA GAG AAT KH test
User Input GAC ACG ATG TAT CAG AGA TCG TTA CGG AAA ATC AGG GAG AAG GAA AAA GAG AAG GAA
Codon frequencies
GAC CAG GTG TAT CGG AAA TCA TTA CAG ATA GTC AGG GAG AAG GAG AAG GAG AAG GTT
GAC CAG GTG TAC CAG AGA TCA TTA CAG AAA GTC AGG GAG AAG GAG AAT GAA GAA CAA --- --- ---
GAT CAG ATG TAC CAG AGT TCA TTA CAG AAA ATC AGG GCG AAG GAG AAG GAG AAG GAA
GAA TCA GAA CAA CAG - - - AAA AGG AAA TCC ACC TTG GTG ACT TCT GAA AGC AGC CAG CGA AAG ATC
0.9
0.8
0.7
F61 F1x4 F3x4
>seq1 GAA TTA GAA GAA AAC - - - --- --- --- AAG AAG AAG TCC GTC TTT GCG CTT TCT GAA AAC AAT CAG AGA ATG ATC
0.6 RAxML
0.5
GTTATGAAG... 0.4
0.3
M7 vs M8 GAA TCC AAA GAG CAG - - - --- --- AAG GGG AGT TCT CGC GAG CAG ACG TCC TCT CTG GAG GAT CAG CGA AAG ATC
>seq2 0.2 Newick Utilities LRT
0.1
TranslatorX
0
GTACTGAAA... 0 200 400 600 800 1000 1200 1400 1600 1800 2000 M1a vs M2a M8a vs M8 GAA TCC AAA GAG CAG - - - --- --- AAG GGG AGT TCT CGC GAG CAG ACG TCC TCT CTG GAG GAT CAG CGA AAG ATC
>seq1 >seq1
FASTA GTT ATG V M R ... Chi-squared test GAA TCG GAA GAG AAG AAG GGT TGT TCG CGC CAG CAG AAG GAG CAG AAT TTC TAT CAG GAG GAT CAG CGA AAG ATC
149
AAG ... >seq2
GAC CAA GAG TAT CGG ACT TGG CTG CAG AAG ATC CGA GAG AAG GAA TCA GAA CAA CAG
>seq2 V L R ... GAC CAA GAA TAT CGG ACT TGG CTG CAG AAG ATC AGA GAG AAG GAA TTA GAA CAA CAG
GAA GCC GAA GAG AAT - - - --- --- AAG AAG AAG AAG AAG GAG CAT ATT TTC TTT GAA GAG GAC GGA CGA AAG ATC
GAC CAA GAG TAT CGG ACT CAT CTA CAG ATG ATC AGA GAG AAG GAA TTA GAA GAA AAC
GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCT AAA GAA GAG dN/dS ( ) ratios D E C M K
GTA CTG GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCA AAA GAA GAG
GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG
GAC CAA GCG TAC CGG GGT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG GAA AAA GAG AAG GAA - - - GAA GAA AGG AAG AGA ACA TTA GGT CGG GCG ATC TGC GAA GAG AGT CGG AGG AAA ATC
GAC CAA GCG TAC CGG GCT TCG CTG CAG AAG ATC CGA GAG AAG GAA TCG GAA GAG AAG
AAA ... R S T F Y
GAC CAG GCG TAC CGC ACC GCG CTG GGA AAG ATC CGA GGG ATG GAA GCC GAA GAG AAT
GAC ACG ATG TAT CAG AGA TCG TTA CGG AAA ATC AGG GAG AAG GAA AAA GAG AAG GAA
GAC CAG GTG TAT CGG AAA TCA TTA CAG ATA GTC AGG GAG AAG GAG AAG GAG AAG GTT BEB
GAC CAG GTG TAC CAG AGA TCA TTA CAG AAA GTC AGG GAG AAG GAG AAT GAA GAA CAA ---
NT ALN AA ALN GAT CAG ATG TAC CAG AGT TCA TTA CAG AAA ATC AGG GCG AAG GAG AAG GAG AAG GAA
GAG AAT GAA GAA CAA - - - AAC AAG AAT AAA TCA AGA GTT TTG GAC CTT GTA CAG AGT TCT CAG
N AGG
Q G AAA LGTCV
Resulting GAG AAG GAG AAG GAA - - - --- GAA GAA ATG AAG AAG AAA TTT AAT TGT TTG AAC CTT CAA CAG I
CAG A W
AGG AAAHATCP
Muscle
fragments
Figure 6.2: Workflow of the PoSeiDon pipeline and example output. The PoSeiDon pipeline comprises in-frame alignment of homologous protein-coding
sequences, detection of putative recombination events and evolutionary breakpoints, phylogenetic reconstructions and detection of positively selected sites
in the full alignment and all possible fragments. Finally, all results are combined and visualized in a user-friendly and clear HTML web page. The resulting
alignment fragments are indicated with colored bars in the HTML output.
6.1. PoSeiDon: Positive Selection Detection
Chapter 6. Single Nucleotide Investigations
A B
Artibeus jamaicensis 0.9
100
Phyllostomidae
0.8
1-1947
Sturnira lilium
270 549
Myotis davidii 0.5
100 88 0.4
Myotis daubentonii
0.3
97
Myotis lucifugus Vespertillionidae 0.2
77
Myotis brandtii 0.1
100
0
Eptesicus fuscus 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Breakpoint location
100
Pipistrellus spec.
Pteropus alecto C
Eidolon helvum
substitutions/site Position: 1-270 271-549 550-1947
0 0.1 0.2 0.3 (90 aa) (92 aa) (466 aa)
Figure 6.3: (A) Species tree of 13 bat species. (B) Putative breakpoints in the nucleotide align-
ment identified with GARD. (C) Schematic view of phylogenetic trees derived from the alignment
fragments based on the identified breakpoints. The largest fragment (550-1947) of the alignment
follows the topology of the species tree, whereas the other parts show different evolutionary seet-
ings. Different topologies can have an impact on the positive selection detection. Therefore, in
PoSeiDon, the full alignment as well as each recombinant fragment are analyzed independently.
The example is based on data from Fuchs et al. [7].
150
6.1. PoSeiDon: Positive Selection Detection
L4
B G domain B Stalk B
N Loop C
Combine all
outp
ut
D E C M K
Exemplarily shown is a part of the hypervariable loop L4 R S T F Y
region of bat Mx1. Significant (posterior probability > 0.95) N Q G L V
I A W H P
positive selected sites are marked.
LOOP L4
0.984 0.867 F3x4 0.950 0.961 0.686 0.986 1.000 0.832 0.995 0.985 0.991 0.998 0.534 0.655 0.984
0.895 0.782 F1x4 0.857 0.966 0.566 0.932 1.000 0.916 0.998 0.989 0.992 0.991 0.538 0.823 0.978
0.861 0.807 F61 0.883 0.985 0.762 0.665 0.963 1.000 0.910 0.996 0.986 0.977 0.977 0.722 0.934 0.971
544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580
GAC CAA GAG TAT CGG ACT TGG CTG CAG AAG ATC CGA GAG AAG GAA TCA GAA CAA CAG - - - --- --- --- AAA AGG AAA TCC ACC TTG GTG ACT TCT GAA AGC AGC CAG CGA
GAC CAA GAA TAT CGG ACT TGG CTG CAG AAG ATC AGA GAG AAG GAA TTA GAA CAA CAG - - - --- --- --- AAA AAG AAA CTG GCC TTT GCG CCT TCT GAA AAC AGC CAG AGA
GAC CAA GAG TAT CGG ACT CAT CTA CAG ATG ATC AGA GAG AAG GAA TTA GAA GAA AAC --- --- --- --- AAG AAG AAG TCC GTC TTT GCG CTT TCT GAA AAC AAT CAG AGA
GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCT AAA GAA GAG - - - --- AAG AAG GGG AGT TCT CGC GAG CAG ACG CCC TCT CTG GAG GAT CAG CGA
GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCA AAA GAA GAG - - - --- AAG AAG GGG ATT TCT CTC CAG CAG ACG TCC TCT CTG GCG GAT CAG CGA
GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG - - - --- --- AAG GGG AGT TCT CGC GAG CAG ACG TCC TCT CTG GAG GAT CAG CGA
GAC CAA GCG TAC CGG GGT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG - - - --- --- AAG GGG AGT TCT CGC GAG CAG ACG TCC TCT CTG GAG GAT CAG CGA
GAC CAA GCG TAC CGG GCT TCG CTG CAG AAG ATC CGA GAG AAG GAA TCG GAA GAG AAG AAG GGT TGT TCG CGC CAG CAG AAG GAG CAG AAT TTC TAT CAG GAG GAT CAG CGA
GAC CAG GCG TAC CGC ACC GCG CTG GGA AAG ATC CGA GGG ATG GAA GCC GAA GAG AAT --- --- --- AAG AAG AAG AAG AAG GAG CAT ATT TTC TTT GAA GAG GAC GGA CGA
GAC ACG ATG TAT CAG AGA TCG TTA CGG AAA ATC AGG GAG AAG GAA AAA GAG AAG GAA --- GAA GAA AGG AAG AGA ACA TTA GGT CGG GCG ATC TGC GAA GAG AGT CGG AGG
GAC CAG GTG TAT CGG AAA TCA TTA CAG ATA GTC AGG GAG AAG GAG AAG GAG AAG GTT GAA CAA TAC AAG AAT AAA TCA AGA GTT TTG GAC CTT GAA CAG AGT TTT CAG AGG
GAC CAG GTG TAC CAG AGA TCA TTA CAG AAA GTC AGG GAG AAG GAG AAT GAA GAA CAA --- --- AAC AAG AAT AAA TCA AGA GTT TTG GAC CTT GTA CAG AGT TCT CAG AGG
GAT CAG ATG TAC CAG AGT TCA TTA CAG AAA ATC AGG GCG AAG GAG AAG GAG AAG GAA --- --- GAA GAA ATG AAG AAG AAA TTT AAT TGT TTG AAC CTT CAA CAG CAG AGG
Figure 6.4: Exemplary view of the PoSeiDon visualization of the Mx1 gene, zoomed in the
hypervariable loop region L4 in the Stalk domain (Sec. 6.2). The loop is already known to interact
with certain (RNA) viruses and blocking early steps of the viral replication cycle. Using the here
described evolution-guided pipeline, we were able to identify the loop L4 of Mx1 as a hot spot for
positive selection in bats [7], as previously also shown for primates [396]. For details see Sec. 6.2.
By splitting the alignment by possible recombination events identified by GARD, we also found
high evidence of positive selection in the N-terminal region of bat Mx1 (see exemplary Tab. 6.1,
fragment1).
of the positive selection model based on a likelihood ratio test. Then, a Bayes
empirical Bayes (BEB) approach [395] is applied to calculate posterior probabilities
(P P ) that a codon comes from the site class with ω > 1. Positively selected sites
with an assigned P P > 0.95 are depicted as significant.
We graphically summarize all positively selected sites under varying frequency
models in the output (Fig. 6.2 and 6.4). Thus, we give the user the opportunity
to investigate sites that would be dismissed from the output when using a P P
threshold. For example, such sites could be located in regulatory domains of the final
protein, yielding a lower P P value due to insufficient species sampling [397]. The
final output of PoSeiDon is based on a heavily modified version of the TranslatorX
HTML output (Fig. 6.4). The amino acid color code is adapted from TranslatorX.
All commands executed within the pipeline are summarized in the final output,
allowing advanced users to adjust the predefined parameters.
6.1.3 Conclusions
Here we present PoSeiDon, an easy-to-use, web-based pipeline for the accurate de-
tection of site-specific positive selection and recombination events in protein-coding
sequences. The input is a multiple FASTA file of homologous coding sequences that
is automatically transferred into a codon-based alignment. Since recombination can
have a profound impact on the evolutionary history of sequences, we initially check
151
Chapter 6. Single Nucleotide Investigations
152
6.1. PoSeiDon: Positive Selection Detection
153
Chapter 6. Single Nucleotide Investigations
Pteropodidae
Megachiroptera
Pteropus vampyrus
vertebrates [402]. Except for polar re- Craseonycteridae Eidolon helvum
Rousettus aegyptiacus
Hipposideridae
nents [403]. Although bats represent
Rhinolophoidae
Rhinolophidae Rhinolophus ferrumequinum
20 % of the mammals [404], only 12 of
Microchiroptera
(Other)
Vespertilionoidea
Myotis brandtii
cluding echolocation and the ability to
Emballonuridae
Myotis lucifugus
(Taphozous)
Vespertilionidae Myotis davidii
Classification Mystacinidae
Ebola virus [402]). Until recently, their MM YY Thyropteridae
406]. Phyllostomidae
154
6.2. Evolution and antiviral specificity of bat Mx proteins
cating [408] and fruit eating bats; and Microchiroptera (microbats), containing 19
families of echolocating bats. Additional support exists by molecular analysis of
mitochondrial cytochrome b of 648 bats [398] and a large-scale analysis of 27 genes
and morphological features [409].
Recently, the order Chiroptera is classified into Yinpterochiroptera (Pteropodi-
dae and Rhinolophoidae) and Yangochiroptera [403] based on overwhelming molec-
ular evidence (see Fig. 6.5, ’YY’-Classification). This phylogenetic arrangement for
Chiroptera renders echolocation to be paraphyletic and is supported by 13.7 kb of
17 nuclear gene fragments [403], 2 320 CDSs [399], KCNQ4 of 15 species [410], and
18 mitochondrial genomes [411].
Bats serve as a reservoir for various, often zoonotic viruses, including significant
human pathogens such as Ebola- and influenza viruses. However, for unknown
reasons, viral infections rarely cause clinical symptoms in bats. A tight control
of viral replication by the host innate immune defense might contribute to this
phenomenon. Transcriptomic studies revealed the presence of the interferon-induced
antiviral myxovirus resistance (Mx) proteins in bats, but detailed functional aspects
have not been assessed. To provide evidence that bat Mx proteins might act as key
factors to control viral replication we cloned Mx1 cDNAs from three bat families,
Pteropodidae, Phyllostomidae and Vespertilionidae. Phylogenetically these bat Mx1
genes cluster closely with their human ortholog MxA. Using transfected cell cultures,
minireplicon systems, virus-like particles and virus infections, we determined the
antiviral potential of the bat Mx1 proteins. Bat Mx1 showed a significant reduction
of polymerase activity of viruses circulating in bats, including Ebola- and influenza
A-like viruses. The related Thogoto virus, however, which is not known to infect bats
was not inhibited by bat Mx1. Further, we provide evidence for positive selection
in bat Mx1 genes that might explain species-specific antiviral activities of these
proteins. Together, our data suggest a role for Mx1 in controlling these viruses in
their bat hosts.
155
Chapter 6. Single Nucleotide Investigations
156
6.2. Evolution and antiviral specificity of bat Mx proteins
157
Chapter 6. Single Nucleotide Investigations
The nucleotide sequences showed relatively high identities to other members of the
same bat family in a phylogenetic tree analysis (Fig. 6.7A). Moreover, the deduced
Mx1 amino acid sequences showed about 70% identity to Mx1 of other bat families
(Fig. 6.7C). An alignment of the amino acid sequences of all available bat Mx1
sequences (Fig. 6.8) reveals high sequence similarities all over the molecule except
for the highly variable N-terminal part before the first BSE and the loop L4. The
phylogenetic analysis revealed that bat Mx1 form a separate branch within the
mammalian Mx1 tree (Fig. 6.7B). The close similarity to other mammalian Mx1
proteins is reflected by the sequence identities of the bat Mx1 sequences to the
human MxA sequence (Fig. 6.7C). Interestingly, the cDNA clones of all bat Mx1
genes showed allelic variations that were mostly silent, but led in some species to
amino acid changes (Fig. 6.7C). According to our sequence alignment analyses, we
defined the cDNA clones with the most common nucleotide or amino acid variation
at particular positions as allele 1 which was used for further characterization.
158
6.2. Evolution and antiviral specificity of bat Mx proteins
A
Region M7 vs M8 M7 vs M8 % sites avg(ω) M8 BEB
(Χ²) p-value with ω > 1 (PP > 0.95/ > 0.99)
R191; A195; S347; F425; R429; T480; A535; E548; S553; L554; Q556; T557;
full (aa 1-649) 101.39 < 0.001 6.26 3.45
S558; S559; D562; T565
R191; A195; S347; D422; F425; R429; T480; A535; E548; S553; L554; Q556;
frag3 (aa 184-649) 112.69 < 0.001 6.59 3.83
T557; S558; S559; D562; T565
B
0.9
0.8
aa: 1 90 183 649
Model averaged support
0.7
1-1947
0.6
0.5 549
0.4
270 549
0.3
0.2
0.1
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Breakpoint location (nucleotides)
Figure 3: Evidence of evolutionary break point positions in the bat Mx1 alignment (13 species)
identified with GARD. T he best-fitting nucleotide substitution model (HK Y 85) was applied.
C
R eferences
90 183
[1] Federico Abascal, Rafael Zardoya, and Maximilian J Telford. TranslatorX: multiple alignment
N B G domain
of nucleotide sequences guided by B Nucleic acidsStalk
amino acid translations. research, page gkq291,
L4 B C
2010.
*****
** **** ** * *** * * ***********
Figure 6.6: (A) Results of the evolutionary analysis for positively selected sites in full length bat
Mx1 (aa 1-649) of 13 bat Mx1 coding sequences as presented in Fig. 6.7. In addition, positive
selection was analyzed for the fragments of the GARD analysis, frag1 (aa 1-90), frag2 (91-183),
frag3 (184-649) with respect to the sequence of M. daubentonii. P-values were achieved by perform-
ing chi-squared tests on twice the difference of the computed log likelihood values of the models
disallowing (M7) or allowing (M8) dN/dS (ω) >1. The BEB column lists rapidly evolving sites
with a dN/dS >1 and a posterior probability >0.95, determined by the Bayes Empirical Bayes
implemented in CODEML. Indels were removed from the alignment prior to evolutionary analyses.
Amino acid positions correspond to the full length Mx1 sequence of M. daubentonii. (B) Evidence
of evolutionary break points in the bat Mx1 alignment of the 13 bat Mx1 sequences (as shown
in Fig. 6.7) identified with GARD. The best-fitting nucleotide substitution model (HKY85) was
applied. The GARD fragments and the supported breakpoints are indicated. (C) Illustration of
the primary structure of bat Mx1 adapted to the crystal structure of MxA [437]: The unstruc-
tured regions (gray), the bundle signaling elements (B, red), the G domain (orange), the stalk
(green, blue) and loop 4 (L4). The rapidly evolving sites are indicated as arrowheads (analysis
of full-length bat Mx1 sequences, Fig. 6.7A) or asterisks (for the GARD fragments, Fig. 6.6B).
Breakpoints of recombination identified by GARD are indicated by vertical dotted lines.
159
Chapter 6. Single Nucleotide Investigations
A B
100
100 Pteropodidae 65
100
100 100
100
75
43
Phyllostomidae
100 Pteropodidae
100
100
100
100
100 Phyllostomidae
100
99 100 82
100 Vespertilionidae
100
Vespertilionidae
100 100
100
74
81
74 100
C
Family Phyllostomidae Vespertilionidae Pteropodidae
S367G L559S
Allelic variants - - - I593V -
I415M V562L
Identity to E.
69,1 % 67,6 % 73,1 % 71,9 % - 79,4 % 80,6 %
helvum Mx1
Identity to
70.2% 69.0% 76.0% 77.3% 73.6% 72.1% 71.5%
human MxA
Figure 6.7: (A) Phylogenetic tree of bat Mx1 using a nucleotide sequence alignment with human
MxA as an outgroup. (B) Phylogenetic tree of mammalian Mx1 nucleotide sequences. (C) Allelic
variants of bat Mx1 and their sequence identity to E. helvum Mx1 and human MxA as determined
via multiple protein alignment.
160
6.2. Evolution and antiviral specificity of bat Mx proteins
Carollia perspicillata
Sturnira lilium
Artibeus jamaicensis
Pipistrellus pipistrellus
Eptesicus fuscus
Myotis brandtii
Myotis lucifugus
Myotis davidii
Myotis daubentonii
Hypsignatus monstrosus
Rousettus aegyptiacus
Eidolon helvum
Pteropus alecto
BSE G domain
Carollia perspicillata
Sturnira lilium
Artibeus jamaicensis
Pipistrellus pipistrellus
Eptesicus fuscus
Myotis brandtii
Myotis lucifugus
Myotis davidii
Myotis daubentonii
Hypsignatus monstrosus
Rousettus aegyptiacus
Eidolon helvum
Pteropus alecto
G domain
Carollia perspicillata
Sturnira lilium
Artibeus jamaicensis
Pipistrellus pipistrellus
Eptesicus fuscus
Myotis brandtii
Myotis lucifugus
Myotis davidii
Myotis daubentonii
Hypsignatus monstrosus
Rousettus aegyptiacus
Eidolon helvum
Pteropus alecto
Carollia perspicillata
Sturnira lilium
Artibeus jamaicensis
Pipistrellus pipistrellus
Eptesicus fuscus
Myotis brandtii
Myotis lucifugus
Myotis davidii
Myotis daubentonii
Hypsignatus monstrosus
Rousettus aegyptiacus
Eidolon helvum
Pteropus alecto
BSE Stalk
Carollia perspicillata
Sturnira lilium
Artibeus jamaicensis
Pipistrellus pipistrellus
Eptesicus fuscus
Myotis brandtii
Myotis lucifugus
Myotis davidii
Myotis daubentonii
Hypsignatus monstrosus
Rousettus aegyptiacus
Eidolon helvum
Pteropus alecto
Stalk
Carollia perspicillata
Sturnira lilium
Artibeus jamaicensis
Pipistrellus pipistrellus
Eptesicus fuscus
Myotis brandtii
Myotis lucifugus
Myotis davidii
Myotis daubentonii
Hypsignatus monstrosus
Rousettus aegyptiacus
Eidolon helvum
Pteropus alecto
Stalk L4 Stalk
Carollia perspicillata
Sturnira lilium
Artibeus jamaicensis
Pipistrellus pipistrellus
Eptesicus fuscus
Myotis brandtii
Myotis lucifugus
Myotis davidii
Myotis daubentonii
Hypsignatus monstrosus
Rousettus aegyptiacus
Eidolon helvum
Pteropus alecto
Stalk BSE
Figure 6.8: Amino acid sequence alignment of bat Mx1, based on the alignment on nucleotide
level used for the positive selection analyses. In-frame multiple sequence alignments of all 13 bat
Mx1 cDNA sequences were conducted using TranslatorX and the aligner Muscle. Lines below
indicate structural domains of Mx1 according to the crystal structure of human MxA [437] (BSE:
bundle signaling element; L4: Loop 4). For a complete output of the positive selection analyses
performed with PoSeiDon visit http://www.rna.uni-jena.de/supplements/mx1_bats/
full_aln [9].
161
Chapter 6. Single Nucleotide Investigations
Within this thesis we present only one antiviral assay from this study, showing ex-
emplarily the effect of bat Mx1 against Ebola viruses (EBOV). Details about the
antiviral activity of bat Mx1 against other viruses like Vesicular stomatitis virus,
orthomyxoviruses, and bunyaviruses can be found in Fuchs et al. [7]. A comprehen-
sive analysis of the whole transcriptome response of human and bat cells to EBOV
(and also Marburg virus) infections is given in Sec. 5.2.
)
lla illat )
A)
1 ( 1
P. x1 (T 1
ist tre A)
i )
pty r ( -
Mx
A
M
he elv
Mx M
ve L)
sa +L
x
x
03
0 1
m M
ta a M
ns ens
lus s
H. tor (
em ecto
pip 1
rel llu
T1
pe (T
c
lvu um
T
v
1 (
pty
M
s
M
E
pip
pie
ici
C
sa
pe
P.
with viral helper plasmids encoding L- Figure 6.9: EBOV: VLP infection 293T cells (12
polymerase, VP35, VP30, NP, and the well format) were co-transfected with 50 ng NP,
cellular TIM-1 adhesion factor. The 375 ng L, 50 ng VP35, 30 ng VP30, 90 ng TIM-1
and 300 ng of the indicated Mx1 expression plas-
latter was included to enhance suscep- mids 24 h prior to infection with EBOV VLPs.
tibility of the cells to infection with As a control cells were treated with comparable
EBOV-VLPs at 24 h post transfection. amounts of VLP preparation produced in the ab-
Using this approach, a robust Renilla sence of the L construct (-L). At 24 h post infec-
tion the activity of the Renilla luciferase encoded
luciferase expression was detected in by the viral genome was determined. The empty
the cell lysates (Fig. 6.9). Omission vector control (without Mx expression) was set to
of the L helper plasmid abolished ex- 100%. Mx1 expression was controlled by Western
pression of the luciferase encoded by blot analysis. Significance was calculated with a
the minigenome, indicating the depen- one-sided student’s t-test (n=3, *p ≤ 0.05 and
**p ≤ 0.01).
dency of reporter gene expression on vi-
ral polymerase activity. Co-expression
of human MxA or the three bat Mx1 proteins reduced luciferase activity to about
50% compared to the activity measured in the presence of the respective inactive
mutants, indicating that bat Mx1 are able to control EBOV polymerase activity.
162
6.2. Evolution and antiviral specificity of bat Mx proteins
163
Chapter 6. Single Nucleotide Investigations
6.2.4 Conclusions
Bats are recognized as an important reservoir of potential zoonotic viruses [448].
However, it is unclear how bats deal with the infection, replication and possible
persistence of these viruses which often induce severe to fatal infections in humans
upon zoonotic transmission. Here we evaluated the role of the IFN-induced bat
Mx1 proteins in the control of diverse RNA viruses and found a broad antiviral
spectrum similar to the human MxA ortholog against orthomyxo-, rhabdo-, filo-,
and bunyaviruses (details about each virus can be found in Fuchs et al. [7]). Our
phylogenetic analysis of Mx1 proteins grouped the bat Mx1 sequences according
to their affiliation into the three different bat families [449]. Within the individual
families, bat Mx1 showed around 80% sequence identity. Between the families, the
bat Mx1 sequences displayed a reduced (around 70%) identity, which is comparable
with the identity to other mammalian Mx1 proteins, e.g. the human MxA. This
may reflect the long, about 40 to 50 million years, independent evolutionary history
and diversification of Mx1 genes in the individual Chiroptera families [407].
A detailed examination of the mammalian Mx1 sequences revealed a high pro-
portion of invariant amino acid residues under purifying selection, supporting the
view of Mx proteins as dynamin-like molecular machines with a sophisticated, highly
conserved structure [437] that allows evolution-driven variations only in a few flex-
ible regions [396, 450]. Accordingly, when analyzing the bat Mx1 sequences, we
identified residues under positive selection in two variable and surface exposed re-
gions. Most residues under positive selection were identified in the N terminus ahead
of the first BSE helix and in the C-terminal loop L4 (Fig. 6.6). The accumulation
of residues under positive selection in these two surface exposed, variable regions of
Mx1 [437] indicates the structural flexibility of these two regions compared to the
overall high structural conservation of the majority of the bat Mx1 molecules. In-
terestingly, positively selected positions in the N terminus were identified only when
individual gene segments were analyzed. Using GARD analysis, we detected two
breakpoints resulting in three gene segments, suggesting exchanges between ancient
bat Mx paralogs by recombination events (Fig. 6.6B and C). A comparable analysis
using the orthologous primate MxA genes identified clusters of positively selected
residues in the L4 loop, which has been previously identified as a determinant of
Mx antiviral specificity [450]. Positively selected positions in the N-terminal region
were identified in the paralogous primate MxB genes [396]. The various positively
selected residues in these two regions of bat Mx1 molecules argues for a long-standing
conflict of bats with diverse viral agents [451], which circulated or are still present
in the bat kingdom.
The overall efficient inhibition of various viral pathogens by bat Mx1 indicates
that this ISG exhibits a central role in the control of viral replication in bats. Of note,
we cannot speculate about the identity of the pathogens that drove the ancient arms
race with bat Mx1, resulting in positive selected positions in today’s bat species.
Since the bat Mx1 cDNAs showed rather comparable antiviral activities against the
164
6.2. Evolution and antiviral specificity of bat Mx proteins
limited spectrum of RNA viruses employed in the present study, our results cannot
fully explain the evolution of bat Mx1 proteins. Follow up studies should extend the
spectrum of viruses tested for bat Mx1 sensitivity and will refine the connection of
Mx1 antiviral capacity with the genetic evolution of these important components of
the innate antiviral defense.
165
Chapter 6. Single Nucleotide Investigations
166
Chapter 7
167
Chapter 7. Conclusions and future perspectives
representation without any gaps for this bacteria. We conducted an extensive an-
notation of protein- and non-coding genes for this new assembly and seven other
Mycobacteria genomes. We showed that the combination of different annotation
tools can improve the overall annotation. Further, we comprehensively compared
all eight Mycobacteria genomes and provided deep insights into the gene composition
and phylogeny of this pathogen.
In the future, the assembled genomes and corresponding annotations of protein-
and non-coding genes of both bacteria will be invaluable for the Chlamydia and
Mycobacteria community. For the Mycobacteria genome assembly, we showed that
the usage of different assembly tools and parameter settings can help to improve
the overall assembly. Currently, the assembly approach used for the Chlamydia
assembly is further adapted to the assembly of more Chlamydia strains, which will
build the basis of a new comparative genome study.
168
tools can be easily integrated. To further improve the selection of the best assem-
bly results, we will adjust and extend our set of evaluation metrics. For example,
metrics like the complete amount of transcripts in an assembly or the N50 value
can influence the metric score. The N50 value can be easily increased by adding
nucleotide stretches of low complexity to the contigs of an assembly.
Furthermore, we can easily extend our evaluation pipeline and add novel de novo
assembly tools to further improve and complement our comparisons. For example,
a new de novo transcriptome assembly tool called IsoTree [452] was presented in
May 2017 and will be incorporated. If the tool performs well, it will be integrated
in our assembly pipeline.
169
Chapter 7. Conclusions and future perspectives
resource that can be used by other researchers to identify cellular responses that
might allow bats to survive filovirus infections.
This comprehensive study presented here was only manageable in such a short
time with the help of 30 experienced scientists who came together in 2014 to “Fight
against Ebola” and manually investigated a tremendously amount of human and bat
genes. During this “hackathon” organized by our group, 1,500 genes (7.5 % of human
protein-coding genes) were investigated in great detail and build the backbone of
this outstanding study.
In comparison to the first DEG study presented in this chapter, the filovirus
project confronted us with much more challenging tasks. At the start of this project,
no genome of the fruit bat Rousettus aegyptiacus was available, so we decided to
construct a comprehensive de novo transcriptome assembly with various tools (as
described in Chapter 4). We annotated and conducted this assembly to find sig-
nificant DEGs for this bat species. Furthermore, the lack of biological replicates
and the high replication rate of the Ebola virus made this project one of the most
challenging ones presented in this thesis. On the other hand, we had the great
opportunity to contribute on the development of antiviral drugs during the 2014
West African Ebola outbreak. With this project, a great entanglement has been
presented by combining the analysis of genome reference data with transcriptome
assembly data for human and bat. We exemplary showed that DEGs identified with
a genomic and a transcriptomic approach are actually comparable. Currently, the
best candidate genes and pathways identified by our study are investigated further
in the wet lab of Prof. Dr. Stephan Becker in Marburg.
170
In the second part of this chapter, we extensively discussed the evolution and
antiviral specificity of a single gene of bats coding for the interferon-induced Mx
protein and its reaction against an infection with Ebola-, Influenza-, and other
RNA viruses [7]. As already presented in Sec. 5.2, bats are a natural reservoir
for various viruses that rarely cause clinical symptoms in bats, but carry zoonotic
pathogens like Ebola or Rabies virus. It has been speculated that the interferon
system might play a key role in controlling viral replication in bats. In this project,
we showed that the interferon-induced Mx proteins are indeed key antiviral factors
in bats and have co-evolved with bat-borne viruses. For the first time, we evaluated
a large set of bat Mx1 proteins spanning three major bat families for their antiviral
potential. We described their phylogenetic relationship by revealing patterns of
positive selection. Our pipeline PoSeiDon was conducted to detect recombination
events and positively selected sites in the bat Mx1 gene [7], as well as in the Mx1
gene of rodents [18].
We already collected multiple ideas for an improvement of the PoSeiDon web
server. First of all, we plan to implement branch-site models into the pipeline. At
the moment, we can only test whether a whole gene is under positive selection and
if so, we can statistically determine which single positions in the alignment have a
high impact on the positive selection. With the integration of branch-site models, we
would be able to detect if specific branches in a phylogenetic tree are under positive
selection. Also, we will provide a version of PoSeiDon for local installation and
execution. We plan to distribute a Docker container for Linux/Windows/MacOS,
which can be easily downloaded and executed without the need of locally installing
and compiling all required dependencies.
Furthermore, one objective of all projects presented throughout this thesis was
to find appropriate visualizations for the data and results. Whenever reasonable,
the obtained results are accompanied by comprehensive and interactive electronic
supplements to encourage other researchers to investigate the results in a productive
and transparent way. The general idea is to provide functions that allow a quick
examination of large amounts of data, to expose trends and to find patterns and
correlations within the data. Effective data visualization is an important part in the
decision making process and helps to gain further insights and picturing possible
answers for certain life-science questions.
Besides the short-read NGS data this thesis is mainly based on, sequencing tech-
nologies emerged in the last few years which are able to produce reads of tremen-
dously read lengths [29]. Such long-read sequencing technologies, as developed by
PacBio and Nanopore, overcome the length limitation of other NGS approaches like
Illumina, and are able to produce reads of a length of >40,000 bases. As those tech-
niques remain considerably more expensive and have lower throughput and higher
error rates than other platforms, the universal adaption of these technologies is still
limited. However, costs and error rates are continuously decreasing and with the
emergence of such new technologies, existing problems will exacerbated and new
problems will arise, which need to be computationally tackled in the future.
In this thesis, we have presented a broad variety of bioinformatical approaches,
not exclusively related to NGS data. From far away, all the main results presented
171
Chapter 7. Conclusions and future perspectives
here (Chapter 3–6) deal with different species, data sets, computational topics and
biological questions. However, we showed how to connect the different approaches
and methods in order to obtain a more comprehensive picture to answer certain
biological questions. Besides the overwhelming amount of results presented in this
thesis, one of our main focuses was to encourage the reader to really look into the
data and to not only trust reported significance values and fold changes obtained
from an NGS study. Combining different approaches of complementing fields, such
as genomics, transcriptomics and single nucleotide investigations, has the greatest
potential of producing comprehensive and helpful results, and to bring some light
in the Dark Art of Next-Generation Sequencing.
172
Bibliography
[1] Petra Möbius, Martin Hölzer, Marius Felder, Gabriele Nordsiek, Marco Groth,
Heike Köhler, Kathrin Reichwald, Matthias Platzer, and Manja Marz. “Compre-
hensive insights in the Mycobacterium avium subsp. paratuberculosis genome using
new WGS data of sheep strain JIII-386 from Germany”. In: Genome biology and
evolution 7.9 (2015), pp. 2585–2601.
[2] Martin Hölzer, Karine Laroucau, Heather Huot Creasy, Sandra Ott, Fabien Vo-
rimore, Patrik M Bavoil, Manja Marz, and Konrad Sachse. “Whole-genome se-
quence of Chlamydia gallinacea type strain 08-1274/3”. In: Genome Announcements
4.4 (2016), e00708–16.
[3] Abdullah H Sahyoun, Martin Hölzer, Frank Jühling, Christian Höner zu Siederdis-
sen, Marwa Al-Arab, Kifah Tout, Manja Marz, Martin Middendorf, Peter F Stadler,
and Matthias Bernt. “Towards a comprehensive picture of alloacceptor tRNA re-
molding in metazoan mitochondrial genomes”. In: Nucleic acids research 43.16
(2015), pp. 8044–8056.
[4] Martin Hölzer, Verena Krähling, et al. “Differential transcriptional responses to
Ebola and Marburg virus infection in bat and human cells”. In: Scientific Reports
6 (2016), p. 34589.
[5] Konstantin Riege, Martin Hölzer, Tilman E. Klassert, Emanuel Barth, Julia
Bräuer, Collatz Maximilian, Franziska Hufsky, Nelly B. Mostajo, Magdalena Stock,
Bertram Vogel, Hortense Slevogt, and Manja Marz. “Massive Effect on LncRNAs
in Human Monocytes During Fungal and Bacterial Infections and in Response to
Vitamins A and D”. In: Scientific Reports 7 (2017), p. 40598.
[6] Tilman E. Klassert, Julia Bräuer, Martin Hölzer, Magdalena Stock, Konstantin
Riege, Christina Zubiría-Barrera, Mario M. Müller, Silke Rummler, Christine Skerka,
Manja Marz, and Hortense Slevogt. “Differential Effects of vitamins A and D on
the Transcriptional Landscape of Human Monocytes during Infection”. In: Scientific
Reports 7 (2017), p. 40599.
[7] Jonas Fuchs, Martin Hölzer, Mirjam Schilling, Corinna Patzina, Andreas Schoen,
Thomas Hoenen, Gert Zimmer, Manja Marz, Friedemann Weber, Marcel A. Müller,
and Georg Kochs. “Evolution and antiviral specificity of interferon-induced Mx
proteins of bats against Ebola-, Influenza-, and other RNA viruses.” In: Journal of
Virology (2017), JVI–00361.
[8] Petra Möbius, Elisabeth Liebler-Tenorio, Martin Hölzer, and Heike Köhler. “Eval-
uation of associations between genotypes of Mycobacterium avium subsp. paratu-
berculosis and presence of intestinal lesions characteristic of paratuberculosis”. In:
Veterinary Microbiology 201 (2017), pp. 188–194.
173
Bibliography
[9] Martin Hölzer and Manja Marz. “PoSeiDon: a web server for the detection of
evolutionary recombination events and positive selection”. In: Bioinformatics (sub-
mitted).
[10] Petra Möbius, Gabriele Nordsiek, Martin Hölzer, Michael Jarek, Manja Marz,
and Heike Köhler. “Complete genome sequence of JII-1961 – a bovine Mycobac-
terium avium subsp. paratuberculosis field isolate from Germany”. In: Genome An-
nouncements (2017), submitted.
[11] The RNA tools and software consortium. “A community-driven catalog of RNA
bioinformatics tools and their ontologies”. In: preparation (2017).
[12] Nelly B Mostajo, Martin Hölzer, Abdullah H Sahyoun, Verena Krähling, Stephan
Becker, and Manja Marz. “A comprehensive annotation of non-coding RNAs in
bats”. In: preparation (2017).
[13] Sebastian Bartschat, Clara Bermudez-Santana, Anke Busch, Alexander Donath,
Jan Engelhardt, Andreas R Gruber, Jana Hertel, Michael Hiller, Martin Hölzer,
Franziska Hufsky, Emanuel Barth, Frank Jühling, et al. “Comparative Analysis of
Non-Coding RNAs in Nematodes”. In: preparation (2017).
[14] Martin Hölzer, Manja Marz, and Daniel Steinbach. “Elucidation of the molecular
mechanisms of progression of the non-muscle invasive urothelial carcinoma of the
urinary bladder (NMIBC) and identification of possible prognostic markers and
therapeutic targets by exom and 3’/5’ UTR mutation analyzes”. In: preparation
(2017).
[15] Martin Hölzer, Friedemann Weber, and Manja Marz. “Description of the tran-
scriptomic landscape of the microbat Myotis daubentonii in response to interferon
stimulation and an infection with the Rift Valley fever virus”. In: Journal of Virology
(in preparation).
[16] Martin Hölzer and Manja Marz. “The Dark Art of de novo Transcriptome As-
sembly: A Comprehensive Across-species Comparison of Short Read RNA-Seq As-
semblers”. In: preparation (2017).
[17] Martin Hölzer and Manja Marz. “GOAssembler: A Method Pipeline for the Con-
struction, Evaluation and Clustering of de novo Transcriptome Assemblies”. In:
preparation (2018).
[18] Barbara Müther, Martin Hölzer, Manja Marz, and Georg Kochs. “Evolution and
antiviral specificity of interferon-induced Mx proteins in rodents”. In: Journal of
Virology (in preparation).
[19] Martin Hölzer, Ruman Gerst, and Manja Marz. “PCAGO: An interactive web
service to analyze RNA-Seq data with principal component analysis”. In: prepara-
tion (2017).
[20] Abdullah H Sahyoun. “Computational investigations into the evolution of mito-
chondrial genomes”. MA thesis. Leipzig: University Leipzig, 2015.
[21] Jon Ison, Matúš Kalaš, Inge Jonassen, Dan Bolser, Mahmut Uludag, Hamish Mc-
William, James Malone, Rodrigo Lopez, Steve Pettifer, and Peter Rice. “EDAM:
an ontology of bioinformatics operations, types of data and identifiers, topics and
formats”. In: Bioinformatics 29.10 (2013), pp. 1325–1332.
174
Bibliography
[22] Christine Durinx, Jo McEntyre, Ron Appel, Rolf Apweiler, Mary Barlow, Niklas
Blomberg, Chuck Cook, Elisabeth Gasteiger, Jee-Hyub Kim, Rodrigo Lopez, et al.
“Identifying ELIXIR Core Data Resources”. In: F1000Research 5 (2016).
[23] Mark D Adams, Jenny M Kelley, et al. “Complementary DNA sequencing: ex-
pressed sequence tags and human genome project”. In: Science 252.5013 (1991),
p. 1651.
[24] Francis S Collins, Michael Morgan, and Aristides Patrinos. “The Human Genome
Project: lessons from large-scale biology”. In: Science 300.5617 (2003), pp. 286–290.
[25] Jeremy Schmutz, Jeremy Wheeler, Jane Grimwood, Mark Dickson, Joan Yang,
Chenier Caoile, Eva Bajorek, Stacey Black, Yee Man Chan, Mirian Denys, et al.
“Quality assessment of the human genome sequence”. In: Nature 429.6990 (2004),
pp. 365–368.
[26] Jay Shendure and Hanlee Ji. “Next-generation DNA sequencing”. In: Nature biotech-
nology 26.10 (2008), pp. 1135–1145.
[27] Michael L Metzker. “Sequencing technologies – the next generation”. In: Nature
reviews genetics 11.1 (2010), pp. 31–46.
[28] HPJ Buermans and JT Den Dunnen. “Next generation sequencing technology: ad-
vances and applications”. In: Biochimica et Biophysica Acta (BBA)-Molecular Basis
of Disease 1842.10 (2014), pp. 1932–1941.
[29] Sara Goodwin, John D McPherson, and W Richard McCombie. “Coming of age:
ten years of next-generation sequencing technologies”. In: Nature Reviews Genetics
17.6 (2016), pp. 333–351.
[30] Anthony Rhoads and Kin Fai Au. “PacBio sequencing and its applications”. In:
Genomics, proteomics & bioinformatics 13.5 (2015), pp. 278–289.
[31] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and Barbara
Wold. “Mapping and quantifying mammalian transcriptomes by RNA-Seq”. In:
Nature methods 5.7 (2008), pp. 621–628.
[32] Zhong Wang, Mark Gerstein, and Michael Snyder. “RNA-Seq: a revolutionary tool
for transcriptomics”. In: Nature Reviews Genetics 10.1 (2009), pp. 57–63.
[33] David C Corney. “RNA-seq using next generation sequencing”. In: Mater Methods
3 (2013), p. 203.
[34] Nicholas J Croucher, Maria C Fookes, Timothy T Perkins, Daniel J Turner, Samuel
B Marguerat, Thomas Keane, Michael A Quail, Miao He, Sammey Assefa, Jürg
Bähler, et al. “A simple method for directional transcriptome sequencing using
Illumina technology”. In: Nucleic acids research 37.22 (2009), e148–e148.
[35] Peter JA Cock, Christopher J Fields, Naohisa Goto, Michael L Heuer, and Peter M
Rice. “The Sanger FASTQ file format for sequences with quality scores, and the
Solexa/Illumina FASTQ variants”. In: Nucleic acids research 38.6 (2010), pp. 1767–
1771.
[36] John C Marioni, Christopher E Mason, Shrikant M Mane, Matthew Stephens, and
Yoav Gilad. “RNA-seq: an assessment of technical reproducibility and comparison
with gene expression arrays”. In: Genome research 18.9 (2008), pp. 1509–1517.
[37] Paul L Auer and RW Doerge. “Statistical design and analysis of RNA sequencing
data”. In: Genetics 185.2 (2010), pp. 405–416.
175
Bibliography
[38] Kimberly Robasky, Nathan E Lewis, and George M Church. “The role of replicates
for error mitigation in next-generation sequencing”. In: Nature Reviews Genetics
15.1 (2014), pp. 56–62.
[39] Alexander Dobin, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski,
Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. “STAR: ul-
trafast universal RNA-seq aligner”. In: Bioinformatics 29.1 (2013), pp. 15–21.
[40] Evguenia Kopylova, Laurent Noé, and Hélène Touzet. “SortMeRNA: fast and ac-
curate filtering of ribosomal RNAs in metatranscriptomic data”. In: Bioinformatics
28.24 (2012), pp. 3211–3217.
[41] Jack A Gilbert and Margaret Hughes. “Gene expression profiling: metatranscrip-
tomics”. In: High-Throughput Next Generation Sequencing: Methods and Applica-
tions (2011), pp. 195–205.
[42] Matthew Kanke, Jeanette Baran-Gale, Jonathan Villanueva, and Praveen Sethupa-
thy. “miRquant 2.0: an Expanded Tool for Accurate Annotation and Quantification
of MicroRNAs and their isomiRs from Small RNA-Sequencing Data”. In: Journal
of Integrative Bioinformatics 13.5 (2016), p. 307.
[43] S Andrews et al. “FastQC: A quality control tool for high throughput sequence
data”. In: Reference Source (2010).
[44] Robert Schmieder and Robert Edwards. “Quality control and preprocessing of
metagenomic datasets.” eng. In: Bioinformatics 27.6 (Mar. 2011), pp. 863–864.
doi: 10.1093/bioinformatics/btr026.
[45] Marcel Martin. “Cutadapt removes adapter sequences from high-throughput se-
quencing reads”. In: EMBnet. journal 17.1 (2011), pp–10.
[46] Cole Trapnell, Brian A. Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Mar-
ijke J. van Baren, Steven L. Salzberg, Barbara J. Wold, and Lior Pachter. “Tran-
script assembly and quantification by RNA-Seq reveals unannotated transcripts
and isoform switching during cell differentiation.” eng. In: Nat Biotechnol 28.5
(May 2010), pp. 511–515. doi: 10.1038/nbt.1621.
[47] Mitchell Guttman, Manuel Garber, Joshua Z Levin, Julie Donaghey, James Robin-
son, Xian Adiconis, Lin Fan, Magdalena J Koziol, Andreas Gnirke, Chad Nusbaum,
et al. “Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals
the conserved multi-exonic structure of lincRNAs”. In: Nature biotechnology 28.5
(2010), pp. 503–510.
[48] Zhenyu Li, Yanxiang Chen, Desheng Mu, Jianying Yuan, Yujian Shi, Hao Zhang,
Jun Gan, Nan Li, Xuesong Hu, Binghang Liu, et al. “Comparison of the two major
classes of assembly algorithms: overlap–layout–consensus and de–bruijn–graph”. In:
Briefings in functional genomics 11.1 (2012), pp. 25–37.
[49] Phillip EC Compeau, Pavel A Pevzner, and Glenn Tesler. “How to apply de Bruijn
graphs to genome assembly”. In: Nature biotechnology 29.11 (2011), pp. 987–991.
[50] Pavel A Pevzner, Haixu Tang, and Michael S Waterman. “An Eulerian path ap-
proach to DNA fragment assembly”. In: Proceedings of the National Academy of
Sciences 98.17 (2001), pp. 9748–9753.
[51] Daniel R Zerbino and Ewan Birney. “Velvet: algorithms for de novo short read
assembly using de Bruijn graphs”. In: Genome Research 18.5 (2008), pp. 821–829.
176
Bibliography
[52] Jared T Simpson, Kim Wong, Shaun D Jackman, Jacqueline E Schein, Steven JM
Jones, and İnanç Birol. “ABySS: a parallel assembler for short read sequence data”.
In: Genome Research 19.6 (2009), pp. 1117–1123.
[53] Ruibang Luo et al. “SOAPdenovo2: an empirically improved memory-efficient short-
read de novo assembler”. In: GigaScience 1.1 (2012), p. 18.
[54] Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail
Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham,
Andrey D Prjibelski, et al. “SPAdes: A new genome assembly algorithm and its
applications to single-cell sequencing”. In: Journal of Computational Biology 19.5
(2012), pp. 455–477.
[55] Sergey Nurk, Anton Bankevich, Dmitry Antipov, Alexey Gurevich, Anton Ko-
robeynikov, Alla Lapidus, Andrey Prjibelsky, Alexey Pyshkin, Alexander Sirotkin,
Yakov Sirotkin, et al. “Assembling Genomes and mini-metagenomes from highly
chimeric reads”. In: Research in Computational Molecular Biology. Springer. 2013,
pp. 158–170.
[56] Inanç Birol, Shaun D Jackman, Cydney B Nielsen, Jenny Q Qian, Richard Varhol,
Greg Stazyk, Ryan D Morin, Yongjun Zhao, Martin Hirst, Jacqueline E Schein,
et al. “De novo transcriptome assembly with ABySS”. In: Bioinformatics 25.21
(2009), pp. 2872–2877.
[57] David R Bentley, Shankar Balasubramanian, Harold P Swerdlow, Geoffrey P Smith,
John Milton, Clive G Brown, Kevin P Hall, Dirk J Evers, Colin L Barnes, Helen R
Bignell, et al. “Accurate whole human genome sequencing using reversible termi-
nator chemistry”. In: Nature 456.7218 (2008), pp. 53–59.
[58] Jeffrey Martin, Vincent M Bruno, Zhide Fang, Xiandong Meng, Matthew Blow,
Tao Zhang, Gavin Sherlock, Michael Snyder, and Zhong Wang. “Rnnotator: an au-
tomated de novo transcriptome assembly pipeline from stranded RNA-Seq reads”.
In: BMC genomics 11.1 (2010), p. 663.
[59] Steven L Salzberg and James A Yorke. “Beware of mis-assembled genomes”. In:
Bioinformatics 21.24 (2005), pp. 4320–4321.
[60] Marcel H Schulz, Daniel R Zerbino, Martin Vingron, and Ewan Birney. “Oases:
robust de novo RNA-seq assembly across the dynamic range of expression levels”.
In: Bioinformatics 28.8 (2012), pp. 1086–1092.
[61] Gordon Robertson, Jacqueline Schein, Readman Chiu, Richard Corbett, Matthew
Field, Shaun D Jackman, Karen Mungall, Sam Lee, Hisanaga Mark Okada, Jenny Q
Qian, et al. “De novo assembly and analysis of RNA-seq data”. In: Nature Methods
7.11 (Nov. 2010), pp. 909–912.
[62] M G Grabherr et al. “Full-length transcriptome assembly from RNA-seq data with-
out a reference genome”. In: Nature Biotechnology 29.7 (May 2011), pp. 644–652.
[63] Yinlong Xie et al. “SOAPdenovo-Trans: de novo transcriptome assembly with short
RNA-Seq reads.” eng. In: Bioinformatics 30.12 (June 2014), pp. 1660–1666. doi:
10.1093/bioinformatics/btu077.
[64] Bastien Chevreux, Thomas Pfisterer, Bernd Drescher, Albert J Driesel, Werner
EG Müller, Thomas Wetter, and Sándor Suhai. “Using the miraEST assembler for
reliable and automated mRNA transcript assembly and SNP detection in sequenced
ESTs”. In: Genome research 14.6 (2004), pp. 1147–1159.
177
Bibliography
178
Bibliography
[81] Simon Anders and Wolfgang Huber. “Differential expression analysis for sequence
count data”. eng. In: Genome Biol 11.10 (2010), R106. doi: 10.1186/gb-2010-
11-10-r106.
[82] Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. “edgeR: a Biocon-
ductor package for differential expression analysis of digital gene expression data”.
In: Bioinformatics 26.1 (2010), pp. 139–140.
[83] Matthew E Ritchie, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi,
and Gordon K Smyth. “Limma powers differential expression analyses for RNA-
sequencing and microarray studies”. In: Nucleic acids research (2015), gkv007.
[84] Robert C Gentleman, Vincent J Carey, Douglas M Bates, Ben Bolstad, Marcel
Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry,
et al. “Bioconductor: open software development for computational biology and
bioinformatics”. In: Genome biology 5.10 (2004), R80.
[85] Melanie A Huntley, Jessica L Larson, Christina Chaivorapol, Gabriel Becker, Michael
Lawrence, Jason A Hackney, and Joshua S Kaminker. “ReportingTools: an au-
tomated result processing and presentation toolkit for high-throughput genomic
analyses”. In: Bioinformatics 29.24 (2013), pp. 3220–3221.
[86] Weijun Luo, Michael S Friedman, Kerby Shedden, Kurt D Hankenson, and Peter J
Woolf. “GAGE: generally applicable gene set enrichment for pathway analysis.” eng.
In: BMC Bioinformatics 10 (2009), p. 161. doi: 10.1186/1471-2105-10-161.
[87] Weijun Luo and Cory Brouwer. “Pathview: an R/Bioconductor package for pathway-
based data integration and visualization”. In: Bioinformatics 29.14 (2013), pp. 1830–
1831. doi: 10.1093/bioinformatics/btt285.
[88] Ryan R Wick, Mark B Schultz, Justin Zobel, and Kathryn E Holt. “Bandage:
interactive visualization of de novo genome assemblies”. In: Bioinformatics (2015),
btv383.
[89] Konrad Sachse, Patrik M Bavoil, Bernhard Kaltenboeck, Richard S Stephens, Cho-
Chou Kuo, Ramon Rosselló-Móra, and Matthias Horn. “Emendation of the family
Chlamydiaceae: proposal of a single genus, Chlamydia, to include all currently rec-
ognized species”. In: Systematic and applied microbiology 38.2 (2015), pp. 99–103.
[90] Konrad Sachse, Karine Laroucau, Konstantin Riege, Stefanie Wehner, Meik Dilcher,
Heather Huot Creasy, Manfred Weidmann, Garry Myers, Fabien Vorimore, Nadia
Vicari, et al. “Evidence for the existence of two new members of the family Chlamy-
diaceae and proposal of Chlamydia avium sp. nov. and Chlamydia gallinacea sp.
nov.” In: Systematic and applied microbiology 37.2 (2014), pp. 79–88.
[91] Karine Laroucau, Fabien Vorimore, Rachid Aaziz, Angela Berndt, Evelyn Schubert,
and Konrad Sachse. “Isolation of a new chlamydial agent from infected domestic
poultry coincided with cases of atypical pneumonia among slaughterhouse workers
in France”. In: Infection, Genetics and Evolution 9.6 (2009), pp. 1240–1247.
[92] Virginie Hulin, Sabrina Oger, Fabien Vorimore, Rachid Aaziz, Bertille de Bar-
beyrac, Jacques Berruchon, Konrad Sachse, and Karine Laroucau. “Host prefer-
ence and zoonotic potential of Chlamydia psittaci and C. gallinacea in poultry”.
In: Pathogens and disease 73.1 (2015), pp. 1–11.
[93] Konrad Sachse, K Laroucau, and D Vanrompay. “Avian Chlamydiosis”. In: Curr
Clin Microbiol Reports 2 (2015), pp. 10–21.
179
Bibliography
[94] Weina Guo, Jing Li, Bernhard Kaltenboeck, Jiansen Gong, Weixing Fan, and
Chengming Wang. “Chlamydia gallinacea, not C. psittaci, is the endemic chlamydial
species in chicken (Gallus gallus)”. In: Scientific reports 6 (2016), p. 19638.
[95] K Laroucau, R Aaziz, L Meurice, V Servas, I Chossat, H Royer, B de Barbeyrac, V
Vaillant, JL Moyen, F Meziani, et al. “Outbreak of psittacosis in a group of women
exposed to Chlamydia psittaci–infected chickens”. In: Euro Surveill (2014).
[96] Chin Lung Lu, Kun-Tze Chen, Shih-Yuan Huang, and Hsien-Tai Chiu. “CAR: con-
tig assembly of prokaryotic draft genomes using rearrangements”. In: BMC bioin-
formatics 15.1 (2014), p. 381.
[97] Kazutaka Katoh and Daron M Standley. “MAFFT multiple sequence alignment
software version 7: improvements in performance and usability”. In: Molecular bi-
ology and evolution 30.4 (2013), pp. 772–780.
[98] Torsten Seemann. “Prokka: rapid prokaryotic genome annotation”. In: Bioinformat-
ics (2014), btu153.
[99] CJ Clarke and D Little. “The pathology of ovine paratuberculosis: gross and his-
tological changes in the intestine and other tissues”. In: Journal of comparative
pathology 114.4 (1996), pp. 419–437.
[100] Marie-Françoise Thorel, Micah Krichevsky, and Véronique Vincent Lévy-Frébault.
“Numerical Taxonomy of Mycobactin-Dependent Mycobacteria, Emended Descrip-
tion of Mycobacterium avium, and Description of Mycobacterium avium subsp.
avium subsp. nov., Mycobacterium avium subsp. paratuberculosis subsp. nov., and
Mycobacterium avium subsp. silvaticum subsp. nov.” In: International Journal of
Systematic Bacteriology 40.3 (1990), pp. 254–260.
[101] Wouter Mijs, Petra de Haas, Rudi Rossau, Tridia Van der Laan, Leen Rigouts,
Françoise Portaels, and Dick van Soolingen. “Molecular evidence to support a
proposal to reserve the designation Mycobacterium avium subsp. avium for bird-
type isolates and ’M. avium subsp. hominissuis’ for the human/porcine type of
M. avium.” In: International Journal of Systematic and Evolutionary Microbiology
52.5 (2002), pp. 1505–1518.
[102] Christine Y Turenne, Desmond M Collins, David C Alexander, and Marcel A Behr.
“Mycobacterium avium subsp. paratuberculosis and M. avium subsp. avium are
independently evolved pathogenic clones of a much broader group of M. avium
organisms”. In: Journal of bacteriology 190.7 (2008), pp. 2479–2487.
[103] Chia-wei Wu, Jeremy Glasner, Michael Collins, Saleh Naser, and Adel M Talaat.
“Whole-genome plasticity among Mycobacterium avium subspecies: insights from
comparative genomic hybridizations”. In: Journal of Bacteriology 188.2 (2006),
pp. 711–723.
[104] Michael Paustian, Xiaochun Zhu, Srinand Sreevatsan, Suelee Robbe-Austerman,
Vivek Kapur, and John Bannantine. “Comparative genomic analysis of Mycobac-
terium avium subspecies obtained from multiple host species”. In: BMC Genomics
9.1 (2008), p. 135.
[105] Chung-Yi Hsu, Chia-Wei Wu, and Adel M Talaat. “Genome-wide sequence vari-
ation among Mycobacterium avium subspecies paratuberculosis isolates: a better
understanding of Johne’s disease transmission dynamics”. In: Frontiers in microbi-
ology 2 (2011), pp. 236–236.
180
Bibliography
181
Bibliography
[116] Pallab Ghosh, Chungyi Hsu, Essam J Alyamani, Maher M Shehata, Musaad A Al-
Dubaib, Abdulmohsen Al-Naeem, Mahmoud Hashad, Osama M Mahmoud, Khalid
BJ Alharbi, Khalid Al-Busadah, et al. “Genome-wide Analysis of the Emerging
Infection with Mycobacterium avium subspecies paratuberculosis in the Arabian
Camels (Camelus dromedarius)”. In: PloS one 7.2 (2012), e31947.
[117] Richard J Whittington, D Jeff Marshall, Paul J Nicholls, Ian B Marsh, and Leslie A
Reddacliff. “Survival and dormancy of Mycobacterium avium subsp. paratuberculo-
sis in the environment”. In: Applied and Environmental Microbiology 70.5 (2004),
pp. 2989–3004.
[118] RW Pickup, G Rhodes, TJ Bull, S Arnott, K Sidi-Boumedine, M Hurley, and J
Hermon-Taylor. “Mycobacterium avium subsp. paratuberculosis in lake catchments,
in river water abstracted for domestic use, and in effluent from domestic sewage
treatment works: diverse opportunities for environmental cycling and human expo-
sure”. In: Applied and environmental microbiology 72.6 (2006), pp. 4067–4077.
[119] Glenn Rhodes, Hollian Richardson, John Hermon-Taylor, Andrew Weightman, An-
drew Higham, and Roger Pickup. “Mycobacterium avium subspecies paratuberculo-
sis: human exposure through environmental and domestic aerosols”. In: Pathogens
3.3 (2014), pp. 577–595.
[120] H Shankar, SV Singh, PK Singh, AV Singh, JS Sohal, and RJ Greenstein. “Presence,
characterization, and genotype profiles of Mycobacterium avium subspecies paratu-
berculosis from unpasteurized individual and pooled milk, commercial pasteurized
milk, and milk products in India by culture, PCR, and PCR-REA methods”. In:
International Journal of Infectious Diseases 14.2 (2010), e121–e126.
[121] Birgit Stief, Petra Möbius, Heidemarie Türk, Uwe Hörügel, Carina Arnold, and
Dietrich Pöhle. “Paratuberculosis in a miniature donkey (Equus asinus f. asinus)”.
In: Berl. Münch. Tierärztl. Wschr. 7.1–2 (2012), pp. 38–44.
[122] Ken Over, Philip G Crandall, Corliss A O’Bryan, and Steven C Ricke. “Current
perspectives on Mycobacterium avium subsp. paratuberculosis, Johne’s disease, and
Crohn’s disease: a review”. In: Critical reviews in microbiology 37.2 (2011), pp. 141–
156.
[123] Raja Atreya, Michael Bülte, Gerald-F Gerlach, Ralph Goethe, Mathias W Hornef,
Heike Köhler, Jochen Meens, Petra Möbius, Elke Roeb, Siegfried Weiss, et al.
“Facts, myths and hypotheses on the zoonotic nature of Mycobacterium avium sub-
species paratuberculosis”. In: International Journal of Medical Microbiology 304.7
(2014), pp. 858–867.
[124] Lingling Li, John P Bannantine, Qing Zhang, Alongkorn Amonsin, Barbara J May,
David Alt, Nilanjana Banerji, Sagarika Kanjilal, and Vivek Kapur. “The complete
genome sequence of Mycobacterium avium subspecies paratuberculosis”. In: Proceed-
ings of the National Academy of Sciences of the United States of America 102.35
(2005), pp. 12344–12349.
[125] James W Wynne, Torsten Seemann, Dieter M Bulach, Scott A Coutts, Adel M
Talaat, and Wojtek P Michalski. “Resequencing the Mycobacterium avium subsp.
paratuberculosis K10 genome: improved annotation and revised genome sequence”.
In: Journal of bacteriology 192.23 (2010), pp. 6319–6320.
182
Bibliography
[126] James W Wynne, Tim J Bull, Torsten Seemann, Dieter M Bulach, Josef Wagner,
Carl D Kirkwood, and Wojtek P Michalski. “Exploring the zoonotic potential of
Mycobacterium avium subspecies paratuberculosis through comparative genomics”.
In: PloS one 6.7 (2011), e22171.
[127] John P Bannantine, Chia-wei Wu, Chungyi Hsu, Shiguo Zhou, David C Schwartz,
Darrell O Bayles, Michael L Paustian, David P Alt, Srinand Sreevatsan, Vivek
Kapur, et al. “Genome sequencing of ovine isolates of Mycobacterium avium sub-
species paratuberculosis offers insights into host association”. In: BMC genomics
13.1 (2012), p. 89.
[128] John P Bannantine, Lingling Li, Michael Mwangi, Rebecca Cote, JA Raygoza
Garay, and Vivek Kapur. “Complete Genome Sequence of Mycobacterium avium
subsp. paratuberculosis, Isolated from Human Breast Milk”. In: Genome Announc.
2(1) (2014), e01252–13. doi: 10.1128/genomeA.01252-13.
[129] Karen Dohmann, Birgit Strommenger, Karen Stevenson, Lucia de Juan, Janin
Stratmann, Vivek Kapur, Tim J Bull, and Gerald-Friedrich Gerlach. “Character-
ization of genetic differences between Mycobacterium avium subsp. paratuberculo-
sis type I and type II isolates”. In: Journal of clinical microbiology 41.11 (2003),
pp. 5215–5223.
[130] Ian B Marsh, John P Bannantine, Michael L Paustian, Mark L Tizard, Vivek
Kapur, and Richard J Whittington. “Genomic comparison of Mycobacterium avium
subsp. paratuberculosis sheep and cattle strains by microarray hybridization”. In:
Journal of bacteriology 188.6 (2006), pp. 2290–2293.
[131] Makeda Semret, Christine Y Turenne, Petra de Haas, Desmond M Collins, and
Marcel A Behr. “Differentiating host-associated variants of Mycobacterium avium
by PCR for detection of large sequence polymorphisms”. In: Journal of clinical
microbiology 44.3 (2006), pp. 881–887.
[132] David C Alexander, Christine Y Turenne, and Marcel A Behr. “Insertion and dele-
tion events that define the pathogen Mycobacterium avium subsp. paratuberculosis”.
In: Journal of bacteriology 191.3 (2009), pp. 1018–1025.
[133] C M Sharma, S Hoffmann, F Darfeuille, J Reignier, S Findeiss, A Sittka, S Chabas,
K Reiche, J Hackermüller, R Reinhardt, P F Stadler, and J Vogel. “The pri-
mary transcriptome of the major human pathogen Helicobacter pylori”. In: Nature
464.7286 (2010), pp. 250–255. doi: 10.1038/nature08756.
[134] Marcus Lechner, Astrid Nickel, Stefanie Wehner, Konstantin Riege, Wieseke, Be-
nedikt M Beckmann, Roland K Hartmann, and Manja Marz. “Genomewide com-
parison and novel ncRNAs in Aquificales”. In: BMC genomics 15.1 (2014), p. 522.
[135] Stefanie Wehner, Gopala K. Mannala, Xiaoxing Qing, Madhugiri Ramakanth, Tri-
nad Chakraborty, Mobarak Abu Mraheil, Torsten Hain, and Manja Marz. “Detec-
tion of very long antisense transcripts by whole transcriptome RNA-Seq analysis of
Listeria monocytogenes by semiconductor sequencing technology”. In: PLos ONE
9.10 (2014), e108639. doi: 10.1371/journal.pone.0108639.
[136] Kristine B Arnvig and Douglas B Young. “Identification of small RNAs in My-
cobacterium tuberculosis”. In: Molecular Microbiology 73.3 (2009), pp. 397–408.
[137] Kristine B Arnvig and Douglas B Young. “Regulation of pathogen metabolism by
small RNA”. In: Drug Discovery Today: Disease Mechanisms 7.1 (2010), e19–e24.
183
Bibliography
[138] Dmitriy Ignatov, Sofia Malakho, Konstantin Majorov, Timofey Skvortsov, Alexan-
der Apt, and Tatyana Azhikina. “RNA-Seq Analysis of Mycobacterium avium Non-
Coding Transcriptome”. In: PloS one 8.9 (2013), e74209.
[139] S Englund, G Bölske, A Ballagi-Pordany, and K-E Johansson. “Detection of My-
cobacterium avium subsp. paratuberculosis in tissue samples by single, fluorescent
and nested PCR based on the IS900 gene”. In: Veterinary microbiology 81.3 (2001),
pp. 257–271.
[140] Petra Möbius, Gabriele Luyven, Helmut Hotzel, and Heike Köhler. “High genetic
diversity among Mycobacterium avium subsp. paratuberculosis strains from Ger-
man cattle herds shown by combination of IS900 restriction fragment length poly-
morphism analysis and mycobacterial interspersed repetitive unit-variable-number
tandem-repeat typing”. In: Journal of clinical microbiology 46.3 (2008), pp. 972–
981.
[141] Virginie C Thibault, Maggy Grayon, Maria Laura Boschiroli, Christine Hubbans,
Pieter Overduin, Karen Stevenson, Maria Cristina Gutierrez, Philip Supply, and
Franck Biet. “New variable-number tandem-repeat markers for typing Mycobac-
terium avium subsp. paratuberculosis and M. avium strains: comparison with IS900
and IS1245 restriction fragment length polymorphism typing”. In: Journal of clin-
ical microbiology 45.8 (2007), pp. 2404–2410.
[142] Alongkorn Amonsin, Ling Ling Li, Qing Zhang, John P Bannantine, Alifiya S
Motiwala, Srinand Sreevatsan, and Vivek Kapur. “Multilocus short sequence re-
peat sequencing approach for differentiating among Mycobacterium avium subsp.
paratuberculosis strains”. In: Journal of clinical microbiology 42.4 (2004), pp. 1694–
1702.
[143] DICK van Soolingen, PW Hermans, PE De Haas, DR Soll, and JD Van Embden.
“Occurrence and stability of insertion sequences in Mycobacterium tuberculosis com-
plex strains: evaluation of an insertion sequence-dependent DNA polymorphism as
a tool in the epidemiology of tuberculosis.” In: Journal of clinical microbiology 29.11
(1991), pp. 2578–2586.
[144] Marten Boetzer, Christiaan V Henkel, Hans J Jansen, Derek Butler, and Walter
Pirovano. “Scaffolding pre-assembled contigs using SSPACE”. In: Bioinformatics
27.4 (2011), pp. 578–579.
[145] Te-Chin Chu, Chen-Hua Lu, Tsunglin Liu, Greg C Lee, Wen-Hsiung Li, and Arthur
Chun-Chieh Shih. “Assembler for de novo assembly of large genomes”. In: Proc Natl
Acad Sci 110.36 (2013), E3417–E3424.
[146] Weizhong Li and Adam Godzik. “Cd-hit: a fast program for clustering and compar-
ing large sets of protein or nucleotide sequences”. In: Bioinformatics 22.13 (2006),
pp. 1658–1659.
[147] Mitchell A Yakrus and Robert C Good. “Geographic distribution, frequency, and
specimen source of Mycobacterium avium complex serotypes isolated from patients
with acquired immunodeficiency syndrome.” In: Journal of clinical microbiology
28.5 (1990), pp. 926–929.
[148] Aaron CE Darling, Bob Mau, Frederick R Blattner, and Nicole T Perna. “Mauve:
multiple alignment of conserved genomic sequence with rearrangements”. In: Genome
research 14.7 (2004), pp. 1394–1403.
184
Bibliography
[149] Aaron E Darling, Bob Mau, and Nicole T Perna. “progressiveMauve: multiple
genome alignment with gene gain, loss and rearrangement”. In: PloS one 5.6 (2010),
e11147.
[150] M Lechner. “Detection of orthologs in large-scale analysis”. Masters thesis. Univer-
sity of Leipzig, 2009.
[151] Marcus Lechner, Sven Findeiß, Lydia Steiner, Manja Marz, Peter F Stadler, and
Sonja J Prohaska. “Proteinortho: Detection of (Co-) orthologs in large-scale anal-
ysis”. In: BMC Bioinformatics 12.1 (2011), p. 124.
[152] Kazutaka Katoh, Kazuharu Misawa, Kei-ichi Kuma, and Takashi Miyata. “MAFFT:
a novel method for rapid multiple sequence alignment based on fast Fourier trans-
form”. In: Nucleic acids research 30.14 (2002), pp. 3059–3066.
[153] Paul P Gardner, Jennifer Daub, John G Tate, Eric P Nawrocki, Diana L Kolbe,
Stinus Lindgreen, Adam C Wilkinson, Robert D Finn, Sam Griffiths-Jones, Sean
R Eddy, et al. “Rfam: updates to the RNA families database”. In: Nucleic acids
research 37.suppl 1 (2009), pp. D136–D140.
[154] Wade Winkler, Ali Nahvi, and Ronald R Breaker. “Thiamine derivatives bind mes-
senger RNAs directly to regulate bacterial gene expression”. In: Nature 419.6910
(2002), pp. 952–956.
[155] Zasha Weinberg, Jeffrey E Barrick, Zizhen Yao, Adam Roth, Jane N Kim, Jeremy
Gore, Joy Xin Wang, Elaine R Lee, Kirsten F Block, Narasimhan Sudarsan, et al.
“Identification of 22 candidate structured RNAs in bacteria using the CMfinder
comparative genomics pipeline”. In: Nucleic acids research 35.14 (2007), pp. 4809–
4819.
[156] Zasha Weinberg, Joy X Wang, Jarrod Bogue, Jingying Yang, Keith Corbino, Ryan
H Moy, Ronald R Breaker, et al. “Comparative genomics reveals 104 candidate
structured RNAs from bacteria, archaea, and their metagenomes”. In: Genome Biol
11.3 (2010), R31.
[157] Jeffrey E Barrick, Keith A Corbino, Wade C Winkler, Ali Nahvi, Maumita Mandal,
Jennifer Collins, Mark Lee, Adam Roth, Narasimhan Sudarsan, Inbal Jona, et al.
“New RNA motifs suggest an expanded scope for riboswitches in bacterial genetic
control”. In: Proceedings of the National Academy of Sciences of the United States
of America 101.17 (2004), pp. 6421–6426.
[158] Dilmurat Yusuf, Manja Marz, Peter Stadler, and Ivo Hofacker. “Bcheck: a wrapper
tool for detecting RNase P RNA genes”. In: BMC genomics 11.1 (2010), p. 432.
[159] Karin Lagesen, Peter Hallin, Einar Andreas Rødland, Hans-Henrik Stærfeldt, Tor-
bjørn Rognes, and David W Ussery. “RNAmmer: consistent and rapid annotation
of ribosomal RNA genes”. In: Nucleic acids research 35.9 (2007), pp. 3100–3108.
[160] Todd M Lowe and Sean R Eddy. “tRNAscan-SE: a program for improved detection
of transfer RNA genes in genomic sequence”. In: Nucleic acids research 25.5 (1997),
pp. 0955–964.
[161] Sam Griffiths-Jones. “RALEE—RNA ALignment editor in Emacs”. In: Bioinfor-
matics 21.2 (2005), pp. 257–259.
[162] Alexandros Stamatakis. “RAxML version 8: a tool for phylogenetic analysis and
post-analysis of large phylogenies.” eng. In: Bioinformatics 30.9 (May 2014), pp. 1312–
1313. doi: 10.1093/bioinformatics/btu033.
185
Bibliography
[163] Thomas Junier and Evgeny M Zdobnov. “The Newick utilities: high-throughput
phylogenetic tree processing in the UNIX shell.” eng. In: Bioinformatics 26.13 (July
2010), pp. 1669–1670. doi: 10.1093/bioinformatics/btq243.
[164] Makeda Semret, David C Alexander, Christine Y Turenne, Petra de Haas, Pieter
Overduin, Dick van Soolingen, Debby Cousins, and Marcel A Behr. “Genomic poly-
morphisms for Mycobacterium avium subsp. paratuberculosis diagnostics”. In: Jour-
nal of clinical microbiology 43.8 (2005), pp. 3704–3712.
[165] J Shine and L Dalgarno. “The 3’-terminal sequence of Escherichia coli 16S ribo-
somal RNA: complementarity to nonsense triplets and ribosome binding sites”. In:
Proceedings of the National Academy of Sciences 71.4 (1974), pp. 1342–1346.
[166] JW Dale et al. “Mobile genetic elements in Mycobacteria”. In: The European respi-
ratory journal. Supplement 20 (1995), 633s–648s.
[167] Srinand Sreevatsan, XI Pan, Kathryn E Stockbauer, Nancy D Connell, Barry N
Kreiswirth, Thomas S Whittam, and James M Musser. “Restricted structural gene
polymorphism in the Mycobacterium tuberculosis complex indicates evolutionarily
recent global dissemination”. In: Proceedings of the National Academy of Sciences
94.18 (1997), pp. 9869–9874.
[168] Laura Rindi and Carlo Garzelli. “Genetic diversity and phylogeny of Mycobacterium
avium”. In: Infection, Genetics and Evolution 21 (2014), pp. 375–383.
[169] EP Green, MLV Tizard, MT Moss, J Thompson, DJ Winterbourne, JJ McFadden,
and J Hermon-Taylor. “Sequence and characteristics or IS900, an insertion element
identified in a human Crohn’s disease isolate of Mycobacterium paratuberculosis”.
In: Nucleic Acids Research 17.22 (1989), pp. 9063–9073.
[170] Makeda Semret, Christine Y Turenne, and Marcel A Behr. “Insertion sequence
IS900 revisited”. In: Journal of clinical microbiology 44.3 (2006a), pp. 1081–1083.
[171] Ingrid Olsen, Tone Bjordal Johansen, Helen Billman-Jacobe, Sigrun Fredsvold
Nilsen, and Berit Djønne. “A novel IS element, IS Mpa1, in Mycobacterium avium
subsp. paratuberculosis”. In: Veterinary microbiology 98.3 (2004), pp. 297–306.
[172] Kai Papenfort and Jörg Vogel. “Regulatory RNA in bacterial pathogens”. In: Cell
host & microbe 8.1 (2010), pp. 116–127.
[173] Paolo Miotto, Francesca Forti, Alessandro Ambrosi, Danilo Pellin, Diogo F Veiga,
Gabor Balazsi, Maria L Gennaro, Clelia Di Serio, Daniela Ghisotti, and Daniela M
Cirillo. “Genome-wide discovery of small RNAs in Mycobacterium tuberculosis”. In:
PloS one 7.12 (2012), e51950.
[174] J Hindley. “Fractionation of 32P-labelled ribonucleic acids on polyacrylamide gels
and their characterization by fingerprinting”. In: J Mol Biol 30.1 (1967), pp. 125–
136.
[175] G G Brownlee. “Sequence of 6S RNA of E. coli”. In: Nat New Biol 229.5 (1971),
pp. 147–149.
[176] A T Cavanagh, A D Klocko, X Liu, and K M Wassarman. “Promoter specificity for
6S RNA regulation of transcription is determined by core promoter sequences and
competition for region 4.2 of sigma70”. In: Mol Microbiol 67.6 (2008), pp. 1242–
1256. doi: 10.1111/j.1365-2958.2008.06117.x.
186
Bibliography
[177] N Gildehaus, T Neusser, R Wurm, and R Wagner. “Studies on the function of the
riboregulator 6S RNA from E. coli: RNA polymerase binding, inhibition of in vitro
transcription and synthesis of RNA-directed de novo transcripts”. In: Nucleic Acids
Res 35.6 (2007), pp. 1885–1896. doi: 10.1093/nar/gkm085.
[178] A E Trotochaud and K M Wassarman. “6S RNA function enhances long-term cell
survival”. In: J Bacteriol 186.15 (2004), pp. 4978–4985. doi: 10.1128/JB.186.
15.4978-4985.2004.
[179] A E Trotochaud and K M Wassarman. “6S RNA regulation of pspF transcription
leads to altered cell survival at high pH”. In: J Bacteriol 188.11 (2006), pp. 3936–
3943. doi: 10.1128/JB.00079-06.
[180] Stefanie Wehner, Katrin Damm, Roland K Hartmann, and Manja Marz. “Dissemi-
nation of 6S RNA among Bacteria”. In: RNA biology 11.11 (2014), pp. 1467–1478.
[181] David Alland, David W Lacher, Manzour Hernando Hazbón, Alifiya S Motiwala,
Weihong Qi, Robert D Fleischmann, and Thomas S Whittam. “Role of large se-
quence polymorphisms (LSPs) in generating genomic diversity among clinical iso-
lates of Mycobacterium tuberculosis and the utility of LSPs in phylogenetic analy-
sis”. In: Journal of clinical microbiology 45.1 (2007), pp. 39–46.
[182] Torsten M Eckstein, John T Belisle, and Julia M Inamine. “Proposed pathway
for the biosynthesis of serovar-specific glycopeptidolipids in Mycobacterium avium
serovar 2”. In: Microbiology 149.10 (2003), pp. 2797–2807.
[183] Elzbieta Krzywinska and Jeffrey S Schorey. “Characterization of genetic differences
between Mycobacterium avium subsp. avium strains of diverse virulence with a
focus on the glycopeptidolipid biosynthesis cluster”. In: Veterinary microbiology
91.2 (2003), pp. 249–264.
[184] IB Marsh and RJ Whittington. “Deletion of an mmpL gene and multiple associated
genes from the genome of the S strain of Mycobacterium avium subsp. paratuber-
culosis identified by representational difference analysis and in silico analysis”. In:
Molecular and cellular probes 19.6 (2005), pp. 371–384.
[185] Michael L Paustian, John P Bannantine, and V Kapur. “Paratuberculosis: Organ-
ism, Disease, Control”. In: CAB International, 2010. Chap. 8. Mycobacterium avium
subsp. paratuberculosis Genome. isbn: 9781845936136.
[186] Eugenie Dubnau, Patricia Fontán, Riccardo Manganelli, Sonia Soares-Appel, and
Issar Smith. “Mycobacterium tuberculosis genes induced during infection of human
macrophages”. In: Infection and immunity 70.6 (2002), pp. 2787–2795.
[187] Lalita Ramakrishnan, Nancy A Federspiel, and Stanley Falkow. “Granuloma-specific
expression of Mycobacterium virulence proteins from the glycine-rich PE-PGRS
family”. In: Science 288.5470 (2000), pp. 1436–1439.
[188] Yongjun Li, Elizabeth Miltner, Martin Wu, Mary Petrofsky, and Luiz E Bermudez.
“A Mycobacterium avium PPE gene is associated with the ability of the bacterium
to grow in macrophages and virulence in mice”. In: Cellular microbiology 7.4 (2005),
pp. 539–548.
[189] Michael L Paustian, John P Bannantine, Vivek Kapur, MA Behr, DM Collins, et
al. “Mycobacterium avium subsp. paratuberculosis genome”. In: Paratuberculosis:
organism, disease and control. CAB, Oxfordshire, UK (2010), pp. 73–81.
187
Bibliography
[190] Chen Tian and Xie Jian-ping. “Roles of PE_PGRS family in Mycobacterium tuber-
culosis pathogenesis and novel measures against tuberculosis”. In: Microbial patho-
genesis 49.6 (2010), pp. 311–314.
[191] Stewart T Cole. “Comparative and functional genomics of the Mycobacterium tu-
berculosis complex”. In: Microbiology 148.10 (2002), pp. 2919–2928.
[192] Michael J Brennan and Giovanni Delogu. “The PE multigene family: a ’molecular
mantra’ for mycobacteria”. In: Trends in microbiology 10.5 (2002), pp. 246–249.
[193] Giovanni Delogu, Maurizio Sanguinetti, Cinzia Pusceddu, Alessandra Bua, Michael
J Brennan, Stefania Zanetti, and Giovanni Fadda. “PE_PGRS proteins are differ-
entially expressed by Mycobacterium tuberculosis in host tissues”. In: Microbes and
infection 8.8 (2006), pp. 2061–2067.
[194] Nicolaas C Gey van Pittius, Samantha L Sampson, Hyeyoung Lee, Yeun Kim, Paul
D Van Helden, and Robin M Warren. “Evolution and expansion of the Mycobac-
terium tuberculosis PE and PPE multigene families and their association with the
duplication of the ESAT-6 (esx) gene cluster regions”. In: BMC evolutionary biology
6.1 (2006), p. 95.
[195] Pradeep Reddy Marri, John P Bannantine, Michael L Paustian, and G Brian Gold-
ing. “Lateral gene transfer in Mycobacterium avium subspecies paratuberculosis”.
In: Canadian journal of microbiology 52.6 (2006), pp. 560–569.
[196] STea Cole, R Brosch, J Parkhill, T Garnier, C Churcher, D Harris, SV Gordon, K
Eiglmeier, S Gas, CE 3rd Barry, et al. “Deciphering the biology of Mycobacterium
tuberculosis from the complete genome sequence”. In: Nature 393.6685 (1998),
pp. 537–544.
[197] Ruth Hershberg, Mikhail Lipatov, Peter M Small, Hadar Sheffer, Stefan Niemann,
Susanne Homolka, Jared C Roach, Kristin Kremer, Dmitri A Petrov, Marcus W
Feldman, et al. “High functional diversity in Mycobacterium tuberculosis driven by
genetic drift and human demography”. In: PLoS biology 6.12 (2008), e311.
[198] David Stucki and Sebastien Gagneux. “Single nucleotide polymorphisms in My-
cobacterium tuberculosis and the need for a curated database”. In: Tuberculosis
93.1 (2013), pp. 30–39.
[199] JS Sohal, SV Singh, PK Singh, and AV Singh. “On the evolution of ’Indian Bison
type’ strains of Mycobacterium avium subspecies paratuberculosis”. In: Microbio-
logical research 165.2 (2010), pp. 163–171.
[200] Christine Y Turenne, Makeda Semret, Debby V Cousins, Desmond M Collins, and
Marcel A Behr. “Sequencing of hsp65 distinguishes among subsets of the Mycobac-
terium avium complex”. In: Journal of clinical microbiology 44.2 (2006), pp. 433–
440.
[201] IB Marsh and RJ Whittington. “Genomic diversity in Mycobacterium avium: single
nucleotide polymorphisms between the S and C strains of M. avium subsp. paratu-
berculosis and with M. a. avium”. In: Molecular and cellular probes 21.1 (2007),
pp. 66–75.
[202] Jeffrey A Martin and Zhong Wang. “Next-generation transcriptome assembly”. In:
Nature Reviews Genetics 12.10 (2011), pp. 671–682.
[203] Brian J Haas, Michael C Zody, et al. “Advancing RNA-seq analysis”. In: Nature
biotechnology 28.5 (2010), p. 421.
188
Bibliography
[204] Leandro Lima, Blerina Sinaimeri, Gustavo Sacomoto, Helene Lopez-Maestre, Camille
Marchet, Vincent Miele, Marie-France Sagot, and Vincent Lacroix. “Playing hide
and seek with repeats in local and global de novo transcriptome assembly of short
RNA-Seq reads”. In: Algorithms for Molecular Biology 12.1 (2017), p. 2.
[205] Qiong-Yi Zhao, Yi Wang, Yi-Meng Kong, Da Luo, Xuan Li, and Pei Hao. “Optimiz-
ing de novo transcriptome assembly from short-read RNA-Seq data: a comparative
study”. In: BMC bioinformatics 12.14 (2011), S2.
[206] Sujai Kumar and Mark L Blaxter. “Comparing de novo assemblers for 454 tran-
scriptome data”. In: BMC genomics 11.1 (2010), p. 571.
[207] Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kris-
tiansen, and Jun Wang. “SOAP2: an improved ultrafast tool for short read align-
ment”. In: Bioinformatics 25.15 (2009), pp. 1966–1967.
[208] BingXin Lu, ZhenBing Zeng, and Tieliu Shi. “Comparative study of de novo assem-
bly and genome-guided assembly strategies for transcriptome reconstruction based
on RNA-Seq”. In: Science China. Life sciences 56.2 (2013), p. 143.
[209] Kaitlin Clarke, Yi Yang, Ronald Marsh, LingLin Xie, and Ke K Zhang. “Compar-
ative analysis of de novo transcriptome assembly”. In: Science China. Life sciences
56.2 (2013), p. 156.
[210] Sufang Wang and Michael Gribskov. “Comprehensive evaluation of de novo tran-
scriptome assembly programs and their effects on differential gene expression anal-
ysis”. In: Bioinformatics (2016), btw625.
[211] Juntao Liu, Guojun Li, Zheng Chang, Ting Yu, Bingqiang Liu, Rick McMullen,
Pengyin Chen, and Xiuzhen Huang. “BinPacker: Packing-Based De Novo Tran-
scriptome Assembly from RNA-seq Data.” In: PLoS Comput Biol 12 (2 2016),
e1004772. issn: 1553-7358. doi: 10.1371/journal.pcbi.1004772.
[212] Zheng Chang, Guojun Li, Juntao Liu, Yu Zhang, Cody Ashby, Deli Liu, Carole L
Cramer, and Xiuzhen Huang. “Bridger: a new framework for de novo transcriptome
assembly using RNA-seq data.” In: Genome Biol 16 (2015), p. 30. issn: 1474-760X.
doi: 10.1186/s13059-015-0596-2.
[213] Yu Peng, Henry CM Leung, Siu-Ming Yiu, Ming-Ju Lv, Xin-Guang Zhu, and Fran-
cis YL Chin. “IDBA-tran: a more robust de novo de Bruijn graph assembler for
transcriptomes with uneven expression levels”. In: Bioinformatics 29.13 (2013),
pp. i326–i334.
[214] Zhaleh Safikhani, Mehdi Sadeghi, Hamid Pezeshk, and Changiz Eslahchi. “SSP:
An interval integer linear programming for de novo transcriptome assembly and
isoform discovery of RNA-seq reads”. In: Genomics 102.5 (2013), pp. 507–514.
[215] Sreeram Kannan, Joseph Hui, Kayvon Mazooji, Lior Pachter, and David Tse.
“Shannon: An Information-Optimal de novo RNA-Seq Assembler”. In: bioRxiv
(2016), p. 039230.
[216] Elena Bushmanova, Dmitry Antipov, Alla Lapidus, Vladimir Suvorov, and Andrey
D Prjibelski. “rnaQUAST: a quality assessment tool for de novo transcriptome
assemblies.” In: Bioinformatics 32 (14 2016), pp. 2210–2212. issn: 1367-4811. doi:
10.1093/bioinformatics/btw218.
189
Bibliography
[217] Daniel R. Zerbino and Ewan Birney. “Velvet: algorithms for de novo short read
assembly using de Bruijn graphs.” eng. In: Genome Res 18.5 (May 2008), pp. 821–
829. doi: 10.1101/gr.074492.107.
[218] Maureen K Thomason, Thorsten Bischler, Sara K Eisenbart, Konrad U Förstner,
Aixia Zhang, Alexander Herbig, Kay Nieselt, Cynthia M Sharma, and Gisela Storz.
“Global transcriptional start site mapping using differential RNA sequencing reveals
novel antisense RNAs in Escherichia coli”. In: Journal of bacteriology 197.1 (2015),
pp. 18–28.
[219] Paul Flicek, M Ridwan Amode, Daniel Barrell, Kathryn Beal, Simon Brent, Denise
Carvalho-Silva, Peter Clapham, Guy Coates, Susan Fairley, Stephen Fitzgerald,
et al. “Ensembl 2012”. In: Nucleic acids research (2011), gkr991.
[220] MA Pfaller and DJ Diekema. “Rare and emerging opportunistic fungal pathogens:
concern for resistance beyond Candida albicans and Aspergillus fumigatus”. In:
Journal of clinical microbiology 42.10 (2004), pp. 4419–4431.
[221] Fabien Cottier, Alrina Shin Min Tan, Jinmiao Chen, Josephine Lum, Francesca
Zolezzi, Michael Poidinger, and Norman Pavelka. “The transcriptional stress re-
sponse of Candida albicans to weak organic acids”. In: G3: Genes| Genomes| Ge-
netics 5.4 (2015), pp. 497–505.
[222] Zhibing Lai, Craig M Schluttenhofer, Ketaki Bhide, Jacob Shreve, Jyothi Thimma-
puram, Sang Yeol Lee, Dae-Jin Yun, and Tesfaye Mengiste. “MED18 interaction
with distinct transcription factors regulates multiple plant functions”. In: Nature
communications 5 (2014).
[223] Manfred G. Grabherr et al. “Full-length transcriptome assembly from RNA-Seq
data without a reference genome.” eng. In: Nat Biotechnol 29.7 (July 2011), pp. 644–
652. doi: 10.1038/nbt.1883.
[224] H. Feldmann, H. D. Klenk, and A. Sanchez. “Molecular biology and evolution of
filoviruses”. In: Arch. Virol. Suppl. 7 (1993), pp. 81–100.
[225] Thasso Griebel, Benedikt Zacher, Paolo Ribeca, Emanuele Raineri, Vincent Lacroix,
Roderic Guigó, and Michael Sammeth. “Modelling and simulating generic RNA-
Seq experiments with the flux simulator.” In: Nucleic Acids Res 40 (20 2012),
pp. 10073–10083. issn: 1362-4962. doi: 10.1093/nar/gks666.
[226] Satshil B Rana, Frank J Zadlock IV, Ziping Zhang, Wyatt R Murphy, and Car-
olyn S Bentivegna. “Comparison of De Novo Transcriptome Assemblers and k-mer
Strategies Using the Killifish, Fundulus heteroclitus”. In: PloS one 11.4 (2016),
e0153104.
[227] Ratan Chopra, Gloria Burow, Andrew Farmer, Joann Mudge, Charles E Simpson,
and Mark D Burow. “Comparisons of de novo transcriptome assemblers in diploid
and polyploid species using peanut (Arachis spp.) RNA-seq data”. In: PloS one
9.12 (2014), e115055.
[228] Joanna Moreton, Stephen P Dunham, and Richard D Emes. “A consensus approach
to vertebrate de novo transcriptome assembly from RNA-seq data: assembly of
the duck (Anas platyrhynchos) transcriptome”. In: Frontiers in genetics 5 (2014),
p. 190.
[229] Richard Smith-Unna, Chris Boursnell, Rob Patro, Julian M Hibberd, and Steven
Kelly. “TransRate: reference-free quality assessment of de novo transcriptome as-
semblies”. In: Genome research 26.8 (2016), pp. 1134–1144.
190
Bibliography
[230] Bo Li, Nathanael Fillmore, Yongsheng Bai, Mike Collins, James A Thomson, Ron
Stewart, and Colin N Dewey. “Evaluation of de novo transcriptome assemblies from
RNA-Seq data”. In: Genome biology 15.12 (2014), p. 553.
[231] Felipe A Simão, Robert M Waterhouse, Panagiotis Ioannidis, Evgenia V Krivent-
seva, and Evgeny M Zdobnov. “BUSCO: assessing genome assembly and annotation
completeness with single-copy orthologs”. In: Bioinformatics (2015), btv351.
[232] Shu Chen, J Scott McElroy, Fenny Dane, and Leslie R Goertzen. “Transcriptome
Assembly and Comparison of an Allotetraploid Weed Species, Annual Bluegrass,
with its Two Diploid Progenitor Species, Schrad and Kunth”. In: The Plant Genome
9.1 (2016).
[233] Sunetra Das, Natalie L Pitts, Megan R Mudron, David S Durica, and Donald L
Mykles. “Transcriptome analysis of the molting gland (Y-organ) from the blackback
land crab, Gecarcinus lateralis”. In: Comparative Biochemistry and Physiology Part
D: Genomics and Proteomics 17 (2016), pp. 26–40.
[234] Oliver Rupp, Jennifer Becker, Karina Brinkrolf, Christina Timmermann, Nicole
Borth, Alfred Pühler, Thomas Noll, and Alexander Goesmann. “Construction of
a public CHO cell line transcript database using versatile bioinformatics analysis
pipelines”. In: PloS one 9.1 (2014), e85568.
[235] Laura S Robertson and Robert S Cornman. “Transcriptome resources for the frogs
Lithobates clamitans and Pseudacris regilla, emphasizing antimicrobial peptides
and conserved loci for phylogenetics”. In: Molecular ecology resources 14.1 (2014),
pp. 178–183.
[236] Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. “CD-HIT:
accelerated for clustering the next-generation sequencing data”. In: Bioinformatics
28.23 (2012), pp. 3150–3152.
[237] Weizhong Li, Lukasz Jaroszewski, and Adam Godzik. “Clustering of highly homol-
ogous sequences to reduce the size of large protein databases”. In: Bioinformatics
17.3 (2001), pp. 282–283.
[238] Weizhong Li, Lukasz Jaroszewski, and Adam Godzik. “Tolerating some redundancy
significantly speeds up clustering of large protein databases”. In: Bioinformatics
18.1 (2002), pp. 77–82.
[239] Julia Bräuer. “Differential effects of vitamins A and D in human monocytes dur-
ing infection – gene expression profiling using RNA sequencing”. MA thesis. Jena:
Friedrich Schiller University Jena, 2016.
[240] Z Al Tanoury, A Piskunov, and C Rochette-Egly. “Vitamin A and retinoid sig-
naling: genomic and nongenomic effects thematic review series: Fat-soluble vita-
mins: Vitamin A.” In: Journal of lipid research 54 (2013), pp. 1761–1775. doi:
10.1194/jlr.r030833.
[241] S Christakos, P Dhawan, A Verstuyf, L Verlinden, and G Carmeliet. “Vitamin D:
Metabolism, molecular mechanism of action, and pleiotropic effects.” In: Physiol
Rev 96 (2016), pp. 365–408. doi: 10.1152/physrev.00014.2015.
[242] CM Bunce, G Brown, and M Hewison. “Vitamin D and hematopoiesis.” In: Trends
in Endocrinology & Metabolism 8 (1997), pp. 245–251. doi: 10.1016/s1043-
2760(97)00066-0.
191
Bibliography
[243] M Hewison. “Vitamin D and the immune system: new perspectives on an old
theme.” In: Endocrinology and metabolism clinics of North America 39 (2010),
pp. 365–379. doi: 10.1016/j.ecl.2010.02.010.
[244] S Christakos, P Dhawan, Y Liu, X Peng, and A Porta. “New insights into the
mechanisms of vitamin D action.” In: Journal of cellular biochemistry 88 (2003),
pp. 695–705. doi: 10.1002/jcb.10423.
[245] M Clagett-Dame and D Knutson. “Vitamin A in reproduction and development.”
In: Nutrients 3 (2011), pp. 385–428. doi: 10.3390/nu3040385.
[246] M Mark, NB Ghyselinck, and P Chambon. “Function of retinoic acid receptors
during embryonic development.” In: Nucl Recept Signal 7 (2009), e002. doi: 10.
1621/nrs.07002.
[247] A Sommer and KS Vyas. “A global clinical view on vitamin A and carotenoids.”
In: The American journal of clinical nutrition 96 (2012), 1204S–1206S. doi: 10.
3945/ajcn.112.034868.
[248] R Bouillon and T Suda. “Vitamin D: calcium and bone homeostasis during evolu-
tion.” In: BoneKEy reports 3 (2014). doi: 10.1038/bonekey.2013.214.
[249] JA Hall, JR Grainger, SP Spencer, and Y Belkaid. “The role of retinoic acid in
tolerance and immunity.” In: Immunity 35 (2011), pp. 13–22. doi: 10.1016/j.
immuni.2011.07.002.
[250] B Prietl, G Treiber, TR Pieber, and K Amrein. “Vitamin D and immune function.”
In: Nutrients 5 (2013), pp. 2502–2521. doi: 10.3390/nu5072502.
[251] P Glasziou and D Mackerras. “Vitamin A supplementation in infectious diseases:
a meta-analysis.” In: Bmj 306 (1993), pp. 366–370. doi: 10.1136/bmj.306.
6874.366.
[252] WB Grant. “Variations in vitamin D production could possibly explain the sea-
sonality of childhood respiratory infections in hawaii.” In: The Pediatric infectious
disease journal 27 (2008), p. 853.
[253] PA Danai, S Sinha, M Moss, MJ Haber, and GS Martin. “Seasonal variation in
the epidemiology of sepsis.” In: Critical care medicine 35 (2007), pp. 410–415. doi:
10.1097/01.ccm.0000253405.17038.43.
[254] E Villamor and WW Fawzi. “Vitamin A supplementation: implications for mor-
bidity and mortality in children.” In: Journal of Infectious Diseases 182 (2000),
S122–S133. doi: 10.1086/315921.
[255] R Semba. “Vitamin A and immunity to viral, bacterial and protozoan infections.”
In: Proceedings of the Nutrition Society 58 (1999), pp. 719–727. doi: 10.1017/
s0029665199000944.
[256] W Waters, M Palmer, B Nonnecke, D Whipple, and R Horst. “Mycobacterium bovis
infection of vitamin D-deficient nos2-/- mice.” In: Microbial pathogenesis 36 (2004),
pp. 11–17. doi: 10.1016/j.micpath.2003.08.008.
[257] Joan Hui Juan Lim, Sharada Ravikumar, Yan-Ming Wang, Thomas Paulraj Tham-
boo, Lizhen Ong, Jinmiao Chen, Jessamine Geraldine Goh, Sen Hee Tay, Lufei
Chengchen, Mar Soe Win, et al. “Bimodal Influence of Vitamin D in Host Re-
sponse to Systemic Candida Infection—Vitamin D Dose Matters.” In: Journal of
Infectious Diseases (2015), jiv033.
192
Bibliography
[258] Alexandra Yamshchikov, Nirali Desai, Henry Blumberg, Thomas Ziegler, and Vin
Tangpricha. “Vitamin D for treatment and prevention of infectious diseases: a sys-
tematic review of randomized controlled trials.” In: Endocrine Practice 15.5 (2009),
pp. 438–449.
[259] Kacper A Wojtal, Lutz Wolfram, Isabelle Frey-Wagner, Silvia Lang, Michael Scharl,
Stephan R Vavricka, and Gerhard Rogler. “The effects of vitamin A on cells of
innate immunity in vitro”. In: Toxicology in Vitro 27.5 (2013), pp. 1525–1532.
[260] Tilman E Klassert, Anja Hanisch, Julia Bräuer, Esther Klaile, Kerstin A Heyl,
Michael M Mansour, Jenny M Tam, Jatin M Vyas, and Hortense Slevogt. “Modula-
tory role of vitamin A on the Candida albicans-induced immune response in human
monocytes”. In: Medical microbiology and immunology 203.6 (2014), pp. 415–424.
[261] Adrian F Gombart. “The vitamin D–antimicrobial peptide pathway and its role in
protection against infection”. In: Future microbiology 4.9 (2009), pp. 1151–1165.
[262] Yong Zhang, Donald YM Leung, Brittany N Richers, Yusen Liu, Linda K Remigio,
David W Riches, and Elena Goleva. “Vitamin D inhibits monocyte/macrophage
proinflammatory cytokine production by targeting MAPK phosphatase-1”. In: The
Journal of Immunology 188.5 (2012), pp. 2127–2135.
[263] Ai-Leng Khoo, Louis YA Chai, Hans JPM Koenen, Bart-Jan Kullberg, Irma Joosten,
André JAM van der Ven, and Mihai G Netea. “1, 25-dihydroxyvitamin D3 modu-
lates cytokine production induced by Candida albicans: impact of seasonal variation
of immune responses”. In: Journal of Infectious Diseases 203.1 (2011), pp. 122–130.
[264] Paul Oeth, Jin Yao, Sao-Tah Fan, and Nigel Mackman. “Retinoic acid selectively
inhibits lipopolysaccharide induction of tissue factor gene expression in human
monocytes”. In: Blood 91.8 (1998), pp. 2857–2865.
[265] Florian B Mayr, Sachin Yende, and Derek C Angus. “Epidemiology of severe sepsis”.
In: Virulence 5.1 (2014), pp. 4–11.
[266] Natalya V Serbina, Ting Jia, Tobias M Hohl, and Eric G Pamer. “Monocyte-
mediated defense against microbial pathogens”. In: Annu. Rev. Immunol. 26 (2008),
pp. 421–452.
[267] S Andrews, F Krueger, A Seconds-Pichon, F Biggins, and S Wingett. “FastQC:
a quality control tool for high throughput sequence data.” In: Cambridge, UK:
Babraham Institute (2014). doi: 10.1093/bioinformatics/btw627.
[268] Robert Schmieder and Robert Edwards. “Quality control and preprocessing of
metagenomic datasets”. In: Bioinformatics 27.6 (2011), pp. 863–864.
[269] Michael I Love, Wolfgang Huber, and Simon Anders. “Moderated estimation of fold
change and dispersion for RNA-seq data with DESeq2”. In: Genome biology 15.12
(2014), p. 550.
[270] Simon Anders, Davis J McCarthy, Yunshun Chen, Michal Okoniewski, Gordon
K Smyth, Wolfgang Huber, and Mark D Robinson. “Count-based differential ex-
pression analysis of RNA sequencing data using R and Bioconductor”. In: Nature
protocols 8.9 (2013), pp. 1765–1786.
[271] Chris Fraley and Adrian E Raftery. “Model-based clustering, discriminant analysis,
and density estimation”. In: Journal of the American statistical Association 97.458
(2002), pp. 611–631.
193
Bibliography
[272] Chris Fraley, AE Raftery, and L Scrucca. “Normal mixture modeling for model-
based clustering, classification, and density estimation”. In: Department of Statis-
tics, University of Washington 23 (2012), p. 2012.
[273] Paul D Thomas, Michael J Campbell, Anish Kejariwal, Huaiyu Mi, Brian Karlak,
Robin Daverman, Karen Diemer, Anushya Muruganujan, and Apurva Narechania.
“PANTHER: a library of protein families and subfamilies indexed by function”. In:
Genome research 13.9 (2003), pp. 2129–2141.
[274] Paul D Thomas, Anish Kejariwal, Michael J Campbell, Huaiyu Mi, Karen Diemer,
Nan Guo, Istvan Ladunga, Betty Ulitsky-Lazareva, Anushya Muruganujan, Steven
Rabkin, et al. “PANTHER: a browsable database of gene products organized by
biological function, using curated protein family and subfamily classification”. In:
Nucleic acids research 31.1 (2003), pp. 334–341.
[275] Damian Szklarczyk, Andrea Franceschini, Stefan Wyder, Kristoffer Forslund, Da-
vide Heller, Jaime Huerta-Cepas, Milan Simonovic, Alexander Roth, Alberto San-
tos, Kalliopi P Tsafou, et al. “STRING v10: protein–protein interaction networks,
integrated over the tree of life”. In: Nucleic acids research (2014), gku1003.
[276] Nicolas Alcaraz, Josch Pauling, Richa Batra, Eudes Barbosa, Alexander Junge,
Anne GL Christensen, Vasco Azevedo, Henrik J Ditzel, and Jan Baumbach. “Key-
PathwayMiner 4.0: condition-specific pathway analysis by combining multiple omics
studies and networks with Cytoscape”. In: BMC systems biology 8.1 (2014), p. 99.
[277] Nicolas Alcaraz, Markus List, Martin Dissing-Hansen, Marc Rehmsmeier, Qihua
Tan, Jan Mollenhauer, Henrik J Ditzel, and Jan Baumbach. “Robust de novo path-
way enrichment with KeyPathwayMiner 5”. In: F1000Research 5 (2016).
[278] Michael Zuker. “Mfold web server for nucleic acid folding and hybridization predic-
tion”. In: Nucleic acids research 31.13 (2003), pp. 3406–3415.
[279] Michael W Pfaffl. “A new mathematical model for relative quantification in real-
time RT–PCR”. In: Nucleic acids research 29.9 (2001), e45–e45.
[280] Ivo Rieu and Stephen J Powers. “Real-time quantitative RT-PCR: design, calcula-
tions, and statistics”. In: The Plant Cell 21.4 (2009), pp. 1031–1033.
[281] Michael W Pfaffl, Ales Tichopad, Christian Prgomet, and Tanja P Neuvians. “De-
termination of stable housekeeping genes, differentially regulated target genes and
sample integrity: BestKeeper–Excel-based tool using pair-wise correlations”. In:
Biotechnology letters 26.6 (2004), pp. 509–515.
[282] U Ligges and M Mächler. “Scatterplot3d – an R package for visualizing multivariate
data”. In: Journal of Statistical Software 8 (2003), pp. 1–20. doi: 10.18637/jss.
v008.i11.
[283] Christopher K Glass and Kaoru Saijo. “Nuclear receptor transrepression pathways
that regulate inflammation in macrophages and T cells”. In: Nature Reviews Im-
munology 10.5 (2010), pp. 365–376.
[284] H Israel, C Odziemiec, and M Ballow. “The effects of retinoic acid on immunoglob-
ulin synthesis by human cord blood mononuclear cells”. In: Clinical immunology
and immunopathology 59.3 (1991), pp. 417–425.
194
Bibliography
[285] Harry D Dawson, Gary Collins, Robert Pyle, Michael Key, Ashani Weeraratna,
Vishwa Deep-Dixit, Celeste N Nadal, and Dennis D Taub. “Direct and indirect
effects of retinoic acid on human Th2 cytokine and chemokine expression by human
T lymphocytes”. In: BMC immunology 7.1 (2006), p. 27.
[286] Yu-Chien Tsai, Hui-Wen Chang, Tai-Tsung Chang, Min-Sheng Lee, Yu-Te Chu,
and Chih-Hsing Hung. “Effects of all-trans retinoic acid on Th1-and Th2-related
chemokines production in monocytes”. In: Inflammation 31.6 (2008), pp. 428–433.
[287] Wibke Schulte, Jürgen Bernhagen, and Richard Bucala. “Cytokines in sepsis: po-
tent immunoregulators and potential therapeutic targets–an updated view”. In:
Mediators of inflammation 2013 (2013).
[288] EJ Giamarellos-Bourboulis. “Clarithromycin: A Promising Immunomodulator in
Sepsis”. In: (2009), pp. 111–118.
[289] Anastasia Antonopoulou and Evangelos J Giamarellos-Bourboulis. “Immunomo-
dulation in sepsis: state of the art and future perspective”. In: Immunotherapy 3.1
(2011), pp. 117–128.
[290] Sanne P Smeekens, Aylwin Ng, Vinod Kumar, Melissa D Johnson, Theo S Plantinga,
Cleo Van Diemen, Peer Arts, Eugène TP Verwiel, Mark S Gresnigt, Karin Fransen,
et al. “Functional genomics identifies type I interferon pathway as central for host
defense against Candida albicans”. In: Nature communications 4 (2013), p. 1342.
[291] Olivia Majer, Christelle Bourgeois, Florian Zwolanek, Caroline Lassnig, Dontscho
Kerjaschki, Matthias Mack, Mathias Müller, and Karl Kuchler. “Type I interfer-
ons promote fatal immunopathology by regulating inflammatory monocytes and
neutrophils during Candida infections”. In: PLoS Pathog 8.7 (2012), e1002811.
[292] Juan José Muñoz, Céline Tárrega, Carmen Blanco-Aparicio, and Rafael Pulido.
“Differential interaction of the tyrosine phosphatases PTP-SL, STEP and HePTP
with the mitogen-activated protein kinases ERK1/2 and p38alpha is determined by
a kinase specificity sequence and influenced by reducing agents”. In: Biochemical
Journal 372.1 (2003), pp. 193–201.
[293] Kate L Jeffrey, Montserrat Camps, Christian Rommel, and Charles R Mackay.
“Targeting dual-specificity phosphatases: manipulating MAP kinase signalling and
immune responses”. In: Nature reviews Drug discovery 6.5 (2007), pp. 391–403.
[294] Katie J Anderson and Rachel L Allen. “Regulation of T-cell immunity by leuco-
cyte immunoglobulin-like receptors: innate immune receptors for self on antigen-
presenting cells”. In: Immunology 127.1 (2009), pp. 8–17.
[295] Francisco Borrego. “The CD300 molecules: an emerging family of regulators of the
immune system”. In: Blood 121.11 (2013), pp. 1951–1960.
[296] Elisabeth Esteban, Ricard Ferrer, Laia Alsina, and Antonio Artigas. “Immunomod-
ulation in sepsis: the role of endotoxin removal by polymyxin B-immobilized car-
tridge”. In: Mediators of inflammation 2013 (2013).
[297] C Ribeiro Nogueira, A Ramalho, E Lameu, CADS Franca, C David, and E Acciolly.
“Serum concentrations of vitamin A and oxidative stress in critically ill patients
with sepsis”. In: Nutr Hosp 24.3 (2009), pp. 312–7.
195
Bibliography
196
Bibliography
197
Bibliography
[324] Aaron R. Quinlan and Ira M. Hall. “BEDTools: a flexible suite of utilities for
comparing genomic features.” eng. In: Bioinformatics 26.6 (Mar. 2010), pp. 841–
842. doi: 10.1093/bioinformatics/btq033.
[325] Helga Thorvaldsdóttir, James T. Robinson, and Jill P. Mesirov. “Integrative Ge-
nomics Viewer (IGV): high-performance genomics data visualization and explo-
ration.” eng. In: Brief Bioinform 14.2 (Mar. 2013), pp. 178–192. doi: 10.1093/
bib/bbs017.
[326] M. Kanehisa and S. Goto. “KEGG: kyoto encyclopedia of genes and genomes.” eng.
In: Nucleic Acids Res 28.1 (Jan. 2000), pp. 27–30.
[327] Yoav Benjamini and Yosef Hochberg. “Controlling the False Discovery Rate: A
Practical and Powerful Approach to Multiple Testing”. In: Journal of the Royal Sta-
tistical Society. Series B (Methodological) 57.1 (1995), pp. 289–300. issn: 00359246.
doi: 10.2307/2346101.
[328] Piotr J Balwierz, Mikhail Pachkov, Phil Arnold, Andreas J Gruber, Mihaela Za-
volan, and Erik van Nimwegen. “ISMARA: automated modeling of genomic sig-
nals as a democracy of regulatory motifs.” In: Genome research 24.5 (Mar. 2014),
pp. 869–84. issn: 1549-5469. doi: 10.1101/gr.169508.113.
[329] Huaiyu Mi, Sagar Poudel, Anushya Muruganujan, John T Casagrande, and Paul
D Thomas. “PANTHER version 10: expanded protein families and functions, and
analysis tools”. In: Nucleic acids research 44.D1 (2016), pp. D336–D342.
[330] J. de Wilde, J. De-Castro Arce, P. J. Snijders, C. J. Meijer, F. Rosl, and R. D.
Steenbergen. “Alterations in AP-1 and AP-1 regulatory genes during HPV-induced
carcinogenesis”. In: Cell. Oncol. 30.1 (2008), pp. 77–87.
[331] B. Varshney and S. K. Lal. “SARS-CoV accessory protein 3b induces AP-1 tran-
scriptional activity through activation of JNK and ERK pathways”. In: Biochem-
istry 50.24 (June 2011), pp. 5419–5425.
[332] T. Kuri, X. Zhang, M. Habjan, L. Martinez-Sobrido, A. Garcia-Sastre, Z. Yuan,
and F. Weber. “Interferon priming enables cells to partially overturn the SARS
coronavirus-induced block in innate immune activation”. In: J. Gen. Virol. 90.Pt
11 (Nov. 2009), pp. 2686–2694.
[333] W. B. Cardenas, Y. M. Loo, M. Gale, A. L. Hartman, C. R. Kimberlin, L. Martinez-
Sobrido, E. O. Saphire, and C. F. Basler. “Ebola virus VP35 protein binds double-
stranded RNA and inhibits alpha/beta interferon production induced by RIG-I
signaling”. In: J. Virol. 80.11 (June 2006), pp. 5168–5178.
[334] P. Ramanan, R. S. Shabman, C. S. Brown, G. K. Amarasinghe, C. F. Basler, and
D. W. Leung. “Filoviral immune evasion mechanisms”. In: Viruses 3.9 (Sept. 2011),
pp. 1634–1649.
[335] P. Luthra, P. Ramanan, C. E. Mire, C. Weisend, Y. Tsuda, B. Yen, G. Liu, D. W.
Leung, T. W. Geisbert, H. Ebihara, G. K. Amarasinghe, and C. F. Basler. “Mutual
antagonism between the Ebola virus VP35 protein and the RIG-I activator PACT
determines infection outcome”. In: Cell Host Microbe 14.1 (July 2013), pp. 74–84.
[336] M. L. Schmitz, M. Kracht, and V. V. Saul. “The intricate interplay between RNA
viruses and NF-κB”. In: Biochim. Biophys. Acta 1843.11 (Nov. 2014), pp. 2754–
2764.
198
Bibliography
[337] S. Reikine, J. B. Nguyen, and Y. Modis. “Pattern Recognition and Signaling Mech-
anisms of RIG-I and MDA5”. In: Front Immunol 5 (2014), p. 342.
[338] M. U. Gack, Y. C. Shin, C. H. Joo, T. Urano, C. Liang, L. Sun, O. Takeuchi,
S. Akira, Z. Chen, S. Inoue, and J. U. Jung. “TRIM25 RING-finger E3 ubiquitin
ligase is essential for RIG-I-mediated antiviral activity”. In: Nature 446.7138 (Apr.
2007), pp. 916–920.
[339] K. Ozato, D. M. Shin, T. H. Chang, and H. C. Morse. “TRIM family proteins and
their emerging roles in innate immunity”. In: Nat. Rev. Immunol. 8.11 (Nov. 2008),
pp. 849–860.
[340] Judith Olejnik, Jesus Alonso, Kristina M. Schmidt, Zhen Yan, Wei Wang, Andrea
Marzi, Hideki Ebihara, Jinghua Yang, Jean L. Patterson, Elena Ryabchikova, and
Elke Mühlberger. “Ebola virus does not block apoptotic signaling pathways.” eng.
In: J Virol 87.10 (May 2013), pp. 5384–5396. doi: 10.1128/JVI.01461-12.
[341] J. I. Jun and L. F. Lau. “Taking aim at the extracellular matrix: CCN proteins as
emerging therapeutic targets”. In: Nat Rev Drug Discov 10.12 (Dec. 2011), pp. 945–
963.
[342] Chiho Goda, Taisuke Kanaji, Sachiko Kanaji, Go Tanaka, Kazuhiko Arima, Shigeaki
Ohno, and Kenji Izuhara. “Involvement of IL-32 in activation-induced cell death in
T cells”. In: International immunology 18.2 (2006), pp. 233–240.
[343] N. Wauquier, P. Becquart, C. Padilla, S. Baize, and E. M. Leroy. “Human fatal
zaire ebola virus infection is associated with an aberrant innate immunity and with
massive lymphocyte apoptosis”. In: PLoS Negl Trop Dis 4.10 (2010).
[344] Wei Li, Yan Liu, Muhammad Mahmood Mukhtar, Rui Gong, Ying Pan, Sahibzada
T Rasool, Yecheng Gao, Lei Kang, Qian Hao, Guiqing Peng, et al. “Activation of
interleukin-32 pro-inflammatory pathway in response to influenza A virus infection”.
In: PLoS One 3.4 (2008), e1985.
[345] W. Li, W. Sun, L. Liu, F. Yang, Y. Li, Y. Chen, J. Fang, W. Zhang, J. Wu, and Y.
Zhu. “IL-32: a host proinflammatory factor against influenza viral replication is up-
regulated by aberrant epigenetic modifications during influenza A virus infection”.
In: J. Immunol. 185.9 (Nov. 2010), pp. 5056–5065.
[346] J. Ouyang, X. Zhu, Y. Chen, H. Wei, Q. Chen, X. Chi, B. Qi, L. Zhang, Y. Zhao,
G. F. Gao, G. Wang, and J. L. Chen. “NRAV, a long noncoding RNA, modu-
lates antiviral responses through suppression of interferon-stimulated gene tran-
scription”. In: Cell Host Microbe 16.5 (Nov. 2014), pp. 616–626.
[347] Christopher F Basler, Xiuyan Wang, Elke Mühlberger, Victor Volchkov, Jason
Paragas, Hans-Dieter Klenk, Adolfo Garcia-Sastre, and Peter Palese. “The Ebola
virus VP35 protein functions as a type I IFN antagonist”. In: Proceedings of the
National Academy of Sciences 97.22 (2000), pp. 12289–12294.
[348] S. P. Reid, L. W. Leung, A. L. Hartman, O. Martinez, M. L. Shaw, C. Carbonnelle,
V. E. Volchkov, S. T. Nichol, and C. F. Basler. “Ebola virus VP24 binds karyopherin
alpha1 and blocks STAT1 nuclear accumulation”. In: J. Virol. 80.11 (June 2006),
pp. 5156–5167.
[349] Christopher F. Basler and Gaya K. Amarasinghe. “Evasion of interferon responses
by Ebola and Marburg viruses.” eng. In: J Interferon Cytokine Res 29.9 (Sept.
2009), pp. 511–520. doi: 10.1089/jir.2009.0076.
199
Bibliography
[350] Michael Schümann, Thorsten Gantke, and Elke Mühlberger. “Ebola virus VP35
antagonizes PKR activity through its C-terminal interferon inhibitory domain.”
eng. In: J Virol 83.17 (Sept. 2009), pp. 8993–8997. doi: 10.1128/JVI.00523-
09.
[351] M. Mateo, S. P. Reid, L. W. Leung, C. F. Basler, and V. E. Volchkov. “Ebolavirus
VP24 binding to karyopherins is required for inhibition of interferon signaling”. In:
J. Virol. 84.2 (Jan. 2010), pp. 1169–1175.
[352] R. Kubisch, L. Meissner, S. Krebs, H. Blum, M. Gunther, A. Roidl, and E. Wag-
ner. “A Comprehensive Gene Expression Analysis of Resistance Formation upon
Metronomic Cyclophosphamide Therapy”. In: Transl Oncol 6.1 (Feb. 2013), pp. 1–
9.
[353] T. Nomiyama, T. Nakamachi, F. Gizard, E. B. Heywood, K. L. Jones, N. Ohkura,
R. Kawamori, O. M. Conneely, and D. Bruemmer. “The NR4A orphan nuclear
receptor NOR1 is induced by platelet-derived growth factor and mediates vascular
smooth muscle cell proliferation”. In: J. Biol. Chem. 281.44 (Nov. 2006), pp. 33467–
33476.
[354] L. Jin, A. Williamson, S. Banerjee, I. Philipp, and M. Rape. “Mechanism of ubiquitin-
chain formation by the human anaphase-promoting complex”. In: Cell 133.4 (May
2008), pp. 653–665.
[355] J. Yao, L. Duan, M. Fan, J. Yuan, and X. Wu. “Overexpression of BLCAP induces
S phase arrest and apoptosis independent of p53 and NF-kappaB in human tongue
carcinoma : BLCAP overexpression induces S phase arrest and apoptosis”. In: Mol.
Cell. Biochem. 297.1-2 (Mar. 2007), pp. 81–92.
[356] A. S. Kondratowicz et al. “T-cell immunoglobulin and mucin domain 1 (TIM-1) is
a receptor for Zaire Ebolavirus and Lake Victoria Marburgvirus”. In: Proc. Natl.
Acad. Sci. U.S.A. 108.20 (May 2011), pp. 8426–8431.
[357] Laurent Meertens, Xavier Carnec, Manuel Perera Lecoin, Rasika Ramdasi, Florence
Guivel-Benhassine, Erin Lew, Greg Lemke, Olivier Schwartz, and Ali Amara. “The
TIM and TAM families of phosphatidylserine receptors mediate dengue virus entry”.
In: Cell host & microbe 12.4 (2012), pp. 544–557.
[358] Naveen L Pereira, Dong Lin, Linda Pelleymounter, Irene Moon, Gail Stilling, Bruce
W Eckloff, Eric D Wieben, Margaret M Redfield, John C Burnett, Vivien C Yee,
et al. “Natriuretic Peptide Receptor-3 Gene (NPR3) Nonsynonymous Polymor-
phism Results in Significant Reduction in Protein Expression Because of Acceler-
ated Degradation”. In: Circulation: Cardiovascular Genetics 6.2 (2013), pp. 201–
210.
[359] W. Xu et al. “Ebola virus VP24 targets a unique NLS binding site on karyopherin
alpha 5 to selectively compete with nuclear import of phosphorylated STAT1”. In:
Cell Host Microbe 16.2 (Aug. 2014), pp. 187–200.
[360] A. P. Zhang, Z. A. Bornholdt, T. Liu, D. M. Abelson, D. E. Lee, S. Li, V. L.
Woods, and E. O. Saphire. “The ebola virus interferon antagonist VP24 directly
binds STAT1 and has a novel, pyramidal fold”. In: PLoS Pathog. 8.2 (Feb. 2012),
e1002550.
[361] R. S. Shabman, E. E. Gulcicek, K. L. Stone, and C. F. Basler. “The Ebola virus
VP24 protein prevents hnRNP C1/C2 binding to karyopherinp α1 and partially
alters its nuclear import”. In: J. Infect. Dis. 204 Suppl 3 (Nov. 2011), S904–910.
200
Bibliography
201
Bibliography
[376] Brian H Bird, Thomas G Ksiazek, Stuart T Nichol, and N James MacLachlan. “Rift
Valley fever virus”. In: Journal of the American Veterinary Medical Association
234.7 (2009), pp. 883–893.
[377] Rolf Muller, Jean-Francois Saluzzo, Nora Lopez, Thomas Dreier, Michael Turell,
Jonathan Smith, and Michele Bouloy. “Characterization of clone 13, a naturally
attenuated avirulent isolate of Rift Valley fever virus, which is altered in the small
segment”. In: The American journal of tropical medicine and hygiene 53.4 (1995),
pp. 405–411.
[378] GH Gerdes. “Rift Valley fever”. In: Revue scientifique et technique (International
Office of Epizootics) 23.2 (2004), pp. 613–623.
[379] Joseph J Vitti, Sharon R Grossman, and Pardis C Sabeti. “Detecting natural se-
lection in genomic data.” eng. In: Annu Rev Genet 47 (2013), pp. 97–120. doi:
10.1146/annurev-genet-111212-133526.
[380] Matteo Fumagalli, Manuela Sironi, Uberto Pozzoli, Anna Ferrer-Admetlla, Anna
Ferrer-Admettla, Linda Pattini, and Rasmus Nielsen. “Signatures of environmental
genetic adaptation pinpoint pathogens as the main selective pressure through hu-
man evolution.” eng. In: PLoS Genet 7.11 (Nov. 2011), e1002355. doi: 10.1371/
journal.pgen.1002355.
[381] Daniel Shriner, David C Nickle, Mark A Jensen, and James I Mullins. “Poten-
tial impact of recombination on sitewise approaches for detecting positive natural
selection.” eng. In: Genet Res 81.2 (Apr. 2003), pp. 115–121.
[382] Wayne Delport, Art F Y Poon, Simon D W Frost, and Sergei L Kosakovsky Pond.
“Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biol-
ogy.” eng. In: Bioinformatics 26.19 (Oct. 2010), pp. 2455–2457. doi: 10.1093/
bioinformatics/btq429.
[383] Adi Doron-Faigenboim, Adi Stern, Itay Mayrose, Eran Bacharach, and Tal Pupko.
“Selecton: a server for detecting evolutionary forces at a single amino-acid site.”
eng. In: Bioinformatics 21.9 (May 2005), pp. 2101–2103.
[384] Adi Stern, Adi Doron-Faigenboim, Elana Erez, Eric Martz, Eran Bacharach, and
Tal Pupko. “Selecton 2007: advanced models for detecting positive and purifying
selection using a Bayesian inference approach.” eng. In: Nucleic Acids Res 35.Web
Server issue (July 2007), W506–W511. doi: 10.1093/nar/gkm382.
[385] Fei Su, Hong-Yu Ou, Fei Tao, Hongzhi Tang, and Ping Xu. “PSP: rapid identifi-
cation of orthologous coding genes under positive selection across multiple closely
related prokaryotic genomes.” eng. In: BMC Genomics 14 (Dec. 2013), p. 924. doi:
10.1186/1471-2164-14-924.
[386] Federico Abascal, Rafael Zardoya, and Maximilian J Telford. “TranslatorX: multi-
ple alignment of nucleotide sequences guided by amino acid translations.” eng. In:
Nucleic Acids Res 38.Web Server issue (July 2010), W7–13. doi: 10.1093/nar/
gkq291.
[387] Robert C Edgar. “MUSCLE: multiple sequence alignment with high accuracy and
high throughput.” eng. In: Nucleic Acids Res 32.5 (2004), pp. 1792–1797. doi:
10.1093/nar/gkh340.
[388] D Posada and KA Crandall. “MODELTEST: testing the model of DNA substitu-
tion.” eng. In: Bioinformatics 14.9 (1998), pp. 817–818.
202
Bibliography
[389] Sergei L Kosakovsky Pond, Simon D W Frost, and Spencer V Muse. “HyPhy:
hypothesis testing using phylogenies.” eng. In: Bioinformatics 21.5 (Mar. 2005),
pp. 676–679. doi: 10.1093/bioinformatics/bti079.
[390] Sergei L Kosakovsky Pond, David Posada, Michael B Gravenor, Christopher H
Woelk, and Simon D W Frost. “GARD: a genetic algorithm for recombination
detection.” eng. In: Bioinformatics 22.24 (Dec. 2006), pp. 3096–3098. doi: 10 .
1093/bioinformatics/btl474.
[391] Sergei L Kosakovsky Pond, David Posada, Michael B Gravenor, Christopher H
Woelk, and Simon D W Frost. “Automated phylogenetic detection of recombination
using a genetic algorithm.” eng. In: Mol Biol Evol 23.10 (Oct. 2006), pp. 1891–1901.
doi: 10.1093/molbev/msl051.
[392] H Kishino and M Hasegawa. “Evaluation of the maximum likelihood estimate of
the evolutionary tree topologies from DNA sequence data, and the branching order
in Hominoidea.” eng. In: J Mol Evol 29.2 (Aug. 1989), pp. 170–179.
[393] Ziheng Yang. “PAML 4: phylogenetic analysis by maximum likelihood.” eng. In:
Mol Biol Evol 24.8 (Aug. 2007), pp. 1586–1591. doi: 10.1093/molbev/msm088.
[394] Willie J Swanson, Rasmus Nielsen, and Qiaofeng Yang. “Pervasive adaptive evolu-
tion in mammalian fertilization proteins”. In: Molecular biology and evolution 20.1
(2003), pp. 18–20.
[395] Ziheng Yang, Wendy S W Wong, and Rasmus Nielsen. “Bayes Empirical Bayes
Inference of Amino Acid Sites Under Positive Selection.” eng. In: Mol Biol Evol
22.4 (Apr. 2005), pp. 1107–1118. doi: 10.1093/molbev/msi097.
[396] Patrick S Mitchell, Janet M Young, Michael Emerman, and Harmit S Malik. “Evo-
lutionary analyses suggest a function of MxB immunity proteins beyond lentivirus
restriction”. In: PLoS Pathog 11.12 (2015), e1005304.
[397] Ross M McBee, Shea A Rozmiarek, Nicholas R Meyerson, Paul A Rowley, and
Sara L Sawyer. “The Effect of Species Representation on the Detection of Positive
Selection in Primate Gene Data Sets.” eng. In: Mol Biol Evol 32.4 (Apr. 2015),
pp. 1091–1096. doi: 10.1093/molbev/msu399.
[398] Ingi Agnarsson, Carlos M Zambrana-Torrelio, Nadia Paola Flores-Saldana, and
Laura J May-Collado. “A time-calibrated species-level phylogeny of bats (Chi-
roptera, Mammalia)”. In: PLoS currents 3 (2011).
[399] G. Tsagkogeorga, J. Parker, E. Stupka, J. A. Cotton, and S. J. Rossiter. “Phy-
logenomic analyses elucidate the evolutionary relationships of bats”. In: Curr. Biol.
23.22 (Nov. 2013), pp. 2262–2267.
[400] F. C. Almeida, N. P. Giannini, N. B. Simmons, and K. M. Helgen. “Each flying fox
on its own branch: a phylogenetic tree for Pteropus and related genera (Chiroptera:
Pteropodidae)”. In: Mol. Phylogenet. Evol. 77 (Aug. 2014), pp. 83–95.
[401] M. Ruedi, B. Stadelmann, Y. Gager, E. J. Douzery, C. M. Francis, L. K. Lin, A.
Guillen-Servent, and A. Cibois. “Molecular phylogenetic reconstructions identify
East Asia as the cradle for the evolution of the cosmopolitan genus Myotis (Mam-
malia, Chiroptera)”. In: Mol. Phylogenet. Evol. 69.3 (Dec. 2013), pp. 437–449.
[402] Charles H Calisher, James E Childs, Hume E Field, Kathryn V Holmes, and Tony
Schountz. “Bats: important reservoir hosts of emerging viruses”. In: Clinical micro-
biology reviews 19.3 (2006), pp. 531–545.
203
Bibliography
[403] Emma C Teeling, Mark S Springer, Ole Madsen, Paul Bates, Stephen J O’brien,
and William J Murphy. “A molecular phylogeny for bats illuminates biogeography
and the fossil record”. In: Science 307.5709 (2005), pp. 580–584.
[404] Don E Wilson and DeeAnn M Reeder. Mammal species of the world: a taxonomic
and geographic reference. Vol. 12. Johns Hopkins University Press, 2005.
[405] John E McCormack, Brant C Faircloth, Nicholas G Crawford, Patricia Adair Gowaty,
Robb T Brumfield, and Travis C Glenn. “Ultraconserved elements are novel phy-
logenomic markers that resolve placental mammal phylogeny when combined with
species-tree analysis”. In: Genome research 22.4 (2012), pp. 746–754.
[406] Mariana F Nery, Dimar J Gonzalez, Federico G Hoffmann, and Juan C Opazo.
“Resolution of the laurasiatherian phylogeny: evidence from genomic data”. In:
Molecular phylogenetics and evolution 64.3 (2012), pp. 685–689.
[407] Nancy B Simmons. “An Eocene big bang for bats”. In: Science 307.5709 (2005),
pp. 527–528.
[408] Nancy B Simmons and J H Geisler. “Phylogenetic relationships of Icaronycteris,
Archeonycteris, Hassianycteris and Palaeochiropteryix to extant bat lineages, with
comments on the evolution of echolocation and foraging strategies in Microchi-
roptera”. In: Bulletin of the American Museum of Natural History 235 (1998), pp. 1–
182.
[409] Maureen A O’Leary, Jonathan I Bloch, John J Flynn, Timothy J Gaudin, Andres
Giallombardo, Norberto P Giannini, Suzann L Goldberg, Brian P Kraatz, Zhe-Xi
Luo, Jin Meng, et al. “The placental mammal ancestor and the post–K-Pg radiation
of placentals”. In: Science 339.6120 (2013), pp. 662–667.
[410] Zhen Liu, Shude Li, Wei Wang, Dongming Xu, Robert W Murphy, and Peng Shi.
“Parallel evolution of KCNQ4 in echolocating bats”. In: PLoS One 6.10 (2011),
e26618.
[411] Michal Szczesniak, Misako Yoneda, Hiroki Sato, Izabela Makalowska, Shigeru Kyuwa,
Sumio Sugano, Yutaka Suzuki, Wojciech Makalowski, and Chieko Kai. “Character-
ization of the mitochondrial genome of Rousettus leschenaulti”. In: Mitochondrial
DNA (2013), pp. 1–2.
[412] Lin-Fa Wang, Peter J Walker, and Leo LM Poon. “Mass extinctions, biodiversity
and mitochondrial function: are bats ’special’ as reservoirs for emerging viruses?”
In: Current opinion in virology 1.6 (2011), pp. 649–657.
[413] James W Wynne and Lin-Fa Wang. “Bats and viruses: friend or foe?” In: PLoS
Pathog 9.10 (2013), e1003651.
[414] Tabea Binger, Augustina Annan, Jan Felix Drexler, Marcel Alexander Müller, René
Kallies, Ernest Adankwah, Robert Wollny, Anne Kopp, Hanna Heidemann, Dickson
Dei, et al. “A novel rhabdovirus isolated from the straw-colored fruit bat Eidolon
helvum, with signs of antibodies in swine and humans”. In: Journal of virology 89.8
(2015), pp. 4588–4597.
[415] Paul M Arguin, Kristy Murray-Lillibridge, ME Miranda, Jean S Smith, Alan B
Calaor, and Charles E Rupprecht. “Serologic evidence of Lyssavirus infections
among bats, the Philippines”. In: Emerging Infectious Diseases 8.3 (2002), pp. 258–
262.
204
Bibliography
[416] Hume Field, Brad McCall, and Janine Barrett. “Australian bat lyssavirus infection
in a captive juvenile black flying fox”. In: Emerging infectious diseases 5.3 (1999),
p. 438.
[417] Eric M Leroy, Brice Kumulungui, Xavier Pourrut, Pierre Rouquet, Alexandre Has-
sanin, Philippe Yaba, André Délicat, Janusz T Paweska, Jean-Paul Gonzalez, and
Robert Swanepoel. “Fruit bats as reservoirs of Ebola virus”. In: Nature 438.7068
(2005), pp. 575–576.
[418] Xavier Pourrut, Marc Souris, Jonathan S Towner, Pierre E Rollin, Stuart T Nichol,
Jean-Paul Gonzalez, and Eric Leroy. “Large serological survey showing cocircula-
tion of Ebola and Marburg viruses in Gabonese bat populations, and a high sero-
prevalence of both viruses in Rousettus aegyptiacus”. In: BMC infectious diseases
9.1 (2009), p. 159.
[419] Suxiang Tong, Yan Li, Pierre Rivailler, Christina Conrardy, Danilo A Alvarez
Castillo, Li-Mei Chen, Sergio Recuenco, James A Ellison, Charles T Davis, Ian
A York, et al. “A distinct lineage of influenza A virus from bats”. In: Proceedings
of the National Academy of Sciences 109.11 (2012), pp. 4269–4274.
[420] Suxiang Tong, Xueyong Zhu, Yan Li, Mang Shi, Jing Zhang, Melissa Bourgeois,
Hua Yang, Xianfeng Chen, Sergio Recuenco, Jorge Gomez, et al. “New world bats
harbor diverse influenza A viruses”. In: PLoS Pathog 9.10 (2013), e1003657.
[421] Sabrina Weiss, Peter T Witkowski, Brita Auste, Kathrin Nowak, Natalie Weber,
Jakob Fahr, Jean-Vivien Mombouli, Nathan D Wolfe, Jan Felix Drexler, Christian
Drosten, et al. “Hantavirus in bat, Sierra Leone”. In: (2012).
[422] Wen-Ping Guo, Xian-Dan Lin, Wen Wang, Jun-Hua Tian, Mei-Li Cong, Hai-Lin
Zhang, Miao-Ruo Wang, Run-Hong Zhou, Jian-Bo Wang, Ming-Hui Li, et al. “Phy-
logeny and origins of hantaviruses harbored by bats, insectivores, and rodents”. In:
PLoS Pathog 9.2 (2013), e1003159.
[423] Marcel A Müller, Stéphanie Devignot, Erik Lattwein, Victor Max Corman, Gaël
D Maganga, Florian Gloza-Rausch, Tabea Binger, Peter Vallo, Petra Emmerich,
Veronika M Cottontail, et al. “Evidence for widespread infection of African bats
with Crimean-Congo hemorrhagic fever-like viruses”. In: Scientific reports 6 (2016).
[424] MF Almeida, LFA Martorelli, CC Aires, PC Sallum, EL Durigon, and E Massad.
“Experimental rabies infection in haematophagous bats Desmodus rotundus”. In:
Epidemiology and infection 133.03 (2005), pp. 523–527.
[425] Judith N Mandl, Rafi Ahmed, Luis B Barreiro, Peter Daszak, Jonathan H Epstein,
Herbert W Virgin, and Mark B Feinberg. “Reservoir host immune responses to
emerging zoonotic viruses”. In: Cell 160.1 (2015), pp. 20–35.
[426] ML Baker, Tony Schountz, and L-F Wang. “Antiviral immune responses of bats: a
review”. In: Zoonoses and public health 60.1 (2013), pp. 104–116.
[427] Susanne E Biesold, Daniel Ritz, Florian Gloza-Rausch, Robert Wollny, Jan Fe-
lix Drexler, Victor M Corman, Elisabeth KV Kalko, Samuel Oppong, Christian
Drosten, and Marcel A Müller. “Type I interferon reaction to viral infection in
interferon-competent, immortalized cell lines from the African fruit bat Eidolon
helvum”. In: PloS one 6.11 (2011), e28131.
205
Bibliography
[428] Christopher Cowled, Michelle Baker, Mary Tachedjian, Peng Zhou, Dieter Bulach,
and Lin-Fa Wang. “Molecular characterisation of Toll-like receptors in the black
flying fox Pteropus alecto”. In: Developmental & Comparative Immunology 35.1
(2011), pp. 7–18.
[429] Christopher Cowled, Michelle L Baker, Peng Zhou, Mary Tachedjian, and Lin-
Fa Wang. “Molecular characterisation of RIG-I-like helicases in the black flying
fox, Pteropus alecto”. In: Developmental & Comparative Immunology 36.4 (2012),
pp. 657–664.
[430] Anthony T Papenfuss, Michelle L Baker, Zhi-Ping Feng, Mary Tachedjian, Gary
Crameri, Chris Cowled, Justin Ng, Vijaya Janardhana, Hume E Field, and Lin-Fa
Wang. “The immune gene repertoire of an important viral reservoir, the Australian
black flying fox”. In: BMC genomics 13.1 (2012), p. 261.
[431] Peng Zhou, Christopher Cowled, Lin-Fa Wang, and Michelle L Baker. “Bat Mx1
and Oas1, but not Pkr are highly induced by bat interferon and viral infection”.
In: Developmental & Comparative Immunology 40.3 (2013), pp. 240–247.
[432] Jinju Li, Guangxu Zhang, Dalong Cheng, Hua Ren, Min Qian, and Bing Du.
“Molecular characterization of RIG-I, STAT-1 and IFN-beta in the horseshoe bat”.
In: Gene 561.1 (2015), pp. 115–123.
[433] Peng Zhou, Mary Tachedjian, James W Wynne, Victoria Boyd, Jie Cui, Ina Smith,
Christopher Cowled, Justin HJ Ng, Lawrence Mok, Wojtek P Michalski, et al.
“Contraction of the type I IFN locus and unusual constitutive expression of IFN-α
in bats”. In: Proceedings of the National Academy of Sciences (2016), p. 201518240.
[434] Dirk Holzinger, Carl Jorns, Silke Stertz, Stéphanie Boisson-Dupuis, Robert Thimme,
Manfred Weidmann, Jean-Laurent Casanova, Otto Haller, and Georg Kochs. “In-
duction of MxA gene expression by influenza A virus requires type I or type III
interferon signaling”. In: Journal of virology 81.14 (2007), pp. 7776–7785.
[435] Markus Mordstein, Georg Kochs, Laure Dumoutier, Jean-Christophe Renauld, Søren
R Paludan, Kevin Klucher, and Peter Staeheli. “Interferon-λ contributes to innate
immunity of mice against influenza A virus but not against hepatotropic viruses”.
In: PLoS Pathog 4.9 (2008), e1000151.
[436] Otto Haller, Peter Staeheli, Martin Schwemmle, and Georg Kochs. “Mx GTPases:
dynamin-like antiviral machines of innate immunity”. In: Trends in microbiology
23.3 (2015), pp. 154–163.
[437] Song Gao, Alexander von der Malsburg, Alexej Dick, Katja Faelber, Gunnar F
Schröder, Otto Haller, Georg Kochs, and Oliver Daumke. “Structure of Myxovirus
resistance protein A reveals intra-and intermolecular domain interactions required
for the antiviral function”. In: Immunity 35.4 (2011), pp. 514–525.
[438] Judith Verhelst, Eef Parthoens, Bert Schepens, Walter Fiers, and Xavier Saelens.
“Interferon-inducible protein Mx1 inhibits influenza virus by interfering with func-
tional viral ribonucleoprotein complex assembly”. In: Journal of virology 86.24
(2012), pp. 13445–13455.
[439] Georg Kochs and Otto Haller. “Interferon-induced human MxA GTPase blocks
nuclear import of Thogoto virus nucleocapsids”. In: Proceedings of the National
Academy of Sciences 96.5 (1999), pp. 2082–2086.
206
Bibliography
[440] Georg Kochs, Christian Janzen, Heinz Hohenberg, and Otto Haller. “Antivirally
active MxA protein sequesters La Crosse virus nucleocapsid protein into perinu-
clear complexes”. In: Proceedings of the National Academy of Sciences 99.5 (2002),
pp. 3153–3158.
[441] Martin Schwemmle, Kirsten C Weining, Marc F Richter, Beats Schumacher, and
Peter Staeheli. “Vesicular stomatitis virus transcription inhibited by purified MxA
protein”. In: Virology 206.1 (1995), pp. 545–554.
[442] Thomas Fricke, Tommy E White, Bianca Schulte, Daniel A de Souza Aranha Vieira,
Adarsh Dharan, Edward M Campbell, Alberto Brandariz-Nuñez, and Felipe Diaz-
Griffero. “MxB binds to the HIV-1 core and prevents the uncoating process of
HIV-1”. In: Retrovirology 11.1 (2014), p. 68.
[443] Benjamin Mänz, Dominik Dornfeld, Veronika Götz, Roland Zell, Petra Zimmer-
mann, Otto Haller, Georg Kochs, and Martin Schwemmle. “Pandemic influenza A
viruses escape from restriction by human MxA through adaptive mutations in the
nucleoprotein”. In: PLoS Pathog 9.3 (2013), e1003279.
[444] Idoia Busnadiego, Melissa Kane, Suzannah J Rihn, Hannah F Preugschas, Joseph
Hughes, Daniel Blanco-Melo, Victoria P Strouvelle, Trinity M Zang, Brian J Wil-
lett, Chris Boutell, et al. “Host and viral determinants of Mx2 antiretroviral activ-
ity”. In: Journal of virology 88.14 (2014), pp. 7738–7752.
[445] Timothy I Shaw, Anuj Srivastava, Wen-Chi Chou, Liang Liu, Ann Hawkinson,
Travis C Glenn, Rick Adams, and Tony Schountz. “Transcriptome sequencing and
annotation for the Jamaican fruit bat (Artibeus jamaicensis)”. In: PloS one 7.11
(2012), e48472.
[446] Darren P Martin, Ben Murrell, Michael Golden, Arjun Khoosal, and Brejnev Muhire.
“RDP4: Detection and analysis of recombination patterns in virus genomes”. In:
Virus Evolution 1.1 (2015), vev003.
[447] Ari Watt, Felicien Moukambi, Logan Banadyga, Allison Groseth, Julie Callison,
Astrid Herwig, Hideki Ebihara, Heinz Feldmann, and Thomas Hoenen. “A novel
life cycle modeling system for Ebola virus shows a genome length-dependent role
of VP24 in virus infectivity”. In: Journal of virology 88.18 (2014), pp. 10511–10524.
[448] Angela D Luis, David TS Hayman, Thomas J O’Shea, Paul M Cryan, Amy T
Gilbert, Juliet RC Pulliam, James N Mills, Mary E Timonin, Craig KR Willis,
Andrew A Cunningham, et al. “A comparison of bats and rodents as reservoirs of
zoonotic viruses: are bats special?” In: Proc. R. Soc. B. Vol. 280. 1756. The Royal
Society. 2013, p. 20122753.
[449] Kate E Jones, Andy Purvis, ANN Maclarnon, OLAF RP BININDA-EMONDS, and
Nancy B Simmons. “A phylogenetic supertree of the bats (Mammalia: Chiroptera)”.
In: Biological Reviews 77.2 (2002), pp. 223–259.
[450] Patrick S Mitchell, Corinna Patzina, Michael Emerman, Otto Haller, Harmit S Ma-
lik, and Georg Kochs. “Evolution-guided identification of antiviral specificity de-
terminants in the broadly acting interferon-induced innate immunity factor MxA”.
In: Cell host & microbe 12.4 (2012), pp. 598–604.
[451] Patrick S Mitchell, Michael Emerman, and Harmit S Malik. “An evolutionary per-
spective on the broad antiviral specificity of MxA”. In: Current opinion in micro-
biology 16.4 (2013), pp. 493–499.
207
Bibliography
[452] Jin Zhao, Haodi Feng, Daming Zhu, Chi Zhang, and Ying Xu. “IsoTree: De Novo
Transcriptome Assembly from RNA-Seq Reads”. In: International Symposium on
Bioinformatics Research and Applications. Springer. 2017, pp. 71–83.
208
Appendix A
209
Table A.1: Read statistics. Read counts and assembly/mapping statistics for all 18 HiSeq samples and the additional MiSeq library for R. aegyptiacus.
We mapped all corresponding samples to the H. sapiens and R. aegyptiacus genome with TopHat and segemehl. Additionally, we build de novo
transcriptome assemblies for both species. For R. aegyptiacus, a de novo transcriptome assembly was computed based on the HiSeq and pooled MiSeq
reads. MiSeq data was assembled with Mira only. For each assembly tool, the number of contigs (>= 0 bp, >= 1000 bp) and the N50 value are listed. For
TopHat and segemehl, overall read mapping statistics are provided. A large amount of reads in the EBOV 23 h sample mapped to the EBOV genome.
Detailed statistics can be found in the electronic supplement: www.rna.uni-jena.de/supplements/filovirus_human_bat.
HuH7 (Homo sapiens) R06E-J (Rousettus aegyptiacus)
Sample Mock EBOV MARV Mock EBOV MARV
3h 7h 23 h 3h 7h 23 h 3h 7h 23 h 3h 7h 23 h 3h 7h 23 h 3h 7h 23 h
Read data (million reads)
raw 40.5 38.0 39.0 34.4 49.9 53.0 44.2 48.4 36.3 50.4 44.0 48.5 41.5 38.4 39.3 37.5 48.7 45.6
proc. 38.4 36.0 36.9 32.8 46.6 50.1 41.8 45.7 34.3 47.8 41.4 45.5 39.4 36.0 37.4 35.5 45.9 43.3
Mapping on human genome(overall read mapping rate in %) Mapping on bat genome (overall read mapping rate in %)
TopHat 89.4 90.8 91.3 89.9 88.9 55.7 90.6 89.3 88.9 90.6 90.7 92.2 90.2 91.0 72.4 91.3 91.1 89.1
segemehl 95.3 95.6 95.1 95.1 93.1 58.3 95.4 94.5 92.8 97.5 92.0 97.2 97.2 96.7 76.9 97.3 96.9 95.4
210
Cell line R06E-J (Rousettus aegyptiacus)
Samples Mock EBOV MARV pooled
3h 7h 23 h 3h 7h 23 h 3h 7h 23 h MiSeq
Read data (million reads)
Appendix A. Fight against Ebola
raw 50.4 44.0 48.5 41.5 38.4 39.3 37.5 48.7 45.6 38.2
processed 47.8 41.4 45.5 39.4 36.0 37.4 35.5 45.9 43.3 38.0
de novo transcriptome assembly
| {z }
>= 0 bp >= 1000 bp N50
Oases 370 200 180 458 3 875
TransABySS 790 204 169 324 1 788
SOAP-Trans 699 418 147 144 3 261
Trinity 484 826 188 534 5 071
Mira 162 861 21 987 774
| {z }
Combined 977 787 277 595 3 923
Mapping on bat transcriptome (overall read mapping rate in %)
TopHat 94.6 94.7 95.2 94.6 95.0 95.8 95.5 95.2 94.8 –
segemehl 98.5 97.0 97.6 98.5 96.6 98.4 98.2 97.6 97.4 –
Table A.2: Number of reads mapping to the viral genomes. For R06E-J samples, we used Blastn+ to find contigs within the R. aegyptiacus
transcriptome assembly which represent the full EBOV (contig610) and MARV (contig5818) genome, respectively. Read counts were normalized by library
size. Read maximum peaks were calculated for each sample. Interestingly, EBOV seems to replicate much faster in human cells compared to bat cells
between 3 and 7 h (15.6X). However, EBOV decreases its transcription speed again in the following 16 h (15.5X) (see Fig. 5.14B and Fig. 5.16 in the main
text). Similarly, MARV replicates faster between 3 and 7 h in human cells (7.6X) than bat cells (4.3X). The RNA profiles mapping to the viral genomes
are astonishingly similar, showing no mutations and only a minor fraction of reads mapping to the 5’ and 3’ UTR of the genome, showing the difference
between genomic and transcriptomic level. Read counts are based on unique TopHat mappings.
211
# reads peak norm. # reads peak norm. # reads peak norm. # reads peak norm.
Mock 3 h 3 689 182 96.07 134 9 3.49 3 956 124 82.76 158 11 3.31
Mock 7 h 1 897 102 52.69 104 8 2.89 4 722 151 119.85 155 13 3.74
Mock 23 h 3 469 148 94.01 128 6 3.47 4 868 164 106.99 289 10 6.35
EBOV 3 h 28 009 1 653 853.93 39 274 1 156 948.65
EBOV 7 h 619 370 43 222 13 291.20 162 618 7 260 4 517.17
EBOV 23 h 10 334 085 429 012 206 269.16 6 853 608 228 449 183 251.55
MARV 3 h 37 504 1 794 897.22 3 896 126 109.75
MARV 7 h 313 238 13 683 6 854.22 21 654 782 471.76
MARV 23 h 701 757 24 435 20 459.39 848 647 22 119 19 599.24
Appendix A. Fight against Ebola
Table A.3: Comparison of genome and de novo transcriptome assemblies. From the
genomic sequences of H. sapiens and R. aegyptiacus we selected different sets of expressed genes
using various filter thresholds: 1) we selected transcripts from the genome with at least N ∈
{100, 1000, 5000} unique mapped
P reads in one sample (= ∃) or 2) accumulated all unqiue mapped
reads over all samples (= ∀). The selected transcript sets were further blasted against the
corresponding de novo transcriptome assembly of human and bat, respectively. We defined a
transcript (derived from the genomic sequence) as true positive and therefore correctly assembled,
if we got at least one blast hit with an alignment length > 90%.
P
∃ sample ∀ samples
read count ≥ 100 ≥ 1000 ≥ 5000 ≥ 100 ≥ 1000 ≥ 5000
H. sapiens 96.54% 97.39% 98.17% 93.0% 97.18% 98.08%
R. aegyptiacus 88.26% 92.8% 94.02% 81.25% 90.20% 92.19%
212
Table A.5: Top 10 keyplayers of human and bat infection. Comparison between all con-
ditions and time points within one species. The read_max values are based on multiple mapped
reads and candidates listed here are filtered based on a read_max of at least 100 reads in one
sample. Fold changes for human samples are based on unique mapped reads. Interestingly, genes
coding for histones are up-regulated between 7 h and 23 h in all samples including Mock.
HIST2H4B – in Mock and EBOV highly regulated, probably cell induced, independent from infec-
tion; CENPE – the other samples are fairly constant at around 500 read_max; superscript sized
numbers – among top 10 of following list (sorted by read_max), number of rank; FC – log2 fold
change based on DESeq normalized read counts; norm_reads – DESeq normalized read counts;
change_max – divided read_max values; read_max – maximum number of reads mapping to one
nucleotide position of this gene; Mo – Mock; EV – EBOV; MV – MARV; Genes specified by a
number refer to the corresponding LOC, for example LOC338651. Further details about differ-
ential expression can be obtained from the various tables and pathway figures in the electronic
supplement.
213
Appendix A. Fight against Ebola
214
Table A.7: Top 15 differences between human and bat cells. To investigate genes that were
differentially expressed between human and Rousettus aegyptiacus tissues, we compared R. aegyp-
tiacus transcripts with the corresponding human genes. R. aegyptiacus transcripts were identified
by homology to annotated Pteropus vampyrus genes. Most of the top 15 differences between human
and bat cells after infection with EBOV and MARV are shut down completely in either human
or bat cells. No gene, except RELN, is part of Tab. A.5 or Tab. A.6, indicating, that these genes
are not differentially expressed during infection, but rather point out general differences of the cell
lines HuH7 and R06E-J. The genes are associated with calcium regulated pathways (ATP2B4),
acyl-CoA pathways (ACADSB), transcription factors (HNF4A), adenylatkinase (AK4) possibly for
nucleotide synthesis, cell cycle (CCND2), keratins for fibrous proteins forming structural frame-
work (KRT5, KRT75), or are involved in actin pathways (ACTA2). FC – log2 fold change based
on DESeq normalized read counts; norm_reads – DESeq normalized read counts; EV – EBOV;
MV – MARV; Mo – Mock. For the complete table, see the electronic supplement.
215
Table A.8: Comparison of human and bat cells (EBOV and MARV as replicates) infected with filoviruses (3h, 23h). Although we observed
various differences in gene expression profiles between EBOV- and MARV-infected cells, both infections share the same disease symptoms. To find genes
that are differentially expressed between human and bat during filovirus infection, we treated EBOV and MARV samples (from the same time point)
as replicates for DESeq analysis (padj ≤ 0.1). Genes sorted by the maximum fold change of 3 h and 23 h p.i. More than half of the top 30 genes are
related to actin, connecting tissues and cell-cell interaction. Since we observed these massive differences also between human-Mock cells and bat-Mock
cells, they might origin from the differences between cell lines HuH7 and R06E-J. To overcome this cell line artifact, we removed differentially expressed
Mock samples (between human and bat cells, padj < 0.1) and list 30 manually selected genes in Tab. A.9. Moreover, we used EBOV and MARV samples
at same time points as replicates to analyze the impact of filovirus infection compared to Mock in the human cell line (Tab. A.10). FC – log2 fold change
based on DESeq normalized read counts; norm_reads – DESeq normalized read counts; read_max – maximum number of reads mapping to one nucleotide
position of this gene; EV – EBOV; MV – MARV. For the complete table, see the electronic supplement. Genes related to actin , connecting tissues and
cell-cell interaction are marked.
norm_reads read_max
human bat human bat
Gene Sample F Cmax EV+MV EV+MV EV MV EV MV Function
COL5A1 23 h 16.39 0.42 36585.11 0 2 904 1288 connective tissues
ATP1A3 23 h 16.25 0.48 37903.82 1 0 2524 2370 cation Na+ /K+ transport
ACTA1 23 h 15.82 1.45 84020.59 2 0 15143 18700 actin, alpha skeletal muscle
COL6A3 23 h 15.26 0.42 16721.46 1 1 307 369 connective tissues
EEF1A2 3h 15.15 6.1 221259.05 2 3 31523 30185 Elongation factor 1-alpha 2
23 h 14.96 0.42 13558.56 0 1 1790 2633 cell cyclus
216
CCND2
MYO10 3h 14.45 0.34 7571.73 24 37 285 178 actin-based, filopodia
RELN 23 h -14.19 15418.01 0.82 502 234 1 0 cell-cell interaction
ACADL 3h 13.91 0.34 5231.1 0 1 567 492 Acyl CoA
PTK7 23 h 13.7 1.33 17812.49 1 1 694 930 tyrosin protein kinase
3h 13.7 0.34 4529.52 0 1 156 108 connective tissues
Appendix A. Fight against Ebola
COL4A2
GPM6A 23 h 13.4 0.42 4589.59 1 2 877 942 membrane glycoprotein
KRT75 23 h 13.39 0.91 9771.54 12 9 904 1621 extracellular matrix
MAP3K13 23 h -13.32 11719.28 1.15 2743 2328 1 2 serine/threonine kinase
ACTA2 3h 13.22 2.64 25179.38 10 12 2882 3400 actin, alpha smooth muscle
RASA3 3h 13.22 0.44 4196.36 2 0 303 202 GTPase activating
ACTG2 23 h 13.13 1.27 11391.44 0 2 1112 1901 actin, cytoskeleton
KIT 23 h 13.12 0.48 4306.14 1 0 214 221 cytokin receptor
PXDN 23 h 13.11 0.97 8596.75 5 2 320 625 peroxidasin homolog
ADAM12 3h 13.11 0.88 7802.54 3 7 395 396 cell-cell ineraction
SPG20 23 h 13.06 0.42 3621.9 1 1 230 265 microtubulin, GTP
CACNA2D1 3h 13.0 0.44 3615.05 9 6 191 184 Ca2+ channel complex
LOXL1 3h 12.9 0.88 6733.85 2 0 529 464 connective tissues
HTR1D 3h 12.89 0.44 3351.02 1 0 322 496 serotonin rexeptor
PTPN13 3h 12.88 2.0 15025.71 10 12 418 312 cytoskeleton, GTPase
SLC26A5 3h -12.77 4089.95 0.59 1085 1480 1 1 prestin, motor protein
IQGAP2 3h -12.72 7910.54 1.17 440 500 2 1 Ras-GTPase
COL1A1 3h 12.72 14.31 96443.9 4 4 3277 1784 connective tissues
TMEM47 3h 12.72 0.34 2285.25 0 1 377 529 transmembrane protein
KCNA4 3h 12.69 1.02 6731.18 1 2 333 251 hexokinase
Table A.9: Comparison of human and bat samples (EBOV and MARV as replicates) with filovirus infected samples (3h, 23h). To find
genes that are differentially expressed between human and bat during filovirus infection, we treated EBOV and MARV samples (from the same time point)
as replicates for DESeq analysis (padj ≤ 0.1). We reduced the influence of the different cell types by removing all genes from the initial list (Tab. A.8)
which were also detected as significantly Mock samples (Mock3h,7h,23h used as replicates for human and bat samples, respectively). Examples in this list
are manually selected from both lists. Genes sorted by the maximum fold change of 3 h and 23 h p.i..
Rk – Rank/position in the corresponding sample list. Abbreviations as in Tab. A.8. For the complete table, see the electronic supplement.
norm_reads read_max
human bat human bat
Gene Sample Rk F Cmax EV+MV EV+MV EV MV EV MV Function
ALPK3 23 h 1 -5.94 3121.47 50.99 177 58 11 5 kinase, adenovirus related
ARHGAP20 23 h 2 5.82 0.85 48.12 16 12 15 8 GTPase activated protein
SCN4A 23 h 3 5.17 1.7 61.25 2 2 10 14 sodium channel
TCTEX1D4 3h 1 4.94 0.44 13.51 10 2 4 5 connecting phosphatase
OSGIN1 3h 2 -4.47 513.5 23.1 58 176 12 44 oxidative stress, inhibits growth
SLC12A3 23 h 4 4.3 14.62 287.7 71 145 20 40 sodium chlorid carrier
SLC16A11 23 h 5 4.29 1.39 27.19 3 2 6 6 carrier monocarboxylate
CCDC78 23 h 6 -4.23 30.8 1.65 6 3 1 2 unknown function
IGSF6 23 h 7 -4.22 30.75 1.65 5 5 3 2 immunoglobulin, inflammatory
UNC13A 23 h 8 4.17 0.97 17.41 7 9 11 8 vesicle, exocytose
217
NEIL1 23 h 9 -3.9 42.93 2.87 6 5 5 4 endonuclease, modulated by virus
METRN 3h 4 -3.85 31.13 2.16 11 11 4 14 cell differentiation
ELN 23 h 10 3.79 0.97 13.37 3 2 4 6 elastin, cell-cell
SLC40A1 23 h 11 -3.78 6409.48 465.85 755 1770 59 82 carrier, iron
C11orf52 23 h 12 -3.73 22.87 1.72 10 10 1 3 together with HSP transcribed
SLC10A1 23 h 13 -3.73 26.07 1.97 5 3 5 4 carrier, NA2+ , entry point HBV/HDV
IGSF6 3h 5 -3.72 19.03 1.44 5 5 3 2 immunoglobulin
MAP6 23 h 15 3.56 0.97 11.48 13 3 6 6 microtubule associated protein
TMEM27 23 h 16 -3.55 23.1 1.97 29 14 4 3 transmembrane
TMOD4 23 h 17 3.53 10.37 119.62 4 6 17 29 tropomodulin, related muscle actin
GRIN2D 23 h 19 -3.25 63.38 6.66 10 7 5 5 glutamate receptor
CLEC4A 23 h 20 -3.24 24.01 2.54 6 9 7 4 cell-cell, immune system
UBC 23 h 24 -3.14 31161.4 3539.27 12040 4245 774 554 ubiquitin
CEP72 23 h 25 -3.11 1003.72 116.45 115 52 21 23 microtubuli, centromer
MAST4 23 h 26 3.11 602.04 5186.53 31 44 168 84 microtubuli
ELF3 23 h 27 -3.1 1012.45 118.16 218 66 19 30 TF, effector of ERBB2 pathway
GLDN 23 h 28 -3.06 28.62 3.44 7 7 21 5 Ranvier nodes along muelinated axons
TRAF4 3h 15 -2.32 1743.78 348.73 731 201 97 136 activation of NFκB + MAPKs
PLIN2 23 h 54 -2.24 4646.53 983.85 1199 473 152 209 lipid storage
TRIB1 3h 17 -2.18 909.52 201.26 296 147 118 70 Ser/Thr protein kinase
Table A.10: Comparison of filovirus infection to Mock samples (EBOV and MARV as replicates). Comparison of filovirus (EBOV and MARV
treated as replicates) infected samples at 23 h p.i. against Mock samples (3 h, 7 h and 23 h treated as replicates) of human cell samples (padj < 0.1), to
find genes differentially expressed in both filovirus-infected cells compared to Mock. Genes sorted by the maximum fold change and filtered manually for
interesting hits. Abbreviations as in Tab. A.9. For the complete table, see the electronic supplement.
norm_reads read_max
Gene Rank F Cmax MO3,7,23 EV23 +MV23 MOread_max EVread_max MVreadm ax Function
SBK3 1 4.68 2.02 51.73 2 11 6 kinase
2 -4.56 25.56 1.09 6 7 4 sulfotransferase
218
SULT1E1
PLAU 4 -3.91 175.46 11.7 33 29 23 urokinase, degra. of ex. matrix
FMNL1 8 3.67 35.67 455.05 5 47 25 cytokinese
ANXA3 20 2.79 226.65 1562.27 66 296 173 cell growth
MYCNOS 21 -2.74 296.44 44.27 73 57 75 viral related oncogene
Appendix A. Fight against Ebola
219
Appendix A. Fight against Ebola
Table A.12: The most regulated TRIM genes. TRIM proteins were recently reviewed by
Ozato et al.[339]. They represent a superfamily of tripartite motif-containing proteins with more
than 60 members from which several are known to be required for the restriction of lentivirus
infections. Based on their emerging role in innate immunity, we investigated their features. We
identified at least 11 TRIM genes (TRIM2, 6, 8, 15, 16L, 25, 32, 34, 38, 45, 47, 54, 67, 71 ) to be
differentially regulated. TRIM14, 21 and 22 were not reported to be differentially expressed, but
show interesting features in a small level of transcripts (see electronic supplement). Classical fold
change values are reported in the electronic supplement. EV – EBOV; hum – human; read_max
– maximum number of reads mapping to one nucleotide position of this gene.
read_max
TRIM Sample 3h 7h 23 h Remarks
TRIM2 hum-EV 143 184 164 TRIM2 localizes to cytoplasmic filaments
bat-EV 106 99 110
TRIM6 hum-EV 85 104 57 Down-regulation for EBOV 23 h, a read-through transcript
from this gene into the downstream TRIM34 gene has been
observed, which is here not the case
bat-EV NA NA NA
TRIM8 hum-EV 120 116 383 TRIM8 localizes to nuclear bodies; strong up-regulation for
EBOV 23 h
bat-EV 437 648 523
TRIM14 hum-EV 107 160 223
bat-EV 15 23 24
TRIM15 hum-EV 10 12 33 TRIM15 localizes to the cytoplasm
bat-EV NA NA NA
TRIM16L hum-EV 32 26 15
bat-EV 109 119 200 putative homolog
TRIM21 hum-EV <10 <10 <10
bat-EV 45 48 62
TRIM22 hum-EV 84 160 83
bat-EV 60 68 73
TRIM25 hum-EV 80 103 26 TRIM25 localizes to the cytoplasm; interacts with DDX58 ;
similar pattern after MARV infection, containing mir-3614 in
3’UTR
bat-EV 319 255 299 a much higher and constant level of transcription than human
cells
TRIM32 hum-EV 65 63 34 TRIM32 localizes to cytoplasmic bodies; Mock 23 h
& EBOV 23 h down-regulated, MARV 23 h up-regulated
(read_max:142)
bat-EV 128 111 120
TRIM34 hum-EV 9 13 11 here no read-through transcript from the upstream TRIM6
gene
bat-EV NA NA NA
TRIM38 hum-EV 14 15 14 almost no expression
bat-EV NA NA NA
TRIM45 hum-EV 11 19 15 TRIM45 may function as a transcriptional repressor of the
mitogen-activated protein kinase pathway almost no expres-
sion
bat-EV 46 30 54 putative homolog
TRIM47 hum-EV 12 18 17 almost no expression
bat-EV 26 23 25 putative homolog
TRIM54 hum-EV 0 0 0 may be important for the regulation of titin kinase and
microtubule-dependent signal pathways in striatedmuscles; no
expression
bat-EV NA NA NA
TRIM67 hum-EV 17 39 41 up-regulated in EBOV 7 h
bat-EV NA NA NA
TRIM69 hum-EV 10 14 18 Only the first two exons are transcribed, possibly a splice vari-
ant
bat-EV NA NA NA No homolog in Pva and Rae
TRIM71 hum-EV 506 860 283 E3 ubiquitin protein ligase; MARV-infected cells stay at about
read_max=750
bat-EV 0 0 0
220
Appendix B
221
Table B.1: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten tools
on the Escherichia coli RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic supplement,
content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of ambiguous bases
is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length or longer cover
50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs in the estimated
true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to Detonates
Appendix B. The Dart Art of de novo transcriptome assembly
222
0.13 0.17 0.07
Average alignment length 469.99 457.65 308.92 364.63 472.64 1100.77 540.83 482.56 552.48 194.25
Mean isoform coverage 0.65 0.58 0.22 0.45 0.64 0.73 0.51 0.64 0.51 0.39
TransRate
N50 560 1051 1039 958 582 1809 680 598 755 7795
Reference coverage 0.32 0.31 0.03 0.19 0.31 0.01 0.16 0.3 0.17 0.18
Mean ORF percentage 75.13 65.72 72.56 71.99 74.59 45.13 71.99 73.34 69.35 82.09
Optimal score NA NA NA NA NA NA NA NA NA NA
Percentage good mappings NA NA NA NA NA NA NA NA NA NA
Percentage bases uncovered NA NA NA NA NA NA NA NA NA NA
Number of ambiguous bases 2255 3602 3153 2579 2117 191 2069 2145 2378 2238
DETONATE
Nucleotide F1 0.63 0.49 0.6 0.74 0.64 0.06 0.66 0.62 0.71 0.57
Contig F1 0.03 0.03 0.03 0.06 0.03 0 0.03 0.03 0.06 0.03
KC score 0.82 0.79 0.82 0.86 0.81 0.16 0.83 0.81 0.88 0.65
RSEM EVAL -1.75 -2.45 -1.48 -3.4 -1.74 -2.62 -4.37 -1.97 -2 -1.76
BUSCO
Complete single-copy 258 136 234 316 281 48 296 261 332 96
Missing BUSCOs 189 172 166 178 190 711 196 198 172 370
Table B.2: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten tools
on the Candida albicans RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic supplement,
content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of ambiguous bases
is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length or longer cover
50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs in the estimated
true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to Detonates
estimated “true” assembly.
Trans-ABySS Oases SOAP-Trans Trinity IDBA-Tran Shannon Bridger BinPacker SPAdes-sc SPAdes-rna
k-mer size 25,35,45,55,65 25,35,45,55,65 default default 25,35,45,55,65 default default default default default
Hisat2
Overall mapping rate 97.29 93.61 98.56 95.12 96.72 96.66 86.34 96.51 97.5 96.98
rnaQUAST
Transcripts 1000 bp 3504 9813 8384 2769 4038 4267 2949 4077 3496 3482
Database coverage 0.52 0.6 0.66 0.56 0.48 0.47 0.53 0.54 0.49 0.52
Misassemblies 34 613 667 9 236 259 8 362 64 39
Mismatches per transcript 1.55 2.12 1.34 0.57 2 2.28 0.81 1 1.82 1.24
223
Average alignment length 957.38 941.6 798.11 567.17 1131.33 1189.55 922.1 678.95 1279.36 843.75
Mean isoform coverage 0.82 0.82 0.84 0.76 0.81 0.82 0.77 0.77 0.85 0.81
TransRate
N50 1573 1629 1666 1349 1824 1846 1315 1093 2105 1911
Reference coverage 0.2 0.24 0.39 0.2 0.17 0.17 0.2 0.3 0.19 0.2
Mean ORF percentage 82.35 79.54 81.25 84.15 79.4 78.65 83.37 87.69 76.75 78.96
Optimal score 0.45 0.06 0.02 0.48 0.36 0.37 0.41 0.05 0.54 0.53
Percentage good mappings 0.75 0.11 0.05 0.83 0.67 0.66 0.72 0.16 0.88 0.87
Percentage bases uncovered 0.2 0.87 0.83 0.02 0.31 0.36 0 0.48 0 0.01
Number of ambiguous bases 9753 26010 23615 8947 10651 11072 8652 12793 9179 9538
DETONATE
Nucleotide F1 0.72 0.51 0.58 0.73 0.74 0.73 0.73 0.64 0.74 0.74
Contig F1 0.08 0.08 0.08 0.06 0.06 0.07 0.05 0.07 0.06 0.07
KC score 0.68 0.62 0.75 0.6 0.68 0.68 0.55 0.72 0.7 0.66
RSEM EVAL -3.39 -4.19 -3.37 -4.48 -3.54 -3.55 -6.11 -3.56 -3.78 -4
BUSCO
Complete single-copy 1279 611 348 1039 1149 1146 1069 460 1510 1458
Missing BUSCOs 162 140 109 248 133 135 240 359 84 93
Table B.3: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten
tools on the Arabidopsis thaliana RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic
supplement, content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of
ambiguous bases is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length
or longer cover 50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs
in the estimated true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to
Appendix B. The Dart Art of de novo transcriptome assembly
224
0.1 0.14 0.15
Average alignment length 979.22 1232.69 740.81 561.75 990.35 1312.17 848.36 1044.9 936.65 566.67
Mean isoform coverage 0.68 0.73 0.64 0.59 0.67 0.75 0.65 0.7 0.68 0.63
TransRate
N50 1517 1832 1654 1318 1628 1879 1282 1607 1633 1389
Reference coverage 0.18 0.19 0.21 0.15 0.13 0.03 0.15 0.15 0.15 0.14
Mean ORF percentage 74.59 72.07 72.04 80.78 74.66 66.84 80.71 76.08 74.57 80.49
Optimal score NA NA NA NA NA NA NA NA NA NA
Percentage good mappings NA NA NA NA NA NA NA NA NA NA
Percentage bases uncovered NA NA NA NA NA NA NA NA NA NA
Number of ambiguous bases 40711 78762 54460 28632 33008 9081 27666 30178 29654 28079
DETONATE
Nucleotide F1 0.68 0.42 0.58 0.77 0.72 0.26 0.76 0.63 0.79 0.74
Contig F1 0.05 0.03 0.04 0.04 0.04 0 0.03 0.04 0.06 0.04
KC score 0.71 0.66 0.75 0.62 0.66 0.48 0.66 0.37 0.72 0.7
RSEM EVAL -5.34 -6.01 -4.5 -7.33 -5.4 -9.63 -6.71 -1.34 -5.78 -5.38
BUSCO
Complete single-copy 858 546 732 1042 978 203 908 804 1053 859
Missing BUSCOs 222 248 224 248 229 1162 269 296 224 357
Table B.4: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten tools
on the Mus musculus RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic supplement,
content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of ambiguous bases
is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length or longer cover
50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs in the estimated
true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to Detonates
estimated “true” assembly.
Trans-ABySS Oases SOAP-Trans Trinity IDBA-Tran Shannon Bridger BinPacker SPAdes-sc SPAdes-rna
k-mer size 25,35,45,55,65 25,35,45,55,65 default default 25,35,45,55,65 default default default default default
Hisat2
Overall mapping rate 92.83 89.29 94 90.98 91.66 54.31 70.64 86.6 94.8 91.98
rnaQUAST
Transcripts 1000 bp 26155 51832 32136 12603 18266 2037 12294 16915 2630 8692
Database coverage 0.22 0.02 0.25 0.1 0.08 0.01 0.1 0 0.01 0.09
Misassemblies 453 52668 690 61 2643 628 41 1927 248 30
Mismatches per transcript 0.7 0.63 0.32 0.16 0.91 4.61 0.19 0.8 33.2 0.13
225
Average alignment length 1296.66 406.08 789.42 519.96 1073.4 1972.81 811.99 877.71 636.58 356.91
Mean isoform coverage 0.66 0.25 0.62 0.38 0.47 0.81 0.43 0.22 0.31 0.35
TransRate
N50 2794 1622 2571 2678 2771 3467 1501 1879 2093 2356
Reference coverage 0.23 0.02 0.25 0.1 0.1 0.02 0.09 0 0.09 0.09
Mean ORF percentage 51.94 45.36 53.12 51.45 49.56 38.86 54.28 55.04 50.33 60.82
Optimal score 0.15 0.02 0.09 0.4 0.18 0.13 0.29 0.18 0.37 0.43
Percentage good mappings 0.35 0.06 0.21 0.74 0.42 0.3 0.51 0.38 0.72 0.73
Percentage bases uncovered 0.62 0.91 0.66 0.12 0.41 0.6 0.02 0.43 0.03 0.09
Number of ambiguous bases 91174 188947 111356 53611 66954 6514 45526 52948 51123 40869
DETONATE
Nucleotide F1 0.4 0.19 0.39 0.51 0.45 0.07 0.5 0.38 0.52 0.42
Contig F1 0.01 0 0.02 0.02 0.01 0 0.01 0.01 0.01 0.01
KC score 0.69 0.47 0.69 0.55 0.6 0.28 0.53 0.56 0.68 0.59
RSEM EVAL -2.26 -3.33 -2.14 -2.63 -2.37 -4.88 -5.02 -2.93 -3.43 -3.25
BUSCO
Complete single-copy 2323 1529 2020 3454 2836 217 2661 2333 3592 2582
Missing BUSCOs 1945 2476 1929 1987 1926 5871 2188 2534 1992 2822
Table B.5: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten tools
on the Homo sapiens + EBOV 3h RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic
supplement, content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of
ambiguous bases is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length
or longer cover 50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs
in the estimated true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to
Appendix B. The Dart Art of de novo transcriptome assembly
226
0.84 0.47 0.86
Average alignment length 754.83 1263.54 505.05 464.16 880.23 2871.75 720.92 885.59 610.84 517.24
Mean isoform coverage 0.47 0.55 0.44 0.38 0.44 0.74 0.45 0.52 0.43 0.42
TransRate
N50 1358 2794 2172 2486 2540 5067 1092 2193 1463 1589
Reference coverage 0.09 0.12 0.12 0.07 0.07 0 0.07 0.11 0.07 0.07
Mean ORF percentage 57.54 48.7 55.56 52.67 51.85 44.28 54.47 58.17 51.23 55.05
Optimal score 0.22 0.02 0.11 0.36 0.2 0.05 0.43 0.05 0.58 0.59
Percentage good mappings 0.43 0.05 0.25 0.74 0.39 0.14 0.71 0.13 0.86 0.87
Percentage bases uncovered 0.48 0.94 0.59 0.11 0.48 0.83 0.01 0.69 0.02 0.01
Number of ambiguous bases 92866 318936 130492 62753 83187 6752 54494 104126 67514 62896
DETONATE
Nucleotide F1 0.49 0.21 0.45 0.56 0.49 0.05 0.55 0.39 0.6 0.58
Contig F1 0.02 0.01 0.05 0.04 0.01 0 0.01 0.03 0.01 0.02
KC score 0.44 0.4 0.56 0.45 0.51 0.22 0.44 0.52 0.51 0.55
RSEM EVAL -1.45 -1.72 -1.19 -1.34 -1.24 -2.95 -1.84 -1.26 -1.32 -1.24
BUSCO
Complete single-copy 1501 972 2309 3316 2477 106 2151 1145 3621 3629
Missing BUSCOs 2103 1892 1839 1934 1862 5976 2216 2096 1856 1975
Table B.6: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten tools
on the Homo sapiens + EBOV 7h RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic
supplement, content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of
ambiguous bases is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length
or longer cover 50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs
in the estimated true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to
Detonates estimated “true” assembly. The rnaQUAST statistics are missing for the Oases assembly, because the tool crashed all the time we executed
it on this data set.
Trans-ABySS Oases SOAP-Trans Trinity IDBA-Tran Shannon Bridger BinPacker SPAdes-sc SPAdes-rna
k-mer size 25,35,45,55,65 25,35,45,55,65 default default 25,35,45,55,65 default default default default default
Hisat2
Overall mapping rate 91.75 85.78 94.3 90.52 89.14 84.63 76.39 90.99 92.97 92.94
rnaQUAST
Transcripts 1000 bp 34490 NA 38048 15251 23487 22928 16181 35785 16051 13995
Database coverage 0.18 NA 0.21 0.16 0.12 0.09 0.15 0.16 0.14 0.15
Misassemblies 2043 NA 7193 198 5132 5522 310 5070 2222 819
Mismatches per transcript 1.42 NA 0.88 0.49 1.5 3.77 0.91 1.49 1.19 0.88
227
Average alignment length 951.5 NA 518.79 454.55 807.15 2445.81 697.57 879.65 577.97 497.39
Mean isoform coverage 0.49 NA 0.46 0.38 0.42 0.67 0.45 0.53 0.42 0.42
TransRate
N50 2287 3352 2205 2422 2550 4065 1020 2287 1257 1359
Reference coverage 0.1 0.12 0.13 0.08 0.08 0.06 0.08 0.12 0.08 0.08
Mean ORF percentage 52.7 43.82 51.88 49.38 47.64 47.66 50.41 54.69 47.14 51.18
Optimal score 0.17 0.02 0.08 0.33 0.18 0.11 0.39 0.04 0.53 0.55
Percentage good mappings 0.34 0.04 0.19 0.71 0.37 0.3 0.67 0.1 0.82 0.84
Percentage bases uncovered 0.59 0.95 0.62 0.13 0.47 0.76 0.01 0.71 0.02 0.02
Number of ambiguous bases 135010 413418 161502 77171 101788 82239 67564 134108 84058 78158
DETONATE
Nucleotide F1 0.45 0.2 0.45 0.55 0.49 0.3 0.56 0.37 0.6 0.58
Contig F1 0.01 0.01 0.05 0.04 0.01 0 0.01 0.03 0.01 0.02
KC score 0.53 0.45 0.58 0.49 0.51 0.5 0.43 0.55 0.52 0.51
RSEM EVAL -1.73 -2.35 -1.63 -1.85 -1.8 -1.94 -2.79 -1.73 -2.02 -1.91
BUSCO
Complete single-copy 1873 704 1938 3362 2471 1909 2128 1108 3481 3606
Missing BUSCOs 1788 1817 1768 1855 1785 2392 2217 1955 1833 1873
Table B.7: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten tools
on the Homo sapiens + EBOV 23h RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic
supplement, content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of
ambiguous bases is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length
or longer cover 50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs
in the estimated true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to
Appendix B. The Dart Art of de novo transcriptome assembly
228
0.89 0.51 0.63
Average alignment length 1040.23 1445.77 563.12 480.92 874.3 2974.03 724.8 699.82 623.06 527.92
Mean isoform coverage 0.49 0.56 0.46 0.38 0.43 0.72 0.45 0.47 0.43 0.42
TransRate
N50 2555 3512 2278 2483 2733 4709 1103 1419 1460 1528
Reference coverage 0.09 0.11 0.12 0.07 0.07 0.03 0.07 0.08 0.07 0.07
Mean ORF percentage 53.94 47.19 53.62 51.42 48.88 46.97 52.33 61.58 48.36 52.72
Optimal score 0.32 0.03 0.05 0.4 0.2 0.07 0.24 0.05 0.41 0.43
Percentage good mappings 0.51 0.11 0.13 0.81 0.41 0.33 0.42 0.12 0.63 0.66
Percentage bases uncovered 0.62 0.94 0.62 0.13 0.48 0.82 0.01 0.47 0.03 0.02
Number of ambiguous bases 119663 301780 133425 64547 87621 43913 57271 60817 71110 65429
DETONATE
Nucleotide F1 0.44 0.22 0.45 0.55 0.48 0.2 0.55 0.38 0.59 0.58
Contig F1 0.01 0.01 0.04 0.04 0.01 0 0.01 0.03 0.01 0.02
KC score 0.72 0.21 0.77 0.69 0.74 0.71 0.16 0.1 0.32 0.43
RSEM EVAL -1.6 -3.86 -1.44 -1.59 -1.54 -2.01 -4.31 -4.94 -3.13 -2.82
BUSCO
Complete single-copy 1971 939 2078 3238 2581 883 2034 1426 3354 3391
Missing BUSCOs 1976 2014 1952 2042 1978 4392 2419 2764 2012 2103
Table B.8: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten tools
on the Homo sapiens flux simulated RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic
supplement, content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of
ambiguous bases is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length
or longer cover 50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs
in the estimated true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to
Detonates estimated “true” assembly.
Trans-ABySS Oases SOAP-Trans Trinity IDBA-Tran Shannon Bridger BinPacker SPAdes-sc SPAdes-rna
k-mer size 25,35,45,55,65 25,35,45,55,65 default default 25,35,45,55,65 default default default default default
Hisat2
Overall mapping rate 95.79 73.26 99.55 91.68 94.02 93.03 85.34 30.77 97.25 96.34
rnaQUAST
Transcripts 1000 bp 9860 28143 7734 4263 5167 7424 2740 2341 2630 2657
Database coverage 0.48 0.59 0.56 0.41 0.35 0.37 0.38 0.19 0.37 0.38
Misassemblies 185 4094 118 49 533 785 8 66 56 124
Mismatches per transcript 1 2.11 0.41 0.31 1.65 1.85 0.1 0.39 0.2 0.4
229
Average alignment length 2340.04 2090.55 1090.25 1009.08 1581.3 1755.16 859.97 1061.01 992.74 738.02
Mean isoform coverage 0.82 0.74 0.7 0.61 0.66 0.66 0.59 0.66 0.58 0.54
TransRate
N50 3821 4266 2653 3036 3136 3364 1444 2137 2996 2735
Reference coverage 0.3 0.38 0.47 0.18 0.19 0.2 0.18 0.13 0.16 0.17
Mean ORF percentage 42.66 37.43 44.83 36.28 42.22 43.57 47.62 47.37 40.16 42.78
Optimal score 0.1 0.01 0.17 0.21 0.14 0.13 0.35 0.06 0.46 0.45
Percentage good mappings 0.22 0.05 0.41 0.47 0.33 0.28 0.6 0.14 0.81 0.78
Percentage bases uncovered 0.55 0.83 0.34 0.17 0.22 0.43 0.01 0.17 0.01 0.03
Number of ambiguous bases 35252 110723 26848 15259 18025 25851 10489 7801 10744 11203
DETONATE
Nucleotide F1 0.53 0.22 0.65 0.73 0.71 0.6 0.78 0.45 0.79 0.79
Contig F1 0.06 0.05 0.1 0.08 0.05 0.05 0.05 0.07 0.05 0.07
KC score 0.89 0.74 0.94 0.6 0.82 0.82 0.6 0.26 0.78 0.73
RSEM EVAL -2.78 -4.61 -2.25 -5.27 -3.66 -3.64 -7.52 -1.15 -4.55 -5.18
BUSCO
Complete single-copy 203 86 290 191 316 256 226 143 393 363
Missing BUSCOs 22 22 18 109 28 29 142 364 60 75
Appendix B. The Dart Art of de novo transcriptome assembly
230
Curriculum vitae
Curriculum vitae
Education
since 07/2013 Ph.D. Student
Friedrich Schiller University Jena
Prof. Dr. Manja Marz
RNA Bioinformatics and High Throughput Analysis
2013 Diploma in Bioinformatics (Dipl.-Bioinf.)
Diploma thesis:
“Datenmanagement von Massenspektren und Fragmentierungsbäu-
men mit BExIS.” (Prof. Dr. Sebastian Böcker)
2007 - 2013 Study of Bioinformatics
Friedrich Schiller University Jena
2006 Diploma qualifying for university admission
Friedrich-Fröbel-Gymnasium Bad Blankenburg
231
Conferences and Workshops
232
Ehrenwörtliche Erklärung
Ehrenwörtliche Erklärung
Hiermit erkläre ich
• dass ich die Dissertation selbst angefertig habe, keine Textabschnitte oder
Ergebnisse eines Dritten oder eigenen Prüfungsarbeiten ohne Kennzeichnung
übernommen und alle von mir benutzten Hilfsmittel, persönliche Mitteilungen
und Quellen in meiner Arbeit angegeben habe,
• dass ich die Hilfe eines Promotionsberaters nicht in Anspruch genommen habe
und dass Dritte weder unmittelbar noch mittelbar geldwerte Leistungen von
mir für Arbeiten erhalten haben, die im Zusammenhang mit dem Inhalt der
vorgelegten Dissertation stehen,
• dass ich die Dissertation noch nicht als Prüfungsarbeit für eine staatliche oder
andere wissenschaftliche Prüfungen eingereicht habe.
Bei der Auswahl und Auswertung des Materials sowie bei der Herstellung des Ma-
nuskripts haben mich folgende Personen unterstützt:
Manja Marz
Ich habe weder die gleiche, noch eine ähnliche oder eine andere Arbeit an einer
anderen Hochschule als Dissertation eingereicht.
Martin Hölzer
233