You are on page 1of 253

The Dark Art of Next-Generation Sequencing

Fundamental approaches for genomics, transcriptomics, and


differential gene expression

DISSERTATION

zur Erlangung des akademischen Grades


doctor rerum naturalium
(Dr. rer. nat.)

FRIEDRICH-SCHILLER-UNIVERSITÄT JENA
Fakultät für Mathematik und Informatik

eingereicht von Dipl. Bioinf. Martin Hölzer


geb. am 16.03.1988 in Rudolstadt

Jena, 30. Juni 2017


Gutachter:

1. Prof. Dr. Manja Marz, Friedrich Schiller University Jena, Germany


2. Prof. Dr. Kay Nieselt, Eberhard Karls University Tuebingen, Germany
3. Prof. Dr. Paul Gardner, University of Canterbury, New Zealand
Leave this world a little better than you found it.
— Robert Baden-Powell, Chief Scout of the World
i
Acknowledgements
First and foremost, I would like to thank my supervisor and the head of our re-
search group, Manja. Having you as a supervisor truly gets never boring. There
are so many interesting questions and projects to fill several theses. With delightful
pleasure I remember the times sitting together until late in the night working on a
paper, finishing a talk or cross checking a proposal (of course just-on-time before
the submission deadline). If the attentive reader asks what is so nice about late-
night work and deadline pressure, I just would answer: "Mycobacterium avium ssp.
paratuberculosis" – "Bowmore", "NcRNA annotation in bats" – "Ardbeg", "Dif-
ferential expressed genes between human and bat cells" – "Talisker", "Writing the
acknowledgements section of this thesis" – "Aberlour".
I am deeply grateful for being a part of this great research group. Folks, you
are awesome and thank you for the great time we spend together in and outside the
Jentower. It has been a pleasure working with all of you and I am looking forward
to our next journeys!
Furthermore, I would like to acknowledge all of my co-authors and cooperation
partners for accompanying my PhD time. I am very grateful for the opportunity to
meet so many experienced and great scientists.
I would like to thank our local faculty staff, especially Erik, who helped tirelessly
to keep our computer systems running. Without his help this thesis would not be
finished in this time due to the installation of certain dependencies and finding the
correct LATEX fonts for deprecated journal templates1 .
A special thanks goes to Kathrin for helping me with bureaucracy and admin-
istrative stuff. Thanks for making our team run smoothly by pulling the strings
behind the scenes!
Many thanks and my gratitude go to my family, my mom and dad, my sister
and my grandparents for their infinite support throughout everything. Especially
I would like to thank my mom and dad: Even though it was not always so clear
for you what I am actually doing, you were constantly supporting me on my way.
Without your help I would not be where I am now.
A cheers goes to the “Kassa” for providing me with a place for recovery from the
scientific daily grind. In the same sense thanks to Club Mate and Maya Mate for
helping me focusing back on science after recovery.
I warmly appreciate the help of Franzi, Emanuel, Markus, Trice, Max and Basti
for proofreading parts of this thesis and giving helpful advices to improve legibility.
Last but not least, I would like to mention that all of this would not have been
possible without the support of my lasting friends. Thank you for distracting me
from work and for pushing me back!

Thanks. Danke für Alles.

1
“Ja, die [Genome Biology and Evolution] wollen die Schriften der Firma Y&Y aus dem letzten
Jahrtausend. Wenn ich die in den alten Dateien angegebene Homepage (http://www.yandy.
com/) richtig deute, hat diese Firma entweder aufgegeben oder zumindest ihr Geschäftsfeld verän-
dert.” – Erik, 28.02.2014

ii
Abstract
In the last decades, our knowledge of the molecular basis of life, the building blocks
known as DNA and RNA, has increased tremendously. Technical breakthroughs
from physics, chemistry, biology, and computer science facilitated the development of
new technologies. Thus, the systematic analysis of massive amounts of data became
increasingly important and challenges computer scientists day-to-day. Particularly,
Next-Generation Sequencing (NGS) has dramatically increased the accessibility of
genetic information, generating massive amounts of genomic and transcriptomic data
that is rapidly changing the landscape of many life science disciplines.
Nowadays, large amounts of NGS data can be produced in a rapid and cost-
effective way. However, the novel data needs to be processed in a comprehensive,
well documented and transparent way. Unfortunately, the computational analysis
often remains an opaque process – comparable to a Black Box. Therefore, the
automated analysis of NGS data and the incorporation of different methods and
workflows combined with a clear presentation of the obtained results gets more and
more important.
The applications of NGS are manifold and range from the analysis of genomes
themselves to how proteins interact with nucleic acids. Various techniques and
protocols exist to generate sequencing data from DNA and RNA to answer a wide
variety of biological questions. As each bioinformatical analysis can be only as good
as the underlying data, the experimental design of an NGS project is of utmost
importance. Current NGS techniques like Illumina produce enormous amounts of
sequencing reads, short snippets of nucleotides derived from fragmented DNA or
RNA molecules. Therefore, the bioinformatical challenge in many applications is to
solve the NGS puzzle and to find the correct connection between the short reads to
reconstruct a full genomic or transcriptomic representation.
Complete and well-annotated reference sequences build an important basis to
successfully tackle various biological problems. The quality of the reference genome
of a certain species is of great importance for the success of the conducted com-
putational analysis. With the help of NGS data, existing reference sequences can
be improved or newly constructed from scratch. This process is called assembly,
and involves the adequate connection of short sequencing reads to build-up the full
genomic sequence. However, the assembly process is not straightforward and can
be jarred by DNA contaminations in the data or repetitive regions in the sequence.
Often, different assembly tools and parameter settings need to be tested and eval-
uated to construct an appropriate genome assembly. While the assembly process
can already be challenging for small bacterial genomes, it is exceedingly difficult
for larger eukaryotic genomes. Furthermore, a novel genome assembly needs to be
well-annotated to be useful in many applications.
However, NGS does not only allow for the sequencing of DNA, in fact it can also
be modified to sequence RNA transcripts (e.g. mRNAs, ncRNAs) that are present
in a biological sample at a given moment in time. RNA sequencing (RNA-Seq)
emerged as a powerful method for the discovery, profiling, and quantification of
RNA transcripts. Nevertheless, with currently available short-read NGS techniques

iv
like Illumina it is not possible to directly sequence RNA molecules. The RNA has
to be reversely transcribed to complementary DNA (cDNA) for sequencing.
The RNA-Seq reads can be used to reconstruct transcripts. If a reference genome
is available, the reads can be aligned to this sequence to deviate the transcript se-
quences. This process is also known as mapping. However, in many applications no
reference genome is available and its construction is complicated, time consuming
and costly. Instead, RNA-Seq data can be directly used to assemble the transcripts
de novo. In the last decade, various tools emerged to solve the de novo transcriptome
assembly problem, however it is still a difficult question which tool and parameter
settings perform best for a certain data set. A sensible selection of good assem-
blies produced from different tools followed by an appropriate combination of the
assembled sequences is one way to overcome the limitations of current assemblers.
By mapping RNA-Seq reads to an annotated genome or transcriptome, the
abundances of transcripts that were present in a certain biological sample at a
certain point of time can be measured. If samples from different conditions were
sequenced, the measurements can be compared to identify differentially expressed
genes (DEGs). With the help of NGS, significant changes in gene expression can
be identified in a fast and reliable way, providing researchers with a comprehensive
overview of the genes and pathways that are regulated, for example during a viral
infection. Furthermore, RNA-Seq allows for the genome-wide analysis of transcripts
at a single nucleotide resolution and therefore includes the identification of single
nucleotide variants, gene fusions, allele-specific expression and alternative splicing
events. However, in many NGS studies exists a gap between how the data is com-
putationally processed and a meaningful interpretation of the final results.
While broad NGS studies can provide a comprehensive overview about significant
regulated genes and pathways, it is of utmost importance to take a closer look
at single genes and single nucleotide positions that were previously identified as
significant with NGS. One scenario involves the detection of recombination events
and positively selected sites in an alignment of homologous protein-coding genes.
Such a gene might have been previously identified in a differential expression study
as a key player during a viral infection. If positive selection can be detected, the
gene might be in an evolutionary ’arms-race’ with the host and the determination
of amino acid sites that are under positive selection can help researchers to develop
countermeasures against the pathogen.
Combining different approaches of complementing fields, such as genomics, tran-
scriptomics and single nucleotide investigations, has the greatest potential of pro-
ducing comprehensive and adjuvant results. Furthermore, the visualization of in-
formation is a cruical part that provides researchers with a way to quickly examine
large amounts of data, to expose trends and to find patterns and correlations. With
this work, we aim to shed some light on the darkness of Next-Generation Sequenc-
ing by combining, presenting and discussing fundamental approaches for genomics,
transcriptomics, differential gene expression, and beyond.

v
Zusammenfassung
In den letzten Jahrzehnten hat sich unser Wissen bezüglich der molekularen Basis
des Lebens, den Bausteinen bekannt als DNA und RNA, rasant vermehrt. Rich-
tungsweisende Errungenschaften aus Bereichen wie Physik, Chemie, Biologie und
Informatik haben die Entwicklung neuer Technologien motiviert und vorangetrieben.
In diesem Zusammenhang gewann die systematische Auswertung riesiger Daten-
mengen immer mehr an Bedeutung und stellt Bioinformatiker tagtäglich vor neue
Herausforderungen. Durch die Entwicklung neuer Sequenziertechnologien, auch
bekannt als “Next-Generation Sequencing” (NGS), können enorme Genom- und
Transkriptom-Datenmengen in kürzester Zeit und zu immer geringeren Kosten gener-
iert werden.
Der enorme Durchsatz und die vergleichsweise geringen Kosten von NGS haben
jedoch auch eine Kehrseite: die stetig anwachsende Datenflut muss prozessiert und
ausgewertet werden, sodass neue Erkenntnisse auf umfassende und transparente
Art und Weise zur Verfügung gestellt werden können. Viel zu häufig stellt sich die
Auswertung von NGS Daten für Außenstehende als ein undurchsichtiger Prozess dar
– vergleichbar mit einer Black Box. Somit spielt die automatisierte Auswertung von
NGS Daten unter Einbeziehung verschiedener Methoden in Kombination mit einer
klaren Präsentation der Ergebnisse eine immer wichtigere Rolle.
Die Anwendungsmöglichkeiten von NGS Daten sind vielfältig. Im direkten Bezug
auf die biologische Fragestellung können verschiedene Sequenziertechniken und Pro-
tokolle angewendet werden, um sowohl DNA als auch RNA Moleküle zu sequen-
zieren. Da jedoch jede bioinformatische Auswertung nur so gut sein kann wie es
die zugrundeliegenden Daten erlauben, spielt das experimentelle Design eines NGS
Projektes eine entscheidende Rolle. Häufig findet die von Solexa entwickelte Sequen-
ziertechnologie Illumina Anwendung. Mit Hilfe von Illumina können Sequenzier-
daten im Gigabasen Bereich in kurzer Zeit erzeugt werden. Jedoch werden aufgrund
technischer Beschränkungen in jeder einzelnen Sequenzierreaktion nur kurze DNA-
Abschnitte (Reads) abgelesen. Eine Aufgabe der Bioinformatik ist es nun, diese
kurzen Abschnitte wieder zu einer gesamten Genomsequenz zusammen zu setzen.
Für viele bioinformatische Anwendungen sind eine gute Referenzsequenz ver-
bunden mit einer umfassenden Annotation der enthaltenen Gene unabdingbar. Die
Qualität der assemblierten Sequenz hat großen Einfluss auf den Erfolg einer com-
putergestützten Analyse. Die Konstruktion der genomischen Sequenz auf der Basis
von NGS Reads bezeichnet man als Assemblierung. Die kurzen Read Fragmente
müssen wieder korrekt zusammen gesetzt werden um die vollständige Genomse-
quenz zu erzeugen. Dabei können Probleme wie DNA Kontaminationen und repe-
titive Sequenzbereiche zu Schwierigkeiten während der Assemblierung führen. Oft-
mals müssen verschiedene Assemblierungs Programme und Parameter getestet und
evaluiert werden, um ein möglichst optimales Assemblierungsergebnis zu erzielen.
Bereits die Assemblierung eines kleinen bakteriellen Genoms kann eine Heraus-
forderung darstellen. Die Assemblierung und Annotation umfangreicherer eukary-
otischer Genome ist jedoch noch um ein Vielfaches schwieriger.

vi
NGS Technologien können jedoch nicht nur für die Sequenzierung von DNA ver-
wendet werden, sondern, mit leichten Modifikationen, auch für die Entschlüsselung
der Nukleotidabfolge von RNA Molekülen (bspw. mRNAs, ncRNAs). Die Sequen-
zierung von RNA (RNA-Seq) hat sich als mächtiges Werkzeug etabliert um RNA
Transkripte zu untersuchen und zu quantifizieren.
RNA-Seq Reads können verwendet werden um die zugrundeliegenden Transkripte
zu rekonstruieren. Falls ein Referenzgenom verfügbar ist, können die Reads gegen
das Genom aligniert werden um die Nukleotidabfolge der Transkripte abzuleiten.
Diesen Prozess bezeichnet man auch als Mapping. In vielen Anwendungsfällen ist
jedoch keine Referenzsequenz verfügbar, da der entsprechende Organismus schlicht
noch nicht sequenziert und assembliert wurde. In solch einem Fall können die
RNA-Seq Reads direkt verwendet werden um das Transkriptom de novo zu as-
semblieren. In den letzten Jahren wurden verschiedene Programme entwickelt,
welche sich diesem Assemblierungproblem annehmen. Es ist jedoch nach wie vor
unklar, welches Programm die besten Ergebnisse erzielt und sich am besten für
die Assemblierung bestimmter RNA-Seq Daten eignet. Die Idee eines kombinierten
Ansatzes umfasst die Verwendung mehrerer Assemblierungsprogramme und Param-
eter. Durch eine Auswahl und die Kombination der besten Ergebnisse lassen sich
so Nachteile der einzelnen Programme ausgleichen, um vollständigere Transkriptom
Assemblierungen zu erzeugen.
Weiterhin können RNA-Seq Daten verwendet werden um die relativen Häu-
figkeiten von Transkripten, die zu einem bestimmten Zeitpunkt in einer biologis-
chen Probe exprimiert waren, zu bestimmen. Die Reads können auf ein Referen-
zgenom oder Transkriptom aligniert und mit Hilfe einer Annotation quantifiziert
werden. Durch den Vergleich der so ermittelten Häufigkeiten der Transkript ver-
schiedener Proben mit unterschiedlichen Konditionen (bspw. eine gesunde Zelle und
eine Virus infizierte Zelle) können differentiell exprimierte Gene (DEG) bestimmt
werden. NGS Methoden erlauben eine schnelle und effiziente Bestimmung signifikan-
ter Veränderungen im Transkriptom und geben somit einen umfassenden Einblick
in die Regulation von Genen. Weiterhin ermöglicht RNA-Seq eine, sich über das
gesamte Genom erstreckende, Analyse einzelner Nukleotide. Somit können mit Hilfe
von RNA-Seq Daten einzelne Nukleotidvarianten, Genfusionen, allel-spezifische Ex-
pressionsmuster und alternative Splicingprozesse detektiert werden.
Zusammenfassend ermöglichen NGS Studien einen umfassenden Einblick in sig-
nifikant regulierte Gene und Stoffwechselwege. Dennoch sollten derartige Analysen
als nützliches Werkzeug verstanden werden, um interessante Kandidaten zu identi-
fizieren, welche im Folgenden weiter untersucht werden müssen. Beispielsweise kön-
nen mit Hilfe von NGS Gene identifiziert werden, bei denen selbst die Veränderung
einzelner Nukleotidpositionen bereits starke biologische Auswirkungen haben kann.
Entsprechend können für Proteine kodierende homologe Gene auf Rekombination-
sereignisse und positive Selektion untersucht werden. In einer Genexpressionsstudie
könnte so bereits ein Gen identifiziert werden, welches eine wichtige Rollen während
einer viralen Infektion spielt. Eine zusätzliche Detektion positiv selektierter Bere-
iche in diesem Gen könnte ein Indikator für ein evolutionäres Wettrüsten zwischen
dem Virus und seinem Wirt sein. Basierend auf der Nukleotidsequenz dieses Genes
können so positiv selektierte Aminosäuren identifiziert werden, welche wiederum als

vii
Ausgangspunkt genutzt werden können um antivirale Gegenmaßnahmen zu entwick-
eln.
Die Kombination verschiedener Ansätze von sich ergänzenden Themenbereichen,
wie Genomik, Transkriptomik und die Analyse einzelner Nukleotide, hat großes
Potenzial um umfassende und hilfreiche Resultate zu erzeugen. Weiterhin spielt
die Visualisierung der Daten stets eine entscheidende Rolle. Eine gute und trans-
parente Darstellung der Resultate ermöglicht auch anderen Wissenschaftlern eigene
Schlüsse und Erkenntnisse zu gewinnen um weitere Fragestellungen zu bearbeiten.
Innerhalb dieser Arbeit werden fundamentale Anwendungen in Bezug auf Genomik,
Transkriptomik, differentielle Genexpression und darüber hinaus präsentiert, kom-
biniert und umfassend diskutiert, um somit ein wenig Licht in die Dunkelheit von
Next-Generation Sequencing Analysen zu bringen.

viii
Preface
This thesis covers large parts of my research in the field of Next-Generation Se-
quencing (NGS) over the last four years. Besides my main projects, dealing mostly
with the analysis and interpretation of various NGS data, I was involved in several
side projects, partly also included in this thesis. During this time, I was working at
the RNA Bioinformatics and High Throughput Analysis group of Professor Manja
Marz at the Friedrich Schiller University Jena.
During my PhD I got into contact with various kinds of NGS data, confronting
me with different biological questions and computational problems. When writing
up this thesis, I have been already involved in the experimental design of roughly
50 NGS projects, all aiming to answer manifold biological questions and involving
different species like human, mouse, bats, fungi, algae, bacteria and also viruses. For
most of the projects presented here, I was more or less involved from the early start
(experimental setup) over the sequencing design (technology, parameters, costs) to
the bioinformatical analyses of the obtained data and the final interpretation of the
results. With this thesis, I want to encourage other researchers involved in similar
NGS projects to get in touch with cooperation partners as soon as possible to discuss
the experimental design and to obtain the most out of your NGS run.
Most of the results presented in this work have been published and have been
achieved in cooperation with my supervisor Manja Marz, my great colleagues and
many amazing collaborators (details will be given at the beginning of each chapter).
During my PhD I was primarily responsible or at least involved as a co-author in
overall 19 publications (see pages xiv and xv), from which eight have been already
published [1–8], two are currently submitted [9, 10], and nine are in preparation [11–
19] at the point of submitting this thesis. The already published work comprises
one first authorship, three joint first authorships and two second authorships.
Since it is not possible to present all topics I was involved in over the last four
years in this thesis, I will focus on [1, 2, 4–7, 9, 16, 17] (see page xiv) complemented
with data and results from [3, 8, 10–15, 18, 19] (page xv).
It was a well-thought decision to also include unpublished work in this thesis, if
additional benefit for the presented topics could be gain. The side projects I was
involved deal with different biological and computational problems, therefore some
of them are slightly tackled in this thesis, others are just mentioned here.
In one of those projects, I had the great opportunity to support my former
colleague Abdullah in his research on tRNA remolding events in metazoan mitoch-
ondrial genomes. Here, my main contribution was the calculation of alignments
of tRNAs and the implementation of a novel maximum likelihood based algorithm
called MLRD (Maximum Likelihood Remolding Detection) to identify the position
of a remolding event by utilizing a previously calculated phylogenetic tree. Further-
more, I was mainly responsible for the visualization of the alignments, trees and
detected remolding events. I really appreciate this joint work that at the end found
its way in a great publication [3] and Abdullahs thesis [20].
I was further involved in two broad annotation studies, dealing with the detection
of non-coding RNAs (ncRNAs) in bats [12] and the lift-over annotation of ncRNAs
of various nematode species [13]. Especially, with the extended annotation of bat

x
ncRNAs, it was possible to improve my further research involving different bat
species [4, 7, 15].
In September 2015 I attended a workshop in Copenhagen, that was aiming to
collect and describe bioinformatical tools related to the RNA world. The main goal
was to generate a community driven catalog of RNA bioinformatic resources and
their relationships [11].
Together with Dr. Daniel Steinbach from the Universitätsklinikum Jena I am
working on whole-exome sequencing data (a special kind of NGS data), to identify
somatic single nucleotide variants between the exome of bladder cancer patients at
different tumor states [14].
Another RNA-Seq project I am involved in, from the early beginning of the se-
quencing design until the currently ongoing bioinformatical analysis of the obtained
data, involves the transcriptional response of Myotis daubentonii cells to interferon
treatment and a Rift Valley Fever virus infection [15]. The Project is conducted to-
gether with Prof. Friedemann Weber from the Justus-Liebig University Gießen and
only included in this thesis as a short detour to compare the different NGS setups
between Sec. 5.2 and 5.3.
Another ongoing project, I am currently supervising, deals with the implemen-
tation of a web server to perform interactive principal component analyses (PCA)
on RNA-Seq quantification data [19]. The project is implemented by one of our
master students, Ruman Gerst. The idea of such an interactive tool, allowing for
the visualization of 2- and 3-dimensional PCAs with flexible variance cutoffs, was
born during my work on the transcriptional response of human monocytes to various
infections under vitamin treatment [5, 6].
This thesis consists of seven chapters. The main results are presented in Chap-
ters 3–6.
In Chapter 3 and 4 fundamental approaches for genome and transcriptome as-
sembly, two applications of NGS, are presented. Chapter 3 deals with the more
simple genome assemblies of two bacterial species [1, 2]. In Chapter 4 a compre-
hensive across-species comparison of different de novo transcriptome assembly tools
is presented [16]. Our idea of a merged assembly of various tools and parameter
settings is presented and evaluated as a proof of concept [17].
Chapter 5 deals with another application of NGS data: the identification of
differential expressed genes between certain conditions. I selected two exemplary
RNA-Seq studies for this thesis. The first one is drawing a comprehensive picture of
the transcriptional response of human monocytes to fungal and bacterial infections
under vitamin treatment [5, 6]. In contrast to this study the second computationally
much more complicated project deals with the transcriptional response of human
and bat cells to Ebola and Marburg virus infections at different time points [4].
After the comprehensive discussion of different high-throughput NGS projects,
special use cases for single genes are presented in Chapter 6. The presented ap-
proaches deal with the detection of positive selection and recombination events in
homologous protein-coding sequences. Although the analyses are not directly based
on NGS data, NGS can help to identify target genes. By working on differential gene
expression during certain types of infections [4–7, 15], I found positive selection to of-
ten be a result of an evolutionary host-virus ‘arms-race’. To comprehensively detect

xi
and visualize positive selected sites, I developed a web server called PoSeiDon [9],
that has been already applied on the Mx1 gene of 13 bat species [7].
For almost all of the projects I conducted during my PhD time, I generated ex-
tensive electronic supplement pages, allowing also other researchers to easily obtain,
interpret and re-use the results in an effective and transparent manner.
Furthermore, I was heavily involved in the organization and execution of two
great and very intensive ‘hackathons’ in Jena. The first one, dealing with one of my
first NGS projects (presented in Sec. 5.2), took place in the second year of my PhD.
After the start of the 2014 Ebola outbreak in West Africa, we decided to speed up
our analyses and invited specialized scientists on the field of RNA-Seq to join us for
one week in Jena to “Fight against Ebola”. The second one, I was organizationally
involved, took place in April 2017. Under the topic “Stay young or Die trying” we
met again with scientists from all over Europe to work on the JenAge data set and
tackled different questions related to the field of aging. Although the organization,
execution and wrap-up of such workshops is very stressful and time-consuming, the
outcome of such intensive weeks, bringing together many great scientists of different
expertise, is of invaluable benefit for all participants.

xii
This thesis is first and foremost based on the
following publications:

[1] Petra MöbiusΨ , Martin HölzerΨ , Marius Felder, Gabriele Nordsiek, Marco Groth,
Heike Köhler, Kathrin Reichwald, Matthias Platzer, and Manja Marz. “Compre-
hensive insights in the Mycobacterium avium subsp. paratuberculosis genome using
new WGS data of sheep strain JIII-386 from Germany”. In: Genome Biology and
Evolution 7.9 (2015), pp. 2585–2601.

[2] Martin HölzerΨ , Verena KrählingΨ , et al. “Differential transcriptional responses to


Ebola and Marburg virus infection in bat and human cells”. In: Scientific Reports 6
(2016), p. 34589.

[3] Martin Hölzer, Karine Laroucau, Heather Huot Creasy, Sandra Ott, Fabien Vo-
rimore, Patrik M Bavoil, Manja Marz, and Konrad Sachse. “Whole-genome sequence
of Chlamydia gallinacea type strain 08-1274/3”. In: Genome Announcements 4.4
(2016), e00708–16.

[4] Konstantin RiegeΨ , Martin HölzerΨ , Tilman E. Klassert, Emanuel Barth, Julia
Bräuer, Collatz Maximilian, Franziska Hufsky, Nelly B. Mostajo, Magdalena Stock,
Bertram Vogel, Hortense Slevogt, and Manja Marz. “Massive Effect on LncRNAs
in Human Monocytes During Fungal and Bacterial Infections and in Response to
Vitamins A and D”. In: Scientific Reports 7 (2017), p. 40598.

[5] Tilman E. Klassert, Julia Bräuer, Martin Hölzer, Magdalena Stock, Konstantin
Riege, Christina Zubiría-Barrera, Mario M. Müller, Silke Rummler, Christine Skerka,
Manja Marz, and Hortense Slevogt. “Differential Effects of vitamins A and D on
the Transcriptional Landscape of Human Monocytes during Infection”. In: Scientific
Reports 7 (2017), p. 40599.

[6] Jonas Fuchs, Martin Hölzer, Mirjam Schilling, Corinna Patzina, Andreas Schoen,
Thomas Hoenen, Gert Zimmer, Manja Marz, Friedemann Weber, Marcel A. Müller,
and Georg Kochs. “Evolution and antiviral specificity of interferon-induced Mx pro-
teins of bats against Ebola-, Influenza-, and other RNA viruses.” In: Journal of
Virology (2017), JVI–00361.

[7] Martin Hölzer and Manja Marz. “PoSeiDon: a web server for the detection of
evolutionary recombination events and positive selection”. In: Bioinformatics (2017),
submitted.

[8] Martin Hölzer and Manja Marz. “The Dark Art of de novo Transcriptome Assem-
bly: A Comprehensive Across-species Comparison of short-read RNA-Seq assemblers”.
In preparation (2017).

[9] Martin Hölzer and Manja Marz. “GOAssembler: A Method Pipeline for the Con-
struction, Evaluation and Clustering of de novo Transcriptome Assemblies”. In prepa-
ration (2018).

Ψ These authors contributed equally to this work.

xiv
... and partially based and complemented by:

[10] Abdullah H Sahyoun, Martin Hölzer, Frank Jühling, Christian Höner zu Siederdis-
sen, Marwa Al-Arab, Kifah Tout, Manja Marz, Martin Middendorf, Peter F Stadler,
and Matthias Bernt. “Towards a comprehensive picture of alloacceptor tRNA remold-
ing in metazoan mitochondrial genomes”. In: Nucleic Acids Research 43.16 (2015),
pp. 8044–8056.
[11] Petra Möbius, Elisabeth Liebler-Tenorio, Martin Hölzer, and Heike Köhler. “Eval-
uation of associations between genotypes of Mycobacterium avium subsp. paratu-
berculosis and presence of intestinal lesions characteristic of paratuberculosis”. In:
Veterinary Microbiology 201 (2017), pp. 188–194.
[12] Petra Möbius, Gabriele Nordsiek, Martin Hölzer, Michael Jarek, Manja Marz, and
Heike Köhler. “Complete genome sequence of JII-1961 – a bovine Mycobacterium
avium subsp. paratuberculosis field isolate from Germany”. In: Genome Announce-
ments (2017), submitted.
[13] The RNA tools and software consortium. “A community-driven catalog of RNA
bioinformatics tools and their ontologies”. In preparation (2017).
[14] Nelly B Mostajo, Martin Hölzer, Abdullah H Sahyoun, Verena Krähling, Stephan
Becker, and Manja Marz. “A comprehensive annotation of non-coding RNAs in bats”.
In preparation (2017).
[15] Sebastian Bartschat, Clara Bermudez-Santana, Anke Busch, Alexander Donath, Jan
Engelhardt, Andreas R Gruber, Jana Hertel, Michael Hiller, Martin Hölzer, Fran-
ziska Hufsky, Emanuel Barth, Frank Jühling, et al. “Comparative Analysis of Non-
Coding RNAs in Nematodes”. In preparation (2017).
[16] Martin Hölzer, Manja Marz, Marc-Oliver Grimm and Daniel Steinbach. “Elucida-
tion of the molecular mechanisms of progression of the non-muscle invasive urothelial
carcinoma of the urinary bladder (NMIBC) and identification of possible prognos-
tic markers and therapeutic targets by exom and 3’/5’ UTR mutation analyzes”. In
preparation (2017).
[17] Martin Hölzer, Friedemann Weber, and Manja Marz. “Description of the tran-
scriptomic landscape of the microbat Myotis daubentonii in response to interferon
stimulation and an infection with the Rift Valley fever virus”. In preparation (2017).
[18] Barbara Müther, Martin Hölzer, Manja Marz, and Georg Kochs. “Evolution and
antiviral specificity of Mx proteins in rodents”. In preparation (2017).
[19] Martin Hölzer, Ruman Gerst, and Manja Marz. “PCAGO: An interactive web
service to analyze RNA-Seq data with principal component analysis”. In preparation
(2017).

[1] Supplement at www.rna.uni-jena.de/supplements/mycobacterium


[2] Supplement at www.rna.uni-jena.de/supplements/filovirus_human_bat
[4,5] Supplement at www.rna.uni-jena.de/supplements/fungi_infection
[6] Supplement at www.rna.uni-jena.de/supplements/mx1_bats/full_aln
[8] Supplement at www.rna.uni-jena.de/supplements/the_dark_art
[14] Supplement at www.rna.uni-jena.de/supplements/bats_ncrna
[16] Supplement at www.rna.uni-jena.de/supplements/urology_all

xv
Contents

1 Introduction 1
1.1 The Dark Art of Next-Generation Sequencing . . . . . . . . . . . . . 1
1.2 Contribution and scope of this thesis . . . . . . . . . . . . . . . . . . 2
1.3 Comprehensive supplemental materials . . . . . . . . . . . . . . . . . 4

2 Welcome to the Black Box 7


2.1 Next-Generation Sequencing . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 The building blocks of life . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Illumina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 NGS design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.4 Preprocessing of sequencing reads . . . . . . . . . . . . . . . . 17
2.2 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Building genomes . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Building transcriptomes . . . . . . . . . . . . . . . . . . . . . 19
2.3 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Gene expression analyses . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 Read quantification and normalization . . . . . . . . . . . . . 23
2.5.2 Fold changes and statistics . . . . . . . . . . . . . . . . . . . . 25
2.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Genome Assembly 29
3.1 Assembly of the whole-genome of Chlamydia gallinacea . . . . . . . . 30
3.1.1 The genus Chlamydia . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Sequencing and assembly . . . . . . . . . . . . . . . . . . . . . 31
3.2 Comprehensive insights in the MAP genome . . . . . . . . . . . . . . 33
3.2.1 The genus Mycobacterium . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Sequencing and assembly . . . . . . . . . . . . . . . . . . . . . 35
3.2.3 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.4 Phylogenetic reconstruction . . . . . . . . . . . . . . . . . . . 38
3.2.5 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 40
3.2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 Transcriptome Assembly 55
4.1 The Dark Art of de novo transcriptome assembly . . . . . . . . . . . 56
4.1.1 RNA-Seq: a revolution in transcriptomics . . . . . . . . . . . 56
4.1.2 Material and methods . . . . . . . . . . . . . . . . . . . . . . 59

xvi
4.1.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 68
4.1.4 Conclusions and future perspectives . . . . . . . . . . . . . . . 80
4.2 Cluster de novo transcriptome assemblies: a proof-of-concept . . . . . 84
4.2.1 How to improve assemblies? . . . . . . . . . . . . . . . . . . . 84
4.2.2 Cluster approaches . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2.3 Evaluation of merged assemblies . . . . . . . . . . . . . . . . . 87
4.2.4 A possible cluster-assembly pipeline and future work . . . . . 92

5 Differential Gene Expression 95


5.1 Differential effects of vitamins on human monocytes after infections . 96
5.1.1 Biological effects of vitamin A and D . . . . . . . . . . . . . . 97
5.1.2 Study design . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.1.3 Bioinformatic analysis of RNA-Seq data . . . . . . . . . . . . 100
5.1.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 103
5.1.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Differential expression in EBOV/MARV infected human and bat cells 120
5.2.1 The 2014 Ebola outbreak in West Africa . . . . . . . . . . . . 121
5.2.2 Sequencing and assembly . . . . . . . . . . . . . . . . . . . . . 122
5.2.3 Differential gene expression . . . . . . . . . . . . . . . . . . . 126
5.2.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 129
5.2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.3 A short NGS design detour: RVFV infection in bat cells . . . . . . . 142

6 Single Nucleotide Investigations 145


6.1 PoSeiDon: Positive Selection Detection . . . . . . . . . . . . . . . . . 147
6.1.1 Positive selection and recombination . . . . . . . . . . . . . . 147
6.1.2 Pipeline and implementation . . . . . . . . . . . . . . . . . . . 148
6.1.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.2 Evolution and antiviral specificity of bat Mx proteins . . . . . . . . . 154
6.2.1 Bats and the Mx1 gene . . . . . . . . . . . . . . . . . . . . . . 155
6.2.2 Material and methods . . . . . . . . . . . . . . . . . . . . . . 156
6.2.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . 158
6.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

7 Conclusions and future perspectives 167

Bibliography 173

Appendix 208

A Fight against Ebola 209

B The Dart Art of de novo transcriptome assembly 221


Chapter 1

Introduction

1.1 The Dark Art of Next-Generation Sequencing


In the last decades, our knowledge of the molecular basis of life, the building blocks
known as DNA and RNA, has increased tremendously by technical breakthroughs
from physics, chemistry, biology, and computer science. Since then, the systematic
analysis of massive amounts of data files became more and more important and
challenges computer scientists day-to-day. Particularly, Next-Generation Sequenc-
ing (NGS) has dramatically increased the accessibility of genetic information, gen-
erating massive amounts of genome and transcriptome data that is rapidly changing
the landscape of many life science disciplines. Decisions to adopt or not adopt NGS
are often driven by a variety of factors.
Whereas the pure production of large amounts of sequencing data can be done
nowadays in a rapid and cost-effective way, one of the most often raised questions
is how to handle the sheer amount of data generated by current NGS systems.
Of course, NGS techniques have great powers, but it is also a challenging task
to select appropriate protocols and the best fitting parameters for a particular se-
quencing run (e.g. throughput and length of the reads) in order to maximize the
chance of success to answer the scientific questions initial raised.
Is it the goal of the NGS project to assemble the genome of a new species? Is
it a rather small bacteria, or an higher eukaryote? How complex is the genome?
Are you interested in the differential expression pattern of a certain gene? Or all
genes? Do you aim to also consider genes that are differentially alternative spliced?
What about currently unknown genes, that are just not annotated? Are you also
interested in the expression of non-coding genes and small RNAs like miRNAs? Do
you also aim to sequence viral sequences possibly included in your samples? Is there
a reference genome available? If not, is a closely related species available that could
function as a reference (is it well annotated?) or do you need to build a de novo
genome/transcriptome assembly instead?
Based on the answers of this questions, a NGS project can be set up by consider-
ing the need and amount of replication, different protocols for molecule selection and
library preparation, the obtained throughput and length of the reads and further
specific parameters like strand-specificity and the insertion size between paired-end
reads.

1
Chapter 1. Introduction

HiSeq 2500
acggactaga
acggactaga
acggactaga
Data
acg acg

Alg

s
se
ori

aly
thm

An
s
Figure 1.1: The comprehensive processing and preparation of huge amounts of data, e.g. obtained
from a NGS experiment, involves many tools and methods, moreover based on different algorithms,
in order to perform the various analyses needed to tackle the specific computational and biolog-
ical problems raised. For many researchers, the way how sequencing data (or also other kind of
biological data) is bioinformatically transfered to final result tables and figures, often remains an
untransparent and obscure process – like a Black Box. This often leads to problems in the correct
interpretation of the data. For example if scientists of other fields are just not able to understand
what happened inside the box.

If the biological questions that should be answered with the help of a NGS run
are clear, an appropriately chosen experimental design can give great insights in the
biolgocial context and furthermore a lot of additional information, that can lso be
used by other scientists to answer different questions.
However, the huge amount of obtained data needs to be handled and processed in
a comprehensive, well structured and transparent way. The emerging development
of new sequencing technologies, comprising for example longer reads, but also higher
error rates, still challenges current bioinformatic approaches for quality control, map-
ping, assembly and differential gene expression detection. In the fast emerging field
of NGS technologies, new algorithms and bioinformatical tools have to be evaluated
and developed continuously, in order to comply with the requirements of novel NGS
data. Obviously, it is often hard for other scientists to follow and completely un-
derstand the workflow of huge bioinformatical pipelines producing the output that
should be finally interpreted by them. Often, this remains as an opaque process –
comparable with a Black Box (Fig. 1.1).

1.2 Contribution and scope of this thesis


This thesis focus mainly on the examination of NGS data. During my PhD I handled
a huge amount of projects dealing with different kinds of this data, DNA- as well as
RNA-Seq, from a huge variety of species and different biological topics. For most of
the projects I was involved from the early start until the bioinformatical analyses and
final interpretation of the obtained results. Respectively, I got a deep understanding
of different NGS workflows and their advantages and disadvantages. Especially, a
well-thought selection of an appropriate sequencing technology and corresponding
protocols for library preparation can already decide over the success or failure of the
bioinformatical analysis.
Next to the different NGS projects, that give a more generalized and broader
overview of certain biological topics, I was also working on more specialized problems
by taking a closer look on specific genes and single nucleotide positions, e.g. by
searching for positive selected sites in protein-coding sequences (Sec. 6.1 and 6.2).

2
1.2. Contribution and scope of this thesis

At the end, obtaining an overall picture by NGS and combining this with more
detailed investigations of single genes and nucleotide positions has the greatest po-
tential to generate high quality and sophisticated insights in manifold biological
topics.
In the following chapters of this thesis, I will present a selection of different
projects dealing with whole genomes, comprehensive transcriptomes and single nu-
cleotides as well. Furthermore, a broad variety of different species, from bacteria to
fungi and eukaryotes, even viruses, will be in the focus of different chapters.
In Chapter 2 (Welcome to the Black Box ) I will present some basic mechanics
and methods that will be referenced consistently in this thesis. However, it is not
possible to give a full and detailed overview of all the biological and computational
background, the content of this thesis is based on. Nevertheless, the basic ideas
presented in Chapter 2 should be comprehensive enough to understand the following
chapters.
In Chapter 3 (Genome Assembly) the studies of two bacterial genome assem-
blies are discussed. The first one, dealing with the obligate intracellular bacterium
Chlamydia gallinacea, presents a basic genomic study, whereas the second one, fo-
cusing on the pathogen Mycobacterium avium subsp. paratuberculosis, presents not
only the assembly of this bacteria but also a comprehensive genome-wide comparison
and annotation study.
In Chapter 4 (Transcriptome Assembly) I switch from the topic of DNA sequenc-
ing and genome assembly to the sequencing of RNA and the reconstruction of whole
transcriptomes instead of genomes. Different problems and challenges need to be
taken into account when constructing a transcriptome instead of a genome. A com-
prehensive across-species comparison of different de novo transcriptome assembly
tools is presented, complemented by a proof of concept study discussing the ad-
vantages and disadvantages of merging assemblies of different tools and parameter
settings.
Chapter 5 (Differential Gene Expression) combines now genomic and transcrip-
tomic concepts presented beforehand in Chapter 3 and 4 and mainly deals with the
detection of differential expressed genes out of RNA-Seq data. In the first part,
the transcriptional landscape of human cells infected with either the bacteria Es-
cherichia coli or one of the two fungi, Candida albicans and Aspergillus fumigatus,
is presented. The other (much more complicated) study deals with the differential
transcriptional responses to Ebola and Marburg virus infections in cells from bats
and humans. The transcriptome assembly process presented in Chapter 4 is picked
up again here. At the end of this chapter, a short detour compares two different
NGS designs and sequencing setups.
Finally, after the comprehensive discussion of different high-throughput NGS
projects, in Chapter 6 (Single Nucleotide Investigations) I will enter more deeply
inside the Black Box and investigate more special use cases and restricted topics.
The following two sections of this chapter are not directly related to NGS data and
deal with the detection of positive selection and recombination events in homologous
protein-coding genes. Due to my work on differential expressed genes during certain
viral infections [4, 7, 15], I also got in touch with positive selection, often a result of
a host-virus ’arms-races’ during evolution. To comprehensively detect and visualize

3
Chapter 1. Introduction

positive selected sites I developed a web server called PoSeiDon, that is presented
and already used on the Mx1 gene of 13 bat species in this chapter.
In Chapter 7 the results of this thesis are summed up, conclusions are drawn and
a future outlook is given.
With this thesis, I give deep insights in different bioinformatical techniques and
design principals, looking from more far away (Chapters 3–5) and also more closely
(Chapter 6) on different kind of data (not exclusively NGS data) aiming to answer
various biological questions.

1.3 Comprehensive supplemental materials


For most of the projects presented here, comprehensive supplemental materials are
available. Some of the material is also appended to this thesis, but due to the nature
of high throughput experiments and the corresponding data analyses huge amounts
of output data are generated. Thus, it is in most cases just not possible to provide
the data as a text or PDF file or even print it.
Therefore, most of the supplements I generated during my PhD are electronic
and publicly accessible via a web browser. Depending on the project, they com-
prise extensive sortable gene and expression data tables with connections to other
databases like Ensembl and NCBI, multiple pathway/alignment/tree figures in cer-
tain formats, and detailed descriptions of the performed methods to reproduce the
whole bioinformatic analyses.
The idea behind this huge electronic supplements is to present the data as trans-
parent and reproducible as possible and to give researchers easy access to the pro-
cessed data, allowing them to answer their own biological questions. As an example,
in the supplemental material accompanying our publication “Differential transcrip-
tional responses to Ebola and Marburg virus infection in bat and human cells” [4],
we provide an interactive gene observer (called IGO, see Sec. 5.2), giving an embrac-
ing overview about all human and bat genes investigated in this study (>27 000)
including their expression profiles and splicing graphs. Thus, IGO allows researchers
to search for their genes of interest and how they behave during an infection with
Ebola or Marburg viruses at different time points in a human and bat cell line.

• Sec. 3.2 http://www.rna.uni-jena.de/supplements/mycobacterium/

• Sec. 4.1 http://www.rna.uni-jena.de/supplements/the_dark_art/

• Sec. 5.1 http://www.rna.uni-jena.de/supplements/fungi_infection/

• Sec. 5.2 http://www.rna.uni-jena.de/supplements/filovirus_human_bat/

• Sec. 6.2 http://www.rna.uni-jena.de/supplements/mx1_bats/full_aln

4
1.3. Comprehensive supplemental materials

Final remarks before entering the Black Box


This thesis is based on a huge variety of different biological and bioinformatical
topics, amongst others Next-Generation Sequencing, genome and transcriptome as-
sembly, the annotation of protein- and non-coding genes, differential expression as-
says, phylogenetics, the identification of single nucleotide variants, and the search
for positively selected sites in protein-coding sequences. Due to the interdisciplinary
nature of those fields, it is almost impossible to provide background information on
all topics touched. I tried to include sufficient background to make the topics of
this thesis as understandable as possible, hopefully also for someone not directly
involved in the bioinformatic – or more precise – the NGS community. However,
I assume that the reader is familiar with basic knowledge in molecular biology as
well as computer science. As usual in scientific presentations, I use the scientific
’we’ instead of the personal pronoun ’I’ in this thesis, because most of the work
presented here would not have been possible without the help of my colleagues and
cooperation partners.

And now, welcome to the Black Box...

5
Chapter 1. Introduction

6
Chapter 2

Welcome to the Black Box


In the last decades, an overwhelming amount of various bioinformatical tools was
accumulated to tackle a broad variety of different problems the life-science commu-
nity came up with. Processing raw data by those various tools to finally produce a
table or plot that can be interpreted by researchers remains often an opaque process
– like a Black Box. The selection of an appropriate tool to analyze a given data
set according to a specific biological question can be already a challenging task. In
many cases different tools are used on the same data and the output is compared
and merged to settle tool-specific disadvantages. Often, an intersection of the most
significant results is reported. For example, multiple assembly tools (Sec. 2.2) can
be used on the same data and the output can be merged to achieve a more complete
view of the genome or transcriptome of a particular species in comparison with the
outcome of only one tool.
In September 2015, a workshop took place
in Copenhagen to collect and describe bioinfor- RNA secondary structure
RNA 3D structure
matical tools related to the RNA world [11]. RNA−RNA interactions
Alternative splicing
The main goal was to generate a community RNA secondary structure alignment
Read quantification
driven catalog of RNA bioinformatic resources MicroRNA

and their relationships. An ontology was in- Data visualisation


Differential expression

vented to group the various tools in relation ncRNA


0 25 50 75

to their different topics, functions, data and Occurrence

formats. This extended ontology will be used Figure 2.1: In February 2016 there
to improve the already available EDAM ontol- were already 389 tools registered by the
ogy [21] and the tools with their meta data will ELIXIR RNA community [11]. The ten
most frequently used topic terms are
be provided to the life-science community via shown. Topics, mainly (red) and casu-
the ELIXIR node [22], with the goal to build a ally (yellow) tackled in this thesis are
stable and sustainable infrastructure to spread marked.
biological information across Europe. Such a
community driven database represents an important resource to help also non-
bioinformaticians to obtain better insights into the Black Box of bioinformatical
methods and tools.
Before we enter the Black Box, some general algorithms, methodologies and
workflows will be introduced, that are widely used all over the different chapters
presented in this thesis (Fig. 2.2). More specific methods are separately described
in the corresponding chapters.

7
Mapping & Quantification gene 3
Mapping C Visualization M
(genome and/or transcriptome reference) • Bowtie
gene 1 gene 2 • TopHat2
Alignment

• Segemehl
Preprocessing
Wet Lab Preparation A • SAMTools
(experimental design) • unique/multi split
Control Treatment
R|FPKM, TPM
Read count

~ 99% infected cells


Quantification

senseRNAs antisenseRNAs antisenseRNAs (var2)


Quantification 10
4
• HTSeq-count
3
Mock 3h p.i. 7h p.i. 23h p.i. Normalization RNA abundances 5

PC2: 12.5% variance

PC2: 13.6% variance


PC2: 9.8% variance
• BEDTools
0
0
0
ne
1
ne
2
ne
3
ne
1
ne
2
ne
3
• FeatureCounts -3 -5 -4
~ 70% infected cells
ge ge ge ge ge ge biological replicates
C3.1|C4|C5 -10
Samples -6
-8 -4
PC1: 23.6% variance
0 4 -20 -10 0
PC1: 40.7% variance
10 20 -5 0
PC1: 53.7% variance
5 10
infections patients vitamins
control C. albicans A. fumigatus E. coli p1 p2 p3 p4 control atRA vitD
Next-Generation Differential Gene Expression D Pathways CLEC7A

TRIM69 TRAF3/ TANK /CXCL8


IRS2

IKK /TBK1/DDX3X
G IL12A-AS1 (IL12A)
Sequencing DEGs
PDGF complex IRS1
CTD-3203P2.3 (IL4R)
RNA DNA
SQSTM1 DIAPH3 ENAH WASF2 SRC ITG REDD1 4EBP1
RIPK3 GRID1 RP11-104J23.1 (CCL15)
PIK3CA PDK1
NCK1 YLPM1 SYK (MYD88) TIRAP IRSp53 AC069363.1 (CCL3)
caveolin
CSE1L DDIT AC131056.3 (CCL3L3)
(e.g. Illumina)
RALBP1/TRADD/ TRAF2 / TNFR1 TRIB3
(e.g. 3x) PAK1 CARD9/MALT1/BCL10 IRAK2 / IRAK4 / TRAF6 MAP3K1
Akt = Rac
PDGF
mTor
RPTOR
(MYCN)

CTB-186H2.3 (CCL14)
RP11-536K7.5 (IL2RA)
INHBA-AS1 (INHBA)
CXXC1
IKBKG MAP2K4 AC073072.5 (IL6)
TAB1 /TAB2/TAB3/TAK1 EGF EGFR MICAC1
RAC1 PIK3CA
• DESeq
IKK MAP2K7 control atRA vitamin D control atRA vitamin D control atRA vitamin D control atRA vitamin D
67 IKK
FPR1 MAP2K1
control A. fumigatus C. albicans E. coli
120 137 NFKB1 / I B / RELA VEGFA FGF
PAK1 MAP2K2
RAF1 RAS
SOS1 RTK CBL Epsin 0 2 4 6 8
• R Bioconductor
MAP2K3
Libraries
MAP3K11 GF Sense IL6 Antisense AC073072.5
MAP2K6 DUSP4 MYLK
CDK9 MAP3K12 MAPK3
NFKB2 / RELB
MAPK1 CENPE 106.0
24
● ● ●

SMAD ● 102.0
• KEGG
● ●
● ● ●● ● ●
● ● ● ●
• Cufflinks
MAPK14 NR4A1 105.0 ● ● ●
● ●
●● ●● ●●
● ● ●
● ●
14
DUSP1 MAPK8 ●
13
● ●
DUSP5 DUSP6
p105 / TLP2 / ABIN2 101.5
104.0 ● ● ●

• GO
JUN FOS TGFB1 TGFBR1 PPP1R15A ●
DUSP10 EP300 ●

• IGO
103.0 ● ●
CDKN1B DUSP16 101.0
59
(IL8) ● ● ●
Dab2 DUSP8 SARA = ZFYVE9
CCNE1 GDNF ● ●
SKP2 CycE SMAD7 102.0 ● ●

• polyA+ • Pathview
CCNE2 ZFP36 MK2 KCNA3 SMURF2 ●
LDL
Control Treatment CDKN1A ATF3 EGR1 TGFB3 100.5
● ●

101.0 ●
● ● ●

CREB5 ●
● ●
HSP27 = HSPB1 ●
100.0 ● 100.0 ● ● ● ●● ● ●
• rRNA- C5 C3.2|C5
control A. fumigatus C. albicans E. coli control A. fumigatus C. albicans E. coli
IL12A IL12A-AS1
104.0 102.0 ●
• sRNA




● ●

103.0 ● 101.5

●● ●

Gene Annotation

Assembly E F


● ●
102.0 ● ●
101.0

GBE

• SPAdes
● ●

● ●●
MAP JIII-386: Genome Assembly, Annotation, and Comparison ●
● ● ●


101.0 100.5 ● ●
● ●● ●

(proteins, ncRNAs, database)
● ● ●

● ● ● ●
(de novo, reference-based)
● ●● ●
• Trinity
control atRA vitamin D control atRA vitamin D control atRA vitamin D control atRA vitamin D ● ● ●
Raw read
● ● ●
● ●
control A. fumigatus C. albicans E. coli ● ● ● ●
100.0 ● ● ● ● ● 100.0 ● ● ● ●● ●● ●● ● ●●
● ●
control A. fumigatus C. albicans E. coli control A. fumigatus C. albicans E. coli
Table 1 aminoacid metabolism
(genome, transcriptome)
other basic cell processes
General Genome Features of the Different Mycobacteria Strains
−2 −1 0 1 2
data • cd-hit-est
immune system
inflammation infections vitamins
apoptosis control C. albicans A. fumigatus E. coli control atRA vitD
• Cluster-Assembly • Ensembl −3 −2 −1 0 1 2 3
ATGCGAG TGCGAGG GCGAGGG CGAGGGT
• Blast
FOSB
FOS
CNN1
EGR1
CXCL3
EP300
NFKB1
Preprocessing B • GORAP
NPC1
GAGGGTG AGGGTGC GGGTGCA GGTGCAA LMOD1
RPS17
CXCL2
ATGCGAGGGTGCAATCGA
AREG
RELA
NR4A1
• Bacprot
ZEB2
GTGCAAT TGCAATC GCAATCG CAATCGA J UN
CYR61
RASGEF1B
(quality check and trimming)
IL32
PLA2G4C
assembled contigs/scaffolds
ATF3
PPP1R15A
DUSP8
de Bruijn graph
NFKB2
reads C3|C4|C5.2|C6.2 C3|C5.2
RELB
DUSP1
DUSP10
SQSTM1
KLF4
RPS17L
IL8
PIM1
FOSL1
MX1
DDX58
DDIT4
TMEM72
NPR3
SLC2A12

8
MN1
CDH6
Phylogeny I Alignment H Genetic Variation L
SFRP2
RTL1
Artibeus jamaicensis
IL6
SDPR
NPPC
• FastQC
STAT2
100 STAT1
Sturnira lilium
FAM49A
DUSP6
(SNVs, INDELs, isoforms)
MYCNOS
100
(nucleotide, amino acid)
ZG16
MYCN
Carollia perspicillata DUSP5
before

CHAC1
• quality trimming
ATP2B4
AK4
FAM71E2
Myotis davidii GAC CAA GAG TAT CGG ACT TGG CTG CAG AAG ATC CGA GAG AAG GAA TCA GAA CAA CAG
Chapter 2. Welcome to the Black Box

MAST4
Processed read data
TRIB3
TRAF6
100 88 TFCP2L1
• ClustalW
GAC CAA GAA TAT CGG ACT TGG CTG CAG AAG ATC AGA GAG AAG GAA TTA GAA CAA CAG
Myotis daubentonii
HAVCR1
TLR3
• adapter clipping
3h 7h 23h 3h 7h 23h 3h 7h 23h 3h 7h 23h
97 GAC CAA GAG TAT CGG ACT CAT CTA CAG ATG ATC AGA GAG AAG GAA TTA GAA GAA AAC EBOV MARV EBOV MARV
Myotis lucifugus NOTE.—The number of ORFs with a homologous sequence in NCBI (homologous ORFs) and additionally hypothetical ORFs, both predicted by BacProt, are provided.
NcRNAs and riboswitches were annotated by homology search of Rfam (v.11.0) (Gardner et al. 2009) families using the GORAP pipeline (unpublished data), see Materials and
Methods. For further information (fasta, gff, stk files), see supplementary tables S11, S14, S20, and S22, Supplementary Material online. chr, chromosome; scaff, scaffolds; con,
77 GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCT AAA GAA GAG
• TranslatorX
contigs; N50, length of the shortest con/scaff, so that at least 50% of all bp in the assembly are represented by this and all longer contigs; ?, candidate, further analysis
Myotis brandtii needed. TPP binds thiamin pyrophosphate (TPP) to regulate thiamin biosynthesis and transport (Winkler et al. 2002); Cobalamin binds adenosylcobalamin to regulate vitamin
• RAxML
B12 (cobalamin) biosynthesis and transport (Nahvi et al. 2002); Glycine binds glycine to regulate glycine metabolism genes, including use of glycine as energy source (Mandal
100 GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCA AAA GAA GAG et al. 2004); SAM-IV binds S-adenosyl methionine (SAM) to regulate methionine as well as SAM biosynthesis/transport (Weinberg et al. 2007); SAH recycling of S-adeno-
Eptesicus fuscus
sylhomocysteine (SAH), produced during SAM-dependent methylation reactions (Weinberg et al. 2007); pan predicted riboswitch function, located in 50-UTRs of genes
encoding enzymes involved in vitamin pantothenate synthesis (Weinberg et al. 2010); pfl predicted riboswitch function, consistently present in genomic locations corre-
GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG sponding to 50-UTRs of protein-coding genes (Weinberg et al. 2010); ydaO–yuaA, genetic “off” switch for ydaO and yuaA genes, maybe triggered during osmotic shock
100
• Mafft
(Barrick et al. 2004); ykok, MG2þ -sensing riboswitch, controls expression of magnesium ion transport proteins (Barrick et al. 2004); ykkC–yxkD, upstream of ykkC and yxkD
Pipistrellus spec. genes in Bacillus subtilis and related genes in other bacteria, function mostly unclear (Weinberg et al. 2010); ykkC-III predicted riboswitch function, appears to regulate genes
• MrBayes
related to preceding motifs such as ykkC and yxkD (Weinberg et al. 2010); NA, not applicable.
GAC CAA GCG TAC CGG GGT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG
Pteropus alecto GAC CAA GCG TAC CGG GCT TCG CTG CAG AAG ATC CGA GAG AAG GAA TCG GAA GAG AAG
• Newick Utilities 100
100
Rousettus aegyptiacus GAC CAG GCG TAC CGC ACC GCG CTG GGA AAG ATC CGA GGG ATG GAA GCC GAA GAG AAT
• cmalign Genome Biol. Evol. 7(9):2585–2601. doi:10.1093/gbe/evv154 2589
Hypsignatus monstrosus GAC ACG ATG TAT CAG AGA TCG TTA CGG AAA ATC AGG GAG AAG GAA AAA GAG AAG GAA
72
90 GAC CAG GTG TAT CGG AAA TCA TTA CAG ATA GTC AGG GAG AAG GAG AAG GAG AAG GTT
Illumina Universal Adapter
Illumina Small RNA Adapter
Eidolon helvum
80 Nextera Transposase Sequence substitutions/site GAC CAG GTG TAC CAG AGA TCA TTA CAG AAA GTC AGG GAG AAG GAG AAT GAA GAA CAA
SOLID Small RNA Adapter
0 0.1 0.2 0.3
70
GAT CAG ATG TAC CAG AGT TCA TTA CAG AAA ATC AGG GCG AAG GAG AAG GAG AAG GAA
C3.2|C6.2-4 C3.2|C5|C6
Sequence
AAGCTGCCAGTTGAAGAACTGTTGGAATTCTCGGGTGCCAAGGAACTCCAG
60
TGTCTGAGCGTCGCTTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTGA
after

% Adapter

50 TATTGCACTTGTCCCGGCCTGTTGGAATTCTCGGGTGCCAAGGAACTCCAG
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTGACCAATCTCGTATGCC
40 CAACGGAATCCCAAAAGCAGCTGTGGAATTCTCGGGTGCCAAGGAACTCCA
TAGCAGCACGTAAATATTGGCGTGGAATTCTCGGGTGCCAAGGAACTCCAG
30 ...
RP11−79H23.3

10
AC002480.3
RP11−662I13.2
AC008697.1
LINC01181
MIR155HG
20 LINC00158
RP11−701P16.5
AC022816.2 AC003092.1
RP11−158I3.3 RP3−325F22.3AC058791.1
RP11−561O23.5
RP11−452H21.4
RP11−221N13.3
RP11−955H22.1 RP11−236B18.5
RP11−567C2.1 AC073072.5
RP1−313L4.3
CTB−114C7.4AC002480.2
RP11−536K7.5
BX255923.3
10 RP11−325F22.2
LINC00299 RP4−607I7.1
Remolding J K
RP11−44K6.2
Positive Selection
AP000355.2 LINC00520
AC046143.3 RP11−370F5.4
RP11−91K9.1
AC114730.3
CTA−293F17.1
AC002480.4
RP11−175K6.1 LINC00884
RP11−253D19.1
MIR3945
AC112518.3 AC061992.2 RP11−283G6.5
RP11−157D23.2
INHBA−AS1 RP11−503N18.1
RP11−404F10.2 LINC00346
ADORA2A−AS1 ROR1−AS1
RP1−249F5.3 RP11−347E10.1
FAM157ARP11−444D3.1 RP11−705C15.4
RP11−383J24.1
VTRNA1−3
RP11−384O8.1
LINC01093
RP11−253D19.2
CTD−2527I21.7
RP11−611O2.5
RP11−561B11.3 MIR222HG
RP11−96A15.1
SERPINB9P1 RP11−44N11.1
RP11−396F22.1AC007036.6
RP11−439L18.1
LINC01336 LINC01262
0 AC099552.4
RP11−133N21.10
MIR4645
AC004988.1
KCNJ2−AS1
MIR222 RP11−21C4.1
RP11−131K5.2
RP11−317P15.4
LINC01215
RP11−10E18.7 CTC−231O11.1
GS1−600G8.5
MIR221 LINC00659

5
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 PACERR
RP11−672A2.6 RP11−806H10.4
LINC01260
RP11−114F3.4
RP11−22N19.2
ZMIZ1−AS1 LINC01050
RP11−20G13.2
RP11−866E20.3 RP11−54O7.17
RP1−239B22.5
RP11−280G9.1
RP11−408H1.3
LINC01388
DLGAP1−AS2
ERICDRP11−214N9.1 RP11−588G21.2
RP3−333A15.1
AC114730.2
DNAJC3−AS1 AC147651.4
RP3−395M20.9
KB−1507C5.4
AC016831.7
RP11−214O1.2
RP11−519G16.3
RP11−439L18.2 AF064858.6
RP11−1151B14.4
DLGAP1−AS1RP11−351I24.1 RP1−28O10.1
AC017002.1
AC097495.2
LINC01465
AC133644.2
ARHGAP31−AS1
Position in read (bp) RP11−834C11.4 CITF22−49E9.3
LINC01136
CFAP58−AS1
RP3−393E18.2
LUCAT1
CTC−550B14.7
RP11−672A2.5 SNORD3B−2
U1 SNORD3B−1
AC010226.4
RP1−68D18.2
RP11−47I22.2
RP11−228B15.4 LINC00936
LINC00243
XXbac−BPG249D20.9
RP11−367G18.1 JAZF1−AS1
RP11−37B2.1
CTD−2313J17.6
AC069363.1 RP11−283G6.6
LINC01358
(tRNAs) (and recombination)
WI2−87327B8.2
LINC01268
RP3−508I15.14
RP11−242C19.2
RP11−157E21.1 RP11−667K14.4
RP3−508I15.19
LINC00152
RP3−508I15.21
GS1−114I9.1
AC131056.3 RP11−221J22.1
RP11−290F5.1
LINC01588
MSC−AS1
RP11−386I14.4
RP11−572M11.4
UXT−AS1
AC018816.3
RP11−734K23.9
RN7SL368P
SNHG15 CTD−2184D3.5
AC002511.3
AC002511.2 AC005220.3
AC007126.1
JARID2−AS1
MIR4435−1HG
JHDM1D−AS1
AC116366.5
AP001063.1 RP11−327P2.5
PAGR1 CTD−2650P22.2
RP11−238K6.2
CTB−58E17.1
AC002456.2
RP11−705C15.3
RP11−1080G15.1
CTB−58E17.3
RP11−61J19.5
RP1−40E16.12
AP000692.10
RP11−367G6.3
RP11−670E13.6RP11−560J1.2 USP12−AS2
RP11−815I9.4
HCG20
FAM157B
NPTN−IT1 CTD−2639E6.9
RP13−314C10.5
RP11−7F17.7
RP11−1149M10.2
RP11−65I12.1
RP1−257A7.4 RP11−57H14.2
LINC01176
CTB−41I6.2
RP4−561L24.3
RP4−673M15.1
RP11−799D4.4
NRIR
RP11−809O17.1
LINC00926
RP11−295G20.2 AF064858.11
LINC01415
RP11−171I2.5
CFLAR−AS1
chr22−38_28785274−29006793.1
BISPR RP11−58E21.3
RP11−212I21.2
RP11−775C24.5
RP11−134L10.1
ADPGK−AS1
AC005071.2 RP11−20G13.3
FAM157C AP001056.1
THUMPD3−AS1
RP11−499P20.2
ZNF674−AS1
ST3GAL6−AS1 USP30−AS1
RP11−326I11.3
CTD−3128G10.7 RP11−13A1.3
ST7−AS1
RP11−733O18.1
RP5−1136G13.2 RP11−13A1.1
C3|C4|C5
PAXIP1−AS1
CTD−2033D15.2 RP11−221J22.2
RP11−10J5.1 AC012363.4
TRG−AS1 RP11−166A12.1
AC006369.2
RP11−394I13.1 IL12A−AS1
CTB−61M7.2
AC074289.1 RP11−171I2.2
TTN−AS1
RP11−432J22.2
LINC00856
RMDN2−AS1
RP11−774O3.3
RP4−756G23.5 RP11−126O1.6
RP11−260E18.1
RP11−967K21.1
AC079767.4
RP11−28F1.2
LINC00472
RP11−347C12.10
RP11−645C24.5
KB−1732A1.1
RP11−434H6.6
RP11−342M1.3
RP11−34F20.7
RP11−345J18.2
LINC00996RP11−4C20.4
KB−1410C5.5
AC083949.1
AP001046.5 H1FX−AS1
RP13−297E16.4
AC025171.1
CTD−2105E13.14
RP11−728F11.4
XXbac−BPG252P9.10
RP11−65L3.2
C1orf132
RP11−89K11.1
RP11−259N19.1

0
AC004069.2
DLEU2LINC01010
FAM225B
FAM225A
LINC00278
RP11−303E16.2
RP11−467L13.7
PRKAG2−AS1
TRAM2−AS1 RP11−126K1.6
RP3−477O4.14
NCK1−AS1
LINC00339
RP11−445P17.8
CTC−378H22.2
KIAA1614−AS1
LINC01184
LINC00324 RP11−823E8.3
RP11−473M20.9 CTD−3224K15.3
CTC−510F12.4AATBC
RP11−121C2.2 TP53TG1
DHRS4−AS1 RP11−53B2.2
KB−431C1.4
LINC00654
RP11−804H8.6
RP11−617F23.1 LINC01504
UBL7−AS1 LINC01127
RP5−1091N2.9 ST3GAL5−AS1
RP3−475N16.1
PLBD1−AS1
RP11−545E17.3
AC068282.3
RP11−67L2.2
AC093627.10LINC01410
RP11−597D13.9
RN7SL138P
LINC00957

−5
LINC01503
CTD−2135D7.5
N
BAIAP2−AS1 AC011899.9
Databases
CEBPA−AS1
LINC01094
RP11−344B5.2
C3|C4|C5|C6
BLACK BOX

−5 0 5 10
A.W.K. • NCBI
• Ensembl • LoFreq
FoRk • PAML4, CODEML, GARD

• IGV
• Sashimi
• MLRD • PoSeiDon C6 C3.2|C5.2|C6
Figure 2.2: What is inside the Black Box ? In this thesis, a broad amount of different topics, combining a huge variety of bioinformatical methods and
tools, is presented. The figure shows an comprehensive overview about the different topics and the related chapters (CX) where those topics are discussed.
Exemplary, some tools are mentioned (written in italics) that are used throughout the corresponding chapters. Tools and pipelines I developed during my
PhD as parts of different projects are additionally marked blue (e.g. MLRD [3],IGO [4], PoSeiDon [9]). – Caption continued on the following page –
Figure 2.2: At the start of almost all projects there is an idea about the specific questions that
should be answered. (A) Depending on those questions, an experiment can be designed (Sec. 2.1.3)
to obtain sufficient data that is needed to answer those questions. This could be a Next-Generation
Sequencing (NGS) experiment, where samples are collected, RNA/DNA is extracted (e.g. in three
biological replicates) and specific molecules of interest (like miRNAs) are enriched, libraries are
generated and the sequencing itself is conducted to obtain the raw read data (Sec. 2.1). However,
the data can be also obtained from publicly available databases like Ensembl or NCBI (N). (B) The
raw data enters the Black Box and is preprocessed (Sec 2.1.4). The quality checked and adjusted
read data can be passed to mapping (C) (Sec. 2.4) and/or assembly (E) (Sec. 2.2), depending on
the questions raised. The mapped data can be further quantified and normalized (Sec. 2.5.1) to
estimate RNA abundances (C). If different samples and conditions were sequenced, these can be
compared to detect significant differential expressed genes (DEGs) (D) (Sec. 2.5). (G) Significant
genes can be applied for pathway analyses and GO-term enrichment. (F) A quantification of read
data or the comparison of core genes between different species is only possible if an annotation
is available (Sec. 2.3). Whereas methods shown in C–G aim to give a more general view on
many genes simultaneously and aim to provide various connections between them, taking a closer
look on single genes or even single nucleotide positions is an important task to finally obtain
a comprehensive picture of a particular biological topic. (H) Therefore, building an alignment
of sequences is always a basic and crucial task of many bioinformatical applications. (I) The
differences and similarities in an alignment can be utilized to calculate phylogenetic trees, obtaining
insights into the evolution of species and their genes. (J) It is also possible to focus only on a
special gene family, like tRNAs, where one can search for a particular biological phenomenon like
remolding by combing data of H and I. This joint work is not presented in this thesis, however the
interested reader can find details in the thesis of my former colleague Dr. Abdullah Sahyoun [20]
and our publication [3]. (K) Another possibility involves the detection of positively selected sites
and putative recombination events in an alignment of multiple protein-coding sequences. (L)
Another topic involves the identification of special genetic variations such as single nucleotide
variances and insertions/deletions, helping to understand what happens in a biological system on
nucleotide level. (M) Finally, visualization is always an important part for all topics (Sec. 2.6). An
comprehensive and clear visualization of results, for example obtained from huge NGS projects, is
crucial and helps researchers to understand and interpret the data correctly. Exemplary shown are
figures visualizing DEGs from far away (MA plots, 2D/3D PCA, heat map) and on a closer level
(box plots of individual genes, scatter plot comparing fold changes) as well as a Mauve alignment
of three bacterial genomes [1]. Importantly, the results can and should be validated again in the
wet lab and can be further used to develop new ideas and hypotheses for upcoming experiments.
The presented figure does not claim to be a complete representation of all possible analysis steps
of different bioinformatical approaches. However, it officiates as a comprehensive overview of the
different topics interlocked in this thesis. Figures are partially adapted from our publications [1,
3–7, 9, 14].

9
Chapter 2. Welcome to the Black Box

2.1 Next-Generation Sequencing


In 1990, the Human Genome Project [23] was launched with the goal to identify the
sequence of all nucleotide base pairs that make up the human DNA. At the end,
the $3-billion project lasted 13 years and involved a huge international consortium
of life scientists [24]. In April 2003 the complete genome was announced and the
project succeeded with a high quality human reference genome [25].
Although the assembly of such a huge genome is still a very challenging task,
nowadays the sequencing can be done in just a few days and for only some thou-
sands of dollars [26], utilizing the still emerging Next-Generation Sequencing (NGS)
technologies.
NGS is a catch-all-term used to describe different sequencing technologies like Il-
lumina/Solexa, Roche 454 or Ion Torrent. All of these modern technologies allow for
the rapid determination of the nucleotide sequences of DNA and RNA (more precise
cDNA) molecules in a much more efficient way than the older Sanger sequencing
technique did. With NGS, we can sequence whole genomes or transcriptomes in
parallel in only a few hours at comparatively low costs. Therefore, NGS methods
are also known as high-throughput sequencing technologies [27–29].

2.1.1 The building blocks of life


DNA
In the last years, DNA sequencing (DNA-Seq) based on NGS technologies became
the most sophisticated method for the sequencing of complete DNA sequences or
genomes of various species, including the human genome and other complete DNA
sequences of many other animals, plants, bacteria and also viruses [29].
In general, DNA-Seq describes the process of determining the exact sequence of
nucleotides within a DNA molecule. It comprises any method that determines the
order of the four bases – adenine, guanine, cytosine, and thymine – in a strand of
DNA.
A general DNA-Seq workflow (Fig. 2.3A) starts with the fragmentation (chem-
ically, physically) of the DNA molecules, because they would be otherwise to long
for library preparation and standard NGS. After amplification of the shorter frag-
ments thousands or millions of short subsequences, so called reads, are produced. In
general, methods like Illumina and Ion Torrent produce reads with a length between
50 and 500 bp, depending on the setup and machine used [29].
Next to those short read producing NGS technologies more and more long read
NGS approaches are emerging. Very popular is the single-molecule real-time se-
quencing introduced by Pacific Biosciences [30] (PacBio). As the name already
implies, PacBio allows for the sequencing of single DNA molecules without an am-
plification step, producing reads with an average length of 15,000 bp and a maximum
of >40,000 bases. Unfortunately, the throughput is much lower compared to tradi-
tional Illumina sequencing. PacBio produces ∼50,000 reads per SMRT cell, whereas
Illumina yields ∼180 million reads on one HiSeq 2500 lane [29]. The Illumina tech-
nique is described in more detail in Sec. 2.1.2. Nevertheless, it is clearly important
to produce longer reads to improve the results of various analyses like the de novo

10
2.1. Next-Generation Sequencing

assembly of highly repetitive and huge genomes (Sec. 2.2). As we deal in the fol-
lowing chapters mostly with short read data, the concept of long read data will be
just mentioned and not discussed in more detail in this thesis. However, producing
longer reads and the sequencing of single molecules without an amplification step is
surely the direction NGS evolves.

RNA
NGS does not only allow for the se-
quencing of DNA molecules, in fact it ADNA 5' 3'

can be also modified for the sequenc- molecule Fragmentation and library preparation

ing of RNA transcripts that are present DNA

}
fragments insert
NGS
in a biological sample at a given mo- r1
size
r2
Reads
ment in time. RNA sequencing (RNA- (paired-end)
Seq) is a powerful method for discov- BDNA 5' Exon 1 Exon 2
3'
ering, profiling, and quantifying RNA molecule
Translation & Splicing

transcripts [31, 32]. The sequencing of mRNA


Fragmentation
RNA transcripts with NGS techniques is RNA fragments
also known as whole transcriptome shot- cDNA Library preparation

gun sequencing. Nevertheless, with the fragments


NGS
Reads
currently available short-read NGS tech- (paired-end)
niques like Illumina it is not possible to Figure 2.3: Schematic overview of the generation
directly sequence RNA molecules – first of short reads from DNA-Seq (A) and RNA-Seq
the RNA has to be reversely transcribed (B) with Illumina. (A) DNA-Seq. The DNA
to complementary DNA (cDNA) for se- molecule needs to be fragmented first. From the
ends of each fragment, short reads are produced.
quencing (Fig. 2.3B).
Exemplary shown is the generation of paired-end
The general workflow of an RNA- reads, explained in more detail in Sec. 2.1.2. Typ-
Seq experiment involves 1) the extrac- ical fragment sizes are ∼500 nt. If paired-end
tion of total RNA from a biological sam- reads with a read length of 50 bp from both ends
ple of interest, 2) the purification of of a fragment are sequenced, the insert size would
be ∼300 nt. Paired-end data can play an impor-
the sample to enrich a certain type of tant role in assembly (Sec. 2.2). (B) RNA-Seq.
RNA like mRNAs or microRNAs, and Exemplary shown is a genomic DNA with two
3) the preparation of a sequencing li- exons. The resulting mRNA is fragmented and
brary (Fig. 2.4). The generation of the the fragments are reverse transcribed to cDNA.
library (Fig. 2.3B) may involve steps Then, the cDNA fragments are sequenced like al-
ready shown in (A). The paired-end sequencing of
like the fragmentation of longer RNA cDNA fragment 1 shows a special case that can
molecules, followed by the reverse tran- occur in DNA- and RNA-Seq as well: if fragments
scription of the RNA to cDNA, ligation are shorter then two times the sequenced read
of adapters to the 5’ and/or 3’ ends of length, the insert size can be negative and paired
the cDNA fragments and PCR amplifi- reads can overlap. Details about library prepa-
ration like adapter ligation and amplification are
cation to enrich the library for correctly omitted here, see Sec. 2.1.2. r1/r2 – paired-end
ligated cDNA fragments [33]. The re- read 1 (forward) and 2 (reverse).
sulting reads from an RNA-Seq exper-
iment can be used to estimate the abundances of certain transcripts within each
sequenced sample (Fig. 2.4). If different conditions are sequenced, the obtained
transcript abundances can be further used to identify differential expressed genes
(Sec. 2.5).

11
Chapter 2. Welcome to the Black Box
General

Desired Reduce to Solve inverese


Sequence Analyze
measurement sequencing problem
RNA-Seq

RNA cDNA library Estimate Differential


Sequence
abundance preparation abundances analysis

HiSeq 2500
acggactaga
acggactaga
acggactaga
acg acg
TrueSeq Small RNA Indices A

Figure 2.4: Shown are the typical main steps of an RNA-Seq experiment, with the final goal to
identify differential expressed genes between different biological conditions. As we want to measure
transcript abundances in our biological samples we have to reverse transcribe the extracted RNA
molecules of interest to a cDNA library for sequencing. This procedure involves steps like the
fragmentation of the RNA and adapter ligation. After sequencing, we have to solve the inverse
problem: due to the fragmentation of the transcripts we need to find the true or most likely location
where each short read was originated from. By counting the reads and estimating RNA abundances,
we can then compare different conditions and search for significant differential expressed genes, see
Sec. 2.5 for details.

Before RNA-Seq came up, gene expression studies were performed with hybri-
dization-based microarrays. Contrasting the microarray technology, RNA-Seq al-
lows also for the identification of novel transcripts and does not necessarily need
a sequenced reference genome. Furthermore, RNA-Seq allows for the genome-wide
analysis of transcripts at a single nucleotide resolution and therefore includes the
identification of single nucleotide variants, gene fusions, allele-specific expression and
alternative splicing events [33].
Most frequently, the NGS platforms Illumina HiSeq/MiSeq, Ion Torrent and 454
Pyrosequencing are used for RNA-Seq. As most of the projects presented in this
thesis are based on the Illumina HiSeq system, this method should be described in
more detail in the following section.

2.1.2 Illumina
Illumina emerged as one of the most widely used NGS methods for both, DNA-
and RNA-Seq [29, 34]. The high accuracy and throughput together with the still
decreasing costs made the Illumina platform most suitable to target many biological
questions, including gene expression studies (Sec. 2.5) and the draft assembly of
(rather small) genomes and transcriptomes (Sec. 2.2).
After the DNA is fragmented and adaptors are ligated, the fragments need to be
immobilized on a plate (Fig. 2.5). The plate, also called flow cell, consists of multiple
channels with a dense lawn of primers fixed to the surface enabling the fragments to
bind. Bridge amplification is used to generate clusters on the plate. In this process,
the adapter of a free end of an already bound fragment interacts with the comple-
mentary primer fixed on the plate. Then, a double-stranded bridge is generated and
after denaturation two single-stranded templates are produced, which are able to
participate in a next amplification step. Following this workflow, clusters of iden-
tical sequences are produced, needed for the final sequencing step. The sequencing

12
2.1. Next-Generation Sequencing

A B C
F C A
F
F
G
F
F T F
F

F G F G G
F A F T F A F T A HO T
HO HO

Figure 2.5: Workflow of the standard Illumina sequencing. (A) Fluorescence labeld and terminally
blocked nucleotides are added to the flow cell. Previously, fragments were immobilized on the
surface and amplified into identical clusters using bridge amplification. Each cluster on the flow
cell can incorporate now a different base. (B) Each cluster emites a nucleotide-specific color,
recognized by a sensor. (C) The fluorophores are cleaved and washed away from the flow cell. The
3’-OH group is regenerated and a new cycle begins.

itself involves four fluorescently labeled and terminally blocked nucleotides that are
flooded iteratively and simultaneously over the flow cell (Fig. 2.5A). If a nucleotide
binds complementary to the template strand, a specific fluorescence color is emitted
and the newly added base can be identified (Fig. 2.5B). Before the next cycle the
fluorophores are cleaved and washed from the flow cell and the 3’-OH group of each
nucleotide is regenerated (Fig. 2.5C). In each sequencing round the reads are elon-
gated by at least one nucleotide and finally saved as simple text strings in a FASTQ
file [35] for further processing.
Next to the typical single-end reads (a fragment is only sequenced from the 5’
or 3’ end) produced by a standard Illumina run, the technology also allows for the
production of so called paired-end reads. In a paired-end design, each fragment
is sequenced from the 5’ and 3’ end, so two corresponding reads (e.g. r1 and r2)
are produced from each fragment (Fig. 2.3). As the mean fragment size in the
library is known, the calculation of an insert size between the two related reads
is possible. This additional information can be used to significantly improve the
assembly process (Sec. 2.2) and the mapping (Sec. 2.4) of the reads.
Besides the sequencing of single- and paired-end reads, specialized protocols for
strand-specific NGS, like the TruSeq Stranded Total RNA Library Prep Kit from
Illumina, exist. With such kits it is possible to produce strand-specific reads, whereas
in a default Illumina run the strand information is lost. To know from which strand
of the DNA an RNA-Seq read is derived is an important information that can greatly
improve de novo transcriptome assemblies and read quantification.

13
Chapter 2. Welcome to the Black Box

2.1.3 How to design an NGS experiment


Each bioinformatical analysis can be only as good as the underlying data. There-
fore, the experimental design of an NGS project is of utmost importance to produce
sufficient data that can be bioinformatically exploited to answer the biological ques-
tions of interest. As an example, it would not be reasonable to search for differential
expressed microRNAs in an RNA-Seq data set produced with an Illumina library
kit that especially selects for mRNAs.
When writing up this thesis, I was already involved in the experimental design of
roughly 40 NGS projects, aiming different biological questions and involving different
species and tissues. When planning an NGS experiment, important questions need
to be asked and clarified already before certain individuals are selected, cells are
harvested or total RNA is extracted. Some of the most important parts that need
to be considered when creating a successful NGS experiment are 1) general design,
2) replication, 3) multiplexing, and 4) the used protocols for library preparation.

Design
First of all, it is of great importance to clarify from the start which biological ques-
tions should be tackled and answered with the NGS run. In most cases, many
different scientists with expertise in different fields are involved and it needs to be
clarified what the goals are and if they really can be achieved with the designed
NGS run. In most cases, it is needed to collect all the issues that might affect the
results and highlighting those that can, or can not be done to remove these issues.
How many reads are needed? Which read length is sufficient? What molecules need
to be addressed in the library preparation step?

Replication
Replication is a fundamental step in almost all experiments. The good thing is, that
technical replicates are not really needed for most well established NGS technologies
like Illumina [36]. Much more important are biological replicates, for example when
planning an RNA-Seq experiment to identify differential expressed genes between a
healthy and an ill (e.g. virus infected) condition. Three biological replicates should
be the minimum to consider for almost all experiments. If one sample fails to
generate meaningful results, the outcome of a whole experiment can be useless.
Four replicates are already a great improvement in the ability to detect differences
between groups, five adds even more power but after six replicates many experiments
start to tail off in this additional power [37, 38]. In most cases, the budget is also an
important factor, unfortunately limiting the number of replicates in way too many
cases.

Multiplexing
Multiplexing describes the process of sequencing more than one sample in one pool
(e.g. Illumina lane). This can be achieved by adding short oligonucleotide sequences
(barcodes) to the fragments. After sequencing, the resulting reads can be demulti-
plexed according to the known barcodes. The factor for multiplexing depends on the

14
2.1. Next-Generation Sequencing

amount of reads that are needed in the experiment. For example, for the assembly
of a bacterial sized genome not so many reads are needed to achieve a sufficient
genome coverage in contrast to an eukaryotic genome. The estimated coverage of a
genome/transcriptome can be calculated as

r·l
Coverage =
G
where r is the amount of reads, l is the read length and G is the expected genome or
transcriptome size. For a standard RNA-Seq sample, we can expect a ∼75 000 000
base pair sized transcriptome. Therefore, a possible sequencing setup might involve
the multiplexing of three samples in one Illumina HiSeq2500 lane, resulting in ap-
proximately 60 million reads with a read length of 50 bp per sample. Roughly, this
would result in a 40-times coverage of an eukaryotic transcriptome of the given size.

Protocols

5,254 bp
1,035,000 bp 1,036,000 bp 1,037,000 bp 1,038,000 bp 1,039,000 bp
Ribo-Zero
smallRNA

UABP2
SNORD121A

Figure 2.6: Shown is a part of contig GL430013 of the Myotis lucifugus genome assembly. Total
RNA was extracted from a cell line of a closely related species (M. daubentonii ) and sequenced on
an Illumina HiSeq 2500 with the Ribo-Zero (top, blue reads) and smallRNA (bottom, red reads)
protocol. Shown are mapped reads of one replicate of a non-infected sample, exemplary obtained
from a currently ongoing study [15]. By combining reads from both protocols, we are able to
observe the expression of the UABP2 gene and the SNORD121A ncRNA, that would be otherwise
lost when only performing the rRNA depletion protocol. Furthermore, the strand-specificity of the
reads from both protocols and the advantage of a splice-aware mapping tool (in this case STAR [39],
see Sec. 2.4) for RNA-Seq data is shown.

As we can only analyze those molecules that were previously prepared for the se-
quencing run in the library preparation step, it is clearly important to select an
appropriate protocol for each NGS setup. Currently, a wide variety of different li-
brary preparation protocols exist, targeting different types of nucleotide molecules
(we will focus here mostly on Illumina kits). Whereas for DNA-Seq the selection
of an appropriate protocol might not be that difficult, one of the biggest problems
for RNA-Seq is the high amount of ribosomal RNAs (rRNA) present in total RNA

15
Chapter 2. Welcome to the Black Box

A Standard-Gel BluePippin B

Standard-Gel

132

225

2
BluePippin

Figure 2.7: Comparison of smallRNA selection with a gel and the BluePippin protocol. (A) Shows
FastQC length distribution plots after adapter clipping. In the BluePippin selection, the ∼22 nt
peak, most likely occurring from microRNAs, is lost. The ∼33 nt peak is most likely produced
by small tRNA- and yRNA-derived RNAs [42]. (B) Shows the overlap between miRNAs, covered
by at least 10 unique mapped reads. Almost all expressed miRNAs are part of the standard gel
size selection, but 132 are completely lost with the BluePippin protocol. The results shown here
exemplary are based on smallRNA-Seq data obtained from the FLI in Jena. The BluePippin
protocol is currently established, so just preliminary results of the comparison of both protocols
are shown.

samples. The rRNA can comprise up to 90 % of the total RNA [40]. Various prior-
to-sequencing procedures, such as mRNA amplification kits, can help to enrich the
yield of mRNA [41]. This is mostly done by a specific selection of RNA molecules
with a poly-A tail. Of course, by this procedure many RNAs that lack a poly-A tail,
like certain long non-coding RNAs, will be lost. Also, many other non-coding RNAs
do not have a poly-A tail. Therefore, another sequencing kit is more and more used
for total RNA library preparation: the Ribo-Zero rRNA removal kit. Within this
kit, magnetic beads are used to specifically target certain rRNA types in order to
remove them physically. Unfortunately, with this protocol, small RNA molecules
like miRNAs will be lost (Fig. 2.6). Specialized protocols can be used to select small
molecules from total RNA, like the Illumina TruSeq Small RNA kits. Nevertheless,
between specialized protocols for the selection of smallRNAs huge differences can
occur, especially if the protocols are not already well established (Fig. 2.7). There-
fore, the adequately selection of a fitting library preparation protocol is a crucial
step in each NGS design.
For a standard RNA-Seq workflow with the goal to identify differential expressed
genes in organisms with a reference genome available, we obtained good results
by combining two different Illumina protocols for RNA-Seq: rRNA− (Ribo-Zero)
and smallRNA, to cover most of the transcriptome present in a biological sample
(Fig. 2.6). We also observed that for model organisms a read length of 50 bp is
sufficient for mapping and read quantification and to finally call differential ex-
pressed genes (Sec. 2.5). Important is the use of strand specific data for the correct
quantification of smallRNAs and antisense transcripts. Of course, if the goal of an
NGS project is the assembly of a new genome, longer paired-end reads should be
preferred.

16
2.2. Assembly

2.1.4 Preprocessing of sequencing reads


After sequencing the data is provided
as so called FASTQ files[35], including 90
Illumina Universal Adapter
Illumina Small RNA Adapter

the nucleotide sequence of each read as 80 Nextera Transposase Sequence


SOLID Small RNA Adapter

well as quality values estimated during Sequence


70

the sequencing run. Before we can use 60


AAGCTGCCAGTTGAAGAACTGTTGGAATTCTCGGGTGCCAAGGAACTCCAG
TGTCTGAGCGTCGCTTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTGA

the data to answer any biological ques-

% Adapter
50 TATTGCACTTGTCCCGGCCTGTTGGAATTCTCGGGTGCCAAGGAACTCCAG
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTGACCAATCTCGTATGCC
tions, bases of low quality and possible 40 CAACGGAATCCCAAAAGCAGCTGTGGAATTCTCGGGTGCCAAGGAACTCCA

poly-A/T tails as well as any adapter se-


TAGCAGCACGTAAATATTGGCGTGGAATTCTCGGGTGCCAAGGAACTCCAG
30 ...

quences need to be removed from each 20

read. To evaluate the data set, before 10

and after any trimming/adapter clip- 0


1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38

ping, FastQC [43] is widely used to Position in read (bp)

calculate certain statistics and a distri- Figure 2.8: Exemplary shown is the FastQC
bution of quality scores per nucleotide. adapter content report for one smallRNA-
Based on the quality values, bases with sequenced sample of a human HuH7 cell line. The
a comparatively low quality can be re- sequencing was done on a Illumina HiSeq2500,
producing strand-specific single-end reads with a
moved, for example with PRINSEQ [44]. length of 51 bp. Clearly, an Illumina small RNA
As the DNA/RNA was fragmented adapter was sequenced in some cases, because
into smaller pieces before sequencing many smallRNAs (like microRNAs) are shorter
and adapters were ligated, it can hap- than 50 bp. Those adapters need to be removed
pen that also the adapter is par- prior further analyses of the data. Adapter se-
quences are marked. In one case, the full adapter
tially sequenced. This especially occurs was sequenced. This data was partially used by
when the length of the sequenced frag- Mostajo et al. [12].
ment was shorter than the applied read
length, which is often the case for smallRNA sequencing data (Fig. 2.8). Adapters
can be removed with CUTADAPT [45].

2.2 Assembly
Recent advances in NGS technologies are able to generate incredible amounts of
sequenced data. Quite recently Illumina announced two new high throughput se-
quencing systems: the HiSeq X Ten and NextSeq 500, saying that the last one would
be able to produce a sequencing throughput of 120 Gb (giga bases) per run, includ-
ing 400 million paired-end reads up to a length of 150 bp each1 . In context of this
continuing evolution of high throughput sequencing technologies we are able to pro-
duce large amounts of read data in a cost-effective and time-saving way. Especially,
massively parallel cDNA sequencing (RNA-Seq) has established as a major tool for
transcriptome quantification and analysis [32] (Sec. 2.1.1).
Nevertheless, as a typical eukaryotic genome is millions of nucleotides in size, no
current sequencing technology can decode such long sequences in one shot. Of course,
recent advantages in NGS [29] lead to an increase of read lengths, but short-read
methods like Illumina (Sec. 2.1.2) are still on the case. They are well established
1
https://www.illumina.com/content/dam/illumina-marketing/documents/
products/datasheets/datasheet-nextseq-500.pdf

17
Chapter 2. Welcome to the Black Box

and have a comparatively low error rate combined with still decreasing costs per
sequenced base. As one major step of NGS involves the fragmentation of the DNA
(or RNA), subsequently all small pieces (reads) need to be computationally merged
to rebuild the genome (or transcriptome) sequence.
This can be either done according to an already known reference sequence (refer-
ence based assembly) or without the use of prior knowledge (de novo assembly). If a
reference genome (or transcriptome) is available, the reads can be mapped (Sec. 2.4)
to the reference and subsequently assembled with tools like Cufflinks [46] or
Scripture [47] by clustering overlapping reads which aligned to nearby positions
in the genome. If no reference is available, or if the reference might not be complete,
the direct de novo assembly of RNA-Seq reads is an alternative strategy.

2.2.1 Building genomes


De novo genome assembler for short read data can be divided into two major types
based on their graph representation of short reads: overlap-graph and de Bruijn
graph [48] assemblers. Traditionally, so called overlap-consensus methods were used
for the assembly of Sanger reads by defining overlaps between the various long reads.
The first approach arranges all short
reads as nodes in a graph. If there Read ATGCGAGGGTGCAATCGA

is an overlap between the sequences 7-mer ATTGC GAG


GCGAGG
of two nodes they can be connected GCGAGGG
CGAGGGT
GAGGGTG
with an edge and called adjacent. The AGGGTGC
GGGTGCA
GGTGCAA
graph construction involves pairwise GTGCAAT
TGCAATC
alignments to find overlaps between GCAATCG
CAATCGA
short reads. Traversing the constructed de Bruijn graph
overlap-graph adjacent short reads can ATGCGAG TGCGAGG GCGAGGG CGAGGGT
be merged into longer contigs (contigu- GAGGGTG AGGGTGC GGGTGCA GGTGCAA
ous sequences of DNA). The de novo
GTGCAAT TGCAATC GCAATCG CAATCGA
genome assembly process becomes equal
to the Hamiltonian path problem [49] Figure 2.9: A very simple de Bruijn graph. All
of finding a path in a graph that vis- substrings of length k = 7 are generated from
its every vertex exactly once. Unfor- each read followed by the construction of a di-
rected graph by connecting k-mers with overlaps
tunately, this problem is NP-complete between the first k − 1 and the last k − 1 nu-
and therefore no efficient algorithm ex- cleotides. For non-strand-specific RNA-Seq data,
ists to solve it accurately. In addition, the reverse complement of each k-mer would also
regarding to the increasing amount of be part of the graph. Chains of adjacent nodes in
short read data up to billions of reads the graph can be connected to form the genome/-
transcriptome. Repetitive sequence elements,
per sample, the overlap-graph represen- single-nucleotide polymorphisms, sequencing er-
tation leads to memory and time con- rors, insertions and deletions as well as different
suming algorithms. isoforms (RNA-Seq) make the de Bruijn graph
Instead, most of the current de novo much more complex and in many cases not clearly
solvable.
assembly tools are based on de Bruijn
graphs [49] (Fig. 2.9). A de Bruijn
graph is similar to a overlap-graph in that way, of searching overlaps between se-
quences. Instead of using the whole read sequences as nodes in the graph the

18
2.2. Assembly

de Bruijn method uses unique subsequences of the reads to represent edges. These
subsequences are called k-mers and represent substrings of the reads of length k. An
edge in the de Bruijn graph describes an overlap of length k −1 between two k-mers.
Constructing the de Bruijn graph can be done in linear time because no pairwise
alignments are needed like in the overlap-graph approach. Thus, the assembly pro-
cess becomes the problem of finding a path in a graph that visits every edge at least
once, also known as an Eulerian path which can be efficiently solved [50]. However,
the entire graph has to be hold in memory for assembly so de Bruijn approaches
can become very memory-intensive for large datasets.
Some prominent and still widely used de novo genome assemblers based on
de Bruijn graphs are Velvet [51], ABySS [52] and SOAPdenovo2 [53]. Already in
the early 2000s Mira was developed (http://www.chevreux.org/projects_
mira.html), which is still based on overlap-graphs. Another recently published
de novo genome assembler is SPAdes [54, 55]. SPAdes was originally developed
for smaller bacterial-size genomes and single-cell data, but can be also applied for
standard isolates and other organisms, although it was not fully tested for larger
genomes.
All of the above mentioned de novo genome assemblers can be applied for tran-
scriptome assembly, in principle. However, the assembly of transcriptomic data has
some special requirements in contrast to genome assembly. Whereas the number
of genomic short reads should be more or less uniformly distributed over the whole
genome, the distribution of transcriptomic reads can differ in many magnitudes and
include transcripts expressed at both, low and high levels. Furthermore, de novo
genome assemblers were developed to reconstruct as few as possible continuous DNA
sequences by simultaneously maximizing the length. However, in a transcriptome
assembly, we can assume thousands of sequences (transcripts) and on top of that
different isoforms originating from alternative splicing. Therefore, de novo genome
assemblers can be in principal applied to transcriptomic short read data [56], but
there are also different challenges that need to be taken into account during the
assembly process of transcripts.

2.2.2 Building transcriptomes


The transcriptome comprises all transcribed molecules within a single cell expressed
at a certain time and under specific environmental conditions. Its reconstruction
spans the identification of transcripts and their isoforms, which can originate from
alternative spliced genes. To reproduce the transcriptome of a specific sample, we
have to assemble millions of short sequence reads produced by high throughput
sequencing technologies like Illumina [57]. The short reads are assembled into a col-
lection of transcripts where each one should represent a full-length transcript as good
as possible [58]. The information gained from the transcriptome allows researchers
to identify all expressed genes within a sample, originate alternative spliced isoforms
and capture differential expressed transcripts between different conditions, providing
a deep understanding of regulatory mechanisms within a cell.
For various organisms, whose genomes are well known and mostly annotated,
RNA-Seq can be used to generate the transcriptome based on million of short reads

19
Chapter 2. Welcome to the Black Box

mapped back to the respective reference (see Sec. 2.4). This approach is also known
as mapping-first or reference-based assembly. Transforming a large assembly prob-
lem into a smaller one by reducing the number of short reads and the possible
connections between them by first mapping reads to the genome is a big advan-
tage of the reference-based transcriptome assembly. On the other hand, the success
of a reference-based strategy depends heavily on the quality of the used reference
genome. Most genome assemblies contain many errors like misassemblies or inser-
tions and deletions [59] and can therefore guide to biased or partially assembled
transcripts. Of course a reference-based assembly is not possible if no reference is
available.
Thus, next to the reference-based
approach a second, called de novo tran- Gene Exon a b c

scriptome assembly, becomes essential


Isoform 1
to provide a meaningful alternative
for reference-free transcriptome analy- Isoform 2
Reads
sis. For organisms lacking an appro-
priate reference genome, de novo tran- Figure 2.10: Shown is a gene with three exons and
scriptome assembly comes into account two different isoforms. Isoform 2 is missing the
to produce transcripts from scratch (just second exon (b). Whereas the goal of a genome
assembler would be to construct the longest possi-
out of the read data gain from RNA- ble sequence (and so connect all three exons), in
Seq experiments). In theory, with a a transcriptome assembly we want to find both
de novo assembly of RNA-Seq reads it isoforms. This might be possible if we can find
should be possible to reconstruct all of split reads (green), supporting a direct connec-
the full-length transcripts and their iso- tion between exon a and c. Of course, in the case
of a de novo assembly, we would not have this
forms represented in a sample. In re- prior knowledge about the structure of the two
ality, the de novo assembly of short se- different isoforms.
quence reads is much more difficult and
technically challenging (Fig. 2.10).
While RNA-Seq becomes more and more important for analyzing the tran-
scriptome and differential expression of non-model organisms, the need of special-
ized de novo transcriptome assembly tools emerged. Transcriptome assemblers like
Oases [60] and Trans-ABySS [61] were build on top of currently existing genome
assemblers, Velvet and ABySS, respectively. At first the underlying genome as-
sembly tool needs to be executed for a range of different k-mer values. The second
step involves the merging of the different contig sets into transcripts. Furthermore,
alternative splicing events are taken into account. However, by merging the con-
tigs assembled from different k-mers redundancies are introduced in the assembly.
Therefore, de novo transcriptome assembly remains a big algorithmically challenge
in bioinformatics.
Another tool particularly developed for de novo transcriptome assembly of mas-
sively short cDNA reads is Trinity [62]. Trinity is also based on de Bruijn
graphs and uses three major algorithms in reconstruction of full-length transcripts,
which include the effective solution of problems like alternatively spliced and par-
alogous fragments. Based on the principles of SoapDenovo2 Xie et al. presented
SOAPdenovo-Trans [63], a de Bruijn transcriptome assembler which combines
novel insights from Oases and Trinity.

20
2.3. Annotation

Finally, there are not only de Bruijn based transcriptome assemblers out there:
Mira, which is based on overlap-graphs like stated above, can be also applied to
RNA-Seq short reads using a special EST mode [64].
One approach to overcome with the problem of disparate distributed short reads
in RNA-Seq data is the use of multiple k-mer values instead of one to build the
de Bruijn graph [65]. The value of k has a big influence even on de novo genome
assembly, but mainly for transcriptomes it can be used to handle both, lowly and
highly expressed transcripts, more efficiently. Using different k-mers to represent the
short reads of an RNA-Seq experiment provides a better representation of transcript
isoforms resulting from alternate splicing events. The assembled contigs based on
a range of different k values can be merged together to generate longer transcripts
and a better overall assembly [66]. Nevertheless, the efficient merging of assemblies
resulting from different k-mers and/or tools is still a challenging task. On the one
hand, very similar isoforms should be not merged into one transcript, e.g. a smaller
isoform that is simply a subsequence of a larger one. On the other hand, merging the
contigs of different de novo assemblies can introduce high redundancy in the final
assembly if similar transcripts are not efficiently merged, therefore slowing down
and complicate further analyses.

2.3 Annotation

A good annotation of a genome or transcriptome assembly is essential for further


investigations, like the identification of differential expressed genes (Sec. 2.5). The
annotation tells us the boundaries of genes, that we can use to quantify RNA-Seq
reads that were previously mapped back to a reference genome or transcriptome
(Sec. 2.4).
A widely used approach for the annotation of protein-coding genes is the search
for homologous sequences with BLAST [67]. By using BLAST, sequences of un-
known function be classified and annotated by a comparison with a known sequence
database (like GenBank or Ensembl). For example, de novo assembled contigs can
be compared to sequence databases in order to annotate them with similarity to
known genes or proteins.
The annotation of non-coding genes can be much more difficult, because in gen-
eral ncRNAs tend to be more conserved in their secondary structure than on se-
quence level. Therefore, specialized tools should be used to annotate non-coding
genes, taking into account also the secondary structure, like Infernal [68]. A
pipeline, combining different tools and developed and frequently used in our group
for the annotation of different ncRNA classes is GORAP [69].
A further advantage of RNA-Seq data is the improvement of already existing
annotations. Often, gene boundaries including UTR regions are not correctly an-
notated. By using RNA-Seq data, those boundaries can be adjusted and current
annotations further improved.

21
Chapter 2. Welcome to the Black Box

2.4 Mapping
Mapping describes the process of generating alignments of DNA- or RNA-Seq reads
to a reference genome or transcriptome. The general goal when mapping sequenced
reads is to determine for each read (or read pair) the true location (origin) with
respect to the reference. Therefore, mapping tools must be able to align millions
of short sequences to a huge reference sequence in a reasonable amount of time.
Generally, the process of mapping involves three major steps: 1) the creation of an
index of the reference sequence (just once), 2) the independent alignment of the reads
of each sample to the reference, and 3) the conversion of the resulting mapping file
for further downstream analyses like read quantification (Sec. 2.5.1) and differential
gene expression analysis. Many programs have been developed to map reads to a
reference sequence, varying in their algorithms and speed.
Some commonly used tools [70] for DNA-Seq data are BWA [71] and Bowtie [72]
and for RNA-Seq data Tophat [73], HISAT [74], STAR [39], and Segemehl [75].
The discrimination between a DNA- and RNA-Seq mapper is important, because
reads originating from transcripts of higher organisms need to be mapped in relation
to possible exon-intron-junctions in the reference sequence. (Fig. 2.10). Therefore, so
called splice-aware mappers like TopHat and Segemehl were developed. Whereas
most splice-aware mapping tools only allow for one split per read, Segemehl is
able to split a read multiple times. This behavior is especially advantageous if
longer reads are mapped to the reference and/or short exons are involved.
Furthermore, it is very likely for some reads to be mapped with the same quality
to multiple locations in the reference. This might be due to repetitive regions present
in a genome or similar transcript isoforms sharing same exons. Such ambiguous
mapped reads need to be considered in the downstream analysis (Sec. 2.5.1).
The SAMtools software suite [76] is widely used to convert, sort and index
alignment files (SAM/BAM) produced by current mapping tools.

2.5 Gene expression analyses


The ultimate objective of many RNA-Seq experiments is to identify features (e.g.
genes) of the transcriptome that are differentially expressed between or among indi-
viduals or tissues that are different with respect to certain biological conditions. As
reads derived from RNA-Seq data correspond well with real RNA abundances [31]
(Fig. 2.4), these abundances can be compared in order to identify significant dif-
ferences in the transcriptome. A typical RNA-Seq pipeline to identify differential
expressed genes (DEGs) is exemplary shown in Fig. 2.11.
To do this, we need to extract count data, so the number of reads that map to
a certain annotated feature like a gene or a transcript. Previously, alignment files
need to be generated like described in Sec. 2.4. If no reference sequence exists, a
de novo transcriptome assembly (Sec. 2.2) can be constructed and used for mapping.
The obtained count data will then serve as a proxy for the magnitude of a genes
expression, because transcripts that are more abundant in the cell will have more
reads generated from libraries prepared for RNA-Seq. The gene expression can vary
among samples, for example between a human monocyte cell infected with a fungi

22
2.5. Gene expression analyses

and additional treated with the vitamins A and D [5, 6] (Sec. 5.1) or between a
control HuH7 cell and a HuH7 cell infected with the Ebola virus [4] (Sec. 5.2).
By performing such experiments, a broad variety of biological questions can be
addressed, like: How many genes are significant differentially expressed between
treatments? Can DEGs be clustered in functional groups? Are they involved in the
same pathways?

2.5.1 Read quantification and normalization


One of the important steps af-
ter mapping is the quantifica- postprocessing
raw reads
tion of the reads that align to NGS (SAMtools)
(SAMtools)
certain features. A feature can
Rstudio
be everything that is described quality check
in a corresponding annotation (FastQC)
read quantification
file, like a contig, gene, tran- (HTSeq-count)
script or exon. Commonly used trimming
(PRINSEQ)
tools for read quantification
normalization, statistics,
are HTSeq-count [77] and differential gene expression
mapping
FeatureCounts [78]. De- (Tophat)
(DESeq2)
pending on the parameters
used, the tools count only reads Figure 2.11: A typical RNA-Seq pipeline to detect differ-
that exactly map within the ential expressed genes (DEGs). Raw reads obtained from
boundaries defined in the anno- NGS (Sec. 2.1) are quality controlled and, if necessary,
trimmed and adapter clipped. Processed reads of each
tation file, or only those reads sample are then separately mapped to a reference sequence
that map to one place across (Sec. 2.4). The prepared mapping files can be used to es-
the entire reference. The use of timate transcript abundances by counting reads mapping
only unique mapped reads in- to annotated features. The resulting read count tables
stead of all (multiple) mapped need to be normalized (e.g. regarding to different sequenc-
ing library depths) and can be finally compared to identify
reads can heavily effect further DEGs. Read counting and differential expression analysis
downstream analyses. There is often done in R, a widely used programming framework
are numerous reasons why a for statistical evaluations.
read may map to multiple loca-
tions in a reference such as high sequence similarities due to gene duplications or
homologous regions across genes, or also errors introduced in the reference sequence
during the assembly process. Especially, when a transcriptome assembly is used as
a reference, different isoforms of one gene can lead to many multiple mapped reads.
The most conservative and often used approach to deal with such ambiguous mapped
reads is to only consider those, that map unique to the reference (Fig. 2.12). Un-
fortunately, this approach can also result in losing interesting differential expressed
candidates, for example in a smallRNA sequencing run with short reads mapping
to many similar microRNAs. Therefore, one has to be always keep in mind the
advantages and disadvantages of including or removing ambiguous mapping reads
in a differential expression study.
Another crucial step after read quantification involves the normalization of the
counts. As the total read numbers sequenced for each sample vary, raw read counts

23
Chapter 2. Welcome to the Black Box

Gene A Gene B

Control condition

Treatment condition

Figure 2.12: Shown are two genes with a homologous sequence part. Therefore, some of the reads
in the control (top) and treatment (bottom) condition are mapping unique to gene A and B, but
some reads are also multiple mapped (light blue). Let the biological truth be that all ambiguous
reads are originally derived from gene A. If we would now count all reads (including the multiple
mapped once) and theoretically compare the read counts between control and treatment, gene
B might be differential expressed, because the multiple mapped reads raise the read count in
the treatment condition. We would possibly detect gene B as false positive DEG. On the other
hand, if we remove the ambiguous reads completely (because we do not know for sure where they
originated from), the expression of gene A in the treatment condition is much lower and so is also
the calculated fold change (Sec. 2.5.2).

need to be normalized according to the library size. Otherwise, a gene could be just
2-fold expressed in one sample, because the double amount of reads was sequenced.
Additionally, the gene length also has an impact on the overall number of sequenced
reads. Statistically, from a longer transcript more fragments are derived during
the library preparation step and so just by chance more reads are sequenced. One
approach to normalize for the library size and feature length simultaneously is the
calculation of transcripts per kilobase per million (TPM [79]) values:
!
ci 1
T P Mi = · P cj · 106
li lj
j∈N

where ci is the raw read count of gene i, li is the length of gene i and N is the
number of all genes in the given annotation. TPM values can be used to filter out
lowly expressed genes in respect to their length.
Another widely used normalization unit is called RPKM or in case of paired-end
data FPKM [46]. Both units represent the same: reads (fragments) per kilobase per
million mapped reads (fragments). Therefore, this is the number of reads aligning
to a feature (like a gene), normalized by the total number of reads mapped (in
millions) and the length of the feature (in kilobases). Instead, the TPM value is
the number of reads from a particular feature normalized first by the feature length,
and then by sequencing depth (in millions) in the sample. Recently, it was shown
that the widely used R/FPKM measures are inconsistent among samples [80]. This
inconsistency essentially arises from the wrongful division by total read counts in the
library normalization step after normalizing by length. Therefore, the TPM value
is more accurate then the R/FPKM measure.
However, when searching for differential expressed genes between different sam-
ples, a normalization based on the gene length is not necessary, because we will only
compare the expression of the same genes in different samples (and therefore only

24
2.6. Visualization

A Mock vs RVFV 6 h p.i. B Mock vs RVFV 24 h p.i.

6
● ●

● ● ●
● ● ●

● ●


● ●
● ● ● ● ●



● ●
● ●

● ●

● ● ●

4

4
● ● ● ● ●
● ● ●
● ●
● ●● ● ● ● ● ●
● ● ● ● ● ●

● ● ●
● ● ● ●● ●
● ● ● ●
● ● ●
● ●
● ● ● ● ●
● ● ● ● ● ●
log2 fold change

log2 fold change


● ●● ● ●
● ● ● ● ● ●●
● ● ●● ● ●● ● ●● ●● ●
● ● ● ● ● ●
● ● ●
● ● ● ●● ● ● ●● ●
● ● ● ● ●●
● ● ●● ● ●
● ● ● ● ●● ●●● ●● ●● ●
● ●
● ● ● ● ● ● ● ●●● ●● ● ● ●
● ● ●● ● ● ● ● ● ● ●
● ●●● ●● ● ● ●●●● ●●● ● ● ●
●●
● ● ●
● ● ● ● ● ●● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ●● ●●● ● ●
2

2

●● ●● ●● ● ● ●●● ● ●● ●● ● ●
● ● ●
●●
● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ●
● ●
● ● ●
● ● ● ● ●● ● ● ●● ● ● ● ● ● ●
●●●

●● ● ●● ●● ● ●●● ● ● ● ● ● ●
● ● ● ●● ●● ● ●●●●●● ● ●●● ●●● ●●● ●●●● ● ● ●● ● ● ● ●
● ● ●● ● ●● ● ●●
● ● ● ● ● ●●● ●● ● ● ●● ●●● ●● ● ●
● ● ● ●●●●● ● ● ●● ●●● ●● ● ●●●●
● ● ● ● ●●
● ● ● ●● ●● ● ●● ● ●●

● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ●●●● ● ●
●● ●● ● ● ● ● ●● ● ● ●● ● ●● ●
● ●●● ●● ● ●●● ● ●


● ● ●
● ●● ● ● ● ● ● ●● ● ●●●● ● ●
● ● ●
●●
●● ● ●● ● ● ●●
● ● ● ● ●●●● ●●●● ● ● ● ●●● ●●

●●● ●
●●●●● ●●●● ●
●● ●
●●● ● ●●●● ●●● ●●●
●●

● ● ●
● ● ●● ●● ● ● ● ●●●● ●● ● ● ●
● ●●● ● ●
●●●
●● ●●● ●● ● ● ●● ●●●● ● ●
● ● ● ●● ● ● ● ● ● ●●● ●●●● ● ●●●● ● ●● ●●● ● ●●●● ●●●●●●●●
●●


●●●● ●●●
● ●●● ● ●● ● ●●●
● ● ● ● ●● ●
● ● ● ●●● ● ● ● ● ●●● ● ●● ●●●● ● ●●●● ●●●● ●● ● ●●● ●●● ●●●
●● ●● ●
●●●
● ●●● ●
●●●
●●●●●●●
● ●●
●●




●●●
● ●● ●●

●● ● ●●●● ●●● ● ●●● ●● ●
●● ●
● ●● ● ● ●● ●
● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●
●● ● ●●●●●● ● ● ●
● ● ●●
● ● ●
●●● ● ●
●●●
● ●● ●
● ●● ●● ● ● ●●
●●●●● ● ● ●
●● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●●● ● ●●
●●
●●●● ●●● ●●● ●●
●●● ●●● ● ●● ●● ●● ●● ●● ●● ●●
● ● ● ● ● ● ● ●●
● ●
● ●● ● ● ●● ●● ●● ● ● ●●●● ●●●● ●● ● ●● ●● ●●● ●
● ●●
●● ●●●●●●
●●
●●●
●●●●●●●●● ●
● ●
●●
● ● ●●●
● ●
●●
●●●
●● ●●




● ●●●●●●
●●●●●
● ●●
●● ● ●
● ●●
●●●●●
● ●● ● ● ●●● ●● ● ●●●
● ●●●●
●● ●●● ●● ●● ●● ●
● ●● ●

●●
●● ●●
●●

●●●
● ● ●
●●●●●
●● ●
●●●●●
● ●● ●
●●●●●
● ●●●●
●●

●● ●
● ●●

●●




●●


●●
●●
●●●● ●●
●●●●●●● ●●●●●● ●●●●● ● ●●

● ● ● ●●●
● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●

● ● ●● ● ● ●●
●● ● ●●● ●●●●●●●● ● ● ●●
●● ●●●●●



● ●● ●


●●● ●●
● ●


●●●


● ●●●
●● ●●●●
●●●

●●●
●●


●●●
● ● ●


●●
●●



●●
● ●●●●
● ●●
● ● ●
●●
●●●●






● ●
●●●●●●
● ●● ●●●● ●● ●●●● ● ● ● ●
●● ●● ● ● ● ● ●

● ● ●●●●● ●●
●●
● ● ●●● ● ● ●●

●●● ●●●
● ● ● ●● ●●
●●●
●●

●● ●●●

●●

●● ●●
●●●●●

●●●●●●●
●●●

●● ●●●● ● ●

● ●

●●●

●●
●●●

●●
●●●
●●

●●●

●●●
● ●

●●


●●
●●

●●●
●●


●●

●●●●●
●●
●●●●
● ●●
●●
●●
● ●
●●

●● ●
●●●●●

●●

●●
●●●
●●
●●

●●●●

● ●●●
●●● ●
● ●● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●
● ●●●● ●● ●
●●
●●●●●
● ● ●
●●●
●●
●●●
●●

●●


●● ●●

●●
● ●●
● ●● ● ●●●●●●
●●●●●● ●●●

●●●●
●●●●
●● ●
● ●
●● ●

●●●● ●
●●
●●●●
● ●●

●●●●

●●
● ●
●●
●●
● ● ●
●●●●









●●●


●●

●●




●●
●●●●

●●


●●













●●●●●

●●●
●●●●●●●
● ●●
●●




●●







●●
● ●





●● ●
●●
● ●●●●
●● ●● ●●●
● ●●● ●● ●

● ●● ● ●●●
● ●● ●● ●● ● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ●●● ● ●
● ● ●
●●●
● ●●●●
●●

● ●
●● ●● ●●
● ●
●●
●●

●●●●●● ●
●●●● ●● ●●● ●
●●
●●● ●● ●●●●●

● ●●

● ●
●●●● ●
●●

●●●

●● ●

●● ●
●●

●●

●●
●●

●●●
●●● ●●

●●

●●●●●
● ●●
●●

●●

●●●●

●●●

● ●●
●●

●●●

●●
●●
●●
●●
●●
●●●

● ●
● ●


●●

●●
●●
●●
●●

●●●


●● ●
●●
●●




●●

●●●●
● ●

●●●
●●


●●
●●
●● ●




●●

●●
● ●
●●
● ●●●
● ●●●
● ●●
● ●●

●● ●● ● ●● ●
●● ● ● ● ● ●●
● ● ●● ● ● ●● ●●●● ●●●●● ●●●●●● ● ● ● ● ● ●●●● ●●●● ●●●●●●● ●●

●●●●

●● ● ●●●● ●●●● ●● ●●
●●●●●● ● ●● ● ● ●●● ●●
●● ●●●●
●●● ●● ●● ●
●● ● ●●


●●

●●
●● ●●

● ●
●●
●● ●●● ●●
●●
●● ●


●●●●●

●●
●●
● ●

● ●●●
●●
●●
●●


●●
●●

● ●
●●●


●●●
●●


●●

●●
●●

● ●
●●

●●●●●

●●
●●

●●● ●

●●
● ●
● ●
●● ●
● ●●
●●●
● ●●

●●●
●●●●
●●● ●●
●●

●●●
●●
●●●●●
●●●●●
●● ● ●
● ● ●
●●●●
●●
●●● ● ●●● ● ● ●●●● ● ●● ● ●● ● ●● ●● ●
●● ● ● ● ●●● ●●●●●● ● ● ● ●●
●● ●
●●●●●● ●●●● ● ●
●● ●●●●●●● ●
●●
●●●●●●
●●

●●
●● ●
●●
●●● ●
● ●●
●●
● ●
●●●●●● ●●●● ● ●●●●


● ●●

● ●

●●
● ● ●●
●●
●●●●●
●●●●●
●●
●●●
●●

●●●
●●
●●

●●
●●


●●
● ●
●●

●●

●●


●●
●●
●●●
●●

●●
●●
●●

●●●
●●


●●●●●
●●
●●

●●


●●
●●●

●●
● ●

●●
●●●●●●

●● ● ●● ●
●●●
●●
●● ●●●
●● ● ●●● ●
●● ●
● ●●● ● ●● ●●● ●● ● ● ● ●
● ●●● ●
●● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ●
●● ●●● ● ●● ● ●● ● ●●● ● ● ● ●●● ● ●

●●
● ●
● ●●●● ●● ●
● ●
●●● ● ● ●●●●●● ●● ●● ● ● ●● ●●●● ●● ●●
●●●●● ●
●●
●●
● ●●●● ●●
● ●● ●●● ●●
●● ● ●● ●
●● ●● ●● ●●●●
●●●●● ● ●●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ●● ●
● ●● ●● ● ● ●●●●
● ●
●●●
●● ● ● ●●
●●

● ●● ●
●●
● ●●
●●

●●
●●

● ●●●●●●
●●● ●●●●●
●●●
● ●●
● ●● ●
● ●
●● ●
● ●

●●● ●●

● ●
●●
● ●●●

●●●●
● ●
●●● ●
●●
●●
●● ●
● ●●●

●●●
● ●●●
●●
●●●●●

●●

●●
●●
●●●●
●●

●●●
●●
● ●

●●● ●

●●●
●●●●
● ● ●
●●

●●
●●
●●
●●
●●

●●

● ●

●●

●●
●●
●●


●●
●●

●●●

●●
●●



●●


●●●



●●●
●●●
●●●

●●

●●
●●
●●
●●


●●●●

●●

●●

●●
●●
● ●●
●●●● ●● ●
●●●
● ●●●● ●●
● ●
●●●
● ●● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ●
●●
● ●● ● ●●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ●●● ●● ● ●
● ●●●● ●●

●●●●● ●●● ●●●●● ●●●● ●●
●●

●●●●
●●

●●●●●


●●●●
●●

●●●●●


●●
●●●●●



●●●●

●● ●

●●●
● ●
●●

● ● ●

●●


●●
●●●● ●
●●


●●
● ●
●●
● ●●●●
● ●
●●●●●
●●
● ●●●
● ●
●●

●●
●●●●●●●●● ● ● ● ● ●●
● ●●
●●● ● ●● ●●
●●●
●● ●
● ●●● ●● ●● ●● ● ● ● ●
● ●●●●● ●● ●
● ●● ●
●●●● ●
●●●● ●
●●●●
●●


● ●
●●● ●●●
●●●●●● ●●●















●●


●●●

● ●
●●




●●



●●●●
●●●● ●●
●●
●●●●
●● ●●




● ●
●●

●●


●●

●●
●●
●●


●●●●●


●●



●●




●●
●●






●●

● ●

●●●●
●●●
●●●
●●
●●
●● ●


● ●

●●●
●●●

●●


●●
●●

●●●●
●●





●●●
●●●




●●●




●●


●●






●●


●●








●●
●●








●●



●●




●●
●●








●●







●●






●●●








●●


●●



●●








●●





●●






●●


●●











●●
●●

●●●


●●●

●●

●● ●

●●

●●●●
●●● ●



● ●

●●●
●●●●●
●●●●●
●●●
●●
●●●●
●● ●
● ●
●●
●● ●●●●● ●●● ●
● ●●●
●●●●●● ●●
● ● ●
●●
●●●●●
●●●
●●●●●
●● ●
●● ●●●● ●
●●

●●
●●
●●
●●

●●

●● ●●


●●●
● ●
● ●
●●
●●●
● ●

● ●
●●●●
●●●●

●●
●●
●●

●●
●●
●●
●●
●●

●●●●
●●
●●
● ●
●●
●●●●●


●●●●●
●●● ●
● ● ●
●●
●●●
●● ●●●●
●●●● ●●
● ●●●● ●●●●● ● ● ● ● ● ● ● ●●●●●
●● ●● ●● ●●
●●●●●●

●● ● ●●
●● ● ● ●●

●● ●●●
● ●●
● ●● ●
●●●
●● ●
●●
●●●
●●●
●●●●●●●
●●●
●●
● ●
●● ●
●●● ●●
●●
●●●●
●●
●● ● ● ●
● ●

●●●
● ●

●● ●

●●
● ●●
●●●
●● ●●●●
●● ●
●●

●●
●●●●

●●
●●●
●●●●
●●
●●


●●

●●
●●
●●
●●●


●●
●●
● ●
●●●●
●●●

●●
●●
●●
●●
●●●

●●


●●
●●

●●

●●
●●
●●

●●

●●●
●●


●●
●●
●●●
●●
●●

●●●
●●

●●
●●
●●

●●
●●
● ●

●●●

●●
●●●
●●
●●
●●

●●●


●●

●●

●● ●●
●●●
● ●
●●
● ●
●●●●●●●
● ● ●● ●
● ●●●● ●● ●
●● ●
●●●●● ● ●●● ● ● ● ● ● ● ●● ●● ● ●
●●● ● ●● ●● ● ● ● ● ● ●● ●●● ●● ●●
●● ●
●● ●● ● ●● ● ● ● ●●●● ●● ● ●● ● ● ● ●● ●● ● ●●● ● ●
● ● ●
●●
● ● ●
● ● ● ●● ●
● ●
● ● ● ● ● ●
● ● ● ●●● ●● ● ●●
● ●
● ●● ●●●
●● ●●● ●●

● ● ● ●● ●●●
●● ●● ●●●
●● ●●●
● ●
● ●●●
●●● ●●●●
● ●
●● ●●●

●●●
● ● ●
●● ●●●
●●
●●

●●●


●●● ●●●●
● ●●
●●● ●
●●●●

●●

● ●




● ●
●●
●●

●●
●●
●●●
●●
●●
●●
●●


●●
●●

●●
●●

●●

●●
●●
●●


●●
●●


●●
●●
●●

●●
●●
●●
●●●
●●

● ●
●●●


●●


●●●






●●●
●●

●●

●●
●●


●●




●●

●●

●●
●●








●●

●●


●●●



●●●






●●

●●

●●

●●


●●

●●
●●


●●


●●

●●
●●


●●



●●


●●

●●


●●
●●

●●



●●

●●




●●●●

●●
●●



●●

●●



●●
●●

●●
●●
●●
●●


●●
●●
●●



●●
●●

●●●



●●


● ●



●●
●● ●

●●
●●


●●●
●●
● ●
●●●●●●●
● ●●
● ●●

●●●●●
● ● ●● ●●● ●● ● ● ●●● ●
●● ●
●●●
●●●●
●●

●●●● ●●●
●● ●
●●●●●

●●
●●
●●
●●

●● ● ●● ●
●●

● ●●● ●
●●●●
● ●
●●●●
● ●


●● ●●●●●
● ●●●
●●●
●●●

●●●
●●
●●
●●●● ●
●●●●
●●
●●●
●●●
●●
●●●●●
●●
●●
●●
● ●●
● ●
●●●●

●●●● ●●●


●●●


●●
● ●●

● ●●

●●

●●●●

●●
●●●
●●

●●●
●●●●


●●
●●

●●●

●●

●●
●●

●●

●●
● ●

●●

●●
●●●
●●●

●●


●●

● ●
●●
●●●


●●


●●

●●
●●
●●



●●●

●●







●●



●●

●●

●●
●●

●●
●●



●●
●●



●●
●●●

●●●
●●


●●●
●●




●●●

●●
●●





●●
●●●



●●


●●

●●●●●
●●●

● ●
●●●

●●
●●●●

●●●
●●
●●
●●●

●●
●●
●● ●
●●●●
●●●● ● ●●● ● ●●●●●●● ● ●
●●● ●●●
●●●
● ●●●● ●●●●●●

● ●
●●●

● ●●
● ●●
● ●


●●●●●
●●

●●●● ●
● ●●●●●



●●
●●
●●
● ●●
● ●
●●●
● ●

●●
●●●
●●
●●●
●●


●●

●●●




●●



●●
●●

●●

●●●


●●

●●


●●

●●


●●
●●






●●

●●

●●

●●
●●

●●

●●


●●

●●

●●

●●●


●●










●●



●●






●●





●●●


●●

●●
●●

●●


●●
●●


●●



●●














●●

●●

●●




●●
●●


●●


●●





●●



●●




●●






●●














●●



●●



●●

●●




●●

●●


●●



●●



●●

●●




●●


●●


●●



●●


●●
●●





●●



●●



●●



●●




●●




●●



●●




●●








●●




●●
●●●




●●





●●








●●



●●


●●




●●




●●

●●



●●











●●



●●




●●



●●



●●



●●








●●





●●
●●



●●








●●


●●

●●

●●










●●


●●



●●



●●



●●

●●
●●
●●
●●


●●
●●


●●

●●

●●

●●

●●



●●
●●
●●
●●●●
● ●

●● ●●
● ●●
● ●
●●
●●
●●● ●●● ●● ●● ● ● ●●
●● ●
●● ●
●●
●●●

●●●●●
●●● ●

●●
●●
● ●

●●● ●● ●●●
●●
●●
●●●


●●

● ●●
●● ●●●●
●●●●●● ● ● ● ● ●
● ●
●●●
●●●
● ●●

●●
●●●●
●●●
● ●●
● ●●

●●●●
● ●●
●●
●●●
● ●
●●●●●●●●

● ●
●●
●●●● ●● ●
●●●
●●● ●

●●

●●




●●●●●
●●●●
●● ●
●●
● ●●

●●
●●

●●●● ●
●●
●●
● ●

●●
●●

●●
●●

●●●
●●




●●●●
●●

●●
●●

●●●
● ●

●● ●●●





●●●
● ●●

●●
●●










●●●





● ●●


●●
●●


●●



●●

●●

●●●
●●

●●






●●

●●



●●


●●
●●

●●

●●


●●●



●●
●●
●●

●●●

●●


●●


●●

●●●


●●


●●
●●

●●

●●

●●
●●

●●
●●

●●●

●●●●
●●

●●
●●

●●
● ●
●●



● ●


●● ●●




●●● ●● ● ●
●●● ● ● ●
0

● ● ● ● ● ● ● ●
●● ● ● ●

0
● ●●●
●●●
●●
●●


●●

●●
●●

● ●

●●
●●

●●●
●●

● ●

●●●●
●●
●●●●● ●●●●●●

●●●

●● ●
●●●
●●

● ●● ● ●●
● ●
● ● ●
●●● ●●●
●● ●
●●
●●
● ● ●
●●


●●●
●●●●
●●
●●

●●
●●●
●●
●●

●●●


●●
●●●●
●●

●●
●●
●●
● ●

●●
●●

●●
●●●
●●

●●●
●●●
●●
●●

●●

●●

●●



●●
●●

●●

●●


●●
●●

●●
●●
●●

●●


●●

●●


●●

●●●


●●

●●
●●
●●


●●●
●●

●●



●●


●●


●●
●●


●●

●●


●●


●●




●●


●●



●●

●●

●●



●●


●●
●●




●●


●●


●●




●●
●●●




●●




●●

●●●



●●

●●




●●



●●




●●





●●



●●


●●




●●


●●

●●



●●

●●
●●



●●




●●

●●




●●



●●

●●


●●





●●


●●




●●




●●



●●



●●



●●




●●




●●


●●

●●



●●


●●



●●

●●




●●


●●

●●

●●●

●●




●●


●●




●●


●●


●●


●●
●●

●●


●●

●●

●●





●●


●●


●●


●●


●●





●●


●●

●●

●●

●●

●●
●●

●●

●●
●●



●●

●●

●●

●●●
●●
●●
●●●

●●
●●
● ●●

● ●●● ●●● ●● ●● ●
●●
●●●
●●


●●

●●

●●
●●

● ●●

●●
●●●●
●●
●●
●●

●●●

● ●● ●
●●


● ●● ●●●●● ●

●● ●
●●

●●


● ●● ● ●●●
● ● ● ●●
●●●●
●●●●●●● ●
●●
●●●●●●●● ● ● ●
● ●
●●●●
● ●

● ●●
●●


● ●
●●
●● ● ●
●●●●●●
●●
● ●●●●

●●●●
●●●
● ●●
● ●●●
● ●●
●●● ●
●●●
●●
●● ● ●●
●●●●●●
●● ●●
●●●●
●●●
●● ●

●●●●
●●

●●●
●●
● ● ● ●

●●●

●●


●●
●●

●●●●

●●●●●


●●

● ●

●●

●●

●●●
●●

● ●
●●
●●

●●
●● ●
●●
●●●
● ●
●●●

●●●
●●●●
●●

●●
●●●

●●

●●

●●●●●●●●●●●
● ●●●●
●●● ● ● ● ● ●● ●● ● ● ● ●
● ●●
● ●●
● ●●
●●●
●●● ●
●●●●
●●●●●●●●●●


●●●
●●

●●

● ●
● ●●

●●●
●●●
●●●● ●●
● ●●●●
● ●●●
● ●●●
●●●●●●●●●
● ●●
●●●
●●●● ●
●●●●●● ●●
●●●●●● ●●
● ●
● ● ●● ●
●● ● ●● ●

●●●●

●●●●
●●●
●●
● ● ● ●●●
●●
●●●●● ●
●●●● ● ●●
●●●

●●

●●
●●

●●● ●●
● ● ●●●●
●●●●
●●
● ●●
●●●

●●●
●●●●●●●
●●
● ●●● ●

●●●
●●

●●●
●●
●●● ●
●●
●●
●●●●●●●
●●● ● ●●

●●
●●
●●

● ● ● ●●●●●● ●●● ●
●●●● ●● ●
● ● ●
●● ●
●● ● ● ●●●● ● ● ●●
●●●●
● ●●●●
●●
●●● ●
●● ●
●●
●●
●●

●●●
●●●
● ●●●●●
●●●
●●
●●
●●●● ●●●● ● ● ●●●● ●● ●●● ● ●
●● ●●● ●● ● ●● ●● ● ● ●●● ● ● ●
●● ● ●●●●● ● ● ●●
●● ●●
● ● ● ● ●● ●● ● ● ● ●●●
● ● ●
●●●● ● ● ●● ●● ●● ●
●●
●●
●● ● ●●●●●
● ●●● ● ●●● ●●
●●
● ● ●
●●
● ●

● ●● ●
● ● ●●●●●● ●●● ●● ●●● ● ●● ●●
● ●
● ●

●●
● ●
●●
●●●
● ●●

●●●
●●


●●

●●
●●●
●●

●●



●●
●●

●●●●

●● ●●●●
●●
●●●●●
● ● ●
●●
● ●● ●
● ●

●●●
●●
●●● ●

●●●
●●●● ●

●● ●●●● ●●

●● ● ●
●●●
●●

●●●

●●●

●●
●●


●●●●
●●

● ●

●●


●●

●●

●●
●●

●●






●●



●●

●●
●●

●●

●●
●●



● ●●

●●



●●●


●●●
●●


●●

●●


●●



●●

●●



●●


●●

●●

●●

●●



●●

●●

●●


●●




●●


●●



●●



●●



●●



●●

●●
●●


●●










●●

●●



●●


●●


●●



●●



●●




●●




●●



●●


●●



●●
●●




●●●

●●



●●





●●
●●









●●




●●



●●


●●


●●



●●




●●●




●●




●●




●●


●●




●●



●●






●●






●●




●●





●●











●●



●●






●●
●●


●●









●●

●●





●●












●●












●●





●●





●●



●●






●●














●●



●●




●●




●●



●●



●●











●●




●●




●●

●●



●●



●●



●●
●●






●●
●●
●●


●●



●●


●●








●●


●●


●●




●●


●●


●●


●●


●●


●●


●●



●●●
●●


●●

●●
●●

●●

●●
● ●


●●
●●
●●


●●


●●

●●●●




●●
●●●
● ●
●●●●

●●●●●●●
● ●● ●● ●
●● ●
● ●●●●●
●●●
●●
●●
●●

●●
●●●

●●
●●

●●

●●
●●
●●

●●

●●●

●●

●●
●●●


●●
● ●
●●
●●●



●●●●
●●●●
● ●
● ●●●

●●●●
●●●


● ●
●●●
●●● ●
● ●

●●●●●
●●●
●●● ●●●
● ●●
●●● ●

●●
●●●●
● ●
●●● ●●
●●
● ●
●●●
● ●● ●
●●●
●● ●

●●●
●● ● ●●
●●●
● ●
● ●●●●
●● ●
● ●

●●●
●●●
●●●
● ●●
●● ●
● ●●●

●● ●
●●
●●
●●



●●●●
● ●●●●
●●●●●
●●

●●
●●●●


●●

● ●●

● ●●●●
●●●
●●

●●

●●●
●●●●

●●
●●
●●●●


●●
●●●
●●


●●

●●●
●●

●●●
●●
●●

●●

●●
●●


●●
●●●

●●

●●







●●
●●●

●●●●



●●

●●





●●



●●
●●


●●
●●
●●
●●



●●


●●

●●
●●


●●●
●●

●●

●●
●●
●●



●●


●●●●

●●●

●●
●●
●●
● ●

● ●

●●
● ●
●● ●●● ●●● ●●● ●●● ●●
●● ●●● ● ●● ●
● ●
● ●
●●●
●●● ●


●●


● ●


●●
● ●●● ●●●
●●●●
● ●

●●

● ●●●●●●●
●● ●



●●
● ●
●●


● ●●●
●●●●
● ●●●●

●●

●●●●

●●


●● ●●
●●

●●

●●

●●
●●
●●
●●
●●

●●

●●
●●

●●●


● ●

● ●●


●●
●●


●●●

●●

●●

●●
●●

●●

●●


●●
●●



●●
●●


●●


●●


●●



●●


●●


●●

●●
●●

●●



●●


●●


●●

●●

●●


●●


●●


●●


●●
●●




●●
●●



●●

●●

●●



●●

●●


●●



●●




●●









●●

●●



●●

●●
●●



●●

●●










●●


●●

●●


●●





●●








●●
●●

●●



●●
●●




●●


●●


●●

●●












●●



●●



●●




●●


●●



●●


●●




●●










●●



●●

●●


●●


●●


●●


●●


●●




●●



●●
●●


●●

●●


●●

●●

●●


●●

●●

●●
●●



●●


●●

●●


●●

●●




●●


●●

●●
●●●●
●●●
● ●●
●●
●●

●●
●●


●●
●●●
●● ●

●●
●●
●●●●●●

● ●● ●●●●●●
●● ● ● ●● ● ●●●●

● ●●
●●● ●
● ● ●●●●

●●


●●●
●●

●●●

●●●●

●●●
●●● ●●
●●
● ●●●
●●●
●● ●● ●
●●●
●●
● ●
●●●


●●●●

●●●
● ●●

●●
●● ●



●●●●●

●●●




●●
●●●●
●●
● ●●

●●

●●●
●●●
●●
●●●●
●●
●●●●
●●

●●

●●
●●●
●● ●●
●●●●
●●●
● ● ●
●●●
●●
●●
●●
●●
●●●●
●●●
● ●●●
●●●●

●●

● ●●●


●●

●●●●●

●●


●●


●●●
●●●
●●

●●●●●

●●

●●●

●●●
●●

●●●
●●●

●●

●●


●●
●●

●●

●●
●●

●●

●●
●●

●●
●●●
●●

●●

●●●
●●

●●


●●
●●●
●●
●●
●●
●●
●●

●●●
●●●

●●
●●

●●

● ●


●●




●●
●●

●●●
●●


●●
●●
●●


●●






●●

●●●
●●●●
●●




●●●

●●
● ●
●●●
●●
●●

●●
●●

●●
●●●●●●
●● ●●
●●●● ●
● ● ●● ● ● ● ●
●● ● ●● ● ●●●
● ●●●●●●●

●● ●●

● ●●● ●
●●●

●●●
●●
●●● ●●●
●●●●●

●●
●● ●

● ●
●●●
●●●
●●
● ●●


●●●●●


●●

●●
●●
●●●



●●●

●●

●●

●●




●●●





●●








●●
●●






●●



●●




●●

●●


●●
●●


●●
●●






●●
●●




●●


●●
●●


●●
●●

●●




●●




●●























●●









●●
●●


●●
●●


●●





●●







●●●

●●

●●



●●


●●

●●











●●


●●




●●

●●
●●



●●
●●
●●



●●
●●
●●

●●

●●







●●






●●

●●●


●●

●●

●●


●●
●●●







●●●●
●●

●●●

●●



●●● ●●
●●

●●
●●●

●● ●
●● ●
●●●●●

● ●● ●●● ●
● ●●●●● ●
●●
●●● ●●
●●
●●
●●●
●● ●● ●●●

● ●
●●
● ●

●●
● ●●●
●●


●●●
● ●●● ●●●
●●●

●●
●●


●● ●●


●●
● ●

●●●


●●●

●●●●
●●●●●● ●
●●
● ●
●● ●●
●●
● ●
●●● ●

●●●

●●
●● ●


●●
●●
● ●

●●


●●
●●
●●●
●● ●
●●●

●●●●
●●●
●●●●

●●●
●●●





●●●



●●●


●●

●●●

●●

●●●


●●


●●







●●
●●●
●●
●●


●●




●●●

●●
● ●





●●●
●●
●●●



●●





●●
●●

●●
●●

●●●●

●●
●●

●●




●●





●●

●●





●●




●●●

●●

●●










●●



●●●





●●
●●

●●


●●







●●


●●
●●

●●
●●

●●

●●














●●





●●●
●●●

●●

●●

●●

●●

●●


●●●
●●
●●

●●
●●
●●





●●


●●

●●●●


●●
●●
●●

●●
●●

●●●●● ●
●●●●
● ●



●●
●● ● ●
●●
● ●●●

●●●●●●● ● ● ●● ●● ● ●●
● ●
●● ●●

●● ● ●● ●

●●●
● ●●
● ●

● ●●● ●
●● ●●●
●●



●●
●●●●●
● ●


● ●
●●

● ●●

●●●


●●

●●
●●
● ●●
● ●


●●●

●●
●●

● ● ●
●●







●●●

●●●

●●



●●

●●
●●
●●
●●●

●●

●●

●●


●●

●●
●●

●●
●●
●●
●●
●●●●
●●●




●●

●●●●
●●●


●●
●●● ●
● ●●
● ●● ● ●●●●●●●●●● ●●●
●●
●●● ● ●● ● ●●● ● ● ● ●
●● ● ● ●●●●●
● ● ● ● ● ●


●●●●●●●
●●
●● ●
●●●●
●●
● ●●

●● ●● ●●
●●
●●●●
● ●●
●●
●●● ●
●●●
●●

●●●●●●
●●●
●●
●●

●●●
● ●
● ●
●●●
●●●●




●● ●●
●●●●

●●●
●●●●
●●
●●

●●
●●
●●●
●●●

● ●●●


●●


●●●●
●●
●●●●

●●
●●●
●●●

●●
●●


●●
●●
●●●

●●

●●

●●●

●●●●
●●
●●
●●
●●
●●


●●


●●

●●



●●

●●


●●
●●
●●
●●
●●


●●


●●



●●

●●
●●
●●

●●
●●
●●
●●
●●●






●●
●●



●●
●●

●●

●●
●●

●●



●●●

●●
●●

●●

●●

●●●●


●●
●●



● ●

●●●●
●●

●●

●●




●●●
● ●● ●

●●

●●
● ●
●●●●●●●

● ●●
●●●
● ●●
●●● ●●●●● ● ●
●● ● ● ●●
●● ● ● ●● ● ●●
●●● ● ●●
●● ●●●
●●
● ●
● ●●●●

● ●● ●
● ●●
●● ●

●●●●
●●●
●●●●●
●●●●
●●

●●


●● ●
●●
●●●
●●
● ●●●●
●● ●●
●●●●●
●● ●●

●● ●●●
● ●●
●● ● ●●●●

● ●●
●●
●● ●●
● ●●●● ● ● ● ●

●● ●●● ●● ●●
● ● ●●● ● ●

● ● ● ● ● ●● ●●
● ●●● ● ●● ●●
●●●●●
●●
●●●●● ●●
● ●
●●●● ●●

●● ●●
●●●●
●●●
●●●
●● ●
●●●
●●●
●●
●●●
●●●●
●● ●
●●
● ● ●
●●●
●●
● ●
●●
●●●


●●●●
● ●●●
● ●
●●

●●●
●●●

●●

●●
●●
●●

●●●●●●●●
●●
●●●●●●●
●●

●●
●●
●●

●●●
●●
●●
●● ●
●●
●●
●●●

●●●
●●
●●

●●
● ●●
●●
●●

●●
●●●
●●

●●●
●●●





●●●●


●●


●●

●●

●●

●●

●●


●●
●●●
●●
●●

●●
●●




●●
●●

●●


●●

●●
●●

●●


●●
●●






● ●



●●
●●

●●
●●●


●●


●●
●●

● ●●

● ●



●●
●●●

●●
●●●

●●

●●●
●●


●●●●
●●●
●●

●●
●●
●●●

●●●

●●

● ●●
●●● ●●●●
●●●

● ●● ●
●●
●●●●
●●● ● ●●● ●
●● ● ● ● ●●●
●●●●●● ● ●●●
● ●●●●● ● ●●
● ●●
●●●●●●
●●
●●● ●● ●●
● ● ●● ●
● ●●
● ●
● ●●●●●●

● ●

●●●● ●● ●●●●●●●● ●●
●●
●●
●● ●●●●●●●
●●●● ●

●●●
● ●
●●


●●●
● ● ●●● ●●●
●●

●●●●
●●●
● ●●
●●●
●●●●●
● ● ●
●●●
●● ●
●●●●
●● ●


●● ●●●
●●
●●●●●●
●●●●
●●●●
●● ●
●●
●●
●●
●●●
●●
●●
●●
●●

●●
●●●
●●

●●
●●●●
●●●
●●●
● ●
●●●

●●
●●●


●●
●●



●●●
●●●
●●

●●

●●
● ●

●●●
●●
●●
●●
●● ●
●●●
●●

● ●
●● ● ●

●●
● ● ● ●
●●● ●●● ●●●
●●●●

● ●
● ●● ● ●
●●

●● ●●●● ● ●●●
● ● ●● ●●● ●
●● ● ● ● ●●●● ●
●●● ●
● ● ●● ● ●● ●● ● ●●
● ● ●●
● ●●●
●●
●● ●●
●●●●● ●● ●●●●●
●●●●● ●●
●●● ●●
●●●
●●
●●● ●●
●● ●●● ●●●
●●●
●● ●●
●●●●
● ● ●
●●● ●
●●●
●●


● ●●
●●●●
●●●
●●
●●
●●●

●●
● ●
●●●●


●●
●●●
● ●

●●
●●
●●
●●
●●


●●


●●
●●
●●
●●
●●
●●●●

●●


●●
●●●



●●

●●


●●

●●
●●



●●● ●




●●
●●


●●

●●●●
●●
●●●●

●●●






●●●

●●

●●
●●




●●
●●
●●

●●


●●



●●●
●●
●●

●●
●●●
●●

●●


●●
●●●
●●●
●●●

●●

●●

●●

●●
●●

●●●
●●

●●
●●●
● ●●

● ●●
●● ●●●●● ●●
●●
●● ●
● ●●
●●●● ● ● ● ● ●● ●
●●
● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●● ●
●●●
● ●●●● ●
●●●● ●●● ●● ● ●●● ● ●●● ● ●
● ● ● ●
● ● ●● ●● ●
●● ● ●●●●●● ● ●
●● ● ●●●
● ●● ●
●●● ●●
●● ●● ●
● ●
● ●
●●
●● ●●●●●●● ●
●● ●●●

●●●● ●●●●
● ●●
●●●● ●
● ●●
●●●●●

●●
●●

●●●

●● ●
●●

●●
●● ●
●●
●●
●●
●●● ●●
●●

●●●●
●●●
●●
●●
● ●
●●
●●●
● ●

●●
●●●●

●●●


●●
●●●
●●
● ●
●●●●
●●●
●●
●●●●

● ●
●●
●●
●●



●●

●●
●●

●●
●●

●●●

●●


●●
● ● ●
● ●

●●
●●
●●●●
●●●● ●●●●●●
● ●●
●●
●● ●
●●
● ●●●
●●
●●● ●●● ● ●
●● ●● ●

●●
● ● ●
● ●● ● ● ●●●
●●● ●●
●●●● ●● ●●●●
●● ●● ●●●
●●●●●●
● ●●
● ●●● ●●●● ●


●●●
●●● ●●
●●
●● ●●●
●●●●●

●●
●●

●●

●●●
●●●●
●●
●●

●●●●●

●●


●●●●●●
●●
●●●


●●


●●●
●●


●●● ●

●●●

●●
●●
●●

●●


●●
●●●
●●
●●


●●●●

●●

●●


●●




●●●
●●●
●●

●●●

●●
●●

●●
●●
●●●●●
●●


●●

●●●

●●
●●
●●●●
●●●
●●
●●



●●●



●●
●●●●●
●●

●●●
●●●
● ●●
●●
● ●
●●●●●

● ● ●
● ●●
●●● ● ●●●● ● ●●● ● ●● ● ●
● ●●● ● ● ●
● ●● ● ● ●● ●
● ●●●● ●● ●●●● ●
●●●●
● ●●● ●● ●
●●●●●● ●●●●

● ●
●● ●●●●●
●●
●●●●
●● ●●
●●

●●● ●
●●

● ● ●●●
● ●●●
●●●●●● ●●●● ●
●●
●●
●●● ●
●●●●

● ●
●●
●●

●●●●● ●


●●●

●●●●●

● ●
●● ●
● ●●●
●●●
●●●
●●●●

● ● ●
●●

●●

●●●●
●●●
●●●
●●
●●●●
● ●●●●
●●
● ●●●● ● ●
●●
●● ●● ●
●●●●●
●●●●●●● ● ●
●●
● ● ●
● ● ● ●● ●●● ●
●●●● ● ● ● ●●●●● ●●
● ●

●●
●●●●●
●● ●
●● ●●●
● ●● ●● ●●●●●● ●

●●●●
● ●
●●
● ● ●
● ●●
●●●●●
● ●●

●●●● ●


●●
●●
● ●
●●● ●

●●
●●

●●
●●

●●●
●●
●●●●●

●●●●●●
●●●
●●●●
●●●●

● ●

●●
●●● ●
●●

●●●●●●
●●

● ●●●●● ●●●●●●●●●

●●
●● ●● ●●
● ●●●●● ●●●
● ● ●
● ● ●● ● ●● ●● ●●● ●
●●● ● ●● ●●●● ●●
●●
●●●●
● ●●
● ●●●●● ●●● ●●

● ● ●

●●

● ●●●● ●● ● ●●
●● ●●●●

●● ●●
●●●●●

● ●
●●●●
●●
● ●●●●●


● ●

●●
●● ●●
● ●
●●●● ●
●●●●●●
●●

●● ●


●●● ●● ●●●●●●● ●●● ●●
●● ●● ● ● ● ● ● ●●
●● ● ●● ● ●
● ● ● ●●● ● ●●● ●●
●●● ●
●●
● ●● ●●
●●● ● ●●● ●●


●●● ●● ●
●●
●●●
●●●●
●●●●



●●●
●●
●●●●●
●●
●●

●●


●●●●●


●● ●
●●

●●
●●●●
●●●


●●
●●●●
● ●●●●●
●●
●●●●●●●
●●●
● ●●
●●
●●●
● ●●
●●

●●●●● ● ● ●●

●● ●●● ●●
●●●●
● ● ● ●
●●

●●●
●●●● ●●● ● ●

● ● ● ● ● ●●● ●● ● ●●● ●●●●

●●
● ●
● ●●● ●● ●●●●
● ●●
●●●●●●
●●●● ●
●●



●●
●●
●●

● ●●●●●●●
● ●●

● ●
●●●● ●●●●
●●●● ●


●●

●●






● ●●●●●
●● ●●●● ●●●●●
●● ● ● ●●●●●●●● ● ● ●●●● ● ●● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
●● ● ● ●●● ● ●● ●
● ●●
● ● ● ● ●●

●●
●● ● ● ● ● ● ● ●● ●● ● ●● ● ●
● ●●● ● ●●●● ● ● ●●●
●●●● ● ●● ● ●
● ●●
●● ●● ● ●●●●● ●●
● ●● ●●
●●●
●● ● ●● ● ●● ●● ●●●●
● ●● ●● ●●●● ●● ●● ● ●
● ●● ●● ●● ●
●●● ● ●●
●● ● ●● ● ● ●● ● ●●
●● ●● ●●●●● ●●● ●●● ●● ●●● ● ● ● ● ●●●● ●●●
● ●●●
● ●● ● ● ●● ●
●●● ●● ● ●
●● ● ●● ●●● ●● ● ● ●● ● ● ● ● ● ●● ●
● ●●●● ● ● ● ●

● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●●● ●● ●
● ●● ●● ● ● ●●● ●● ● ●● ●● ● ●● ●
● ●● ● ● ●● ●●

● ●
●●●● ● ●● ● ●●● ● ● ● ● ●● ●● ●

● ● ●●● ● ●● ●
● ● ● ●● ●● ● ●
● ●

−2

−2
●●
● ● ●
●● ● ● ●
● ● ● ● ●


● ●
−4

−4
−6

−6
1e−01 1e+01 1e+03 1e+05 1e−01 1e+01 1e+03 1e+05

mean of normalized counts mean of normalized counts

Figure 2.13: Exemplary shown are two MA plots of some preliminary expression data of a bat cell
line (Myotis daubentonii ), infected with a Rift Valley fever virus (RVFV) clone. Here, the non-
infected samples (mock) were compared against the virus infected samples 6 and 24 h post infection
(p.i.), shown in sub figures (A) and (B), respectively. The plots show the normalized mean read
count (x-axis) vs. the log2 fold change (y-axis) for each analyzed gene (each dot corresponds to
one gene). Red dots indicate significant (p-value <0.1) differential expressed genes as calculated
by DESeq [81]. After 6 h p.i. only few genes are differential expressed, whereas after 24 h p.i.
we can observe a dramatic reaction of the bat cells due to the infection with the RVFV clone.
Preliminary data was obtained from [15].

genes with the same length). Nevertheless, the normalization for different library
sizes is crucial to estimate good RNA abundances to call DEGs.

2.5.2 Fold changes and statistics


After read quantification, we can now analyze the gene expression data to identify
differences among treatments, e.g. using DESeq [81]. This R tool can be directly
used to normalize counts in relation to the library sizes and test for differences in gene
expression. However, there are many other programs that perform similar tasks like
edgeR [82] or Limma [83]. All tools calculate fold changes between genes, showing
how big the difference in expression between genes in two samples is. Furthermore,
p-values are calculated to statistically support the differential expression of each
gene. Often, the log2 fold change is used, because on a logarithmic scale the leading
sign can directly tell if a gene is up (+) or down (−) regulated (Fig. 2.13).
Within the R scripting language and with the help of various Bioconductor
tools [84] like DESeq, ReportingTools [85], Pheatmap, GAGE [86] and path-
view [87], it is further possible to directly visualize differential expressed genes and
go deeper into pathway analyses and functional enrichments of the DEGs.

2.6 The importance of visualization


With the previously presented NGS workflows and methods we are able to produce
an overwhelming amount of results. A typical RNA-Seq experiment, involving an

25
Chapter 2. Welcome to the Black Box

eukaryotic cell line and comprising for example two different conditions (untreated,
infected), three time points and four biological replicates already results in the se-
quencing of 24 samples. The current Ensembl annotation of the human genome
(version 85) consists of 58,051 genes from which 19,961 are protein-coding. In a
differential gene expression analysis we can now compare all expressed genes be-
tween different conditions and time points, resulting in an overwhelming amount of
data. Genes can be further analyzed for differential expressed isoforms and clustered
according to their function. With a de novo gene prediction, one of the huge ad-
vantages of RNA-Seq data in comparison to microarrays, an incomplete annotation
can be further extended and even more genes can be possibly involved. The use of
different library protocols (like Ribo-Zero and smallRNA) extends the complexity
of such an RNA-Seq study even further. It is easily possible to overtax researchers,
especially if they are not that familiar with all bioinformatical methods and the
statistical background happening inside the Black Box (Fig. 2.2). Important ob-
servations may be lost in the huge variety of results. Therefore, the presentation
and visualization of all results obtained not only from an NGS experiment, but also
from other data-rich projects, is crucial to allow other researchers to understand,
interpret, and investigate the obtained data successfully.
NR4A1 RASGEF1B
ZEB2 IL32

General Overview J UN
CYR61
RASGEF1B
Specific Observations PLA2G4C
ATF3
PPP1R15A
IL32 DUSP8
PLA2G4C NFKB2
ATF3 RELB

A C −3 −2 −1 0 1 2 3
D PPP1R15A
DUSP8
NFKB2
RELB
F IL12A-AS1
DUSP1
DUSP10
SQSTM1 (IL12A)

DUSP1 KLF4
CTD-3203P2.3 (IL4R)
FOSB
FOS DUSP10 Fragment
0.950
1 0.686
RPS17L Fragment 2
0.832 0.534 0.655

CNN1
0.857 0.966 RP11-104J23.1
0.566
IL8
0.932 1.000 0.916
(CCL15)
0.998 0.989 0.992 0.991 0.538 0.823 0.978
SQSTM1
EGR1
CXCL3 67 KLF4
0.883 0.985 0.762 0.665 0.963 1.000 0.910 0.996 0.986 0.977
AC069363.1 PIM1 (CCL3)
0.977 0.722 0.934 0.971

EP300
NFKB1 120 137RPS17L
558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574
FOSL1
AC131056.3 (CCL3L3)
575 576 577 578

IL8 GAA TCA GAA CAA CAG - - - - - - - - - - - - AAA AGG AAA TCC ACC TTG GTG ACT
NPC1
LMOD1
MX1 TCT GAA AGC AGC

RPS17 PIM1 CTB-186H2.3


GAA TTA GAA GAA AAC - - - - - - - - - - - - AAG AAG DDX58 (CCL14)
AAG TCC GTC TTT GCG CTT TCT GAA AAC AAT
CXCL2 FOSL1 DDIT4 (IL2RA)
AREG
RELA 24 MX1 RP11-536K7.5
GAA TCC AAA GAG CAG - - - - - - - - - AAG GGG AGT TCT CGC GAG CAG ACG TCC
TMEM72 TCT CTG GAG GAT
PC2

NR4A1
ZEB2
DDX58 INHBA-AS1 NPR3 (INHBA)
J UN
CYR61
13 14 DDIT4 GAA TCC AAA GAG CAG - - - - - - - - - AAG GGG AGT
AC073072.5
TCT CGC GAG CAG ACG TCC
SLC2A12 (IL6)
TCT CTG GAG GAT

TMEM72 GAA TCG GAA GAG AAG AAG GGT TGT TCG CGC CAG CAG AAG GAG CAG AAT TTC TAT CAG GAG GAT
RASGEF1B
MN1
IL32 control atRA vitamin D control atRANPR3 vitamin D control atRA vitamin D control atRA vitamin D
PLA2G4C GAA GCC GAA GAG AAT - - - - - - - - - AAG AAG AAG CDH6
ATF3 control 59 SLC2A12
A. fumigatus C. albicans E. coli
AAG AAG GAG CAT ATT TTC
SFRP2
TTT GAA GAG GAC

MN1
C3
PPP1R15A
GAA AAA GAG AAG GAA - - - GAA GAA AGG AAG AGA ACA TTA GGT CGG GCG ATC TGC GAA GAG AGT

E
DUSP8
P NFKB2 CDH6 RTL1
RELB
DUSP1 0 SFRP2 2 4 6 8 GAG AAT GAA GAA CAA - - - - - - AAC AAG AAT AAA IL6TCA AGA GTT TTG GAC CTT GTA CAG AGT TCT
DUSP10 RTL1 SDPR
GAG AAG GAG AAG GAA - - - - - - GAA GAA ATG AAG AAG AAA TTT AAT TGT TTG AAC CTT CAA CAG
SQSTM1
PC1 KLF4
IL6 NPPC
RPS17L Sense SDPR
IL6 Antisense AC073072.5 STAT2

B
IL8
NPPC STAT1

G
PIM1
FOSL1 106.0 STAT2
● ● ● FAM49A
MX1 ●
DDX58 STAT1

● 10 2.0
● DUSP6
DDIT4 ● ●FAM49A ●● ● ●

TMEM72 ● ● ● ● MYCNOS
NPR3 105.0 ● ● ●●DUSP6 ● ●
●● ●●
SLC2A12 ●
MYCNOS ● ● ●
● ZG16
MN1 ● ● ●MYCN
CDH6 ZG16 101.5
SFRP2
104.0 MYCN ● ● DUSP5
RTL1 ● ●
IL6 DUSP5 ●
CHAC1
●ATP2B4
SDPR
NPPC CHAC1 ●
STAT2 103.0 ● ATP2B4 ● AK4
log fold change

1.0
STAT1 10
FAM49A ● ● AK4 ● FAM71E2
DUSP6 ● ● FAM71E2 MAST4
MYCNOS
ZG16 102.0 ●
● ● MAST4 TRIB3
MYCN ● TRIB3
DUSP5 0.5 TRAF6
TRAF6 10
CHAC1
1.0 ● ●
● TFCP2L1
ATP2B4
AK4
10 ●
● ● ● TFCP2L1 ●
● HAVCR1

FAM71E2 HAVCR1 ●
MAST4
TLR3
TLR3
TRIB3 ●
3h 7h 23h 3h
TRAF6
7h 23h 100.0 7h ●23h
3h 3h
3h 7h 7h 23h23h 3h 7h 100.0 3h
23h ● ● ●
7h ●● 23h ● ● 3h 7h 23h
TFCP2L1
HAVCR1
EBOV MARV
TLR3
EBOV EBOV
control MARV EBOV
A. fumigatus
MARVC.EBOV
albicans MARV
MARV
E. coli EBOV
control A. fumigatus MARV
C. albicans E. coli
3h 7h 23h 3h 7h 23h 3h 7h 23h 3h 7h 23h
EBOV MARV EBOV MARV
IL12A IL12A-AS1
mean expression 104.0 ● 102.0 ●

Figure 2.14: Exemplary shown is a hierarchical structure of possible visualizations of a putative


RNA-Seq experiment. (A) Principal component analysis (PCA) can give first insights into the
quality of the samples (do biological replicates cluster?) and can give hints which conditions might
be interesting to compare. (B) MA plot showing the amount of significant DEGs between two
conditions. (C) Significant DEGs can be clustered and visualized in a heat map. (D) Venn
diagram to show the overlap of DEGs between different comparisons. (E) A single gene (e.g. IL6 )
is selected based on the more general overview and the differential expression is visualized in a box
plot. (F) Homologous sequences of a single gene, previously identified as interesting, can be used
to find possible recombination break points and positively selected sites in an alignment. (G) The
RNA-Seq reads can be further used to identify single nucleotide variations that might explain the
different behavior and expression pattern. Figures are adapted from our publications [4–7, 14] and
connected here exemplary to show the idea behind high dimensional visualizations (like PCA) and
visualizations on a closer level (like single nucleotide variants).

In all scientific fields, the visualization of information is always an important


step to tell a meaningful and informative story out of the data. Therefore, we
use appropriate visualization to communicate information. A well designed and

26
2.6. Visualization

elaborated figure can provide high-dimensional information, like gene expression


data of a multi-dimensional study design, in a much smaller space than for example
a table.
The general idea is to provide functions that allow a quick examination of the
large amounts of data, to expose trends and to find patterns and correlations within
the data. Effective data visualization is an important part in the decision making
process and helps to find further pieces of the puzzle, picturing the possible answer of
certain life-science questions. In general, it is a good idea to organize plotting func-
tions with a hierarchical structure in mind: starting with a more general overview
(like PCA, MA plots) and closing with definite subsets of selected genes and terms
(heat maps, single gene box plots, single nucleotide investigations), see Fig. 2.14.
One of the objectives of all projects presented throughout this thesis was to find
appropriate visualizations of the data presented in each section, comprising certain
heat maps (e.g. in Sec. 5.1 and 5.2), alignment visualizations (e.g. in 6.1) and phylo-
genetic trees (e.g. in Sec. 3.2). Besides these state-of-the art visualization methods
we present in this thesis further improved representations, such as an Interactive
Gene Observer (IGO, Sec. 5.2), animated PCA plots [19] (Sec. 5.1) and improved
online tables that link results between different databases and are responsive to the
user input (Sec. 3.2, 5.1 and 5.2).

27
Chapter 2. Welcome to the Black Box

28
Chapter 3

Genome Assembly

This chapter is based on our publications “Assembly of the whole-genome sequence of


Chlamydia gallinacea type strain 08-1274/3” [2] and “Comprehensive insights in the
Mycobacterium avium subsp. paratuberculosis genome using new WGS data of sheep
strain JIII-386 from Germany” [1] complemented by the “Evaluation of associations
between genotypes of Mycobacterium avium subsp. paratuberculosis and presence of
intestinal lesions characteristic of paratuberculosis” [8] and the “Complete genome
sequence of JII-1961 – a bovine Mycobacterium avium subsp. paratuberculosis field
isolate from Germany” [10]*. We will mainly focus on the topics Preprocessing &
Quality Control, Mapping, Assembly, Gene Annotation, Phylogeny and Visualization
(see Fig. 2.2; B, C, E, F, I, and M).
The generation of valuable reference genomes and corresponding annotations is
an important task, since subsequent analyses, such as the mapping of RNA-Seq
reads and the identification of differentially expressed genes (see Chapter 5), heavily
rely on those references. In this chapter, we focus on the assembly of rather small
genomes of two bacteria. The genome assembly of higher organisms is much more
complicated as well as time and cost consuming. Therefore, when no reference
sequences are available, NGS studies on higher organisms often rely on de novo
transcriptome assemblies (see Chapter 4) instead of building a full genome assembly.
In the first part of this chapter (Sec. 3.1), the de novo genome assembly of a
small bacterium (∼1 Mbp genome) of the family Chlamydiaceae is presented. Al-
though the generation of a draft assembly is more or less straightforward for such
a bacterial-sized genome, obtaining the complete genomic sequence without any
gaps is still a challenging task. In this case, the assembly process was impaired by
DNA contaminations (along the way we assembled the full mitochondrial genome of
chicken) and highly polymorphic regions present in the genome. However, by testing
different tools and parameter settings and by finally complementing the NGS data
with additional Sanger sequencing we generated the full genomic sequence for this
bacterium.
In the second part (Sec. 3.2) a new genome assembly of a Mycobacterium subsp.
from Germany is constructed and comprehensively compared with other available
genomes of this genus, including the comparison of annotated protein- and non-
coding genes, possible virulence factors and phylogenies. We further re-annotated

* unpublished work, publication in progress

29
Chapter 3. Genome Assembly

protein- and non-coding genes for all assemblies included in this study and compared
the results with the available annotations from NCBI. By combining the output of
different assembly strategies, we were able to improve the initial assembly, how-
ever, we could not close all gaps. This section is accompanied by a comprehensive
Electronic Supplement1 .

3.1 Assembly of the whole-genome sequence of


Chlamydia gallinacea type strain 08-1274/3
3.1.1 The genus Chlamydia
The recently introduced bacterial species
Chlamydia gallinacea is known to oc-
cur in domestic poultry. Its poten-
tial as an avian pathogen and zoonotic
agent is currently under investigation.
Chlamydia (C.) gallinacea is an obli-
gate intracellular bacterium classified
*
under the single genus Chlamydia [89] *
within the family Chlamydiaceae. Com-
parative genetic analysis recently re-
vealed its status as a separate species
with close relatedness to C. avium and
C. psittaci [90]. Epidemiological obser-
vations suggest the main host of C. gal-
linacea being domestic chicken [91–94]
alongside turkey, duck and guinea fowl. Figure 3.1: Visualization of the dissolved as-
sembly graph produced by SPAdes, done with
Its etiological importance is still sub- Bandage [88]. The graph shows clearly repeti-
ject to investigation. Experimentally tive regions in the genome, from which some could
infected chickens remained symptom- be automatically solved by SPAdes. The green
less, but showed reduced body weight nodes were connected to Scaffold 1 (643,147 nt),
blue nodes to Scaffold 2 (228,815 nt) and red
gains [94]. Some observations sug- nodes to Scaffold 3 (185,839 nt) in the final
gest a zoonotic potential [91, 95]. Co- SPAdes assembly. The purple circularized node
infections with C. psittaci seem to be represents Scaffold 4 (7,619 nt), the plasmid
common [92]. Comparison of Chinese p1274. Some small nodes are shared between dif-
and European isolates suggested high ferent scaffolds, indicated with an asterisk. Only
contig lengths greater 1,000 bp are shown.
intra-species diversity by revealing 13
different ompA genotypes [94]. Studies
on whole-genome sequences (WGS) will facilitate elucidation of unresolved issues. In
a previous paper, a partially assembled WGS of C. gallinacea type strain 08-1274/3
was reported [90]; NZ_AWUS00000000.1.
In general, generating a closed genome without gaps from a Chlamydia seems to
be a challenging task, mainly because of highly polymorphic regions present in the
genome. After host DNA was removed from the input data, three large scaffolds were
1
available at http://www.rna.uni-jena.de/supplements/mycobacterium

30
3.1. Assembly of the whole-genome of Chlamydia gallinacea

A B
NODE_4_length_7610 NODE_4_length_7610

NODE_3_length_185839

NODE_2_length_228815

NODE_1_length_643147

NODE_2_length_228815

NODE_1_length_643147 NODE_3_length_185839
Reference Chlamydia avium Reference Chlamydia avium

Figure 3.2: Comparison of dot plots before (A) and after (B) rearrangement with CAR [96]. (A)
Dot plot between de novo assembly of C. gallinacea (y-axis) and the C. avium genome as reference
(x-axis). (B) Dot plot between the rearranged de novo assembly of C. gallinacea (y-axis) and
the C. avium genome as reference (x-axis). With the help of the reference the de novo assembled
scaffolds could be ordered and used for primer design to close remaining gaps. Red dots represent
sequence homology between the query and the reference in the same orientation, blue in antisense
orientation. The order corresponds with the de novo predicted scaffold order, see Fig. 3.1.

calculated by the assembly tool, but at the polymorphic locations, the assembler was
not able to decide for a unique path in the de Brujin graph (Sec. 2.2). Finally, with
additional Sanger sequencing, we were able to close the remaining gaps.
Here, we present the whole-genome sequence of the C. gallinacea type strain
08-1274/3 that consists of the 1,059,583 bp chromosome with 914 CDS coding for
proteins and plasmid p1274 sized 7,619 bp with 9 protein-encoding sequences.

3.1.2 Sequencing and assembly


Whole-genome sequencing was conducted at the Institute for Genome Sciences (Uni-
versity of Maryland, Baltimore, MD). Briefly, Illumina-sequenced reads of an aver-
age length of 250 nt and genome coverage of 1949 X were assembled using CLC bio
(v6.0.1), which resulted in four scaffolds sized 630,796 bp, 228,666 bp, 185,564 bp
and 7,088 bp (GenBank assembly accession GCA_000471025.1). In the present
work, this dataset has been subjected again to de novo assembly. Non-chlamydial
reads pertaining to host DNA (from culture in embryonated eggs) were identified
through mapping to the Gallus gallus genome using Segemehl [75]. The remaining
reads were assembled using SPAdes version 3.7.0 [54] with multiple k-mer values of
21, 33, 55, 77, 99, and 127, the -careful option and automatic coverage cut-off,
which yielded 22 contigs. Of those, 16 contigs were identified as sequencing or as-
sembly artifacts. Two other contigs were assigned through BLAST to Enterobacteria
phage phiX174 (positive control in DNA sequencing) and Gallus gallus mitochon-
dria, respectively. Thus, the assembly resulted in Scaffold 1 (643,147 nt), Scaffold
2 (228,815 nt) and Scaffold 3 (185,839 nt), all of which representing the chromo-
some, and Scaffold 4 (7,619 nt) representing plasmid p1274. The Bandage tool [88]
was further used to visualize the dissolved de Brujin graph of the assembly and

31
Chapter 3. Genome Assembly

to determine the putative genomic order of the resulting scaffolds (Fig. 3.1). The
genomic order was further confirmed with CAR [96], using the genome of C. avium
(GCF_000583875) as a reference (Fig. 3.2).
Based on both visualization methods, flanking scaffolds were determined and
corresponding primer sites were selected to close the gaps. The primers were used
in PCR to generate DNA fragments of 600–800 bp (Gap 1) and 1,300–1,500 bp (Gap
2), which were sent to Eurofins Genomics (Paris, France) for Sanger sequencing.
Alignment of Sanger sequences to Scaffolds 1–3 using BLAST and Mafft [97] fi-
nally enabled closure of the gaps. The complete chromosomal sequence was sized
1,059,583 bp. Provisional annotations using Prokka [98] revealed 914 protein-
encoding genes and 46 non-coding RNAs, including amongst others 39 tRNAs, 3
rRNAs and 1 tmRNA. The size of plasmid p1274 has been determined as 7,619 bp
with 9 proteins encoded. The average G+C content of the genome is 37.9 mol-%.
This is the first report of a completely assembled genome sequence of C. gallinacea.
It can serve as a reference genome for future studies.
The updated sequence data of C. gallinacea type strain 08-1274/3 WGS and
its plasmid p1274 have been deposited in NCBI GenBank under accession numbers
CP015840 and CP015841, respectively.

32
3.2. Comprehensive insights in the MAP genome

3.2 Comprehensive insights in the Mycobacterium


avium subsp. paratuberculosis genome using
new WGS data of sheep strain JIII-386 from
Germany
In the following section we present the genome assembly of a specific Mycobacteria
strain, JIII-386, from Germany. In contrast to the Chlamydia gallinacea genome
assembly presented in 3.1 it was not possible to achieve a closed genomic sequence
without any gaps for this bacteria. However, we could present a proof of concept that
the incorporation of multiple assemblies calculated by different tools and parameter
settings could improve the whole assembly by closing at least some of the gaps
remaining after the first assembly round. We used the improved assembly to conduct
an extensive annotation of protein- and non-coding genes of this strain and all other
bacterial genomes used in this study. Based on a comprehensive comparison of
eight Mycobacteria genomes we provide deep insights in the gene composition and
phylogeny of this pathogen.
This study is complemented by a comprehensive supplementary material: sup-
plementary tables S1–S22 and figures S10a, S10b and S22–S26 are available at
http://www.rna.uni-jena.de/supplements/mycobacterium/.

3.2.1 The genus Mycobacterium


Mycobacteria – a genus of Actinobacteria – include pathogens known to cause serious
diseases in man and other mammals: tuberculosis (Mycobacterium tuberculosis) and
leprosy (Mycobacterium leprae). M. tuberculosis has been in the focus of research
for a long time, aiming to fight against tuberculosis worldwide. Therefore, most
new scientific findings concerning Mycobacteria are based on this species. Another
Mycobacterium species which is distributed worldwide affects domestic and wild
ruminants: M. avium subsp. paratuberculosis (MAP).
MAP is the causative agent of paratuberculosis (Johne’s disease); a chronic gran-
ulomatous enteritis causing malnutrition, therapy-resistant diarrhoea, emaciation,
low milk yield and ultimately death [99]. Paratuberculosis is of considerable eco-
nomic significance especially for the dairy industry.
MAP belongs to the species M. avium together with the bird pathogens M. avium
subsp. avium (MAA) and M. avium subsp. silvaticum (MAS), as well as the related
environmental organism M. avium subsp. hominissuis (MAH) associated with op-
portunistic infections in man and pig [100, 101]. MAH shows the highest genetic
variability within this species [102]. Comparing strains of different M. avium sub-
species a high diversity can be recognized [103–105]. In contrast, strains of sub-
species paratuberculosis exhibit a relatively low genetic heterogeneity. Depending
on the study, MAP was differentiated into two or three groups based on phenotype,
genotype and host association: MAP Type I + III and MAP Type II (see Fig. 3.3),
also designated as MAP-S (sheep type) and MAP-C (cattle type), respectively [102,
106–108].

33
Chapter 3. Genome Assembly

MAP Type I strains predominantly infect ovine hosts; MAP Type II strains
principally infect cattle but also deer, goat, sheep, and other ruminants [109–111].
MAP Type III strains (intermediate) are closely related to Type I strains and have
been isolated up to now from sheep, goat, cattle and camels [112–116].
In another study, we evaluated the potential association between the genotype of
individual field strains belonging to the MAP-C group (Type II) and the presence
of macroscopic intestinal lesions characteristic of paratuberculosis in the infected
animals [8]. Overall, 88 MAP-C isolates were sampled from clinically healthy cows
at slaughter. Cows were grouped as A (n=46) with, and B (n=42) without macro-
scopic intestinal lesions. A Fisher’s Exact Test was applied to determine if specific
genotypes were more strongly associated with one of the two groups of isolates (A
and B). Groups were considered to be significantly different, if the probability of er-
ror was lower than 5 %. MAP isolates from groups A and B exhibited similar strain
diversity: 20 and 18 combined genotypes, altogether 32 genotypes. Six of these geno-
types were detected in both groups. Although no association was found between
individual combined genotypes and presence of macroscopic intestinal lesions, IS900 -
RFLP-(BstEII)-Type-C1 (the most common type worldwide) was found more often
in group A (p<0.01). Further studies will have to elucidate, which genomic struc-
tures and variations, regulatory elements, phenotypic or functional characteristics
of an MAP isolate are fundamental for its virulence.
The high entry of the MAP organism into the immediate environment of diseased
animals by shedding and the high tenacity of MAP within the environment [117]
generate an increasing risk of exposure to MAP for ruminants but also for other
mammals. For instance, MAP was detected in tap water, rivers, dams [118, 119],
and also in raw milk [120]. Furthermore, MAP was isolated from a clinically diseased
donkey [121]. MAP has also been isolated from man. Its etiologic role in Crohn’s
disease is under discussion [122, 123].
About ten years ago, first sequence data of MAP strains were published. Isolate
MAP K-10 from U.S. – belonging to the MAP-C group – was fully sequenced [124].
Later on it was re-sequenced and better annotated by Wynne et al. [125]. Recently,
ovine derived isolates CLIJ361 (MAP-S, Type I) from Australia [126], MAP S397
(MAP-S, Type III) from U.S. [127], and the first human derived isolate MAP4
(MAP-C) from U.S. [128] have been sequenced.
The availability of these MAP-S genome sequences, although not fully assem-
bled, improved the informational value of genome comparisons no longer only based
on MAP K-10 and M. avium strain 104. In the meantime, MAP-S and MAP-C
specific loci, genome deletions and insertions have been identified and evolutionary
relationships proposed [104, 127, 129–132].
Besides the comparative whole genome sequence analysis, in the past decade
non-protein-coding fractions of the transcriptome were studied in bacteria [133–135].
Regarding Mycobacteria, non-coding RNAs (ncRNAs) were identified in M. tuber-
culosis and their role in the regulation of the pathogen metabolism was studied [136,
137]. Furthermore, RNA sequences were analyzed in M. avium including MAH and
MAA [138]. Until now, no data have been published on the full set of ncRNAs in
MAP.

34
3.2. Comprehensive insights in the MAP genome

The objective of this study was to sequence a further MAP-S strain: the ovine
derived strain JIII-386 from Germany (Europe), and to compare sequence data with
seven assemblies of related genomes from other continents to examine previously
defined genomic differences between MAP-S and MAP-C strains (Type I/III and II,
respectively). Complete genome sequences of a bovine derived MAP-C strain (also
from Germany) and the M. avium strain 104 were included. A genome-wide annota-
tion of protein coding sequences (CDS) was performed by using two data resources,
NCBI and BacProt. For the first time, a comprehensive annotation of regulatory
RNAs in MAP was performed. Based on the current data analysis we wanted to
find out new aspects regarding proposed ancestral relationship of M. avium complex
strains and indications for an evolution or conservation of regulatory RNAs.

3.2.2 Sequencing and assembly


MAP ovine isolate JIII-386 and bovine isolate JII-1961

Isolation, identification and characterization of the MAP-S, Type III isolate JIII-386
was described by Möbius et al. [114]. The strain was isolated in 2003 and belongs
to the strain collection of the Friedrich-Loeffler-Institut in Jena (Germany). Briefly,
JIII-386 was isolated from ileal mucosa of a sheep from a migrating herd in the
north-west of Germany. The animal showed no clinical symptoms. Paratuberculosis
was suspected based on positive serological results, detection of MAP in feces by
culture, and pathomorphological and histological results after necropsy. JIII-386
had been cultured using modified Middlebrook 7H11 solid medium (Difco) con-
taining 10 % OADC, Amphotericin B, and Mycobactin J (Allied Monitor, Fayette,
USA). Subcultivation was done on modified Loewenstein-Jensen solid medium, also
supplemented with Mycobactin J.
Additionally, MAP strain JII-1961 (MAP-C) isolated from cattle in 2003 and
sequenced and assembled on the chromosomal level at the Helmholtz Centre for
Infection Research (Braunschweig, Germany) was included and annotated in this
study [10]. This isolate originated from the ileocaecal lymph node of a clinically
diseased dairy cow from a paratuberculosis positive herd in eastern Germany. JII-
1961 was isolated and subcultivated using Herrold’s Egg Yolk Medium (HEYM)
supplemented with Mycobactin J.
JIII-386 had been grown for up to seven months, strain JII-1961 for up to six
weeks. Both isolates were characterized by positive acid-fast staining and their
growth characteristics and were proved to be MAP by cultural confirmation of
mycobactin-dependency and detection of the presence of the IS900 insertion se-
quence using PCR [139].
The genotypes of isolates were determined [114] by multi-target genotyping based
on IS900 -RFLP (4 digestion enzymes)-, MIRU-VNTR (9 loci)-, and SSR (4 loci)-
analysis [140–142]. Isolates were expanded for sequencing on HEYM supplemented
with Mycobactin J. Genomic DNA was prepared by the cetyltrimethylammonium
bromide method described by Van et al. [143], and identity of the strain was con-
firmed by MIRU-VNTR-genotyping.

35
Chapter 3. Genome Assembly

Sequencing
Whole-genome shotgun sequencing was performed. Illumina paired-end (fragment
size ∼300 bp) and mate-pair (fragment size ∼2.2 kb) libraries were generated from
fragmented genomic DNA of MAP strain JIII-386. Libraries were sequenced using
Illumina GAIIx (paired-end library) and HiSeq2000 (mate-pair library) and resulted
in 28.6 million 101 bp paired-ends (∼1.100-fold genome coverage) and 10.9 million
100 bp mate-pairs (∼440-fold genome coverage) (see Tab. S3, supplementary mate-
rial online2 ).

Data preprocessing and de novo assembly


After quality trimming and removal of duplicons, 2 x 27.5 million paired-end reads
and 2 x 10.5 million mate-pairs were de novo assembled using CLC Genomics
Workbench (v5.0, default parameter, http://www.clcbio.com). This initial
assembly (I) resulted in 130 contigs with a total length of 4,792,650 bp. The assembly
was improved with the scaffolding tool SSPACE v2.0 [144] using all mate-pairs. To
close remaining sequence gaps, primers flanking missing regions were designed. The
amplicons obtained from genomic DNA were sequenced directly using Sanger tech-
nology. The resulting assembly (II) comprises 14 scaffolds totaling in 4,846,897 bp
(see Tab. S4).

Assembly improvement
To improve the assembly (II) the following steps to handle low-coverage regions,
low-quality reads, misassemblies, replacing gap regions and connecting scaffolds were
applied: Four additional de novo genome assembly tools were separately used on
both libraries: Velvet (v1.2.10, k=55) [51], ABySS (v1.3.4, k=45) [52], SPAdes
mainly implemented for single-cell data (v2.5.1, k=43,55,65) [54] – all de Bruijn
graph based [48] – and the seed-and-extend approach based JR-Assembler (v1.0.3,
default parameters) [145]. The resulting contigs (>1,000 bp) were merged and clus-
tered for sequence similarities using CD-HIT-EST (v4.6, -c 0.95) [146] to reduce
redundancy. Statistical information and each assembly can be retrieved from the
supplementary material online (Tab. S4).

Related genomes
Related genomes served as reference genomes in the current study to assist in as-
sembly, open reading frame (ORF) predictions and annotation. Furthermore, they
were used for comparison of different MAP types including strains originating from
different geographic regions of the world. The selection comprises the genomic se-
quences of the three MAP-C (Type II) strains: K-10/K-10’ [124, 125], MAP4 [128]
and JII-1961 (Möbius et al., unpublished), two sheep derived MAP-S strains, one of
Type I: CLIJ361 [126] and one of Type III: S397 [127], as well as one MAH strain:
M. a. strain 104 designated as MAH 104 [147]. Strains K-10, MAP4, and S397 orig-
inated from U.S., strain CLIJ361 from Australia and strain JII-1961 from Germany.
2
http://www.rna.uni-jena.de/supplements/mycobacterium/

36
3.2. Comprehensive insights in the MAP genome

Mycobacterium
Genotypes of investigated MAP
isolates within this study are

Genus
MAC MTBC
shown in Tab. S2. Currently, (Mycobacterium avium complex) (Mycobacterium tuberculosis complex)

finished genome sequences of

Species
Mycobacterium avium (M.a.) Mycobacterium tuberculosis
these three MAP-C strains are
available. The two ovine iso-

Subspecies
M.a. subsp. paratuberculosis M.a. subsp. hominissius
(MAP) (MAH)
lates are available at contig
level: S397 which comprises 176 MAP−C (Type II) MAP−S (Type I/III)
contigs and CLIJ361 based on

Type
1,147 contigs (draft genomes). Type I Type III

All reference sequences and

Strain
K−10 K−10’ MAP4 JII−1961 CLIJ361 S397 JIII−386 104
full full full full 1147 con 176 con 6 scaff
available annotation files for full

the above mentioned strains


(except JII-1961) were down-
loaded from NCBI. All used Figure 3.3: Overview of Mycobacterium avium strains com-
genome data are linked in the pared in this study and their assembly and annotation
supplementary material online status. Strains of MAP-S (Type I/III): JIII-386, S397,
CLIJ361 (red); Strains of MAP-C (Type II): K-10, K-10’,
(Tab. S1). MAP4, JII-1961 (green), MAH strain 104 (blue), Mycobac-
Mauve (v2.3.1) [148, 149] terium tuberculosis strain H37Rv (brown, used for extended
was used for genome alignments ncRNA annotation). Underlined – annotations available;
of strains K-10’, JIII-386 and scaff – scaffolds; con – contigs; full – finished genome. Pic-
tograms describe host origin.
S397 and comparison of exa-
mined MAP strains.

3.2.3 Annotation
Annotation of protein-coding sequences (CDSs)
Annotations for MAP strains K-10, K-10’, MAP4, S397 and strain MAH 104 were
downloaded from NCBI (see Tab. S1). For reference based annotation of CDSs,
BacProt (unpublished data) based on Proteinortho [150, 151] was used to
complement present annotations. Furthermore, the novel open reading frame (ORF)
prediction of BacProt, containing Shine-Dalgarno and Pribnow box motif informa-
tion, was applied. For each M. avium strain re-annotated and previously annotated
ORFs as well as statistics like codon usage and occurrence of Shine-Dalgarno se-
quence motifs were calculated. For the ovine derived strain JIII-386 annotation
was complemented with data from Bannantine et al. [127] by using BLAST [67]
(v2.2.27+, E-value ≤ 10−4 ) with at least 90 % identity and an alignment length of
90 %. ORFs with sequence homology to genes with an assigned function in the NCBI
annotation were identified and designated as protein coding sequences (CDS).
For each isolate, annotations provided by NCBI were merged with the BacProt
annotations, to find ORFs being present or absent between two strains by using
BLAST (E-value ≤ 10−4 ). All ORFs of strain A, which could not achieve a sequence
overlap of at least 50 % in length and identity against the genome of strain B, were
marked as present in A but absent in B. ORFs without an assigned function were
excluded from Tab. S15a. These data provide an overview of the different ORFs,
present/absent between the investigated M. avium strains. Detailed analyses of

37
Chapter 3. Genome Assembly

single genes and gene clusters as well as large sequence polymorphisms (LSPs) and
phylogenetic relationships were performed by more restrictive parameters (E-value
≤ 10−20 , alignment length ≥ 95 % of query, sequence similarity ≥ 90 %; depending
on the kind of analysis) and manual investigation of all BLAST results, alignments
and sequences.
Single nucleotide variants (SNVs) were searched by pairwise comparison of protein-
coding sequences of the eight investigated genomes. First, BLAST (E-value ≤ 10−4 )
was used to assign homologous sequences between two strains which were aligned
in a second step using MAFFT (v.7.017b, method: L-INS-i) [152]. The resulting
alignments were searched for SNVs by individual ruby scripts.
The presence or absence of 35 large sequence polymorphisms (LSPs), each con-
taining several ORFs and previously reported by Alexander et al. [132] and Bannan-
tine et al. [127], were examined by using BLASTn+ across the investigated strains.

Annotation of ncRNAs

NcRNAs were annotated by homology search of Rfam (v.11.0) [153] families using
the GORAP pipeline [69] which currently comprises Infernal (v1.1) [68], Bcheck
(v0.6) [158], RNAmmer (v1.2) [159] and tRNAscan-SE (v1.3.1) [160] for detection
of different ncRNA classes. Within the pipeline family specific parameters and
several filter steps based on taxonomy, secondary structure and primary sequence
comparison were used. To compare the amount of ncRNAs GORAP was used to
perform additional annotation of ncRNAs for two well-known bacteria: E. coli and
S. entericus. All resulting stockholm alignments were hand-curated with the help
of Emacs RALEE mode [161].

3.2.4 Phylogenetic reconstruction and ancestral relationship

To obtain the relationship between all investigated M. avium strains, a phylogenetic


reconstruction based on a selected set of CDSs and ncRNAs shared by all strains was
performed. For CDSs BLASTn+ (E-value ≤ 10−10 , alignment length >90 % of query)
and the extended annotations by BacProt were used to find coding sequences that
are common between all species. From this a set of 790 CDSs was obtained, which
was aligned on nucleotide (∼930,000 nt per species) and amino acid (∼310,000 aa)
level using MAFFT. Furthermore, the ncRNAs annotated by GORAP were used to
align a set of 70 ncRNAs (∼8,200 nt) with MAFFT (L-INS-i). Maximum likelihood
tree constructions were performed on all three alignments using RAxML (v8.0.25)
[162] with the GTRGAMMA model for nucleotide alignments and PROTGAMMAWAG
for amino acids. All calculations were applied with 1,000 bootstrap replicates and
outgroup rooting (M. tuberculosis H37Rv). The Newick Utilities suite (v1.6)
[163] was used to visualize the calculated trees.

38
3.2. Comprehensive insights in the MAP genome

Table 3.1: General genome features of the different Mycobacteria strains. The number of ORFs
with a homologous sequence in NCBI (homologous ORFs) and additionally hypothetical ORFs,
both predicted by BacProt, are provided. NcRNAs and riboswitches were annotated by homol-
ogy search of Rfam (v.11.0) [153] families using the GORAP pipeline [69], see Sec. 3.2.3. For further
information (FASTA, GFF, STK files) see supplementary tables S11, S13, S19 and S21, supple-
mentary material online. chr – chromosome; N50 – length of the shortest contig/scaffold, so that
at least 50 % of all bp in the assembly are represented by this and all longer contigs; ORF – open
reading frame; ? – candidate, further analysis needed.

Type MAP-C MAP-S MAH


Strain K-10 K-10’ MAP4 JII-1961 JIII-386 S397 CLIJ361 104
Origin

General Features
Genome (bp) 4,829,781 4,832,589 4,829,424 4,829,628 4,850,274 4,813,711 4,612,386 5,475,491
Assembly 1 Chr 1 Chr 1 Chr 1 Chr 6 Scaff 176 Con 1,147 Con 1 Chr
N50 n.a. n.a. n.a. n.a. 1245802 56150 7088 n.a.
Max contig 4,829,781 4,832,589 4,829,424 4,829,628 1,505,968 137,410 49,981 5,475,491
G+C (%) 69.3 69.3 69.3 69.3 69.16 69.31 68.96 68.99

Protein-coding ORF annotation by BacProt


Homolog. 3,096 3,081 3,082 3,099 3,067 3,008 2,458 3,553
ORFs
Hypothet. 952 967 960 948 991 3,569 5,547 1,054
ORFs

Housekeeping ncRNA annotation


tRNAs 46 46 46 46 46 44 46 46
5S rRNA 1 1 1 1 1 1 1 1
SSU rRNA 1 1 1 1 1 1 1 1
LSU rRNA 1 1 1 1 1 1 1 1
RNase P 1 1 1 1 1 1 1 1
tmRNA 1 1 1 1 1 1 1 1
small SRP 1 1 1 1 1 1 1 1
other ncRNAs
PyrR 1 1 1 1 1 1 1 1
6C 1 1 1 1 1 1 0 1
Actino-pnp 1 1 1 1 1 1 1 1
mraW 2 2 2 2 2 2 2 2
ASdes 3 3 3 3 3 3 3 3
ASpks 4 4 4 4 3 3 3 4
F6 1 1 1 1 1 1 1 1
G2 0 0 0 0 1? 1? 1? 1?
AS1890 1? 1? 1? 1? 1? 1? 1? 1?
Riboswitches
TPP 2 2 2 2 2 2 1 2
Cobalamin 2 2 2 2 2 2 2 2
Glycine 1 1 1 1 1 1 1 1
SAM-IV 1 1 1 1 1 1 1 1
SAH 1 1 1 1 1 1 1 2
pan 1 1 1 1 1 1 1 1
pfl 1 1 1 1 1 1 1 0
ydaO-yuaA 1 1 1 1 1 1 1 1
ykoK 3 3 3 3 3 3 3 3
ykkC-yxkD 1 1 1 1 1 1 1 1
ykkC-III 1 1 1 1 2 2 2 2
TPP binds thiamin pyrophosphate (TPP) to regulate thiamin biosynthesis and transport [154]
SAH recycling of S-adenosylhomocysteine (SAH), produced during SAM-dependent methylation reactions [155]
pan predicted riboswitch function, located in 5’-UTRs of genes encoding enzymes involved in vitamine pantothenate synthesis [156]
pfl predicted riboswitch function, consistently present in genomic locations corresponding to 5’-UTRs of protein-coding genes [156]
ydaO-yuaA genetic “off” switch for ydaO and yuaA genes, maybe triggered during osmotic shock [157]
ykok MG2+ -sensing riboswitch, controls expression of magnesium ion transport proteins [157]
ykkC-yxkD upstream of ykkC and yxkD genes in B. subtilis and related genes in other bacteria, function mostly unclear [156]
ykkC-III predicted riboswitch function, appears to regulate genes related to preceding motifs like ykkC and yxkD [156]

39
Chapter 3. Genome Assembly

500 k 1.000 k 1.500 k 2.000 k 2.500 k 3.000 k 3.500 k 4.000 k 4.500 k

K-10'

JIII-386 S01 S02 S03 S04 S05 6

S397

Figure 3.4: Genome comparison of K-10’ (top), JIII-386 (middle) and S397 (bottom) calculated
with Mauve. Colored blocks connected by lines indicate homologous regions which are internally
free from genomic rearrangements. White areas within blocks indicate sequence regions of lower
similarity. Blocks below the center line are aligned reverse complementary. A detailed Fig. S10a
is available in the supplementary material online.

3.2.5 Results and discussion


Genome sequencing, assembly and analysis
The general genomic features of MAP JIII-386 are presented in Tab. 3.1 together
with the data for the other investigated strains.
Using the described approach of cluster assembly (see Sec. 3.2.2) and several
genome comparisons with related strains the assembly (II) of JIII-386 was improved.
13 sequences of low complexity, comprising nine poly-N, three poly-A and one poly-
T region, were substituted resulting in an overall replacement of 10,254 bp. Five
extensions of 5’- and 3’ endings with a minimum of 44 bp and a maximum of 142 bp,
respectively, were performed.
The final assembly (III) of JIII-386 comprises 4,850,274 bp on six scaffolds (see
also Tab. S5). Compared to those of the other ovine isolates it includes fewer gaps
and a much better N50 value of 1,245,805 bp. JIII-386 shows a slightly lower G+C
content of 69.16 % in comparison to those of MAP Type II isolates and also MAP
S397 (Tab. 3.1). Based on the scaffolds of JIII-386 possible connections between
contigs within the assemblies of MAP S397 and CLIJ361 could be determined (see
Tab. S8) which might be helpful for further improvement of these assemblies. For
calculations and details see supplementary material online (Tab. S6–S9).
A graphical visualization of a genome-wide alignment of the K-10’, JIII-386 and
S397 sequences is shown in Fig. 3.4 and in more detail in supplementary Fig. S10a.
The genomic arrangement of strain JIII-386 is similar to strain S397 (both MAP
Type III) and has still large genome fractions in homology with K-10’. Scaffold
S03 of strain JIII-386 comprises the longest genomic region (650,087 bp, 558 CDSs
annotated with BacProt), which is homologous between the three strains but in-
versely oriented in JIII-386 and S397 compared to that in K-10’. Two other large
homologous regions are located on S01 and S04 of JIII-386 and S397 but in dif-
ferent order than in the K-10’ genome. Two regions >17,000 bp (5’ in S01 and 3’
in S04) were detected in JIII-386 and S397 but not in K-10’ by Mauve alignment

40
3.2. Comprehensive insights in the MAP genome

Table 3.2: Annotations obtained from NCBI and those additionally calculated using BacProt
lead to an extended annotation for each investigated M. avium (last column). In the second
lines (bold): only predicted ORFs with homology to genes with an assigned function in the NCBI
annotation are shown (CDSs). Corresponding – ORFs identified by BacProt and NCBI originating
from same positions in the genome; Start/End shifted – ORFs identified by BacProt and NCBI
but with differences in length (only 5’ or 3’); NCBI/BacProt only – ORFs identified only by
NCBI/BacProt; Extended – total number of ORFs (combination of NCBI + BacProt only). All
*.gff files are provided in the supplementary material online, Tab. S1, S11 and S12.

g
ndin

nly
fted

nded
e

only

rot o
t

shift
espo

t shi
Pro
p.

Exte
BacP
NCBI

NCBI
Corr
strai
subs

Bac
host

Star

End
K-10 4350 4048 2332 411 458 1149 847 5197
1146 3096 998 60 77 11 1961 3107
K-10’ 4394 4048 2374 432 433 1155 875 5269
MAP-C

3048 3081 2046 297 319 385 480 3527


MAP4 4326 4042 2467 342 382 1135 851 5177
3029 3082 2179 248 293 309 362 3391
JII-1961* 4047
3099

JIII-386. 4598 4058 2259 479 484 1376 882 5480


3166 3067 1953 363 349 501 428 3594
MAP-S

S397 4619 6577 2339 386 458 1436 3394 8013


3179 3008 2033 262 326 558 387 3566
CLIJ361* 8005
2458

104 5120 4607 2846 357 509 1408 895 6015


MAH

3472 3553 2661 248 373 190 271 3743


* – MAP strains with currently no NCBI annotation available, instead only BacProt results are shown
. – lift-over annotation based on NCBI, MAP S397

(see Fig. S10b). Further analysis showed that these two regions comprise 16 and 15
CDSs, respectively, which are really absent in the investigated MAP-C genomes but
present in MAP-S and also in MAH 104 (see Tab. S17).

Annotation of protein-coding genes


Annotation was performed in two independent steps: (1) BLAST-based lift-over from
CDS annotation of MAP S397 [127] to JIII-386; (2) semi-de novo annotation via
BacProt.
For JIII-386 4,598 ORFs were predicted using the lift-over annotation based
on 4,619 ORFs of strain S397 from NCBI (JIII-386 annotation see supplementary
material online, Tab. S12) and 4,058 ORFs using BacProt (Tab. 3.2). For JIII-
386 the number of genes with assigned function (3,067 CDSs) and those without
assigned function (991 hypothetical ORFs) corresponds to the ORF numbers of all
MAP-C isolates and approximately to the ORF numbers of MAH 104 (Tab. 3.1
and 3.2). For MAP S397 and CLIJ361 a higher number of hypothetical ORFs was
generated by BacProt, whereby a large amount of these ORFs was detected only on
short sequences (∼120 bp). This was presumably caused by the BacProt prediction
analysis for these two strains based on the published assemblies with a high number
of contigs (176 for S397 and >1,000 for CLIJ361).

41
Chapter 3. Genome Assembly

Table 3.3: Distribution of 10 large sequence polymorphisms (LSPs) in M. avium strains, previously
described to be present in MAP-S but absent in MAP-C. Labels and locations according to Ban-
nantine et al. [127]. LSPS 8 was only partially detected with an alignment length of 692 bp in all
MAP-C strains. Homologous FASTA sequences for LSPS 1–10 of MAP JIII-386 as well as further
details and additional information about the distribution of 25 other LSPs [132, 164] can be found
in the supplementary material, Tab. S14a and b. full – full-length hit; part – partial hit; * – all
ORFs comprised by the LSPS are present but split on different contigs or genomic locations.

MAP-C MAP-S MAH


Name kb ORFs K-10 K-10’ MAP4 JII-1961 JIII-386 S397 CLIJ361 104
LSPS 1 9.01 15940–16060 – – – – full full full* part
LSPS 2 6.65 46190–46270 – – – – full full full* full*
LSPS 3 3.78 14620–14660 full full full full full full full* full
LSPS 4 3.63 46290–46320 – – – – full full full* full
LSPS 5 3.47 17580–17610 – – – – full full full part
LSPS 6 3.0 40470–40500 full full full full full full full –
LSPS 7 2.89 17640–17670 – – – – full full full full
LSPS 8 2.39 02730–02760 part part part part full full full* full
LSPS 9 1.84 23120–23150 full full full full full full full full
LSPS 10 1.58 42460–42490 full full full full full full full –

For all strains except MAP S397 BacProt identified fewer ORFs than provided
to date in the NCBI annotations, however additional ORFs were found (Tab. 3.2).
Both annotations were merged to generate an extended prediction of ORFs (Tab. 3.2,
last column). A large fraction of ORFs without an assigned function was included
in both annotations, therefore a second line was added to Tab. 3.2 for each strain in
which only the number of ORFs with an assigned function was presented (CDSs).
Approximately 49 % (first line) and ∼62 % (second line) of all genes (extended
panel) were annotated on the same positions by BacProt and NCBI (corresponding
genes) whereas ∼20 % (first line) and ∼9 % (second line) of all were found on unique
positions with either BacProt or NCBI only. The intersection of corresponding
ORFs seem to be more reliable, although the additionally detected ORFs using
BacProt were of special interest for further analyses.
For all ORFs annotated by Bac-
Prot, the Shine-Dalgarno sequence 2

AGCTGG
2
bits

bits

AG TGG 1 1

motif was extracted (see Fig. 3.5), K-10’ T CA JIII-386 A T A A


G G
0 0
CA
A C A
A C A
A
C
0
1
2
3
4
5

0
1
2
3
4
5

T T

which represents a part of the ri- 2 2

bosomal binding site on prokaryotic MAP4 AGCTGG


bits

bits

AG TGG 1
T CA G
S397 A
1
T A
0 0
A
C
A
A
A
C CG A A
0
1
2
3
4
5

0
1
2
3
4
5

mRNA. The motif is generally lo-


cated upstream of a start codon
2 2

AG TGG AG TGG CLIJ361


bits

bits

1 1
JII-1961
and involved in the recognition of
C
0 T CA
G
A A
C A
0 TC C
G
T
A C
A A
0
1
2
3
4
5

0
1
2
3
4
5

AT A
C

translation start sites during the ini- 2 2

tial phase of protein synthesis. Re- H37Rv AGCTGG


bits

bits

1
T
AG TGG
G MAH 104
A A
1
T C
ACA
0 0
CG
CA C AA A
0
1
2
3
4
5

0
1
2
3
4
5

A T

markably, the Shine-Dalgarno mo-


Figure 3.5: Shine-Dalgarno sequence motifs of in-
tif observed in all M. avium strains vestigated M. avium strains and M. tuberculosis
examined in this study and addi- strain H37Rv. Detailed information and the motif
tionally in M. tuberculosis strain of MAP strain K-10 (similar to K-10’) can be found
H37Rv: 5’-AGCTGG-3’ (Fig. 3.5) in Tab. S11.
was different from the standard
5’-AGGAGG-3’ pattern [165], possibly conserved for the genus Mycobacterium.

42
Table 3.4: New large sequence polymorphisms (LSPs) regions, extended and revised previous described regions, present in MAP-S but absent in MAP-C.
The number of novel ORFs, additionally predicted by BacProt and with no overlap against previously annotated MAPs ORFs, are listed. For further
information about genomic positions of homologous ORFs (CDSs) in MAP JIII-386 and gene annotation see Tab. S17. # ORFs – Number of ORFs including
homologous as well as hypothetical ORFs.

LSP LSPS included* new LSP Island size (bp) Including MAPs # ORFs # ORFs present in
(genomic region) (MAP-C negativ) (BacProt) MAP-S MAP-C MAH 104
LSPS Ia+b 34,377 MAPs_15870–16180 31 +5 yes not partly
new 2 ORFs of LSPS 1* LSPS Ia 10,227 MAPS_15870–15950 9 +2 yes not not
(this study)
extended 23 ORFs of LSPA 4-II** LSPS Ib 24,150 MAPS_15961–16180 22 +3 yes not yes
(this study) 8 ORFs of LSPS 1

extended LSPS 2 + 4* / LSPA 18** LSPS II 16,392 MAPs_46170–46350 18 +2 yes not yes
(this study) new (BacProt) MAPs_46241–46242† 1

43
extended LSPS 5 + 7* / GPL** LSPS III 16,015 MAPs_17580–17700 12 +5 yes not partly
(previously*) new (BacProt) MAPs_17690† 1 yes not not
LSPS 5 + 7* / GPL** LSPS IIIa 12,142 MAPS_17580–17680 11 +4 yes not yes
LSPS IIIb 3,873 MAPS_17690–17700 2 +1 yes not not

extended MAV-14** LSPS IV ≥21,310 MAPs_20550–20770 22 +2 yes not yes


(this study)

revised LSPS 3 * * yes yes yes


(this study) LSPS 6 * * yes yes not
LSPS 8 * * yes partly yes
LSPS 9 * * yes yes yes
LSPS 10 * * yes yes not
* – see [127]
** – see [132]
† – BacProt assigned function
3.2. Comprehensive insights in the MAP genome
Chapter 3. Genome Assembly

Using this newly identified Shine-Dalgarno sequence and Pribnow box, ORFs with
and without known function were predicted by BacProt and listed in Tab. S11 (see
GFF-files).
Additionally to the annotation of CDSs an overview of codon usage for each
investigated Mycobacteria strain is given in Tab. S13. As expected, similarities
regarding the ratio of G+C-rich codons were found. The codon preferences for
G+C correspond likely with the high (almost 70 %) G+C content (see Tab. 3.1) of
M. avium genomes.
Several regions that are present multiple times in the genome of JIII-386 were
discovered. Among these there are insertion sequences (IS), previously described
to act as transposable elements, also responsible for the genomic diversity of My-
cobacteria [166–168] and used as molecular epidemological markers. 17 copies of
MAP-specific IS900 [169, 170] were verified in JIII-386, as in strains K-10 and S397
[127]. Additionally, six copies of ISMap02 were present in JIII-386 as described
before for K-10 and S397 by Bannantine et al. [127]. Two copies of ISMpa1 (not
three copies as detected by Olsen et al. [171] for MAP strains) were found in the
genome of JIII-386. Only five copies of IS1311 are present in JIII-386, instead of
seven copies as reported previously for K-10 and S397 [127]. Furthermore, we found
eight copies of IS1311 in the genomes of K-10’ and MAP4.

Annotation of ncRNAs
In the last decade, non-coding RNAs (ncRNAs), possible regulators of cellular pro-
cesses and virulence control [137, 172] gained more importance. They were charac-
terized for M. tuberculosis by [136, 173]. For M. avium (including MAH and MAA)
two riboswitches as well as several antisense and intergenic transcripts have been
identified [138].
Hits for all known ncRNAs provided by the Rfam database, based on a screening
of the seven MAP genomes and MAH 104 are presented in Tab. 3.1. All correspond-
ing files are available in STK, GFF and FASTA-format in the supplementary material
online, Tab. S19.
In general, ncRNAs among the investigated M. avium lineages are extremely
conserved (e.g. tRNAs and riboswitches; Tab. 3.1) - there are only few exceptions
like ASpks and ykkC-III which differ in the number of detected copies between MAP-
C, MAP-S and MAH 104 and are discussed in detail below. The high conservation of
ncRNAs between the different M. avium strains is remarkable - even other bacteria,
like the closely related strains of the obligate intracellular family Chlamydiaceae [90],
show more differences in their small ncRNA repertoire compared to the M. avium
strains presented here.

Housekeeping ncRNAs. All examined genomes in this study contained 46 tR-


NAs in contrast to 45 tRNAs detected by Li et al. [124], among them three copies
of two different methionine- and one selenocysteine-tRNA genes (see Tab. S21). In
contrast to the study of Bannantine et al. [127] in the genome of strain S397 the
genes for Arg-TCT and Lys-TTT could not be identified by tRNAscan-SE analysis
(see Tab. S21). Each of the other housekeeping ncRNAs (rRNAs, RNase P RNA,

44
3.2. Comprehensive insights in the MAP genome

tmRNA, SRP RNA) were identified exactly once per genome (see Tab. 3.1).

Riboswitches. One third of the known riboswitches are present in MAP JIII-386:
Two copies of TPP and Cobalamin and one copy of SAM-IV, SAH and Glycine ri-
boswitches, respectively, were found in all investigated M. avium genomes (Tab. 3.1).
Additionally, two copies of SAH in MAH 104 were identified. The genome of CLIJ361
is lacking the TPP upstream of thiE.
Additionally, riboswitch features were found in several 5’-UTRs, however a func-
tion for these has not yet been confirmed: pan (synthesis of the vitamine pantothen-
ate), pfl (absent in MAH 104), ydaO-yuaA, ykoK and ykkC-III. The latter one has
lost its second copies in MAP-C strains. Three of the riboswitches (SAM-IV, Cobal-
amin, ykoK) have been reported previously in MAH, MAA [138] and M. tuberculosis
[136] and were confirmed in this study.
The pan RNA motif represents a conserved RNA structure previously identified
in only three bacterial families: Chloroflexi, Firmicutes and Proteobacteria [156].
Its secondary structure consists of one or two stemloops containing two bulged
adenosines and is located in 5’-UTRs of genes involved in the synthesis of the vita-
min pantothenate. If the observed RNA motif is truly a pan like sequence, it would
be the first discovery of this RNA family in Actinobacteria.

Other ncRNAs. Using GORAP and manual alignment correction one PyrR bind-
ing site was identified in each isolate, which is located upstream of a variety of genes
involved in pyrimidine biosynthesis.
With the exception of MAP strain CLIJ361 (likely due to the limited assembly
quality) one copy of 6C RNA was found within each investigated M. avium [155].
The Actino-pnp RNA motif was previously described as a conserved structure in
Actinobacteria, apparently located in the 5’-UTR of genes encoding exoribonucleases
[156]. For each investigated M. avium strain one copy of Actino-pnp was confirmed
in this study (Tab. 3.1).
The mraW RNA motif is a highly conserved RNA structure consisting of one
hairpin with a highly conserved terminal loop sequence 5’-CUUCCCC-3’. Previ-
ously, it was predicted in many Actinobacteria and particularly within Mycobac-
teria. MraW was detected twice in investigated genomes, one copy being located
consistently in the 5’-UTR of mraW genes and another copy, with similar secondary
structure features, located in a region with multiple types of mur genes which likely
form operons with mraW.
A study by Arnvig et al. [136] discovered at least nine putative small RNA fam-
ilies in the genome of Mycobacterium tuberculosis by RACE analysis and Northern
blot experiments resulting in four cis- and five trans-encoded ncRNAs. With GORAP
and a manual correction of stockholm alignments three of these ncRNAs were iden-
tified in all of the studied Mycobacteria samples: ASdes, ASpks and F6 (Tab. 3.1).
Additionally, two ncRNA homologous classes were discovered: The trans-encoded
ncRNA G2, which has been lost in MAP-C strains and the AS1890 alignment,
which achieved a very good bit score, however lacked the antisense protein homolog
Rv1890c. These domains were described to act as cis-encoded and trans-encoded
ncRNAs [136].

45
Chapter 3. Genome Assembly

ASdes and ASpks are involved in lipid metabolism by regulating the Polyketide
synthase-12 (pks12 ) and fatty acid desaturase (desA1 ), respectively. The pks12
gene contains two identical copies of ASpks, acting as antisense regulators of pks12
mRNA. In the current study, two clusters of potential ASpks ncRNAs were identified.
One cluster, including two identical copies of the region encoding ASpks, as described
for M. tuberculosis [136] and a novel cluster, comprising one copy (in MAP-S) and
two copies (in MAP-C and MAH 104). Within K-10’, ASpks homologs of the second
cluster were detected in two copies, localized in different, but adjacent PKS genes:
pks7 and pks8. In addition to one copy of ASdes, located antisense of desA1 gene,
we were able to find further copies in desA2.
6S RNA is a highly-abundant ncRNA, which was initially identified in E. coli
[174] and was amongst the first small RNAs to be sequenced [175], further be-
lieved to be necessary in at least one copy for each bacterium. By binding to
the σ 70 -containing housekeeping RNAP holoenzyme, it inhibits a large number of
σ 70 -dependent genes and thus enables a better adaption to stationary phase and en-
vironmental stress [176–179]. Although 6S RNA is known for all bacteria branches
(except Deinococcus/Thermus, Chlamydiae, most Actinobacteria) [180], until now
no 6S RNA is known for Mycobacteria. Results of the current study confirm these
data: based on the analysis of the eight investigated genome sequences no 6S RNA
could be identified.
Using GORAP, in all investigated MAP strains and MAH 104 about 80 ncRNAs
were found (see Tab. 3.1), whereas for E. coli about 155 and S. entericus about 200
ncRNAs could be detected (see Tab. S20). Some ncRNAs not known for the latter
two bacteria were listed in Tab. S19. Possibly, in MAP strains and MAH strain 104
there are also more ncRNAs, however, they have not been studied intensively so far,
and transcriptome profiles for discovering novel, specific ncRNAs are lacking and
should be investigated in more detail in the future.
In 2013, Ignatov et al. [138] described the non-coding transcriptome of Mycobac-
terium avium resulting in 87 antisense and 10 intergenic small RNAs, which can
roughly also be expected for MAP strains.
Altogether, in the current study a different number for Aspks, G2 and YkkC-III
among MAP Type-S and -C was detected.
Based on the multiple alignment of 70 ncRNAs, the phylogenetic reconstruction
(Fig. 3.6C) divides all MAP strains into MAP-S and MAP-C clusters, with a low
bootstrap support within the highly similar MAP-C strains.

Loss and gain of gene clusters

A lot of previous studies presented multiple large sequence polymorphisms (LSPs)


between Mycobacterium avium isolates including MAP-S and MAP-C (Type I/III
and II) strains [104, 127, 129–132, 164]. Deletion or insertion of LSP regions was
related to virulence properties of pathogenic bacteria and used as phylogenetic mark-
ers in Mycobacterium tuberculosis complex (MTC) strains [181]. In M. avium, such
regions encode also for metabolic enzymes and antigenic proteins; their specific
distribution is an important source of genetic variability among members of the
M. avium complex [131, 132]. Within the current study, the presence or absence of

46
3.2. Comprehensive insights in the MAP genome

LSPs (consisting of at least four ORFs as gene cluster) but also the gain or loss of
single genes were explored using comparative sequence analyses.
First, the presence or absence of 25 genome regions characteristically distributed
between isolates of M. a. subspecies [132] was confirmed in the examined genomes
(see Tab. S14b). LSPA 11 was missing in MAP-S strain JIII-386 from Germany, as
previously only reported for porcine MAP strain LN20 of sheep type originating
from Canada [132] – both belonging to MAP Type III.

Insertions. Ten regions of specific LSPs (LSPS 1 to LSPS 10) present in MAP-S
but absent in K-10 [127] were confirmed in S397 in this study and detected as homol-
ogous regions also in JIII-386 and CLIJ361 (Tab. 3.3 and supplementary material,
Tab. S14a and b). In contrast to Bannantine et al. [127], four out of these ten LSPs
(LSPS 3, 6, 9 and 10) were identified also in all MAP-C strains. Furthermore, only
LSPS 6 and LSPS 10 are absent in MAH 104. The distribution of the ten LSPS s is
MAP-type associated; it shows no differences between individual strains of MAP-S
or MAP-C regarding their geographical origin.
However, our analysis showed, that LSPS 1 (9 kb) and LSPS 2 (6.6 kb) [127] are
subsets of previously described larger elements LSPA 4-II (28.9 kb) and LSPA 18
(16.4 kb) identified in MAH 104 and MAP-S, but absent from MAP-C [131]. LSPA 4-
II and LSPA 18 are related to the PIG-RDA20 and PIG-RDA10 regions detected
by Dohmann et al. [129]. Based on the newly assembled JIII-386, homologous
sequences of S397 and merged annotation, 23 ORFs (MAPs_15961-16180) homol-
ogous to LSPA 4-II sequences and comprising 8 ORFs of LSPS 1 were identified in
the examined MAP-S strains. Two ORFs of LSPS 1 (MAPs_15940 and 15950) were
absent in the genome of MAH 104 and no homologs were found in LSPA 4-II. This
region could be extended by six adjacent ORFs (MAPs_15870-15930) and addition-
ally by BacProt annotated ORFs. A new LSP was defined: LSPS Ia (see Tab. 3.4)
comprising 11 ORFs, absent in MAP-C and absent in MAH 104. This LSP re-
ally could represent an insertion, with genes encoding proteins involved in CoA
energy metabolism and tetracycline-controlled transcriptional activation. Further-
more, LSPS 2 (MAPs_46190-MAPs 46270) matched to LSPA 18, the nine ORFs of
LSPS 2 are homologous to ORFs MAV5227-5235 in MAH 104. LSPS 2 was combined
with LSPS 4, extended by 10 adjacent ORFs and newly designated as LSPS II (see
Tab. 3.4).
ORFs belonging to LSPS 5 and LSPS 7, (see Tab. 3.3) and additionally six ad-
jacent ORFs (MAPs_17621, MAPs_17622, MAPs_17680–MAPs_17710) were de-
scribed as novel region in MAP-S and MAH 104 genomes by Bannantine et al. [127]
comprising also the GPL region (missing MAPs_17680-17710) published by Alexan-
der at al. [132]. This genome region was predicted to encode proteins involved in the
biosynthesis of glycopeptidolipids (GPL) [182]. GPLs are discussed to contribute to
the virulence of members of the Mycobacterium avium complex (MAC). Different
genes involved in the synthesis of GPLs would be expected to alter indirectly the
interaction of the bacterium with its host. We analyzed that MAPs_17650, 17670,
and 17690 are homologous to the GPL genes mtfC, dhgA, and hlpA belonging to
the GPL biosynthesis cluster that is known to be diversely organized among indi-
vidual strains and subspecies of M. avium [182, 183]. In the present study the 14

47
Chapter 3. Genome Assembly

ORFs were detected to be present in MAP-S, and absent in the examined MAP-C
strains. It was possible to assign a function for MAPs_17620, 17621, and 17690 (see
Tab. S17). Otherwise, MAPs_17690–17710 are absent in MAH 104, but genes dhgA
and mtfC are still present in the annotation of MAH 104. Altogether, this region
included in addition five BacProt annotated ORFs and was newly designated as
LSPS III (see Tab. 3.4).
A further region (21.3 kb) was identified and newly defined as LSPS IV, com-
prising 22 ORFs (MAPs_20550-20770). This LSP is present in JIII-386, S397,
CLIJ361 and MAH 104 (with 243 mismatches), but absent in MAP-C strains (see
Tab. 3.4). Additionally, in JIII-386 two ORFs were predicted on the opposite strand
by BacProt. Sequences of 15 ORFs (MAPs_20620-20770) are homologous to se-
quences of previously described LSP MAV-14 [132]; 7 adjacent ORFs of LSPS IV
(MAPs_20550-20610) are absent from MAV-14.

Deletions. Several deletions in MAP-S strains, which have already been described
earlier, were verified in the current study, but also differences were found. Three gene
clusters (LSPs) comprising 32 genes, annotated in MAP K-10 were previously char-
acterized to be absent in MAP-S isolates: MAP1432–MAP1438c (deletion s∆-1),
MAP1484c–MAP1491 (deletion #1) and MAP1728c–MAP1744 (deletion #2) [104,
127, 130, 131, 184]. Genes included in deletion s∆-1, deletion #1 and deletion #2
were tested to be absent in sheep strains from United States [104, 127]; those of dele-
tion #1 and #2 tested as absent in Australian sheep strains [130]. In the current
study, genes of deletions #1 and #2 were also absent in JIII-386 from Germany and
CLIJ361 from Australia. But in contrast to sheep strain S397 from U.S., the seven K-
10 genes belonging to deletion s∆-1 were identified as being present in sheep strains
JIII-386 from Germany and CLIJ361 from Australia (genes MAP1433c–MAP1438c
in full length; MAP1432 with mismatches; see Tab. 3.5 and Tab. S16). Differences
regarding the presence or absence of deletion s∆-1 could reflect diversities among
MAP-S strains originating from different geographic regions of the world. Marsh et
al. [130] identified ORF MAP2325 in cattle strains but its loss (designated as dele-
tion #3) in Australian sheep isolate Telford 9.2 (MAP-S, Type I) using microarray
and confirmed this deletion #3 in 16 sheep strains by PCR. In contrast, MAP2325
was found to be present in MAP-S (Type III) isolates from the U.S. [104, 127]. This
discrepancy suggested a difference between MAP isolates recovered from sheep in
Australia and the United States. Furthermore, within the current study, MAP2325
could be found with 100 % sequence identity also in MAP-S strains JIII-386 (Type
III) from Germany as well as in CLIJ361 (Type I) from Australia, and confirmed
in all MAP-C isolates. Again, results reflect diversities within MAP-S group, but
could also partially indicate discrepancies between results of different methods (se-
quencing, microarray, PCR).

Loss and gain of genes


Based on the merged annotations between NCBI and BacProt, new differences
regarding the loss and gain of single protein coding genes (CDSs) among MAP-
S and MAP-C strains were detected. In JIII-386, S397 and CLIJ361 genomes,

48
3.2. Comprehensive insights in the MAP genome

80 homologous CDSs were identified which are absent from K-10/K-10’ and 82
homologous CDSs which are absent from MAP4 and JII-1961. Tab. S15a presents in
detail gain and loss of genes detected in the current study comparing MAP genomes
and MAH 104. 40 genes with assigned functions (homologous genes) as well as 30
hypothetical genes, all previously described by [127] to be present in three sheep
isolates of MAP-S, Type III (from U.S.) but absent in K-10 strain (MAP-C), were
part of this analysis. However, 36 out of these 70 genes belonged to the ten MAP-S
specific LSPS regions also published by Bannantine et al. [127] including all ORFs
of LSPS 1, 2, 4, 5 and 7, and two out of four ORFs of LSPS 8 (see Tab. 3.3). Four
genes (hypothetical genes) were still present in MAP-C strains (Tab. S15b). For
nine out of the 30 above mentioned hypothetical genes it was possible to assign a
function based on homology. 34 additional ORFs were found in all MAP-S, but
absent in MAP-C, among them five ORFs which were annotated only by BacProt
(Tab. S17).
Altogether 80 CDSs (ORFs with an assigned function), present in MAP-S but
absent in MAP-C strains, were annotated in this study and listed in Tab. S17. Nine
CDS were also absent in MAH 104. Eight of these genes belong to the new designated
LSPS Ia, possibly indicating a specific insertion region into MAP-S strains.
MAP-S (Type III) strains JIII-386 and S397 differed in the presence and/or
absence of altogether 33 CDS (see Tab. S15a). In detail: 25 CDSs of S397 were
present in MAP-C and partially in CLIJ361 but absent in JIII-386 including four
ORFs (MAPs_23210–MAPs_23240), and six ORFs (MAPs_39450–MAPs_39500),
possibly representing specific deletions in JIII-386. The last gene cluster encodes
for three mammalian cell entry (mce) family proteins and virulence factor mce.
Mce genes were originally identified and studied in M. tuberculosis and have been
associated with survival within macrophages and increased virulence in this species
(see review of Behr et al. [185]). Eight CDSs of JIII-386 were present in MAP-C,
CLIJ361 and MAH 104 but absent in S397 and included 4 complete ORFs with an
assigned function (CDSs) of deletion s∆-1.
In contrast, MAP-C type strains showed high similarities regarding their gene
repertoire. Only two genes are absent from MAP4 (coding for ATP/GTP-binding
integral membrane protein and CsbD-like protein) and two other genes are absent
from JII-1961 (coding for inosine 5-monophosphate dehydrogenase and a PE-PGRS
family protein, see Tab. S15a). As expected, with loss and gain of about 700 CDSs,
MAH strain 104 emerged as the most different strain among the investigated My-
cobacteria (see Tab. S15a). The large number of genes (up to 208 compared to
K-10’), absent in MAP-S, Type I strain CLIJ361, includes a high amount of false
negative hits most likely caused by the lower assembly quality. Probably, some of
the genes are present in the genome of CLIJ361 but could not be identified in nearly
full-length and were therefore counted as absent from this strain. Nevertheless, bet-
ter assembled MAP Type I strains could enable more reliable comparisons among
MAP-S: Type I and III strains.

49
Chapter 3. Genome Assembly

PE/PPE/PGRS genes and Table 3.5: Gene cluster comprising seven K-10 ORFs absent
mmpL5. The PE and PPE in S397 but present in JIII-386 and CLIJ361. Table based
gene families are restricted to on Bannantine et al. [127]. Homologous sequences of all
ORFs were found on scaffold S02 in MAP JIII-386. For
mycobacteria, encode acidic, additional information and BLAST results see Tab. S16.
glycine-rich proteins and sev-
eral of them are proposed to be ORF Size (bp) Description
involved in antigenic variation
MAP1432* 1490 REP-family protein
and in the pathogenesis of in- MAP1433ce 1745 3-oxosteroid 1-dehydrogenase
fection [186–188]. They com- MAP1434e 1118 putative phthalate oxygenase
prise anywhere from 1 % of the MAP1435 713 short chain dehydrogenase
e
genome (MAP) to nearly 10 % MAP1436c 782 putative oxidoreductase
MAP1437c 986 hypothetical proteinh
(M. tuberculosis) [189]. In this
MAP1438cd 983 probable lipaseh
study individual strains show e
– involved in energy metabolism
six, seven or eight PE genes as d – involved in degradation of macromolecules
well as 32 (JIII-386, S397) or 33 * – partial hit (alignment length 1 484 bp) with mismatches
and 35 (MAP-C strains) PPE h – treated as hypothetical ORFs during analyses
genes, annotated by BacProt
– there is no clear differentiation between MAP-S and -C strains possible. Fur-
thermore, PE_PGRS family protein genes – the largest sub-family of PE family
genes, also suggested to play an important role in the persistence of mycobacteria
and to be involved in antigenic variation and immune evasion [190] – were searched.
It was previously assumed that M. avium, including MAH and MAP lack these
PE_PGRS family protein genes [188, 191–194]. However, in this study at least one
(S397, CLIJ361, MAH 104) or two (JIII-386, K-10’, JII-1961, MAP4) homologue to
PE_PGRS gene family could be annotated by BacProt, confirming results of Tian
et al. [190] for M. avium. Marri et al. [195] suggested that the paucity of PE/PPE
virulence genes in MAP in comparison to M. tuberculosis was compensated by the
acquisition of other virulence factors as a result of lateral gene transfer.
Otherwise, many mycobacterial membrane protein large (mmpL) genes are as-
sociated with clusters involved in the biosynthesis of cell wall-associated glycolipids
[196]. MmpL5 gene encodes a protein involved in lipid transport [195]. The current
study confirms the absence of mmpL5 gene in MAP-S strains and its presence in
MAP-C strains (and MAH 104) previously described by Marsh et al. [184] possibly
indicates that some of these mmpL gene products could also help in host association.

Single nucleotide variants


Depending on MAP type and M. avium subspecies, different numbers of SNVs were
detected within corresponding CDS, see Tab. S18. Among MAP-C strains less than
200 SNVs and among MAP-S strains about 1,000 SNVs were identified. As obvious
from the genome comparison (for JIII-386 and S397 shown in Fig. 3.4, not shown
for MAP-C strains), these results confirm a higher heterogeneity within the MAP-S
group and high similarity between MAP-C strains. This could indicate that MAP-C
has evolved more recently or over a long time within a restricted niche. Among CDSs
of MAP-C and MAP-S, Type III strains more than 2,000 SNVs were found. More
than 26,000 SNVs were detected comparing CDSs of MAP Type I, II and III strains
with MAH 104, revealing the high evolutionary distance between MAP and MAH.

50
3.2. Comprehensive insights in the MAP genome

As shown before for the M. tuberculosis complex (MTBC) [197] also two thirds of
SNVs among MAP strains are nonsynonymous (see Tab. S18) which is unlike in most
other organisms in which synonymous SNVs predominate. This has been proposed
to be the consequence of the relatively short evolutionary age of MTBC [198] which
applies also to MAP. Furthermore, this could indicate an adaptive evolution of MAP
to different hosts with positive selective pressure [105].

Phylogenetic reconstruction and ancestral relationship


Together with the other members of M. avium (MAH, MAA, MAS) and the geneti-
cally related Mycobacterium (M.) intracellulare, MAP belongs to the MAC revealing
different pathogenicity and infecting different hosts. How MAP has evolved into a
professional pathogen of ruminants remains largely unknown, also its division into
two main lineages: MAP-S (MAP Type I/III) and MAP-C (Type II). Previously,
two models for a putative biphasic evolution of MAP to MAP-S and MAP-C strains
were proposed. In model I, the first phase is characterized by the emergence of an
original pathogenic clone of MAP (proto-MAP) from a strain of MAH via acqui-
sition of novel DNA and polymorphisms shared by all modern strains [132]. The
second phase includes the subsequent differentiation from proto-MAP to sheep and
cattle lineages (MAP-S and MAP-C) via genomic insertion/or deletion of different
LSPs [132]. Model II suggests that a different number of independent inversion
events and loss of LSPs causes the evolution from MAH or from M. intracellulare
to proto-MAP and further via sheep type to cattle type [127].
In this study, the detected higher number of individual CDSs in MAP-S as in
MAP-C and the higher number of sequence regions present in MAP-S and MAH 104
but absent in MAP-C (see LSPS s) than deletions in MAP-S supports the model of
evolution from proto-MAP via sheep type to cattle type. Furthermore, the higher
diversity among MAP-S strains (see SNVs) could also indicate an evolutionary ear-
lier onset of MAP-S in comparison to MAP-C. Otherwise, the calculated phyloge-
netic trees (Fig. 3.6, Fig. S22–S26) - based on comparison of nucleotide- or amino
acid sequences within 790 corresponding CDSs and additionally of corresponding
ncRNA sequences - give ambiguous results regarding proposed evolutionary models
(Fig. 3.6). The trees illustrate the large genetic distance between MAP, MAH, and
M. intracellulare and they show clearly that the subspecies MAP is more closely re-
lated to MAH than to M. intracellulare (Fig. 3.6A and 4B). Depending on the type
of compared sequences, trees exhibit different results concerning higher or lower
similarity of MAP-S or MAP-C to MAH (Fig. 3.6A versus Fig. 3.6B and 4C). Con-
sequently, results of current phylogenetic trees do not answer the question if MAP-S
is the evolutionary intermediate between proto-MAP and MAP-C or if there was
another way of division into the two main lineages during MAP evolution.
Summarizing several previous studies also provides ambiguous results contradict-
ing both proposed evolutionary models. Sohal et al. [199] described that SNPs in
IS1311 could be indicative of the MAP-S type being an evolutionary intermediate
between M. avium and MAP-C type, but SNPs in the hsp65 gene [200] indicate that
MAP-C is the intermediate. Otherwise, Marsh et al. [201] identified 11 SNPs be-
tween MAP-S and -C strains in eight genes, all present in MAH 104 and distributed
almost evenly among both MAP types. Furthermore, there are polymorphic regions

51
Chapter 3. Genome Assembly

(A)
4.5217
MAH 104
0.0026
MAP K-10'
50.8321 0.0204
100
100 MAP K-10
0.2497 100 0.0045
0.0136
MAP JII-1961
3.8641 0.0026 49
100 MAP4
0.0154
182.2227 0.1277
MAP CLIJ361
0.1912 100 0.1575
0.0761
MAP JIII-386
100
0.0764
MAP S397
3.0895
56.3016 MI MOTT-02
100 3.1072
MI MOTT-64
182.2227
MTB H37Rv

0.1

(B)
14.0856
MAH 104
0.3902
MAP JII-1961
0.6880
33.0826 100
0.4289
100 0.2453
100 MAP K-10
0.8376 MAP K-10'
100 0.0041
1.9932 MAP4
100 0.3046
139.3741 0.9703
MAP CLIJ361
1.8905 100
0.3751
0.2964
MAP JIII-386
100 0.5250
MAP S397
2.9815
39.1698 MI MOTT-64
100 3.6466
MI MOTT-02

139.3741
MTB H37Rv

1.0

(C)
0.1208

0.8132
MAP K-10
MAH 104 0.0013 0.0013
18 MAP JII-1961
0.0013 10
0.1208 0.0013
58.6396
MAP K-10 0.0013 11 MAP K-10'
MAP JII-1961 0.0013
99 MAP4

1.0041
MAP K-10'

MAP4
0.2417
MAP CLIJ361
0.1208
99 0.1208
MAP S397
0.2418
99
0.2419
MAP JIII-386
58.6396
MTB H37Rv

0.1

Figure 3.6: Phylogenetic reconstructions for all investigated M. avium strains based on sequence
comparison of 790 corresponding CDSs on nucleotide (A) and amino acid level (B) and 70 cor-
responding ncRNAs (C). M. tuberculosis strain H37Rv was used as an outgroup. Mycobacterium
intracellulare (MI) strains were included as members of the Mycobacterium avium complex. Float
numbers correspond to substitutions per site and integer numbers represent RAxML bootstrap val-
ues. Long branches are shrunken. Detailed figures, all multiple sequence alignments and tree
representations in newick format can be found in the supplementary material online, Fig. S22–S26.
Shown are strains of MAP-S, Type I: CLIJ361 and Type III: JIII-386, S397 (red); Strains of MAP-
C, Type II: K-10, K-10’, MAP4, JII-1961 (green), MAH strain 104 (blue), MI MOTT-64 and MI
MOTT-02 (orange) and M. tuberculosis strain H37Rv (brown, used as outgroup).

52
3.2. Comprehensive insights in the MAP genome

unique to MAP-S strains and MAH 104, but also large deletions in the MAP-S
strains [130].
To decipher the complexity of the evolutionary processes leading to MAP-S and
MAP-C strains, future genome comparisons should investigate additional target
regions or genes as such for metabolic pathways, and especially use a higher number
of WGS of MAP-S, Type I and III as well as of related strains among the species
M. avium (MAA, MAS, and MAH).

3.2.6 Conclusions
With the newly sequenced JIII-386 genome the so far best assembled MAP-S se-
quence was presented here. We could show that the combination of different de novo
genome assembly tools and parameter settings can improve an initial assembly, how-
ever we were not able to obtain the completely closed genome. Using merged results
from NCBI and BacProt a comprehensive annotation of CDSs was obtained, in-
cluding a large fraction of CDSs identified by both approaches, and also additional
ones identified exclusively with one of the approaches. This relativizes absolute
numbers of annotated genes in studies using only one annotation program. Newly
annotated CDSs complete the previously detected differences between MAP-S and
MAP-C strains. Within this study BacProt re-annotations of CDSs for each of
the seven M. avium strains are provided. A new Shine-Dalgarno sequence motif
was extracted; further studies should disclose if this motif was conserved among
Mycobacteria.
For the first time about 80 ncRNAs and riboswitches of MAP were presented,
differing in numbers in three cases from MAH 104 but also between MAP-S and -C.
Furthermore, a pan like sequence was observed; which is the first discovery of this
RNA family in Actinobacteria. The performed genome comparison is the most com-
prehensive to date since it comprises three MAP-S and three MAP-C isolates from
three, respectively two different continents. Using extended annotation, previously
reported genome differences between S and C strains were partially revised and new
MAP Type-S specific regions were identified.
The concordant presence and absence of specific LSPs and distribution of ncR-
NAs among the examined MAP-S, Type I and III strains show that these strains
are very closely related subgroups of MAP-S.
In conclusion, our data will improve the understanding of the Mycobacterium
avium subsp. paratuberculosis genome, help to decipher the genetic basis for different
phenotypic characteristics of MAP-S and -C (Type I/III and II, respectively) strains
and the evolution of MAP types, also in relation to Mycobacterium avium subsp.
hominissuis.

53
Chapter 3. Genome Assembly

54
Chapter 4

Transcriptome Assembly

This chapter is based on “The Dark Art of de novo transcriptome assembly: a com-
prehensive across-species comparison of short read RNA-Seq assemblers” [16]* and
“GOAssembler: a method pipeline for the construction, evaluation and clustering
of de novo transcriptome assemblies” [17]*. We present our idea of a merged or
clustered transcriptome assembly, utilizing different tools and parameter settings,
to construct a more complete and comprehensive assembly out of a given RNA-Seq
data set. The approaches and findings collected out of the conceptual comparisons
and calculations presented in this chapter, have been already incorporated in the
bacterial genome assembly processes discussed in Chapter 3 [1, 2], in the de novo
transcriptome assembly process of the fruit bat Rousettus aegyptiacus presented
in Section 5.2 [4] and in the transcriptome construction of the jamaican fruit bat
Artibeus jamaicensis, used for the annotation of a Mx1 homolog in Section 6.2 [7].

In this chapter, we will mainly focus on the topics Preprocessing, Assembly, and
Visualization (see Fig. 2.2; B, E, and M).

In Sec. 4.1, we comprehensively compare ten publicly available tools for de novo
transcriptome assembly across different RNA-Seq data sets comprising various species
and sequencing parameters. We define different metrics to evaluate the performance
of each assembly tool on the various data sets. This section is accompanied by a
comprehensive Electronic Supplement1 . In the second part (Sec. 4.2), we present
our idea of a clustered transcriptome assembly and discuss the advantages and dis-
advantages of this approach by incorporating results from Sec. 4.1. We show as
proof-of-concept how the transcripts obtained from different de novo assembly runs
may be combined to a final and more comprehensive transcriptome, ready for an-
notation, quantification and differential gene expression analysis.

* unpublished work, publication in progress


1
available at http://www.rna.uni-jena.de/supplements/the_dark_art/

55
Chapter 4. Transcriptome Assembly

4.1 The Dark Art of de novo transcriptome assem-


bly: a comprehensive across-species comparison
of short read RNA-Seq assemblers
In Chapter 3, we presented two rather small de novo genome assemblies of a C. gal-
linacea and an M. avium subsp. paratuberculosis strain. Whereas the assembly of
bacterial-sized genomes is already challenging, the assembly of eukaryotic genomes
is much harder [202]. Furthermore, the costs for sequencing are increasing be-
cause more reads are requiered to achive a sufficient coverage depth of the genome
that should be assembled. Correspondingly, the costs, computational resources and
amount of time needed for an appropriate genome assembly increase. If no ref-
erence genome is available, one common approach involves the construction of a
transcriptome assembly out of RNA-Seq data. The assembled transcripts can be
further annotated and used for quantification, differential gene expression analyses
and comparative studies, without the need of a genome assembly. De novo strate-
gies can be conducted, that leverage the redundancy of short-read sequencing data
to find overlaps between the reads to finally assemble them into transcripts.

4.1.1 RNA-Seq: a revolution in transcriptomics


The sequencing of short cDNA molecules (RNA sequencing, RNA-Seq; see Sec. 2.1.1)
has emerged as a powerful tool to understand various molecular mechanisms and
programs. RNA-Seq is a fast and reliable method to access sets of expressed features
in a qualitative and quantitative manner at low costs. In particular, for non-model
organisms and in the absence of an (appropriate) reference genome, RNA-Seq data
are used to reconstruct and quantify whole transcriptomes at the same time. Even
SNPs, INDELs, and alternative splicing events can be predicted directly from the
data without having a reference genome [32]. RNA-Seq allows the identification of
differential expressed genes, even if there is currently no reference genome available.
The short sequencing reads derived from an RNA-Seq experiment can be assembled
into contigs. Ideally, each contig corresponds to a certain transcript isoform. The
same reads can be used to quantify each contig. A key challenge is the management
of the resulting datasets, especially if different tools and parameter settings are
used for the construction of a transcriptome assembly. Furthermore, even though
a reference genome is available, it is recommended to complement a comprehensive
differential gene expression study by a de novo transcriptome assembly [202] to
identify transcripts that have been missed by the genome assembly process or are
not appropriately annotated.
Besides the great advantages of RNA-Seq, it also faces several computational
challenges, including efficient methods to store, retrieve and process large amounts
of data. At first glance, the transcriptome assembly process seems similar to genome
assembly (Sec. 2.2), but actually they are fundamentally different and comprise vari-
ous challenges. On the one hand, some transcripts might have a very low expression
level, while others are highly expressed [202]. Especially in eukaryotes, each lo-
cus produces several transcripts (isoforms) due to alternative splicing events [62].

56
4.1. The Dark Art of de novo transcriptome assembly

Therefore, reads derived from one exon can be part of multiple paths in the as-
sembly graph. Furthermore, some transcript variants with a low expression level
might be considered as sequencing errors by various tools and removed from the
assembly process [203]. As in genome assembly, repetitive regions are also a huge
problem for the construction of transcripts. One of the main challenges in de novo
genome assembly of DNA-Seq data is to deal with repeats that are longer than
the reads. In de novo transcriptome assembly, we have fewer and shorter repeated
sequences. However, they could create ambiguities and confuse assemblers if not
addressed properly [204]. Another point involves the coverage vs. cost relation, an
important question during the design of each NGS experiment. In direct comparison
to DNA-Seq, a much higher coverage is necessary for RNA-Seq to detect rare and
low expressed transcript variants [205]. Additionally, it is less straightforward to cal-
culate the sufficient coverage for a transcriptome assembly compared to a genome
assembly, because the true number of expressed transcripts and their isoforms is
usually not known or can only be estimated [32]. Furthermore, the transcriptome
varies between different cell types, environmental conditions and time points. A
successful transcriptome assembler should be aware of all of these points and be
capable to recover full-length transcripts of different expression levels.
For most analyses, a transcriptome assembly is only useful if a functional anno-
tation is available (see Sec. 2.3). Such annotations are often based on a homology
search to annotate protein-coding genes as done for the Mycobacterium genome in
Sec. 3.2.3), but can get much more complicated if non-coding genes should be also
targeted [12, 13].
The de novo transcriptome assembly of non-model organism has recently been
on the rise in concern with the number of de novo transcriptome assembly tools.
There is a knowledge gap which assembly software and parameter settings should
be used for the construction of a good de novo assembly. Additionally, there is a
lack of consensus on which evaluation metrics should be used to assess the quality
of de novo transcriptome assemblies to select good ones.
Several tools for de novo transcriptome assembly were developed in the last
decade [202]. Some of them are build on top of already existing genome assembly
tools, others were especially designed for transcriptome assembly. The question
that comes to mind: which tool should be used for which kind of data? Some tools
may fit the needs of eukaryotic transcripts, where alternative splicing has to be
considered to construct different isoforms, whereas other tools can handle simpler
prokaryotic transcripts. More complicating, there are different RNA-Seq library
preparation protocols, resulting in reads of many kinds: single-end or paired-end,
strand-specific or not strand-specific, and with different insertion sizes. To tackle
these and other questions, we performed a comprehensive comparison of ten de novo
assembly software programs across Illumina RNA-Seq data sets of different species,
sequencing parameters, and library preparation protocols.
The evaluation of de novo transcriptome assembly tools has been already done in
the past, however those studies often rely on limited data sets (e.g., a single species,
a single sequencing protocol) or focus on classic assembly tools.
In 2010, Kumar and Blaxter [206] compared five assemblers based on Roche
454 pyrosequencing data, however the most frequently used NGS platform today

57
Chapter 4. Transcriptome Assembly

is provided by Solexa Illumina [202] and will be the focus in this study. In 2011,
Chen et al. [66] evaluated the impact of different k-mer sizes on the de novo tran-
scriptome assembly results. They found out, that using a single k-mer value for
assembly is not enough to generate good assembly results. Instead, the combina-
tion of different contigs constructed based on different k-mer sizes could yield much
longer transcripts and greatly improve the final assembly. However, using larger k-
mers can improve the assembly quality of common transcripts and transcripts with
repetitive regions, but the assembly of rare transcripts may suffer. In this study,
only a single genome assembler (Velvet [51]) was used to build the different k-mer
assemblies. Also, Zhao et al. [205] showed the improvement of de novo transcrip-
tome assemblies from short-read RNA-Seq data by combining multiple assemblies of
different k-mer values. Here, four single k-mer assemblers (Oases [60] (build on top
of Velvet), ABySS [56], Trinity [62], SOAPdenovo [207]) and three multiple
k-mer (MK) methods (SOAPdenovo-MK, Oases-MK, Trans-ABySS [61]) were
tested. They showed, that small and large k-mer values performed better in the re-
construction of lowly and highly expressed transcripts, respectively. They suggest,
that generally multiple k-mer approaches should be considered to achieve better
assemblies. A newer study from 2013 [208] compared different de novo assembly
and genome-guided assembly strategies for transcriptome reconstruction. Overall,
five assemblers were used in this study, from which three (Oases, Trans-ABySS,
Trinity) can be applied de novo on the RNA-Seq data. Here, the merged as-
semblies of all five tools achieved the best overall assembly. Of course, how the
merging of the assembled transcript sequences is performed has a high impact on
the quality of the final assembly. One common approach is to build a merged (com-
bined, clustered ) assembly out of the concatenation of multiple FASTA files. Tools
like CD-HIT-EST [146] can be used for this task to cluster the sequences by sim-
ilarity. The process of merging different assemblies is not trivial, as highly similar
isoforms might be clustered into on sequence if the sequence similarity cutoff is too
low. Otherwise, high redundancy can be introduced in an assembly that combines
the output of many single assemblies of different tools and parameter settings. Also
in 2013, Clarke et al. [209] evaluated five de novo assemblers, ABySS, Mira [64],
Trinity, Velvet and Oases on simulated and real RNA-Seq data. All of those
assembly tools are based on de Bruijn graphs (see Sec. 2.2), except Mira using an
overlap graph algorithm. Clarke et al. [209] suggest, that there is an urgent need of
novel assembly tools for assembling transcriptome data generated by current NGS
techniques. In a recent study from 2016 by Wang and Gribskov [210], eight assembly
tools (Oases [60], SOAPdenovo-Trans [63], Trans-ABySS [61], Trinity [62],
BinPacker [211], Bridger [212], IDBA-Tran [213], SSP [214]) were compared
based on two RNA-Seq data sets from Arabidopsis thaliana with a series of k-mer
values. In this study, SOAPdenovo-Trans and Trans-ABySS performed best
regarding base coverage and the number of recovered full-length transcripts, respec-
tively. While this study is of general interest, especially because novel assembly
tools are included, the results are only based on data sets from one species (plant)
and also the novel assembly tools can not really show their strength on the more
simple single-end and non-strand-specific data.

58
4.1. The Dark Art of de novo transcriptome assembly

All of these studies agree in one point: currently there is no optimal assembly
tool for all RNA-Seq data sets out there. Different species, sequencing protocols
and parameter settings need different approaches and adjustments of the underlying
algorithms to obtain the best possible results out of the RNA-Seq data. Merging
the contigs of different assembly tools and parameter settings to overcome the dif-
ferent disadvantages of certain assemblers and to combine their advantages seems
to be the best way to obtain a comprehensive de novo transcriptome assembly (see
Sec. 4.2). Nevertheless, knowing the advantages and disadvantages of each tool is
an important step in the direction of an automated merging algorithm for multiple
de novo transcriptome assemblies.
Here, we present a comprehensive evaluation of ten de novo assembly tools (long-
standing and novel ones) across RNA-Seq data sets of different species and based
on different Illumina sequencing parameters and protocols. In comparison to recent
studies, we do not only focus on RNA-Seq data of one species or kingdom. Instead,
we use real data sets from bacteria, fungi, plants, and higher eukaryotes. We also in-
clude data sets that underwent viral infections. We further tested promising metrics
of various evaluation tools to assess and compare the performance of each assembler.
In a next step, such metrics could be used for an automized selection of good as-
semblies or contigs to build a more comprehensive and better cluster-assembly. Our
results give insights in the performance and usability of the different assemblers and
how they perform on the different data sets. As far as our knowledge goes, this is
the most complete comparison of short-read de novo transcriptome assembly tools
currently available.

4.1.2 Material and methods


Assemblers
We collected four widely used and six novel assembly tools for Next-Generation
Sequencing short-read data (summarized in Tab. 4.1 and Electronic Supplement2
Tab. S2:

• Trans-ABySS [61] (v1.5.5; 2010) (build on top of ABySS [52] (v1.5.1)),


• Trinity [62] (v2.3.2; 2011),
• Oases [60] (v0.2.08; 2012) (build on top of Velvet [51] (v1.2.10)),
• SPAdes-sc [54] (v3.9.1; 2012),
• SPAdes-rna [54] (v3.9.1; 2012),
• IDBA-Tran [213] (v1.1.1; 2013),
• SOAPdenovo-Trans [63] (v1.03; 2014),
• Bridger [212] (v2014-12-01; 2015),
• BinPacker [211] (v1.0; 2016), and
• Shannon [215] (v0.0.2; 2016).
2
http://www.rna.uni-jena.de/supplements/the_dark_art/

59
Chapter 4. Transcriptome Assembly

Velvet/Oases and ABySS/Trans-ABySS (hereafter referred as Oases and


TransABySS) are established de novo genome/transcriptome assemblers based on
de Bruijn graphs. Both are supporting multiple k-mer values by running the under-
lying genome assembler multiple times and merging the assembled contigs.
SOAPdenovo-Trans (build on the principles of SOAPdenovo2 [53]) and Tri-
nity are stand alone de novo transcriptome assembly tools, also based on de Bruijn
graphs but lacking an automated multiple k-mer support. Whereas for SOAPdenovo-
Trans different k-mer values can be applied, Trinity relies on a fixed k-mer value
of 25.
IDBA-Tran is a novel assembly tool that claims to be more robust regarding
uneven expression levels in RNA-Seq data [213].
The Shannon assembler is a so called information-optimal de novo RNA-Seq
assembler [215]. Shannon aims to reconstruct all information-theoretically recon-
structable transcripts. The assembler is based on the theory that differing abun-
dances among transcripts are the key, rather than the barrier, to construct an effec-
tive assembly. These six transcriptome assemblers are especially designed to work
on RNA-Seq data and are based on de Bruijn graphs.
We further included Bridger, a new framework (published 2015) for de novo
transcriptome assembly, that “bridges” between techniques employed in the Cuff-
links [46] pipeline and the Trinity tool, in order to overcome the limitations of
Trinity. Bridger constructs a so called splicing graph, that is somehow different
from a standard de Bruijn graph even though there is a kind of correspondence be-
tween both (see [212]). Based on the principals of Bridger the authors developed
another assembly tool (presented in 2016) called BinPacker [211]. Like Shannon,
BinPacker utilizes coverage information to efficiently dissolve corresponding iso-
forms. To achive this, a serie of bin-packing problems is solved by the tool [211].
BinPacker (like Bridger) constructs splicing graphs as backbone of the assembly
process, instead of relying on de Bruijn graphs.
SPAdes [54] is a novel de novo genome assembler, also included in this com-
parison because we were interested in how good the tools optimization for single
cell assembly can be applied on RNA-Seq data and how the tool performs in con-
trast to the specialized transcriptome assemblers mentioned above. It was previ-
ously reported that SPAdes performs very well on RNA-Seq data when used in
the single-cell mode [216], possibly due to the uneven coverage optimization imple-
mented for single-cell data. SPAdes is also based on de Bruijn graphs and multiple
k-mer values. In newer versions of SPAdes, an RNA-Seq mode was implemented.
We evaluated the performance of SPAdes in single-cell (--sc; SPAdes-sc) and
RNA-Seq (--rna; SPAdes-rna) modus. Henceforth, we refer to SPAdes-sc and
SPAdes-rna as two different assemblers, although both are based on the same tool.
When designing this study, we also aimed to include an assembly tool that
is not based on k-mers. Mira [64] (v4.0rc5) uses an overlap-consensus-graph for
assembly and can be used in EST mode for RNA-Seq data. However, due to the
bad performance, runtime and memory consumption of the assembler we decided to
remove the tool from our comparison3 .
3
e.g. 62 h for the EBOV infected sample, writing >300 GB temporary files and consuming
∼130 GB RAM. Furthermore, we were not able to detect any BUSCO hits in the Mira assemblies.

60
4.1. The Dark Art of de novo transcriptome assembly

Table 4.1: Overview of the different de novo assembly tools evaluated in this study. We obtained the
most recent versions in December 2016. We further rated our experiences regarding the installation
and usability of each tool ( – excellent, – good, – unsatisfactory). These experiences might
be subjective, nevertheless we want to share them to give non-experienced users an idea of how
difficult it is to get each tool installed (Setup) and executed (Usage), see Sec. 4.1.3 for details. MK
– Whether or not the tool has a built-in multiple k-mer approach and is able to automatically merge
the output of different k-mer runs. a Oases was used on top of the de novo genome assembler
Velvet (v1.2.10) [217]. b SPAdes, originally designed as a de novo genome assembler for single-
cell data, was used in RNA-Seq modus (-rna) and single-cell modus (-sc), respectively. c When
running SPAdes in RNA-Seq modus, only a single k-mer value is allowed.

de novo Assembler Version Method MK Setup Usage Reference Year


Trans-ABySS 1.5.5 de Bruijn yes Robertson et al. [61] 2010
Trinity 2.3.2 de Bruijn no Grabherr et al. [62] 2011
Oasesa 0.2.08 de Bruijn yes Schulz et al. [60] 2012
SPAdes-scb 3.9.1 de Bruijn yes Bankevich et al. [54] 2012
SPAdes-rna 3.9.1 de Bruijn noc Bankevich et al. [54] 2012
IDBA-Tran 1.1.1 de Bruijn yes Peng et al. [213] 2013
SOAPdenovo-Trans 1.03 de Bruijn no Xie et al. [63] 2014
Bridger 2014-12-01 splice graph no Chang et al. [212] 2015
BinPacker 1.0 splice graph no Liu et al. [211] 2016
Shannon 0.0.2 de Bruijn no Kannan et al. [215] 2016

For our comparisons, we adopted the most recent versions in December 2016.
Finding the best parameter setting for each tool and each data set is obviously
beyond the scope of this evaluation. Therefore, we used the default settings of each
tool and adjusted only few key parameters (like k-mer values, strand-specificity)
whenever possible. Execution details can be found in the Electronic Supplement,
Tab. S3. For the tools with built-in function to automatically merge the output
of different k-mer values (Oases, Trans-ABySS, IDBA-Tran, SPAdes-sc; see
Tab. 4.1), we applied a set of selected k-mers (for details see Tab. S3). If strand-
specific data was used for the assembly, we applied the corresponding option in each
tool, if possible. In application one should try several different parameter settings
and compare the resulting assemblies to optimize the whole assembly process. In
particular, different k-mers should be tested and evaluated against each other. Here,
we carefully chose k-mer values to obtain a somewhat fair comparison between the
assemblers, although some parameters may not be optimal.
Whenever a tool was difficult to install (e.g. due to missing dependencies) or
could not be run on a specific data set, we attempted to debug the source code and
in few cases also contacted the authors to solve the problem. Therefore, we also
decided to share our experiences regarding the installation procedure and execution
of each tool, because we observed that usability differs widely between them (see
Tab. 4.1).

Description of RNA-Seq data sets used for assembly


We applied eight RNA-Seq data sets of five different species with available reference
genomes and annotations and also evaluated the assemblers on one synthetic data
set composed of protein- and non-coding transcripts of the human chromosome 1
(Fig. 4.1). Our real RNA-Seq data sets cover different kingdoms of life, comprising

61
Chapter 4. Transcriptome Assembly

RNA-Seq data sets

H. sapiens Human + EBOV 3h Human Chr1 simulated M. musculus A. thaliana C. albicans E. coli
96 million reads (pe) 17 million reads (pe) 60 million reads (pe) 43 million reads (pe) 16 million reads (se) 36 million reads (pe) 6.6 million reads (se)
100 bp, strand-specific 100 bp, unstranded 100 bp, unstranded 76 bp, strand-specific 100 bp, unstranded 34 bp, strand-specific 94 bp, strand-specific

Human + EBOV 7h
24 million reads (pe)
100 bp, unstranded

Human + EBOV 23h


26 million reads (pe)
Preprocessing
100 bp, unstranded

FastQC Prinseq

de novo Assembly
SOAPdenovo-
Trinity Oases Trans-ABySS IDBA-Tran
MK MK
Trans MK

Bridger BinPacker Shannon SPAdes-sc SPAdes-rna


MK

Evaluation

rnaQUAST Detonate BUSCO TransRate Hisat2 CPU/RAM Usability

Figure 4.1: Overview of the used RNA-Seq data sets (orange – eukaryote, light orange – simulated
human chromosome 1, green – plant, pink – fungi, yellow – bacterium) and evaluated assembly
tools. Each data set was quality controlled with FastQC and preprocessed with Prinseq prior
to assembly. Overall, more than 200 single k-mer assemblies were calculated. For details about
the used data sets and assemblies tools, see Electronic Supplement Tab. S1 and S2, respectively.
We further used several tools and statistics for the evaluation of each assembly. The CPU/RAM
consumption and the usability of each assembler were not included in the selected evaluation
metrics, see 4.1.2). se/pe – single-end/paired-end; MK – the assemblers built-in multiple-k-mer
approach was applied.

representatives for bacteria (Escherichia coli ; ECO), fungi (Candida albicans; CAL),
plant (Arabidopsis thaliana; ATH), and higher eukaryotes (Mus musculus; MMU
and Homo sapiens; HSA). We further included three data sets of a human HuH7
cell line, infected with the single-stranded RNA virus Ebola at three different time
points (HSA-EBOV-3h, HSA-EBOV-7h, HSA-EBOV-23h) [4]. With the help of
these data sets, we evaluated the assemblers capability to reconstruct the viral RNA
genome directly out of the mixed host and viral reads.
We further simulated one artificial data set based on protein-coding and non-
coding transcripts of human chromosome 1 (labeled HSA-FLUX).
With our selection of the different data sets, we further aim to represent different
experimental setups for RNA-Seq data: 1) single-end vs. paired-end data, 2) strand
specificity vs. unstranded protocols, 3) polyA enriched vs. rRNA depleted library
preparations, 4) different read lengths and 5) different sequencing depths, see Fig. 4.1
and Electronic Supplement Tab. S1.

62
4.1. The Dark Art of de novo transcriptome assembly

Escherichia coli. Raw read RNA-Seq data of E. coli str. K-12 substr. MG1655
was downloaded from the NCBI Short Read Archive (SRA), study PRJNA238884,
run SRR1173967 [218]. The run is comprised of roughly 8 million single-end reads
with a length of 94 bp each. A protocol retaining the strand-specificity was used
for sequencing. The reference genome, annotation data and coding sequences were
obtained from the Ensembl [219] bacteria database, release 344 .

Candida albicans. Candida albicans is one of the major invasive fungal pathogens
of humans [220]. Here, we obtained 11.5 million 51 bp paired-end reads (not strand
specific, sequenced on an Illumina HiSeq 2500) from the SRA (study: PRJNA213618,
selected run: SRR1654847), previously used in a comprehensive study about the
stress response of this fungal pathogen to weak organic acids [221]. Genome and an-
notation data for C albicans SC5314 were obtained from www.candidagenome.
org (Ca22, 11.12.2016).

Arabidopsis thaliana. To investigate the performance of the assemblers on a


plant transcriptome, we selected RNA-Seq data of the widely used plant model
organism A. thaliana. We downloaded ∼17 million reads (single-end, not strand
specific, 30–101 bp length) of one run (SRR1049376) from the SRA (study PR-
JNA231064). This data set was previously used by Wang and Gribskov [210] in a
comparison of eight de novo transcriptome assemblers and was originally obtained
and presented by Lai et al. [222]. Genome and annotation data were obtained from
the Ensembl plant database, release 345 .

Mus musculus. For M. musculus, we used an RNA-Seq data set that was previ-
ously conducted for the evaluation of the Trinity assembler [223]. The 52.6 mil-
lion strand-specific paired-end reads with a length of 76 bp were downloaded from
the SRA, study PRJNA140057 (run SRR203276). The mouse reference genome
(GRCm38), annotation data and coding sequences (CDS) were downloaded from
Ensembl, release 87.

Homo sapiens. The human data set was derived from the widely studied cell line
GM12878. A detailed description of the data set can be found in the ENCODE data
center (https://www.encodeproject.org/experiments/ENCSR000AED/)
with accession ENCSR000AED. Overall, we obtained 97.5 million strand-specific
paired-end reads with a length of 101 bp, sequenced by a polyA mRNA protocol.
The human reference genome (GRCh38) and annotation were obtained from En-
sembl, release 87.

Homo sapiens with EBOV infection. Here, we utilized three samples from our
study of a Ebola virus (EBOV) infected HuH7 cell line 3, 7 and 23 h post infection
(poi) [4], comprising ∼17–26 million paired-end reads with a length of 100 bp (not
4
ftp://ftp.ensemblgenomes.org/pub/bacteria/release-34/gtf/bacteria_0_
collection/escherichia_coli_str_k_12_substr_mg1655/
5
ftp://ftp.ensemblgenomes.org/pub/release-34/plants/fasta/
arabidopsis_thaliana/dna/

63
Chapter 4. Transcriptome Assembly

strand specific). This data, including details about the experimental design, RNA
extraction and sequencing is presented and discussed in Sec. 5.2. The Ebola virus
(filovirus) consists of a single-stranded RNA genome with a negative orientation
that is approximately 19 kb in size and encodes for seven structural proteins [224].
By performing RNA-Seq without a polyA selection step, we sequenced the EBOV
genome together with the host transcripts. With these data sets, we aim to test the
performance of de novo assembly tools on a viral RNA genome. We assembled the
three different time points individually, to investigate how the different assemblers
perform on varying amounts of viral reads in the data (3 h ∼ 0.1 % viral reads, 7 h
∼ 2 %, 23 h ∼ 20 %). Details about the amount of viral reads in each data set can
be found in Sec. 5.2 and Appendix A.2 For the evaluation, we used again human
genome and annotation data from Ensembl (release 87) and concatenated the data
with the EBOV genome of strain Zaire, Mayinga (GenBank: NC_002549).

Homo sapiens flux simulated data. In addition to the real RNA-Seq data
sets, we quasi-simulated RNA-Seq data based on a selection of protein- and long
non-coding transcripts of human chromosome 1.
We downloaded the human annotation GTF file and cDNA sequences (excluding
ab initio predictions) from Ensembl (GRCh38, release 87) and selected all protein-
coding genes from chromosome 1 (2,044 genes), comprising 352 genes with one iso-
form, 196 with two isoforms and 1,496 with more than two isoforms. We extended
this set of protein-coding genes by 1,075 non-coding genes from chromosome 1. The
combined set of protein- and non-coding genes was used to create a set of transcripts
including all known isoforms with a length >200 nt and without ambiguous N bases
from which paired-end reads should be simulated. Our final set of transcripts com-
prised 12,793 protein-coding transcripts as well as 1,006 lincRNAs, 839 antisense
RNAs and 7 snoRNAs of human chromosome 1.
This 14,645 transcript sequences were further used as an input in flux sim-
ulator [225] for RNA-Seq raw read simulation, yielding 60 million paired-end
100 bp reads (Tab. S1). We used flux simulator as suggested for Illumina
data, utilizing the default 76-bp error model. With this simulated sequences, we
attempt to mimic a state-of-the-art RNA-Seq data set based on Illumina’s Ribo-
Zero protocol for library preparation and rRNA depletion, further multiplexed three
times and sequenced on one HiSeq 2500 lane. As such a protocol also allows for the
detection of untranslated transcripts as part of the RNA-Seq data, we also included
the sequences of non-coding transcripts in the simulation.

Quality control of all RNA-Seq data sets


We investigated the quality of each data set with FastQC[43] and used Prinseq[44]
for an initial quality processing of all raw reads. Low quality regions were trimmed
with an average quality below 20 using a five base sliding window approach. Only
reads that yielded a remaining read length of at least 25 bp were considered for
further analysis. All reads including ambiguous N bases were removed. PolyA/T
tails were trimmed. Details about the trimmed data, finally used for the assemblies,
can be found in Electronic Supplement Tab. S1.

64
4.1. The Dark Art of de novo transcriptome assembly

Execution of de novo transcriptome assemblies

In total, we calculated more than 200 single k-mer assemblies. Each assembler was
run on each data set (see Fig. 4.1). If possible, multiple k-mers were used (see
Tab. 4.1). Trans-ABySS, Oases and IDBA-Tran dispose a built-in functionality
for multiple k-mers. SPAdes-sc can automatically chose multiple k-mers for the
assembly process and was therefore executed with this default option. All assem-
blers were run with default parameters, if not otherwise stated. Details about the
execution of each tool on each data set can be found in the Electronic Supplement,
Tab. S3. For the E. coli, A. thaliana, H. sapiens and the artificial data sets k-mers
25, 35, 45, 55 and 65 were used with Trans-ABySS, Oases and IDBA-Tran. The
short-read C. albicans data was run with k-mers 21, 27, 33 and 39. M. musculus
data was assembled with the k-mers: 25, 35, 45 and 55, because the read length
is shorter in comparison to the bacterial and plant data sets. The EBOV infected
HuH7 samples were run with k-mers 25, 29, 33, 37 and 41. The k-mer values were
selected based on previous results for these data sets and in relation to the different
read lengths and sequencing setups.
As IDBA-Tran assumes paired-end reads to be in order (->, <-; forward–
reverse), we manually converted reads if necessary before running IDBA-Tran (see
Tab. S3).
We tried to run Bridger in RF (reverse–forward ) mode for strand-specific data,
however this was not working. Therefore, we used the non strand-specific mode for
the M. musculus and H. sapiens assemblies.

Performance benchmarks, evaluation criteria and selected metrics

We benchmarked the different assembly tools using several evaluation tools and met-
rics, summarized in Fig. 4.1. Some of the metrics are based on reference sequences
and annotations, whereas others are only based on the final assembly itself (the
contigs) or the reads that were used to construct the assembly.
Evaluation metrics are very important to assess the quality of a genome or tran-
scriptome assembly. However, there is a lack of consensus which evaluation metrics
work best for de novo transcriptome assembly. For example, Rana et al. [226] com-
pared different assemblers and k-mer strategies using killifish RNA-Seq data and
based their comparisons on eleven selected metrics, such as contig number, N50
value6 , contigs >1 kb, re-mapping rate, number of full length transcripts, number of
open reading frames, Detonates RSEM-EVAL score and percentage of alignments
to closely related fish. In another study, Chopra et al. [227] performed comparisons
on peanut RNA-Seq data and evaluated the assemblies on metrics like N50, average
contig length, number of contigs and the number of full length transcripts. Moreton,
Dunham, and Emes [228] used also the N50 length, the number of transcripts, the
number of transcripts ≥1 kb and RMBT and CEGMA percentages when evaluating
different assemblies of duck. Surely, more information on which metrics best pre-
dict the quality of a de novo transcriptome assembly would help to establish “best
6
The Nx value describes the length of the shortest contig in the assembly, so that the accumu-
lated bases of all contigs of this length or longer cover x % of all the bases in the assembly.

65
Chapter 4. Transcriptome Assembly

practice” protocols that could be further utilized to develop automatic evaluations


to improve assemblies.
We selected 20 powerful metrics to evaluate each assembly strategy, comprising
eight reference-based evaluation metrics (rnaQUAST: database coverage, misassem-
blies, mismatches per transcript, average alignment length, mean isoform coverage;
TransRate: reference coverage; BUSCO: complete and single-copy BUSCOs, miss-
ing BUSCOs) and twelve contig- or read-based evaluation metrics (Hisat2: over-
all assembly mapping rate; rnaQUAST: transcripts ≥1000 bp; TransRate: N50
length, mean ORF percentage, optimal assembly score, percentage of good map-
pings, percentage of bases uncovered by any read, number of ambiguous N bases;
Detonate: nucleotide F1, contig F1, KC score, RSEM-EVAL score). For the data
sets consisting of single-end data (ECO, ATH) only 17 metrics could be evaluated,
because the optimal score and the percentage of good mappings and uncovered bases
are only calculated by TransRate if paired-end data is provided.
We also evaluated the computational efficiency (runtime, memory) to assess fea-
sibility of the tools for deeply sequenced data sets and/or large sample size.

Mapping rate. We used Hisat2 [74], a fast splice-aware aligner with low mem-
ory consumption, to map the quality controlled reads back to each assembly. The
mapping rate can give insights in the amount of reads that were incorporated in
final transcripts during the assembly process (see Electronic Supplement Fig. S4).
Therefore, this value tells us how many reads were incorporated in the assembly
process and how well. However, reads that are not part of the true transcriptome
assembly but are still included in the RNA-Seq data (e.g., due to contamination)
can induce chimeric contigs and higher mapping rates. Furthermore, contigs that
were just wrongly constructed can also increase the mapping rate. Therefore, the
re-mapping rate can give first insights in the quality of a transcriptome assembly,
but further metrics are needed to obtain a more complete picture of the assemblers
performance.

rnaQUAST. We used rnaQUAST [216] (v1.4.0) to calculate various statistics for


the assemblies of each data set. With the help of the rnaQUAST metrics, we can
demonstrate the completeness and correctness levels of the assembled transcripts
in a user-friendly report (see Electronic Supplement Tab. S5). Furthermore, by
providing a reference transcriptome, rnaQUAST can further calculate the sensitivity
and specificity of an assembly.
For the sensitivity of an assembly (assembly completeness, true positive rate),
rnaQUAST attempts to select the best-matching database isoforms for every tran-
script. It should be mentioned that a single transcript can of course contribute to
multiple isoforms in the case of, for example, paralogous genes or repeats in the
genome. On the other hand, an isoform can be covered by multiple transcripts in
the case of a fragmented assembly or duplicated transcripts in the assembly. The
assembly specificity (true negative rate) is computed only based on transcripts that
have at least one significant alignment and are not misassembled. Details about the
calculation of both metrics can be found in the rnaQUAST manual7 .
7
http://spades.bioinf.spbau.ru/rnaquast/release1.4.0/manual.html

66
4.1. The Dark Art of de novo transcriptome assembly

Furthermore, rnaQUAST calculates various bar plots and histograms to visual-


ize basic statistics such as transcript lengths, mismatch rates and the number of
transcript alignments per isoform. All plots and detailed statistics can be found in
the Electronic Supplement, Tab. S5.

TransRate. TransRate [229] is a novel tool for de novo transcriptome assembly


quality assessment. The tool examines an assembly in great detail and compares it
to experimental evidence such as the reads the assembly was built on.
Three key approaches are used by TransRate to analyze a transcriptome as-
sembly: 1) inspecting the contig sequences, 2) mapping reads to the contigs and
inspecting the alignments, 3) aligning the contigs against known proteins or tran-
scripts from a related species and inspecting the alignments. The authors claim,
that metrics based on reads only (1 and 2) are much better for assembly optimiza-
tion, then a comparison with reference sequences (3). Such comparative metrics are
not ideal for optimizing the assembly process, because comparison to a reference
will always penalize genuine biological novelty contained in the assembly.
One of our metrics relies on the calculated TransRate scores as a measure for
the quality of the assembly without using a reference. The score is produced for the
whole assembly and for each single contig. The scoring process uses only the reads
that were used to generate the assembly as evidence. However, currently the scores
can be only calculated for paired-end data. The score of an assembly is calculated as
the geometric mean of all contig scores multiplied by the proportion of input reads
that provide positive support for the assembly [229]. Thus, the score captures how
confident one can be in what was assembled, as well as how complete the assembly
is. 0 is the minimum possible score, while 1.0 is the maximum score.

Detonate. We further used the Detonate workflow: a pipeline for the DE novo
TranscriptOme rNa-seq Assembly with or without the Truth Evaluation [230]. The
pipeline consists of two component packages, RSEM-EVAL and REF-EVAL. Both
packages are mainly intended to be used to evaluate de novo transcriptome assem-
blies, although REF-EVAL can be used to compare sets of any kinds of genomic
sequences. Here, we mainly focus on Detonates RSEM-EVAL score as a novel
reference-free evaluation method to assess the quality of transcriptomes. The tool
calculates a statistically based evaluation score using multiple factors, such as the
compactness of the assembly and its support from the RNA-Seq reads used to create
it [230]. Therefore, the RSEM-EVAL score can be used to evaluate assemblies even
when the ground truth is unknown. At the end, assemblies with higher RSEM-EVAL
scores are considered better.
We further calculated nucleotide F1, contig F1 and KC scores with Detonate.
The F1 score is a measure of a test’s accuracy. An F1 score of 1 would mean that
all nucleotides/contigs in the estimated true assembly were recovered with at least
90 % identity. The k-mer compression score (KC score) reflects the similarity of each
assembly to Detonates estimated “true” assembly and combines two measures:
weighted k-mer recall and inverse compression rate [230].

67
Chapter 4. Transcriptome Assembly

Detonate was run for all assemblies as recommended in the online vignette8 .
The main metrics calculated by Detonate can be found in Electronic Supplement
Tab. S8.

BUSCO. Here, we benchmarked universal single copy orthologs, named BUS-


COs [231] (v2.0), to detect orthologous candidate genes in the assemblies and to
assess the presence and abundance of such single-copy orthologs as an evaluation
criteria. The so called BUSCOs are selected from OrthoDB orthologous groups at
major species radiations requiring orthologs to be present as single-copy genes in
the vast majority (>90%) of available species. The BUSCO tool attempts to provide
a quantitative assessment of the completeness of an assembly in terms of expected
gene content. The results are further simplified into categories of ’Complete and
single-copy’, ’Complete and duplicated’, ’Fragmented’, or ’Missing’ BUSCOs.
For the evaluation of the simulated human data set, the Euarchontoglires BUSCO
set was reduced to BUSCO orthologs originating only from human chromosome 1
(# 671 BUSCOs). The full BUSCO output for each data set can be found in the
Electronic Supplement, Fig. S7.

Computational resources
Each assembly was executed on 48 threads. All calculations were run on two sym-
metric multiprocessing servers with 14 TB storage (raid-5) and 48 CPU cores, com-
prising four AMD Opteron 6238 CPUs and 512 GB RAM running on a Debian 64 bit
system.

Usability
We further aimed to install and run all tools without root rights on our systems
(Debian GNU/Linux 8 (jessie) 64-bit). Of course, how easy a tool can be installed
and executed heavily depends on the used machine, the server setup and how familiar
the user is with the programing language the tool is based on. Nevertheless, it should
be the goal of each public available piece of software to be as user friendly as possible.
Therefore, we collected our experiences during the installation and execution of each
assembler to share our observations (Tab. 4.1).

4.1.3 Results and discussion


In this comparative study, we investigated the performance of ten de novo assembly
tools on nine RNA-Seq data sets of five different species (including an viral infection)
and build over 200 single k-mer assemblies. A comprehensive online supplement9
provides deep insights in the performance of each assembly tool on each data set for
various metrics.
For each selected evaluation metric (see 4.1.2), the three best performing assem-
bly tools were scored, for each of the nine data sets separately. We selected the best-
performing assemblers for each data set based on the selected metrics. The assembly
8
http://deweylab.biostat.wisc.edu/detonate/vignette.html
9
http://www.rna.uni-jena.de/supplements/the_dark_art/

68
4.1. The Dark Art of de novo transcriptome assembly

strategies that performed the best for the H. sapiens data set were SPAdes-rna
(12/20; the assembler performed within the top three scores of 12 out of the 20 met-
rics) followed by SOAPdenovo-Trans (9/20), the Trans-ABySS assembly (9/20)
and Trinity (8/20) (Fig. 4.2 and Tab. 4.2).
In the following sections, we will present the performance of each assembler over
all data sets based on the selected evaluation metrics (4.1.2). For the H. sapiens data
set (∼96 million strand-specific paired-end reads with a maximum length of 101 bp),
all 20 selected metrics and the scores for each of the ten assembly tools are shown in
Tab. 4.2. The tables for all other data sets can be found in Appendix B.1–B.8 and in
detail in the Electronic Supplement (Tab. S9). Detailed plots and further statistics
for all data sets and assembly tools can be found in the Electronic Supplement.
To get a general overview of the performance of each assembler, we summed
up the metric scores achieved for each data set to calculate an overall metric score
(OM S) for each assembler. Because of the similarity of the three human RNA-
Seq data sets treated with the Ebola virus 3, 7, and 23 h post infection (same read
length, paired-end, not strand-specific, roughly the same amount of reads), we used
the mean of all three scores when accumulating the scores of all data sets. For
example, Trans-ABySS performed very good on all three Ebola-infected data sets
(10/20), whereas IDBA-Tran did not (4/20, 5/20, 4/20) (Fig. 4.2).

Trans-ABySS

We executed Trans-ABySS with multiple k-mers (MK) and in strand-specific mode


(--SS; if suitable) on each data set (see Tab. S3). Over all data sets, Trans-ABySS
had the highest re-mapping rate compared to the other tools (98.56 % for C. albicans,
99.55 % for the simulated data; Fig. S4).
Trans-ABySS arranges within the midfield or worse regarding the optimal score
calculated by TransRate. The percentage of good mappings and uncovered bases
are generally bad for Trans-ABySS.
The assemblies calculated with Trans-ABySS achieved the best RSEM-EVAL
scores over all nine data sets. No other assembler outperformed Trans-ABySS
regarding this metric. Therfore, the transcripts constructed by Trans-ABySS are
well supported by the reads, used to build the assembly.
Trans-AbySS performed good in all BUSCO analyses and showed a high amount
of complete (C) ortholog detections compated to the other tools (Fig. 4.3). However,
among those complete detections, a lower amount occurs as complete single-copy
(CS) detections, whereas many occur multiple times in the assembly (complete and
duplicated; CD) (for example the C. albicans data set in Fig. S7). This might be
a result of the MK approach, if too many potential isoforms are assembled and not
merged accurately at the end of the assembly process. We observed similar results
for the MK runs of Oases. Regarding the amount of fragmented (F) and missing
(M) BUSCOs, Trans-ABySS arranges among the best performing tools.
Trans-ABySS achieved the highest OM S (overall metric score) of 60 of all
assembly tools (Fig. 4.2) and performed best for the EBOV-infected human data
sets and the simuated data of human chromosome 1. The lowest metric score was
achieved for the E. coli data set.

69
Chapter 4. Transcriptome Assembly

5 9 8 9 9 10 10 10 10 Trans-ABySS (60)

8 12 6 7 5 7 6 6 9 SPAdes-sc (53.3)

5 5 7 10 8 2 8 9 9 Trinity (50.3)

SOAPdenovo- (45.6)
6 7 6 8 9 7 6 7 3
Trans

5 7 5 7 12 9 9 6 6 SPAdes-rna (50)

7 5 5 7 4 4 5 4 8 IDBA-Tran (40.3)

5 4 6 1 2 6 2 6 7 Oases (29.6)

4 4 4 4 5 5 3 4 3 BinPacker (28)

6 3 1 6 2 3 2 4 1 Bridger (22)

0 4 3 1 4 7 9 4 4 Shannon (22.6)

E. coli C. albicans A. thaliana M. musculus H. sapiens H. sapiens + EBOV Simulated

3h poi 7h poi 23h poi

Metric score
12 10 8 6 4 2 0

Figure 4.2: Heat map showing for each data set (column) and each assembler (row) the summed
up metric score based on the 20 metrics presented in Sec. 4.1.2. For each metric, an assembler got
a point if the resulting assembly arranges within the top three results. The hierarchical clustering
of the metric scores divides the assembly tools in two groups, performing generally better (upper
half) and generally not so good (lower half) on the tested data sets. The maximum achievable
metric score for the E. coli and A. thaliana data sets is 17 and not 20, because the optimal score,
the percentage of good mappings and the percentage of uncovered bases are only calculated by
TransRate in the case of paired-end data. Please note, that for the HSA-EBOV-7h data set no
rnaQUAST statistics were calculated for the Oases assembly. rnaQUAST was not able to finish
the calculations if the Oases assembly for this data set was included. Numbers in brackets next to
the assembler names present the summed up metric scores (overall metric score, OM S)for all nine
data sets. For the three similar human data sets infected with the Ebola virus, we added the mean
value to the OM S. BinPacker and Bridger, build on the same principals, performed similar
and cluster together. However, BinPacker achieved more consistent scores. SPAdes in single-
cell mode worked best for the bacterial data set. SOAPdenovo-Trans worked generally good on
all real data sets, but was outperformed by other tools for the articifical data set. Interestingly,
Trinity was outperformed by other tools for the HSA-EBOV-3h data set, but worked well on the
later time points. Suprisingly, the RNA mode of SPAdes performed best on the human data set,
whereas SPAdes in single-cell mode achieved a much lower score. Details about the used metrics
can be found in the Electronic Supplement, Tab. S9 and Appendix Tab. B.1–B.8.

70
Table 4.2: Selected metrics based on the output of rnaQUAST, Hisat2, Detonate, TransRate and BUSCO for the transcripts assembled by all ten
assembly tools on the Homo sapiens RNA-Seq strand-specific paired-end library with read length 101 bp (accession number ENCSR000AED). Details and
much more statistics, complementing this evaluation, can be found in the Electronic Supplement, content S4–S8. In each row the top three values are
indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of ambiguous bases is given in thousand. N50 – the length of the shortest
contig in the assembly, so that the accumulated bases of all contigs of this length or longer cover 50 % of all the bases in the assembly. F1 score – a measure
of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs in the estimated true assembly were recovered with at least 90 % identity.
KC score – k-mer compression score reflecting the similarity of each assembly to Detonates estimated “true” assembly.

Trans-ABySS Oases SOAP-Trans Trinity IDBA-Tran Shannon Bridger BinPacker SPAdes-sc SPAdes-rna
k-mer size 25,35,45,55,65 25,35,45,55,65 default default 25,35,45,55,65 default default default default default
Hisat2
Overall mapping rate 91.66 88.04 98.36 89.93 86.83 72.6 64.61 84.27 90.26 90.76
rnaQUAST
Transcripts >1000 bp 72685 207474 68662 27529 43201 22611 23516 31328 26245 15945
Database coverage 0.23 0.08 0.29 0.1 0.07 0.06 0.09 0.01 0.09 0.1
Misassemblies 2739 216128 2878 279 7329 5603 302 2837 1566 570
Mismatches per transcript 1.04 1.25 0.61 0.27 1.44 4.63 0.67 1.26 0.82 0.41

71
Average alignment length 781.94 343.48 258.13 218 654.41 2335.73 487.11 711.83 429.36 207.92
Mean isoform coverage 0.52 0.33 0.49 0.27 0.33 0.7 0.35 0.28 0.34 0.28
TransRate
N50 1613 1230 1913 3391 1386 3511 566 1446 641 16469
Reference coverage 0.23 0.09 0.27 0.09 0.09 0.07 0.08 0 0.08 0.09
Mean ORF percentage 51.47 42.09 51.09 48.02 45.1 42.57 52.46 55.7 48.28 55.04
Optimal score 0.08 0.02 0.08 0.27 0.14 0.07 0.25 0.07 0.32 0.35
Percentage good mappings 0.22 0.06 0.17 0.59 0.32 0.26 0.49 0.22 0.63 0.64
Percentage bases uncovered 0.66 0.94 0.66 0.33 0.42 0.84 0.02 0.5 0.04 0.31
Number of ambiguous bases 306314 843235 460747 241236 206635 72918 138699 117068 159111 186834
DETONATE
Nucleotide F1 0.4 0.18 0.49 0.57 0.48 0.15 0.55 0.35 0.58 0.56
Contig F1 0.02 0.02 0.2 0.21 0.01 0 0.02 0.02 0.02 0.09
KC score 0.49 0.24 0.56 0.37 0.4 0.37 0.29 0.42 0.36 0.33
RSEM EVAL -6.63 -1.18 -6.22 -9.03 -7.71 -1 -1.63 -8.95 -1.38 -1.34
BUSCO
Complete single-copy 1401 1321 2079 2151 2360 1010 1677 2302 2347 2551
Missing BUSCOs 1810 1922 1772 2164 1812 4078 2615 2133 2457 2392
4.1. The Dark Art of de novo transcriptome assembly
Chapter 4. Transcriptome Assembly

A B

Trans-ABySS C:301 [S:234, D:67], F:314, M:166, n:781 Trans-ABySS C:1458 [S:348, D:1110], F:144, M:109, n:1711

Oases C:299 [S:136, D:163], F:310, M:172, n:781 Oases C:1382 [S:611, D:771], F:189, M:140, n:1711

SOAP-Trans C:316 [S:316, D:0], F:287, M:178, n:781 SOAP-Trans C:1042 [S:1039, D:3], F:421, M:248, n:1711

Trinity C:281 [S:258, D:23], F:311, M:189, n:781 Trinity C:1369 [S:1279, D:90], F:180, M:162, n:1711

IDBA-Tran C:296 [S:296, D:0], F:289, M:196, n:781 IDBA-Tran C:1070 [S:1069, D:1], F:401, M:240, n:1711

Shannon C:280 [S:261, D:19], F:303, M:198, n:781 Shannon C:1088 [S:460, D:628], F:264, M:359, n:1711

Bridger C:285 [S:281, D:4], F:306, M:190, n:781 Bridger C:1416 [S:1149, D:267], F:162, M:133, n:1711

BinPacker C:50 [S:48, D:2], F:20, M:711, n:781 BinPacker C:1416 [S:1146, D:270], F:160, M:135, n:1711

SPAdes-sc C:332 [S:332, D:0], F:277, M:172, n:781 SPAdes-sc C:1515 [S:1510, D:5], F:112, M:84, n:1711

SPAdes-rna C:96 [S:96, D:0], F:315, M:370, n:781 SPAdes-rna C:1464 [S:1458, D:6], F:154, M:93, n:1711

0 20 40 60 80 100 0 20 40 60 80 100

%BUSCOs %BUSCOs

C D

Trans-ABySS C:1119 [S:732, D:387], F:97, M:224, n:1440 Trans-ABySS C:4104 [S:2079, D:2025], F:316, M:1772, n:6192

Oases C:1108 [S:546, D:562], F:84, M:248, n:1440 Oases C:3588 [S:1321, D:2267], F:682, M:1922, n:6192

SOAP-Trans C:1058 [S:1042, D:16], F:134, M:248, n:1440 SOAP-Trans C:2625 [S:2151, D:474], F:1403, M:2164, n:6192

Trinity C:1094 [S:858, D:236], F:124, M:222, n:1440 Trinity C:3925 [S:1401, D:2524], F:457, M:1810, n:6192

IDBA-Tran C:930 [S:908, D:22], F:241, M:269, n:1440 IDBA-Tran C:1682 [S:1677, D:5], F:1895, M:2615, n:6192

Shannon C:1049 [S:804, D:245], F:95, M:296, n:1440 Shannon C:3385 [S:2302, D:1083], F:674, M:2133, n:6192

Bridger C:1103 [S:978, D:125], F:108, M:229, n:1440 Bridger C:3909 [S:2360, D:1549], F:471, M:1812, n:6192

BinPacker C:262 [S:203, D:59], F:16, M:1162, n:1440 BinPacker C:2009 [S:1010, D:999], F:105, M:4078, n:6192

SPAdes-sc C:1077 [S:1053, D:24], F:139, M:224, n:1440 SPAdes-sc C:2357 [S:2347, D:10], F:1378, M:2457, n:6192

SPAdes-rna C:878 [S:859, D:19], F:205, M:357, n:1440 SPAdes-rna C:2564 [S:2551, D:13], F:1236, M:2392, n:6192

0 20 40 60 80 100 0 20 40 60 80 100

%BUSCOs %BUSCOs

E F

Trans-ABySS C:4135 [S:1938, D:2197], F:289, M:1768, n:6192 Trans-ABySS C:563 [S:290, D:273], F:90, M:18, n:671

Oases C:4055 [S:704, D:3351], F:320, M:1817, n:6192 Oases C:613 [S:86, D:527], F:36, M:22, n:671

SOAP-Trans C:3667 [S:3362, D:305], F:670, M:1855, n:6192 SOAP-Trans C:289 [S:191, D:98], F:273, M:109, n:671

Trinity C:3718 [S:1873, D:1845], F:686, M:1788, n:6192 Trinity C:597 [S:203, D:394], F:52, M:22, n:671

IDBA-Tran C:2134 [S:2128, D:6], F:1841, M:2217, n:6192 IDBA-Tran C:226 [S:226, D:0], F:303, M:142, n:671

Shannon C:3777 [S:1108, D:2669], F:460, M:1955, n:6192 Shannon C:242 [S:143, D:99], F:65, M:364, n:671

Bridger C:3979 [S:2471, D:1508], F:428, M:1785, n:6192 Bridger C:525 [S:316, D:209], F:118, M:28, n:671

BinPacker C:3590 [S:1909, D:1681], F:210, M:2392, n:6192 BinPacker C:527 [S:256, D:271], F:115, M:29, n:671

SPAdes-sc C:3493 [S:3481, D:12], F:866, M:1833, n:6192 SPAdes-sc C:393 [S:393, D:0], F:218, M:60, n:671

SPAdes-rna C:3617 [S:3606, D:11], F:702, M:1873, n:6192 SPAdes-rna C:363 [S:363, D:0], F:233, M:75, n:671

0 20 40 60 80 100 0 20 40 60 80 100

%BUSCOs %BUSCOs

Missing (M) Fragmented (F) Complete (C) and duplicated (D) Complete (C) and single−copy (S)

Figure 4.3: Selected BUSCO assessment results for E. coli (A), C. albicans (B), A. thaliana (C),
H. sapiens (D), HuH7 cells infected with EBOV 7 h post infection (E) and flux simulated reads
of human chromosome 1 (F). The numbers indicate the absolut amount of complete (C) and
single-copy (S), complete and duplicated (D), fragmented (F), and missing (M) BUSCOs. BUSCO
results for all other data sets can be found in the Electronic Supplement, Fig. S7.

72
4.1. The Dark Art of de novo transcriptome assembly

Oases
Oases was also run in MK and -strand_specific mode if suitable (Tab. S3).
The re-mapping rate was good (>85 %) for most data sets, however for the simulated
human data (73.26 %), the HSA-EBOV-23h data (70.05 %) and the E. coli data
(49.16 %) it dropped below acceptable thresholds. For the E. coli data set, a similar
behavior could be observed for SOAPdenovo-Trans (56.62 %) and IDBA-Tran
(34.31 %).
Oases introduced the highest amount of ambiguous bases in the assemblies in
comparison to the other tools and arranges among the last places regarding the
TransRate statistics. Based on the optimal score calculated by TransRate,
Oases occupies the last place for six out of the seven evaluated data sets.
Oases arranges in the last third of the RSEM-EVAL scores calculated by De-
tonate.
Based on the Oases assemblies, a comparable good amount of complete BUS-
COs could be detected, however many duplicate hits are included that might be
a result of the MK approach (Fig. 4.3). Oases assembled the highest amount of
complete BUSCOs for the simulated data set (∼90 %), however also had the highest
amount of duplicate BUSCOs within these hits (∼80 %).
Regarding the selected metrics, Oases performed best for the human simulated
data (7/20), the EBOV-infected samples (6/20) and the plant data (6/20). The
calculated metric score for the HSA-EBOV-7h data set could be comparatively low,
because we were not able to calculate rnaQUAST statistics for this assembly. Oases
achieved only an OM S of 29.6 (Fig. 4.2).

SOAPdenovo-Trans
SOAPdenovo-Trans was run on a single k-mer, because the tool has no build-in
function to merge assemblies from multiple k-mers. No strand-specific assembly is
supported. According to the authors this is planned for a future release to further
improve the algorithm [63]. The re-mapping rate was generally good (>85 %), except
for the E. coli data set.
SOAPdenovo-Trans performed quite well based on the TransRate statis-
tics. Almost all of the conducted assemblies achieved great scores for the percent-
age of good mappings, the percentage of uncovered bases, the number of ambigu-
ous bases and the optimal score calculated by TransRate. In most of the cases,
only the SPAdes assemblies could outperform SOAPdenovo-Trans regarding the
TransRate metrics.
The RSEM-EVAL scores vary depending on the assembled RNA-Seq data set.
For the HSA-EBOV-23h and M. musculus sample SOAPdenovo-Trans performed
good, whereas for the bacterial, the fungal, the plant and the simulated RNA-Seq
data the tool is among the last three assemblers regarding the RSEM-EVAL metric.
SOAPdenovo-Trans arranges in the middle field regarding the amount of as-
sembled complete BUSCOs. The amount of CD BUSCOs is very low (Fig. 4.3),
which correlates with the tools ability to detect different isoforms (see mean isoform
coverage calculated with rnaQUAST, Tab. S5). However, this might be also a result
of the single k-mer approach.

73
Chapter 4. Transcriptome Assembly

SOAPdenovo-Trans achieved a good OM S of 45.6 (Fig. 4.2). The assembler


performed good on all evaluated data sets (metric scores between 6–9) and only
showed a lower score for the artificial data set of human chromosome 1 (3/20).

Trinity
Trinity was run on a single k-mer and, if suitable, in strand-specific mode on each
data set (Tab. S3). The re-mapping rate was generally good and between 85.56 %
(E. coli ) and 97.29 % (C. albicans).
Trinity assemblies arrange in the midfield regarding the TransRate metrics,
in some cases (C. albicans, HSA-EBOV-23h) the assemblies can be even found in
the top four of optimal TransRate scores.
Trinity performed very well on almost all data sets (except HSA-EBOV-3h)
by scoring among the top three RSEM-EVAL values.
Trinity performed well regarding the detection of complete BUSCOs for most
of the data sets (Fig. 4.3). For the eukaryotic data sets, approximately the half
amount of the detected complete BUSCOs is included multiple times in the assembly,
which could be a result of the sub-graphs Trinity relies on to detect different
isoforms of one transcript.
The accumulated metric scores for the Trinity assemblies resulted in one of
the top three scores (OM S=50.3, Fig. 4.2). Trinity achieved the best score for
the M. musculus data set (10/20), what might be not so suprisingly, because this is
the data set that was among others used for evaulation of the tool in the Trinity
paper [223]. Trinity achieved also good scoorings for the artificial data set (9/20)
and the HSA-EBOV-23h data (9/20). Interestingly, Trinity performed generally
good on the virus infected data sets, except the 3 h sample (2/20).

IDBA-Tran
IDBA-Tran was run with multiple k-mers and has no option for strand-specific
assembly.
For the E. coli (34.31 %), C. albicans (86.34 %), H. sapiens (64.61 %), and H. sap-
ines EBOV 7 h (76.39 %) data sets, the tool showed the lowest re-mapping rates
in comparison to all other assemblers. The best mapping rate was achieved for
A. thaliana with 89.04 %. All other mapping rates are between 48.37 % (human
EBOV 23 h) and 85.34 % (human simulated).
IDBA-Tran shows the lowest percentage of uncovered bases in the assemblies,
meaning that the contigs constructed by the tool are highly accurate. Accordingly,
the number of ambiguous bases is very low. Furthermore, some of the IDBA-Tran
assemblies arrange within the top three assemblies regarding the optimal score cal-
culated by TransRate. The optimal scores of the IDBA-Tran assemblies are
comparable with the SOAPdenovo-Trans scores. Overall, the TransRate met-
rics of the IDBA-Tran assemblies are generally good.
IDBA-Tran performed worse regarding the Detonate RSEM-EVAL calcula-
tions. For the E. coli, C. albicans, M. musculus, H. sapiens and HSA-EBOV-7h
data sets IDBA-Tran is placed last regarding to this metric and never reaches the
top five (Tab. S8).

74
4.1. The Dark Art of de novo transcriptome assembly

Furthermore, IDBA-Tran is one of the tools with the lowest amount of com-
plete BUSCOs and the highest amount of missing BUSCOs (Fig. 4.3 and Fig. S7).
Within the low amount of complete BUSCOs, the assembler included almost no du-
plicate contigs. Therefore, it seems that IDBA-Tran (although an MK approach)
is not performing well in constructing full-length transcripts and different isoforms.
Furthermore, the amount of fragmented BUSCOs in the IDBA-Tran assemblies is
comparably high.
IDBA-Tran is placed in the midfield of all metric scores (OM S=40.3, Fig. 4.2)
and showed the best performance for the artificial data set (8/20), the E. coli data
(7/17), and the M. musculus data (7/20).

Shannon
The Shannon assembler was used with the single default k-mer value and if suitable
in strand-specific mode (--ss).
Shannon showed the most variant re-mapping rates, ranging between 30.77 % for
the human simulated data set and 96.51 % for A. thaliana. Interestingly, Shannon
had a low mapping rate on the simulated data, whereas all the other tools (except
Oases, 73.26 %) showed a mapping rate >85 %.
The Shannon assemblies do not result in good TransRate optimal scores.
For most of the data sets, the Shannon assemblies arrange in the lower third of
optimal scores. However, the percentage of uncovered bases lays within the midfield
of all scorings and Shannon does not introduce that many ambiguous bases in the
assembled transcriptome.
The RSEM-EVAL scores of Shannon vary among the assembled data sets. For
some assemblies, the tool performed very well (H. sapiens, HSA-EBOV-3h and -
7h), whereas for others (A. thaliana, HSA-EBOV-23h, simulated data) it completely
failed regarding to this metric.
Shannon arranges in the midfield regarding the amount of assembled com-
plete BUSCOs, however the tool showed a relatively high amount of duplicated
hits (Fig. 4.3). Before, this behavior was mainly observed for the MK approaches
like Trans-ABySS and Oases. Interestingly, for the simulated data, Shannon
showed the highest amount of missing BUSCOs in comparison to the other assem-
blers (Fig. S7).
Shannon achieved one of the lowest accumulated metric scores (OM S=22.6,
Fig. 4.2). The best metric scores were obtained for the assemblies of the HSA-
EBOV-3h and -7h data sets (7/20 and 9/20). All other metric scores are below
5.

Bridger
Bridger can only handle single k-mer values between 19 and 32 with a default of
25. Whereas for most assembly applications of short read RNA-Seq data this range
might be acceptable, especially for longer read data (like produced by an Illumina
MiSeq, >150 nt) also longer k-mers can be advantageous. Here, we used the default
k-mer size in the assemblies performed by Bridger. If possible, we also used
the strand-specific option of the tool (--SS_lib_type), however for some of the

75
Chapter 4. Transcriptome Assembly

strand-specific RNA-Seq data sets Bridger failed (M. musculus and H. sapiens)
and so we executed the tool in the default unstranded mode. There seems to be a
problem to handle strand-specific paired-end data in this version of the tool. The
strand-specific assembly of the single-end E. coli data (--SS_lib_type F) was
running well.
Bridger showed quite good re-mapping rates between 87.35 % (E. coli) up to
96.72 % (C. albicans).
Over almost all TransRate metrics, the Bridger assemblies arrange in the
midfield of scores and are already far away from the top scores produced by the
SPAdes, SOAPdenovo-Trans and IDBA-Tran assemblies.
Bridger assemblies are among the top four RSEM-EVAL scores over all data
sets, therefore the tool is performing generally well according to this metric.
Furthermore, Bridger performed well in the detection of complete BUSCOs
with a moderate amount of duplicated hits. The amount of missing BUSCOs is low
(Fig. 4.3).
Bridger performed best for the E. coli data set (6/20) and the M. musculus
data set (6/20). However, the general performance of the tool is comparatively
humble, underlined by the lowest overll metric score of all assemblers (OM S=22,
Fig. 4.2).

BinPacker
BinPacker, build on the principals of Bridger, was also executed on a single
k-mer value and if suitable in strand-specific (-m F|RF) mode.
The re-mapping rate of BinPacker was quite low, depending on the data set.
For HSA-EBOV-3h (36.6 %) and M. musculus (54.31 %) BinPacker showed the
lowest mapping rates in comparison to the other tools. For all other data sets, the
mapping rate varies between 67.15 % (A. thaliana) and 96.66 % (C. albicans).
The BinPacker assemblies behave similar to the Bridger assemblies regard-
ing the TransRate metrics, however are slightly worse in direct comparison, placing
BinPacker among the worst three tools according to the TransRate statistics.
On the other hand, BinPacker introduces only a low amount of ambiguous bases
in the assemblies.
BinPacker arranges in the midfield or on the last places regarding the RSEM-
EVAL score, except on the human simulated data the tool achieves a scoring similar
to Bridger and reaches the third place (behind Trinity and Trans-ABySS).
Regarding the amount of complete BUSCO detections, BinPacker performed
well on the C. albicans, HSA-EBOV-7h and human simulated data set, but com-
pletely failed for the others (Fig. 4.3). Over 90 % of the BUSCOs included in each
database could not be identified in the BinPacker assemblies. Regarding the or-
tholog detection, BinPacker had the worst performance.
Regarding the selected metrics, the performance of BinPacker is similar to the
performance of Bridger (OM S=28, Fig. 4.2). This observation is not suprising,
because BinPacker is build on the principals of Bridger. BinPacker showed
a more consistent behaviour than Bridger regarding the metric scores, however
only reached scores between 3 and 5 (Fig. 4.2).

76
4.1. The Dark Art of de novo transcriptome assembly

SPAdes-sc and -rna


Although originally developed for single-cell genome assemblies and smaller bacterial-
sized genomes, we also included SPAdes in our evaluation. It was previously re-
ported that the assembler performs well on RNA-Seq data when used in the single-
cell mode [216]. This might be due to the uneven coverage optimization imple-
mented for single-cell data, that may also fit very well the behavior of low and high
expressed transcripts. Furthermore, SPAdes has a special RNA-Seq mode. There-
fore, we evaluated the performance of SPAdes in single-cell (--sc; SPAdes-sc)
and RNA-Seq (--rna; SPAdes-rna) mode. Here, we will discuss both parameter
options of the assembler together.
The re-mapping rates for both SPAdes parameter options arrange among the
top mapping rates for all data sets (Fig. S4). Furthermore, the mapping rates
between SPAdes-sc and SPAdes-rna are on a comparable level, although in
-rna mode only one single k-mer is used. The mapping rates arrange between
87.66 % (E. coli, SPAdes-sc) and 97.25 % (H. sapiens simulated, SPAdes-sc)
and are all time above 90 % except for the bacterial data set. For the bacterial data
set, only Trans-ABySS outperforms SPAdes regarding the mapping rate.
Based on the TransRate metrics SPAdes build the most accurate assemblies.
For almost all data sets, the SPAdes-sc and -rna assemblies achieved the high-
est percentage of good mappings and the lowest percentage of uncovered bases,
followed by SOAPdenovo-Trans. For five out of the seven paired-end data sets
SPAdes-rna achieved the best optimal score calculated by TransRate (in the
other two cases the single-cell mode performed slightly better). Furthermore, the
SPAdes assemblies introduced only a low up to moderate amount of ambiguous
bases in the final contigs.
The RSEM-EVAL scores of the SPAdes assemblies vary widely among the dif-
ferent RNA-Seq data sets. For some samples, SPAdes-sc achieves a better scor-
ing than SPAdes-rna, and vice versa. For the bacterial data set, SPAdes-rna
achieves the fourth place and for the plant data even the third place. For the H. sapi-
ens data set, SPAdes in both modes did not work well regarding the RSEM-EVAL
scoring. For the HSA-EBOV-3h data set SPAdes-rna is on the second place,
outperformed by Trans-ABySS.
SPAdes arranges in the midfield of BUSCO detections, with the --sc mode
performing generally better than the --rna mode (Fig. 4.3). When only compar-
ing the amount of complete single-copy orthologs, the SPAdes assemblies generally
outperform the other assemblers. Furthermore, in both modes, SPAdes does only
detect very few duplicated complete BUSCOs. On the E. coli, A. thaliana, and
M. musculus data sets SPAdes-rna missed a higher amount of BUSCOs in com-
parison to the other tools. This might be a result of the single k-mer used in --rna
mode. SPAdes-sc is one of the best performing tools for the detection of complete
single-copy BUSCOs in the C. albicans transcriptome.
SPAdes-sc arranges within the top three tools regarding the summed up met-
ric score (OM S=53.3, Fig. 4.2), only outperformed by Trans-ABySS (OM S=60).
Furthermore, SPAdes-sc reached on of the highest metric scores for the C. al-
bicans assembly (12 points). SPAdes-sc achieved also the highest metric score
for the E. coli assembly among all assembly tools. This might be an advantage

77
Chapter 4. Transcriptome Assembly

of SPAdes algorithm, originally developed for small bacterial-size genomes. Like


SPAdes in single-cell mode, SPAdes-rna performed also generally good on almost
all data sets. However, the lowest scores of 5 were achieved for the single-end data.
Therefore, it seems that SPAdes in RNA mode can better work with paired-end
RNA-Seq data. SPAdes-rna reached also one of the highest metric scores (12) for
the real human data set (without any infection). For this data set, the single-cell
mode of SPAdes achieved a low score of only 5 points (Fig. 4.2). Based on these
observations, we suggest that for larger eukaryotic RNA-Seq data sets the RNA
mode of SPAdes should be applied.
It remains questionable, why SPAdes in --rna mode does only work on a single
k-mer size (55 by default), although in genome assembly mode multiple k-mers can
be used. In the online manual, the authors strongly recommend to not change this
parameter10 . However, the algorithm might be further improved by also allowing
multiple k-mer values in the --rna mode of SPAdes.

Usability
We further rated our experiences regarding the installation and usability of each
tool (Tab. 4.1). These experiences might be subjective, nevertheless we want to
share them here to give non-experienced users an idea of how difficult it is to get
each tool installed and executed. Some of the tools rely on many dependencies
and/or were difficult to compile, at least on our system without administrative per-
missions (Shannon, SOAPdenovo-Trans, Trans-ABySS), while others could be
installed straight out of the box (SPAdes). Furthermore, some assemblers need ad-
ditional parameter files for execution (SOAPdenovo-Trans), are circuitous to run
(Trans-AbySS, Oases, SOAPdenovo-Trans, ), needed additional preprocess-
ing steps of the reads for some of the data sets (IDBA-Tran assumes paired-end
reads to be in order forward–reverse), or were just not terminating for all the data
sets (Bridger), while with others we had no problems and could execute them
straightforward (Trinity, SPAdes, BinPacker, IDBA-Tran).
Bridger failed in the path search step for some of the generated sub files.
Therefore, we combined the transcript output manually, because this is anyway the
last step of the tool. Furthermore, we had to start Bridger two times for each
data set, because the tool crashed each time after the first start, but continues with
the assembly when started a second time on the same output folder.
In the past, Oases and Trans-ABySS were always circuitous to run, because
the corresponding genome assemblers Velvet and ABySS needed to be executed
first with multiple k-mers. This difficulties were somehow emasculated by new
wrapper scripts provided by the developers to automatically execute the underlying
genome assemblers.

Computational efficiency
Since de novo transcriptome assembly can involve the analysis of large sequencing
data, computational efficiency is an important benchmark, especially for deep se-
quencing projects and large sample sizes. Furthermore, it is highly recommended
10
http://spades.bioinf.spbau.ru/release3.10.1/rnaspades_manual.html

78
4.1. The Dark Art of de novo transcriptome assembly

to run multiple assemblies with different tools and parameter settings (for example
different k-mers), so computation time is an important part of each tool. Electronic
Supplement Fig. S10 summarize the computation time and the maximum memory
peak of all data sets and assemblers.

Runtime. SOAPdenovo-Trans appeared to be by far the fastest algorithm, fol-


lowed by IDBA-Tran, BinPacker, SPAdes-sc, SPAdes-rna, Shannon and
Bridger. Obviously, the older tools like Oases and Trans-ABySS that are addi-
tionally based on a multiple k-mer strategy (MK) are comparatively slower. If those
tools would be executed only on one k-mer, the runtime would be comparable with
the other assemblers or would be even faster. Of course, SOAPdenovo-Trans can
also be run on different k-mers, but no automatic merge function for the different
assemblies is implemented. The Trinity runtime lays between the faster tools and
the slower MK approaches, although the tool relies on one k-mer only. Although
also based on a MK strategy, IDBA-Tran and SPAdes-sc are much faster than
the older MK algorithms and can compete against the other tools in the sense of
speed.

Maximum memory consumption. IDBA-Tran appeared to be the tool with


less memory consumption, estimated over all data sets. Shannon showed really
high memory peaks, especially for the larger data sets (more than 100 GB for the
EBOV infected human samples, see Fig. S10), followed by Oases and Trinity.
The other assemblers run with comparatively same amounts of memory.
When running Trinity we observed in the first phase of assembly (meaning in
the first seconds up too few minutes, depending on the size of the input data set)
very high memory peaks. For example, in the first five minutes of the execution
of all human data sets we noticed memory peaks of ∼240 GB with Trinity (data
not shown). Immediately after this initial peak, the memory consumption drops
down to a comparative normal level (see Fig. S10, here we removed the high initial
peak for Trinity to get a better overall comparison of the memory usage of all
assemblers). The high memory consumption in the first phase might be due to the
many individual de Bruijn graphs build by Trinity based on partitions of the
sequence data [223].
Practitioners should pay extra attention to plan enough computing power and
time when running many tools on different parameter settings, especially when
projects with deep sequencing depth and large sample size are processed.

Detour: Viral de novo assembly


Although it was not the main focus of this study and the scope of the compared tools,
we were interested in how the different assemblers deal with viral contamination in
RNA-Seq data and if they are able to construct viral genomes. Therefore, we used
Blast to search for contigs in the assemblies that match the full genome of the
Ebola virus, strain Zaire, Mayinga (GenBank: NC_002549).
The Ebola virus (EBOV) consists of a single-stranded RNA genome with a neg-
ative orientation that is approximately 19 kb in size and encodes for seven structural

79
Chapter 4. Transcriptome Assembly

proteins [224]. As we assembled the three human samples infected with EBOV at
three different time points individually, we were able investigate how the different
assemblers perform on varying amounts of viral reads in the data (3 h ∼0.1 % viral
reads, 7 h ∼2 %, 23 h ∼20 %). Details about the amount of viral reads in each data
set can be found in Sec. 5.2 and Appendix A.2
Interestingly, with a higher amount of viral reads, the performance of most of
the assembly tools dropped. For example, Trans-ABySS was able to construct
the full EBOV genome out of the 3 h (18,926 nt, 99.984% sequence similarity) and
7 h (18,903 nt, 99.974%) data set, but failed on the 23 h data set with a viral read
contamination of roughly 20 % (many small contigs, longest hit: 8,500 nt).
In general, Trans-ABySS, SOAPdenovo-Trans, Shannon, Bridger, Bin-
Packer and SPAdes (-sc and -rna mode) performed well and constructed the full
EBOV genome out of the 3 h data set. In the BinPacker assembly we found
only one homologous sequence with a length of 18,896 nt and 99.984% similarity.
The exactly same contig was found in the Bridger assembly, the precursor tool
of BinPacker. Oases produced many small contigs and a 16,149 nt hit with
similarity to the EBOV genome.
On the 7 h data set, with a ∼2 % amount of viral reads, Trans-ABySS, SOAP-
denovoTrans and Shannon performed best. However, the longest hit found in the
Shannon assembly comprises only 17,107 nt and many other fragments of different
sizes with similarity to parts of the EBOV genome. Bridger and BinPacker
were only able to construct the same 10 kbp partial EBOV genome. SPAdes-sc
and SPAdes-rna assembled viral contigs up to a length of 12 kbp and 14 kbp,
respectively.
Out of the 23 h data set (∼20 % viral reads), only SOAPdenovo-Trans was
able to construct the full EBOV genome (18,901 nt, 99.53%), but also including
many small contigs with similarity to the viral genome. Bridger and BinPacker
construct contigs of a length of 14.8 kbp and 12 kbp, respectively. All other assembly
tools were not able to construct any longer contigs with a similarity to the EBOV
genome out of this data set.
Interestingly, Trinity was the only tool that was not able to construct any
full-length EBOV genome out of the three data sets.
In summary, SOAPdenovo-Trans performed very well on all three data sets by
constructing accurate full-length contigs with high similarity to the EBOV genome.
Therefore, it could be interesting to evaluate the performance of SOAPdenovo-Trans
for the construction of RNA viral genomes out of meta-transcriptomic RNA-Seq data
in the future. If the amount of viral reads is low (∼0.1 %), all assembly tools ex-
cept Trinity, Oases and IDBA-Tran produced accurate viral contigs with high
similarity to the EBOV genome and a length >18 kbp.

4.1.4 Conclusions and future perspectives


In this section, we presented a large-scale comparative study by applying ten de novo
assembly tools to eight RNA-Seq data sets comprising different kingdoms of life and
one artificial data set of human chromosome 1. Overall, we calculated more than
200 single assemblies and evaluated their performance on different metrics. All

80
4.1. The Dark Art of de novo transcriptome assembly

results are summarized in a comprehensive Electronic Supplement11 , that is easily


extendible by more RNA-Seq data sets, new assembler versions, parameter settings
and tools.
With the selection of the RNA-Seq data sets, we aimed to represent a broad
variety of different species and Illumina short-read sequencing setups. The data sets
consist of read data with different insert sizes, single-end vs. paired-end, strand- and
not strand-specific, various read lengths and sequencing depths.
We further evaluated reference based metrics and de novo metrics only based
on the input read data and the final assembly itself. There is still a general lack of
which metrics should be used for an appropriate evaluation of de novo transcrip-
tome assemblies. More complicating, we observed that some metrics are contra-
dicting each other, such as the optimal assembly score calculated by TransRate
and the RSEM-EVAL score of Detonate. For example, assemblies of the Homo
sapiens simulated data set achieved the best RSEM-EVAL scores for Shannon,
Trans-ABySS and Trinity, whereas IDBA-Tran performed worst (Electronic
Supplement Tab. S9 and Appendix Tab. B.8). However, IDBA-Tran achieved the
third-best optimal TransRate score, only outperformed by the two SPAdes as-
semblies, and Shannon arranges on the next-to-last place regarding this metric.
We conclude, that a careful selection of evaluation metrics is necessary to select the
best performing results out of multiple assembly runs. Based on our observations,
we suggest to use reference-free metrics like provided by the TransRate software.
Generally, the optimal assembly score calculated by TransRate seems to be a good
measure for the quality of an assembly. Assemblies, that needed fewer contigs for
a comprehensive description of the whole transcriptome achieved in most cases also
good TransRate scores. However, this score can be only calculated for paired-end
RNA-Seq data at the moment. If reference-based metrics should be included, the
mean isoform coverage calculated by rnaQUAST as well as the BUSCO scores are
good metrics for the evaluation of the best assembly results.
SPAdes, although originally developed as de novo assembly tool for small ge-
nomes, produced also highly accurate transcriptome assemblies in both modes, for
single-cell (SPAdes-sc) and RNA-Seq data (SPAdes-rna). Interestingly, the
single-cell mode outperformed the RNA mode for some of the data sets. This might
be a result of the missing multiple k-mer approach in the RNA mode. Therefore, a
further improvement of the transcriptome assembly mode of SPAdes, also taking
advantage of multiple k-mers like already implemented in the genome and single-cell
mode, might further improve the performance of the tool for de novo transcriptome
assembly construction.
Taking a closer look on the BUSCO results, SPAdes produced in both modes
the lowest amount of complete and duplicated transcripts (Fig. 4.3). This could
indicate that SPAdes merges highly similar transcripts into single contigs, therefore
losing similar isoforms. This behavior can be also observed when looking at the
mean isoform coverage calculated with rnaQUAST (Electronic Supplement Tab. S5
and S9, Appendix Tab. B.1–B.8). Generally, SPAdes arranges in the midfield or
worse regarding this metric. Especially, SPAdes-rna does not perform well in
11
http://www.rna.uni-jena.de/supplements/the_dark_art/

81
Chapter 4. Transcriptome Assembly

constructing different isoforms, most likely a result of the missing multiple k-mer
approach.
Oases also performed generally well, for example when taking a look at the
BUSCO results (Fig. 4.3). However, the tool produced the highest amount of com-
plete and duplicated hits, which might indicate that highly similar isoforms derived
from the multiple k-mer approach are nor efficiently merged. For all data sets,
Oases produced also the highest amount of contigs, however did not achieve the
best database coverage in all test cases. For example the Oases assembly of the
H. sapiens data set comprises ∼207,000 transcripts with a length >1000 bp, cover-
ing 8 % of the reference transcripts (Electronic Supplement Tab. S9, Tab. 4.2). In
comparison, the Trans-ABySS assembly needs only ∼68,000 contigs to achieve a
database coverage of 29 %. Therefore, Oases can create good assembly results, but
also produces big assemblies with many contigs that might complicate and confuse
downstream analyses.
The fastest tool executed on all data sets was SOAPdenovo-Trans. The
tool outperformed all other assemblers regarding the runtime (Electronic Supple-
ment Fig. S10). Combined with the moderate memory consumption, this makes
SOAPdenovo-Trans the most resource-efficient tool evaluated in this study. How-
ever, it might be interesting to run multiple k-mer assemblies with SOAPdenovo-
Trans and use another assembly merge strategy (e.g., conducted from Oases or
TransABySS) to merge the final transcripts resulting from each run. In general,
multiple k-mer approaches performed better than single k-mer approaches.
Here, we summarize some key conclusions from the comparative study:
(I) No tool performed dominantly best for all data sets. However, Trans-ABySS,
Trinity, SOAPdenovo-Trans and SPAdes performed consistently among
the best assembly tools (Fig. 4.2).

(II) SOAPdenovo-Trans performed best for the construction of the viral RNA
genome at all three time points tested.

(III) SOAPdenovo-Trans has the lowest runtime, followed by IDBATran, Bin-


Packer, SPAdes and Bridger.

(IV) Based on our results, we recommend to apply different tools and parameter
settings for de novo transcriptome assembly, followed by the evaluation of the
output transcripts and selecting the best-performing results. This general idea
needs to be investigated in more detail in future studies, because the selection
of the best assemblies based on appropriate metrics and also the following clus-
tering procedure (without the lose of isoforms, avoidance of building chimeric
transcripts, redundancy) are still challenging and open tasks. In Sec. 4.2, a
proof-of-concept is presented, comparing the performance of different cluster-
ing approaches on the single assemblies of the M. musculus data set presented
here.
The complementary performance of the top performing tools motivated the devel-
opment of an ensemble method by combining the best performing methods to achieve
an overall better assembly day. Therefore, we developed the idea of a pipeline that

82
4.1. The Dark Art of de novo transcriptome assembly

automatically selects the top performing assemblies (or only the best transcripts
from each assembly) based on various metrics and clusters them based on sequence
similarity to achieve a more comprehensive assembly.
A common problem of many comparative studies is that they can only provide
limited suggestions based on the tools and data sets that have been available at
the time point they were carried out. If new versions of an assembly tool or com-
pletely new tools are released, there is almost no way to estimate the advantages
(or disadvantages) without carrying out a new comparison study. Therefore, next
to the main focus of this study (comparing de novo short-read assembly tools on
various RNA-Seq data sets), we developed a pipeline that allows an easy integration
of updated versions and novel assembly tools in the comparison process. All met-
rics, figures and tables are created by scripts and finally combined with the help of
an in-house ruby script to build up the electronic supplement. Evaluation metrics
can be changed or additional metrics (next to the 20 selected by us) can be easily
added. If a new assembly tool needs be included in the comparison, we just execute
it on each of the selected data sets and use the resulting multiple FASTA files as
input for the evaluation pipeline. The electronic supplement would be automatically
extended by the evaluation results for the new assembly tool.
Therefore, we can easily extend the comparison presented here and update all
tables and figures in the electronic supplement to investigate the performance of
upcoming tools for de novo assembly of short-read RNA-Seq data.
In the following section, we combined all assembly results and only the best
performing onces (based on our selected metrics) of the M. musculus data set to
serve as a proof-of-concept. An evaluation of the different parameter settings for
clustering the single assemblies is performed. The information gathered out of this
proof-of-concept study may be further contribute to develop new cluster methods
for the construction of transcriptome assemblies.
For the large bioinformatics community working in the area of RNA-Seq, the
development of a high-performing (accurate and fast) de novo transcriptome cluster
workflow to automatically select and combine the output of top-performing assembly
tools remains an important and challenging task.

83
Chapter 4. Transcriptome Assembly

4.2 Cluster de novo transcriptome assemblies: a


proof-of-concept
In Sec. 4.1, we comprehensively compared ten de novo assembly approaches on nine
short RNA-Seq data sets of various species. Whereas some of the assemblers per-
formed generally well on most of the data sets (like TransABySS, SOAPdenovo-
Trans, Trinity, SPAdes in single-cell and RNA mode) the performance of other
tools was mixed. Some tools worked better on smaller data sets, whereas others
were able to assemble good results for larger eukaryotic data in an appropriate time.
Based on our evaluation, we can conclude that different assembly tools and param-
eter settings should be tested to find the best assembly for a given RNA-Seq data
set. Standardized metrics are needed that allow for an automatic detection of the
best assemblies (or the best contigs within a given assembly) that can be further
efficiently merged to obtain a final and more comprehensive de novo transcriptome
assembly.
Here, we present our idea of clustering the contigs of different assembly tools
to overcome the specific limitations of each assembler. As a proof-of-concept, we
chose the Mus musculus data set already described and used for the evaluation
of the ten de novo assembly tools in Sec. 4.1. We tested the performance of
CD-HIT-EST on different subsets of the individual mice assemblies generated in
Sec. 4.1. Furthermore, we show how a preselection of the best contigs (according to
TransRate [229]) of each individual assembly or the selection of the top assemblies
(according to the metrics defined in 4.1.2) influences the final assembly. We further
used a new tool called Transfuse (publication in progress), that combines the
preselection of the best contigs of each assembly (based on TransRate metrics)
and automatically merges them based on the paired-end information of the read
data.
Based on the results presented here, we aim to develop a modular pipeline to
automatically execute different assembly tools and parameter settings, selecting the
best assemblies and/or contigs, and clustering them to generate a more comprehen-
sive de novo transcriptome assembly out of short-read RNA-Seq data.
The cluster approaches presented here are based on and compared to the single
assemblies discussed in Sec. 4.1. Therefore, we also refer to the Electronic Supple-
ment presented in Sec. 4.112 whenever reasonable.

4.2.1 How to improve assemblies?


One approach to overcome some of the limitations of a single de novo assembly tool
is the usage of multiple k-mer values to build the de Bruijn graph, the common
backbone of most current assembly tools [205]. Contigs of the different k-mer as-
semblies can be merged to obtain a more complete view of the whole transcriptome,
including also low expressed transcripts that may be covered better by smaller k-
mers. However, how to efficiently merge different assemblies without losing highly
similar isoforms is still a challenging task.
12
http://www.rna.uni-jena.de/supplements/the_dark_art/

84
4.2. Cluster de novo transcriptome assemblies: a proof-of-concept

Recent studies already used such merging approaches to combine multiple as-
semblies for annotation, comparative studies, quantification and differential gene
expression [226, 232–235]. In most cases, the multiple FASTA files resulting from all
assembly runs are concatenated into one big FASTA file and clustered by sequence
similarity with CD-HIT-EST [146]. However, the quality of assemblies resulting
from such clustering approaches was never systematically investigated. Another
problem is the high redundancy that can be introduced by clustering the output of
multiple assembly runs.

CD-HIT: Cluster Database at High Identity with Tolerance


A quite often used tool for the clustering of transcript sequences to produce a set of
non-redundant representative sequences is CD-HIT-EST [146, 236]. CD-HIT (Clus-
ter Database at High Identity with Tolerance) comprises a software suit for the fast
clustering and comparison of large sets of amino acid or nucleotide sequences. In
2001 and 2002 Weizhong et al. published two papers [237, 238] describing ultrafast
clustering algorithms for protein sequences called CD-HI and CD-HIT, respectively.
The underlying algorithms were further extended to cluster large sets of DNA/RNA
sequences, implemented in the program CD-HIT-EST as part of the CD-HIT pack-
age. The extension -EST stands for Expressed Sequence Tag, a term used in genetics
to describe a short sub-sequence of a cDNA (complementary DNA, produced from
an RNA before sequencing). The application of the clustering algorithm for Next-
Generation Sequencing data was described in detail in 2012 [236].
The input of CD-HIT-EST is a multiple FASTA file. From this, the program
produces a set of non-redundant representative sequences as output. Additionally,
a cluster file, describing the sequence groups for each non-redundant sequence rep-
resentative, is provided. The idea of the algorithm is to reduce the overall size of a
sequence database (like the concatenated FASTA files of multiple transcriptome as-
semblies) without removing any sequence information by only removing redundant
(or highly similar) sequences. Therefore, CD-HIT-EST allows for a user-defined
similarity threshold (−c) to cluster the sequences (default −c = 0.9).
The comprehensive clustering of a sequence database involves all-by-all compar-
isons and is very time consuming. However, CD-HIT avoids many pairwise sequence
alignments by utilizing a short word filter. A pairwise alignment is only conducted,
if two given sequences share at least a certain amount of short words, so a spe-
cific number of dinucleotides, trinucleotides, and so on. For example, to achieve an
identity of 85 % over a 100 nt window two sequences need to have at least 70 identi-
cal dinucleotides, 55 identical trinucleotides, and 25 identical pentanucleotides13 . If
this does not hold for two sequences, no pairwise alignment need to be constructed,
because the similarity of the two sequences is below the identity threshold of 85 %
by simple word counting. Combined with an index table, the counting of short
words can be done very efficiently and makes CD-HIT extremely fast. However, the
avoidance of many pairwise sequence alignments based on the short word filter is a
heuristic. Therefore, the final cluster representation might be not optimal. A short
13
see http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_
guide

85
Chapter 4. Transcriptome Assembly

summary of the algorithm can be found at http://weizhong-lab.ucsd.edu/


cd-hit/wiki/doku.php?id=cd-hit_user_guide.
The CD-HIT-EST program is commonly used in transcriptomic studies to clus-
ter the output of multiple assemblies and to reduce the redundancy of a combined
assembly [226, 232–235]. However, the performance was never systematically in-
vestigated. For example, multiple k-mer assembly approaches like Oases and
Trans-ABySS tend to produce many (short) transcripts and therefore assemblies
with a high amount of contigs (see Sec. 4.1). By using CD-HIT-EST, the size of
such assemblies can be reduced and short sequences that are just a part of larger
ones can be clustered. However, a huge problem is the presence of similar isoforms
in one assembly. CD-HIT-EST can not distinguish between isoforms and therefore
they might be erroneously clustered. For example, a gene comprises three exons and
an isoform A consisting of all three exons and an isoform B consisting only of the
first exon. Now, it is very likely that CD-HIT-EST would cluster both isoforms and
B would be lost in the final assembly.

Transfuse
Transfuse is currently under development14 and based on the output of Trans-
Rate [229], a transcriptome evaluation pipeline already used for the calculation of
several metrics in Sec. 4.1 and shortly explained in 4.1.2.
Transfuse merges multiple de novo transcriptome assemblies. The input are
multiple assemblies calculated with different de novo assemblers, or different param-
eter settings in the same assembler. The output is a single high quality assembly
representing the transcriptome.
To cluster the multiple assemblies, Transfuse does not only rely on the contigs
(like CD-HIT-EST), but also takes the reads used to perform the transcriptome
assemblies as an input. The idea is to first calculate reference-free assembly scores
for each single assembly and transcript with TransRate, followed by the clustering
of the transcripts that pass the filter. A multiple sequence alignment is calculated
for each cluster, and then a splice graph is resolved for each alignment to generate
merged contigs. According to the authors, the tool is still in development but it
appears to be performing well and the manuscript is in preparation.
Currently, Transfuse can only work with paired-end RNA-Seq data like Trans-
Rate.

4.2.2 Description of used cluster approaches


For this proof-of-concept, we used the M. musculus RNA-Seq data set presented and
evaluated in Sec. 4.1. The data set consists of ∼52 million strand-specific paired-end
reads with a length of 76 bp (Electronic Supplement Tab. S1) and was assembled
with ten different de novo assemblers15 . Details and statistics about each single
assembly can be found in Sec. 4.1 and the corresponding Electronic Supplement.
14
https://github.com/cboursnell/transfuse
15
If we count again the two different modes of SPAdes (single-cell and RNA) as two different
assemblers.

86
4.2. Cluster de novo transcriptome assemblies: a proof-of-concept

In Fig. 4.4 we visualize the workflow presented in Sec. 4.1 and how the dif-
ferent clustering setups were conducted here. (1) – At first, we clustered the
transcripts of all ten assemblies with CD-HIT-EST (Fig. 4.4). (2) – In a sec-
ond approach, we selected only the best six assemblies according to our results
in Sec. 4.1 before clustering again with CD-HIT-EST. The top six assemblies were
Trinity, Trans-ABySS, SOAPdenovo-Trans, IDBA-Tran, SPAdes-sc, and
SPAdes-rna and are additionally marked in Fig. 4.4. (3) – In a third approach, we
selected from all ten assemblies only those contigs for clustering with CD-HIT-EST
that were previously defined as good based on the TransRate scoring. (4) – In a last
setup involving the clustering with CD-HIT-EST, we selected only the TransRate
good contigs from the best assemblies (a combination of (2) and (3)) for clustering.
For each approach, the CD-HIT-EST clustering was performed on three similarity
thresholds (−c parameter): 1.0, 0.99 and 0.95. (5) – We further used the unpub-
lished Transfuse pipeline (utilizing the TransRate scoring scheme) to merge all
ten assemblies. In summary, we build 13 merged assemblies (Fig. 4.4).
For evaluation and comparison of the merged assemblies with the single assembly
tools we used the same metrics and evaluation software like previously described in
Sec. 4.1.

4.2.3 Evaluation of merged assemblies


Re-mapping rate
The Hisat2 [74] mapper was used to re-align the reads to each corresponding as-
sembly. For the single assembly tools, SPAdes-sc achieved the highest re-mapping
rate of 94.8 % (followed by Trans-ABySS (94.0 %) and Trinity (92.83 %), see
Electronic Supplement Fig. S4). By merging the single assemblies like presented in
Fig. 4.4, we were able to increase the re-mapping rate up to 96.03 % (CD-HIT-95p)
by clustering all ten single assemblies with CD-HIT-EST and a similarity thresh-
old of 95 % (Tab. 4.3). The lowest re-mapping rate of 89.93 % was achieved by
the most restrictive clustering approach (CD-HIT-TransRate-Metrics-95P), where
we only clustered the contigs defined as good by TransRate out of the six best
assemblies according to our previously defined metrics with a similarity threshold of
95 %. Nevertheless, this re-mapping rate is still higher than the re-mapping rate of
four single assembly tools (Electronic Supplement Fig. S4).
Of course, a high re-mapping rate does not necessarily represent a good assembly
quality, because chimeric and nonsense transcripts can also increase the re-mapping
rate. Further metrics need to be considered, however, the re-mapping rate can give
first insights in the performance of an assembler.

Detonate
We investigated the RSEM-EVAL scores calculated by Detonate for each of the
13 clustered assemblies (Tab. 4.3). The scores arrange between -2.10 (CD-HIT-
Metrics-99p and Transfuse) and -2.60 (CD-HIT-TransRate-Metrics-100p). The best
RSEM-EVAL scores for single assemblies were achieved by Trans-ABySS (-2.14),
Trinity (-2.26) and Bridger (-2.37) (Electronic Supplement Tab. S8 and Ap-

87
Chapter 4. Transcriptome Assembly

Sec. 4.1
RNA-Seq data set
Mouse
2x 43 Mio reads
76 bp, strand-specific

Preprocessing

FastQC Prinseq

De novo single assemblies


SOAPdenovo-
Trans-ABySS
Trans

Bridger BinPacker

Trinity Oases

SPAdes-rna SPAdes-sc

Shannon IDBA-Tran

without any preselection


Combine all assemblies
Preselection of single assemblies/contigs
Sec. 4.1

Metrics + Selected
TransRate
TransRate metrics
Select 'good' (3) Select 'good' (4) (2)
Select 'best'
contigs from contigs from
assemblies
all assemblies 'best' assemblies

Transfuse CD-HIT-EST
(1)
(5)
100% 99% 95%

Clustering

Merged assemblies

CD-HIT-100p CD-HIT-Metrics-100p CD-HIT-TransRate-100p

CD-HIT-99p CD-HIT-Metrics-99p CD-HIT-TransRate-99p

CD-HIT-95p CD-HIT-Metrics-95p CD-HIT-TransRate-95p

CD-HIT-TransRate-Metrics-100p

CD-HIT-TransRate-Metrics-99p Transfuse
CD-HIT-TransRate-Metrics-95p

Figure 4.4: Proof-of-concept transcriptome pipeline. We chose the Mus musculus data set pre-
sented in Sec. 4.1 and the corresponding assemblies to evaluate the performance of different clus-
tering approaches and parameter settings to build a merged assembly. (1) – we clustered the tran-
scripts of all ten assemblies with CD-HIT-EST [146]. (2) – before clustering with CD-HIT-EST,
we selected only the best six assemblies according to the 20 metrics defined in Sec. 4.1. The top
six assemblies are marked with a star. (3) – From all ten assemblies we selected only those contigs
for clustering that were previously defined as good based on the TransRate scoring. (4) – We
select only the TransRate good contigs from the best assemblies (combination of (2) and (3)) for
clustering. The clustering is performed on three similarity thresholds (−c parameter): 1.0, 0.99
and 0.95. (5) – We further used the unpublished Transfuse pipeline (utilizing the TransRate
scoring scheme) to merge all ten assemblies. In summary, we build 13 merged assemblies.

88
Table 4.3: Selected metrics based on the output of rnaQUAST, Hisat2, Detonate, TransRate and BUSCO for the transcripts clustered with 13 different
approaches (Fig. 4.4) based on the single assemblies created previously for the Mus musculus RNA-Seq strand-specific paired-end library with read length
76 bp (Sec. 4.1). In each row the top three values are indicated with bold italic. We rated those approaches as top scoring that ended up with the lowest
amount of transcripts, because the goal of merging multiple assemblies should be to reduce the size of the overall assembly by still keeping the correct
representatives for each transcript. The RSEM-EVAL score is multiplied by 109 . C+SC – complete and single-copy BUSCOs.

CD-HIT CD-HIT-Metrics CD-HIT-TransRate CD-HIT-TransRate-Metrics Transfuse


100p 99p 95p 100p 99p 95p 100p 99p 95p 100p 99p 95p
Hisat2
Mapping rate (%) 95.88 96.03 96.01 96.01 95.94 96.01 92.52 92.38 92.50 90.08 89.93 90.04 93.86
rnaQUAST
# Transcripts 418,712 328,644 270,784 207,814 153,326 128,506 89,990 64,988 55,947 67,155 49,167 43,108 84,587

89
>1000 bp 123,651 89,973 73,251 57,236 33,897 24,969 36,419 24,328 21,603 26,054 16,995 15,263 41,935
Database coverage 0.23 0.191 0.162 0.238 0.208 0.185 0.147 0.117 0.109 0.15 0.125 0.119 0.181
Misassemblies 59,484 57,891 53,545 1,582 1,475 1,289 5,210 5,186 5,072 321 308 290 7,051
Isoform coverage 0.556 0.522 0.484 0.588 0.571 0.545 0.579 0.558 0.552 0.588 0.574 0.571 0.649
TransRate
Optimal score 0.003 0.007 0.011 0.010 0.067 0.127 0.0539 0.166 0.221 0.094 0.268 0.341 0.014
Detonate
RSEM EVAL -2.49 -2.35 -2.29 -2.19 -2.10 -2.11 -2.33 -2.29 -2.30 -2.60 -2.58 -2.59 -2.10
BUSCO
C+SC BUSCOs 493 1,178 1,665 761 2,064 2,745 1,309 2,873 3,347 1,667 3,172 3,507 1,167
Missing BUSCOs 1,866 1,854 1,849 1,892 1,893 1,894 1,920 1,921 1,922 1,962 1,964 1,965 1,872
4.2. Cluster de novo transcriptome assemblies: a proof-of-concept
Chapter 4. Transcriptome Assembly

pendix Tab. B.4), whereas BinPacker (-4.89) and IDBA-Tran (-5.03) performed
worst. All RSEM-EVAL scores calculated for the merged assemblies arrange within
the top three scores of the single assemblies. RSEM-EVAL scores better than the
score achieved with the top-performing tool regarding this metric (Trans-ABySS)
were obtained with the CD-HIT-Metrics-99p clustering (-2.10), the Transfuse clus-
tering (-2.10), and the CD-HIT-Metrics-95p (-2.11). In general, merging only the
transcripts of the preselected best performing assembly tools (Fig. 4.4) according
to the 20 metrics defined in Sec. 4.1 worked well and outperformed the other clus-
tering approaches according to the RSEM-EVAL score. Interestingly, the merged
assembly calculated by Transfuse achieved the same top score (-2.10) like the CD-
HIT-Metrics-99p approach. Using only the good contigs (defined by TransRate) of
the best performing assemblers seems to be too restrictive to achieve better RSEM-
EVAL scores (-2.58, -2.59, -2.60). The RSEM-EVAL scores were multiplied by
109 for more clarity and easy comparison. Furthermore, within each clustering ap-
proach (all assemblies, best assemblies, good contigs, good contigs of best assemblies)
the CD-HIT-EST clustering achieved the best RSEM-EVAL score with an identity
threshold of 0.99.
In summary, the Transfuse merging and using a combination of different met-
rics to preselect the best performing assemblies and all contigs of these assemblies
for clustering achieved the best Detonate evaluation results (Tab. 4.3).

TransRate
For the single assemblies, the best TransRate scores were achieved for SPAdes-rna
(0.43), SOAPdenovo-Trans (0.40), and SPAdes-sc (0.37) (Electronic Supple-
ment Tab. S9 and Appendix Tab. B.4). BinPacker (0.13), Trans-ABySS (0.09)
and Oases (0.02) performed worst. For the merged assemblies, there was not a sin-
gle TransRate score calculated that outperforms the results of the top three single
assembly tools regarding this metric (Tab. 4.3). The best score was achieved by the
CD-HIT-TransRate-Metrics-95p clustering (0.34), followed by CD-HIT-TransRate-
Metrics-99p (0.27), and CD-HIT-TransRate-95p (0.22). The clustering approach
involving the transcripts of all ten single assemblies did not perform well regarding
the TransRate scoring. The CD-HIT-99p and CD-HIT-100p assemblies achieved
a TransRate score below 0.01. The best TransRate scores were achieved for the
merged assemblies with metric preselection and the best contig filter of TransRate.
Surely, this does also introduce a bias in the evaluation. If we only select the tran-
scripts for merging that previously achieved a good scoring with TransRate, the
resulting assemblies should perform better when again evaluated with the same
metric. However, using only the TransRate-defined good contigs of the six best
performing assembly tools (Fig. 4.4) did greatly improve the TransRate score
(Tab. 4.3). For example, the CD-HIT-TransRate-95p clustering achieved a score
of 0.22, whereas the CD-HIT-TransRate-Metrics-95p achieved a score of 0.34. The
same holds for the CD-HIT-EST clustering with an identity threshold of 99 % (0.17
using the good transcripts of all assemblies and 0.27 using only the good contigs
of the six best assemblies) and with an threshold of 100 % (0.05 vs. 0.09). Inter-
estingly, the Transfuse merging of all ten single assemblies did not perform well
(0.01) although the tool is based on the output of TransRate.

90
4.2. Cluster de novo transcriptome assemblies: a proof-of-concept

Based on the TransRate evaluation, we conclude that the preselection of the


best performing assemblies according to a collection of different metrics is an im-
portant step before clustering.

rnaQUAST
First of all, the number of transcripts in the merged assemblies varies widely between
43,108 (CD-HIT-TransRate-Metrics-95p) and 418,712 (CD-HIT-100p) (Tab. 4.3).
This behaviour can be easily explained by the amount of restriction each clustering
approach applies on the input data. In the most restrictive approach (CD-HIT-
TransRate-Metrics-95p) only few transcripts are used as input and clustered with
a low identity threshold of 95 %. Therefore, only 15,263 transcripts with a length
>1000 bp are left in the final assembly, whereas the CD-HIT-100p approach still
comprises 123,651 transcripts. Those numbers are especially interesting for further
downstream analyses like the annotation of the transcripts or differential gene ex-
pression estimations. The goal of merging the transcripts of multiple assemblies
should be to finally obtain as few transcript sequences as possible without losing
important information, such as different isoforms.
Whereas most clustering approaches introduced only a moderate amount of mis-
assembled transcripts (290 (CD-HIT-TransRate-Metrics-95p) – 7,051 (Transfuse)) in
the final assembly, the simple CD-HIT-EST clustering approaches of all transcripts
of all ten single assemblies resulted in >53,000 misassemblies (Tab. 4.3). When com-
paring with the number of misassemblies of the single assemblies (Tab. B.4), this high
number most likely derives from the Oases assembly (52,665 misassemblies). The
lowest numbers of misassemblies were achieved for SPAdes-rna (30), IDBA-Tran
(41), and SOAPdenovo-Trans (61). Therefore, all clustering approaches seem to
introduce novel misassemblies or accumulate present ones in the final assemblies.
However, the CD-HIT-Transrate-Metric approaches performed best regarding the
amount of misassemblies (290–321, Tab. 4.3).
The mean isoform coverage calculated by rnaQUAST revealed Transfuse (0.649),
CD-HIT-Metrics-100p (0.588), and CD-HIT-TransRate-Metrics-100p (0.588) as the
top performing approaches regarding this metric (Tab. 4.3). In summary, almost all
tools performed with only slight differences in the mean isoform coverage rate. How-
ever, for the single assemblies mean isoform coverage values of 0.81 (BinPacker),
0.66 (Trinity), and 0.62 (Trans-ABySS) were achieved (Electronic Supplement
Tab. S9, Appendix Tab. B.4).
In conclusion, the CD-HIT-EST algorithm seems to be not optimal in recovering
highly similar isoforms. As Transfuse also takes the paired-end read information
into account, the resolution of different isoforms is much higher (0.649) in comparison
to the other twelve clustering approaches. Therefore, CD-HIT-EST should be used
carefully with a high identity threshold in order to keep similar isoforms in the
merged assembly.

BUSCO
The BUSCO assessment results are shown in Fig. 4.5. When only comparing the
single assembly tools (Fig. 4.5A), Trans-ABySS (3,989), Trinity (3,921) and

91
Chapter 4. Transcriptome Assembly

Bridger (3,914) produced the assemblies with most complete orthologs to the
BUSCO Euarchontoglires data set. Trans-ABySS, Trinity, Oases and Bridger
also assembled many complete and duplicated transcripts, whereas IDBA-Tran,
Shannon, SPAdes-rna and Oases included many fragmented hits.
By combining the transcripts of the single assemblies, we were able to increase
the maximum number of complete BUSCOs from 3,989 (Trans-ABySS) to 4,111
(CD-HIT-95p, Fig. 4.5B). However, higher redundancy is introduced in the clus-
tered assemblies, as much more complete and duplicated transcripts are detected
by BUSCO (1,969 for Trans-ABySS and 3,598 for CD-HIT-100p). By lowering
the similarity threshold of CD-HIT-EST, this number can be reduced (for example
2,446 complete and duplicate BUSCOs for CD-HIT-95p, Fig. 4.5).
By using the metrics defined in Sec. 4.1 as a filter and additionally only the
contigs defined as good by the TransRate evaluation, we could heavily reduce the
overall amount of contigs in the final assembly, without losing to much sensitivity
and specificity of the assembly. Whereas the CD-HIT-100p assembly still comprises
418,712 sequences, the CD-HIT-TransRate-Metrics-95p consists of only 43,108 con-
tigs. The amount of ambiguous N bases is also decreasing (in this case from 426
million to 53 million).

4.2.4 A possible cluster-assembly pipeline and future work


Based on our observations and the evaluation results presented in Sec. 4.1 for the
single assembly tools and here for the merged assemblies, we developed the idea of a
cluster-assembly pipeline for the automatization of the whole assembly process and
transcriptome optimization.
First of all, the selected evaluation metrics are very important. The Detonate
and TransRate results showed that an appropriate selection of the best performing
assemblies can greatly improve the merging process and the final assembly results.
Therefore, the calculation of many different assemblies with various tools and
parameter settings (such as different k-mers) followed by an appropriate selection of
the best performing ones according to reference-free metrics should be a key feature
of a possible pipeline. In order to calculate many different assemblies, the assembly
tool should dispose an acceptable run time and moderate memory consumptions.
Among others, SOAPdenovo-Trans performed well regarding this requirements
(Sec. 4.1). Another open question is how many of the best assemblies should be
selected for merging. Possibly, this could also be implemented as an iterative pro-
cess: 1) sort the assemblies according too selected evaluation metrics, 2) merge the
transcripts of the best assemblies, starting with the two best ones, 3) after each
merging step evaluate the performance of the resulting assembly and stop if the
quality decreases.
With this proof-of-concept, we exemplary showed that merging the output of
different assembly runs can improve the overall assembly. Furthermore, we suggest
that not only different parameter settings for a single assembler should be used,
but also different tools and algorithms. After the generation of many assemblies,
the different results need to be evaluated appropriately and the best results finally
merged. The merging process is an important step and the CD-HIT-EST heuristic

92
4.2. Cluster de novo transcriptome assemblies: a proof-of-concept

A Single assemblies
Trans-ABySS C:3989 [S:2020, D:1969], F:274, M:1929, n:6192

Oases C:3066 [S:1529, D:1537], F:650, M:2476, n:6192

SOAP-Trans C:3616 [S:3454, D:162], F:589, M:1987, n:6192

Trinity C:3921 [S:2323, D:1598], F:326, M:1945, n:6192

IDBA-Tran C:2679 [S:2661, D:18], F:1325, M:2188, n:6192

Shannon C:2829 [S:2333, D:496], F:829, M:2534, n:6192

Bridger C:3914 [S:2836, D:1078], F:352, M:1926, n:6192

BinPacker C:310 [S:217, D:93], F:11, M:5871, n:6192

SPAdes-sc C:3604 [S:3592, D:12], F:596, M:1992, n:6192

SPAdes-rna C:2596 [S:2582, D:14], F:774, M:2822, n:6192

0 20 40 60 80 100
%BUSCOs

B Merged assemblies
B.1 CD-HIT-100p C:4091 [S:493, D:3598], F:235, M:1866, n:6192

CD-HIT-99p C:4107 [S:1178, D:2929], F:231, M:1854, n:6192

CD-HIT-95p C:4111 [S:1665, D:2446], F:232, M:1849, n:6192

B.2 CD-HIT-Metrics-100p C:4049 [S:761, D:3288], F:251, M:1892, n:6192

CD-HIT-Metrics-99p C:4032 [S:2064, D:1968], F:267, M:1893, n:6192

CD-HIT-Metrics-95p C:4026 [S:2745, D:1281], F:272, M:1894, n:6192

B.3 CD-HIT-TransRate-100p C:3980 [S:1309, D:2671], F:292, M:1920, n:6192

CD-HIT-TransRate-99p C:3973 [S:2873, D:1100], F:298, M:1921, n:6192

CD-HIT-TransRate-95p C:3970 [S:3347, D:623], F:300, M:1922, n:6192

B.4 CD-HIT-TransRate-Metrics-100p C:3873 [S:1667, D:2206], F:357, M:1962, n:6192

CD-HIT-TransRate-Metrics-99p C:3860 [S:3172, D:688], F:368, M:1964, n:6192

CD-HIT-TransRate-Metrics-95p C:3855 [S:3507, D:348], F:372, M:1965, n:6192

B.5 Transfuse C:4089 [S:1167, D:2922], F:231, M:1872, n:6192

0 20 40 60 80 100
%BUSCOs

Missing (M) Fragmented (F)

Complete (C) and duplicated (D) Complete (C) and single−copy (S)

Figure 4.5: BUSCOs for M. musculus single and clustered assemblies. (A) shows the BUSCO
results for each single assembly tool (compare Fig. 4.3). The six assemblies that performed best
according to the metrics presented in Sec. 4.1 are marked with a star. In (B) the BUSCO hits for
the different merging approaches are shown. B.1 – All ten assemblies produced by the single tools
shown in (A) are clustered with CD-HIT-EST. B.2 – Only the six best assemblies (according to the
metrics defined in Sec. 4.1, marked in (A)) are clustered by CD-HIT-EST. B.3 – Only those contigs
defined as good by TransRate are clustered. B.4 – Good TransRate contigs were only selected
from the six best assemblies and clustered. For each of the four merging approaches (B.1–B.4)
CD-HIT-EST was applied with three different sequence identity thresholds (−c parameter: 1.0,
0.99, 0.95). This sequence identity threshold is calculated as the number of identical nucleotides in
the alignment divided by the full length of the shorter sequence. In B.5 the results for the merging
by Transfuse are given. For a description of the BUSCO assessment results see Sec. 4.1.2.

93
Chapter 4. Transcriptome Assembly

has its limitations, for example regarding the efficient merging of isoforms.
The Transfuse software (currently under development) already goes in this
direction, however the pipeline can only work with paired-end data at the moment.
However, many RNA-Seq projects are still based on single-end data or sequencing
techniques different from Illuminas paired-end protocol. Therefore, the usage of
Transfuse is currently restricted to a limited set of NGS projects.
This proof-of-concept and the scripts already conducted here will be used as
a starting point for the implementation of an automated pipeline for the efficient
calculation, evaluation and clustering of de novo transcriptome assemblies [17]. To
achieve this, we will 1) define meaningful reference-free metrics, 2) automate the
detection of the best assemblies and/or contigs to 3) finally merge them into a
comprehensive de novo transcriptome assembly based on short-read RNA-Seq data.

94
Chapter 5

Differential Gene Expression

This chapter is based on our publications “Differential effects of vitamins A and D


on the transcriptional landscape of human monocytes during infection with A. fumi-
gatus, C. albicans and E. coli ” [6], “Massive effect on lncRNAs in human monocytes
during fungal and bacterial infections and in response to vitamins A and D” [5],
“Differential transcriptional responses to Ebola and Marburg virus infection in cells
from bats and humans” [4], and “Description of the transcriptomic landscape of
the microbat Myotis daubentonii in response to interferon stimulation and an in-
fection with the Rift Valley fever virus” [15]*. We will focus mainly on the topics
Preprocessing & Quality Control, Mapping & Quantification, Differential Gene Ex-
pression, Assembly, Gene Annotation, Pathways, and Visualization (see overview
Fig. 2.2; B–G and M).
In the first part of this chapter (Sec. 5.1), a project conducted in cooperation
with Klassert et al. and Riege et al. is presented. While both publications are based
on the same NGS data and similar bioinformatical methods, the main topics differ.
Together with Riege et al., we focused on the differential expression of long non-
coding RNAs and antisense transcripts, while in the work conducted together with
Klassert et al. protein-coding genes were in the focus. Here, we will mainly present
the results ontained in Klassert et al. [6]. The results are in parts also presented in
the master thesis of Julia Bräuer [239]. All wet lab experiments, including infection,
RNA extraction, sequencing and qPCR validation were done in the lab of Prof.
Dr. Hortense Slevogt in Jena. Here, we presents a state-of-the-art RNA-Seq and
differential gene expression analysis. From the early start of the project, reference
genomes of all species and corresponding annotations were available, as well as
biological replicates for statistical calculations and strand-specific RNA-Seq data,
allowing for a straight-forward analysis without any prior assembly steps needed.
In contrast to the other NGS projects presented in this thesis, here the Ion Torrent
sequencing technique was used instead of Illumina. One advantage of Ion Torrent
data are the comparably longer read lengths, however the general throughput is
lower [29]. Furthermore, the Ion Torrent reads differ in their length in comparison
to equally sized Illumina reads. This section is accompanied by a comprehensive
Electronic Supplement1 .

* unpublished work, publication in progress


1
available at http://www.rna.uni-jena.de/supplements/fungi_infection/

95
Chapter 5. Differential Gene Expression

The second part of this chapter (Sec. 5.2) was performed in cooperation with
the virology group of Prof. Dr. Stephan Becker at the Philipps University Marburg.
A project is presented that confronted us with much more complicated problems.
In the first place, at the start of this project no genome of the fruit bat Rousettus
aegyptiacus was available, so we decided to construct a comprehensive de novo tran-
scriptome assembly with various tools (see Chapter 4) to find differential expressed
genes of this bat. Furthermore, the lack of biological replicates and the high repli-
cation rate of the Ebola virus made this project one of the most challenging ones
during my PhD. Nevertheless, in this section a great entanglement is presented by
combining the analysis of genome reference data with transcriptome assembly data
for human and bat, respectively. In this project, we also exemplarily showed that
differentially expressed genes identified with a genome and a transcriptome reference
approach are actually comparable. This project is accompanied by a comprehen-
sive Electronic Supplement as well as an interactive gene observer, both available
online2 .

5.1 Differential effects of vitamins A and D on the


transcriptional landscape of human monocytes
during infection with A. fumigatus, C. albi-
cans and E. coli
Vitamin A and vitamin D are essential nutrients with a wide range of pleiotropic
effects in humans. Beyond their well-documented roles in cellular differentiation,
embryogenesis, tissue maintenance and bone/calcium homeostasis, both vitamins
have attracted considerable attention due to their association with immunological
traits. Nevertheless, our knowledge of their immunomodulatory potential during
infection is restricted to single gene-centric studies, which do not reflect the com-
plexity of immune processes. In the present study, we performed a comprehensive
RNA-Seq-based approach to define the whole immunomodulatory role of vitamins A
and D during infection. Using human monocytes as host cells, we characterized the
differential role of both vitamins upon infection with three different pathogens: As-
pergillus fumigatus, Candida albicans and Escherichia coli. Both vitamins showed
an unexpected ability to counteract the pathogen-induced transcriptional responses.
Upon infection, we identified 346 and 176 immune-relevant genes that were regulated
by atRA and vitD, respectively. This immunomodulatory activity was dependent
on the inflammatory stimulus, allowing us to distinguish regulatory patterns which
were specific for each stimulatory setting. Moreover, we explored possible direct
and indirect mechanisms of vitamin-mediated regulation of the immune response.
Our findings highlight the importance of vitamin-monitoring in critically ill patients.
Moreover, our results underpin the potential of atRA and vitD as therapeutic op-
tions for anti-inflammatory treatment.

2
available at http://www.rna.uni-jena.de/supplements/filovirus_human_bat/

96
5.1. Differential effects of vitamins on human monocytes after infections

5.1.1 Biological effects of vitamin A and D


Vitamin A and vitamin D are known to exert pleiotropic effects on many biological
processes [240, 241]. Their broad functional potential relies mainly on their capac-
ity to facilitate transcriptional changes by binding their specific nuclear receptors:
the retinoic acid receptors (RARs) and the vitamin D receptor, respectively [240–
244]. While vitamin A is involved in functions such as reproduction, embryogene-
sis, cellular differentiation, and maintenance of body tissues [245–247], vitamin D
is crucial for the regulation of bone and calcium homeostasis [248]. Beside these
well-known functions, both vitamins have been associated with a regulatory role
in immunity [249, 250]. Over the last decades, an increasing effort has been de-
voted to better define the involvement of both vitamins in the regulation of the
immune response. Especially, since vitamin A deficiency and seasonal variations
in vitamin D levels have been associated with an increased susceptibility to severe
infectious diseases [251–253].
Animal models and clinical trials demonstrated the protective effect of vitamin
supplementation in infectious diseases [254–257]. Indeed, vitamin A or D supple-
mentation has been proposed as treatment option in diseases such as measles and
tuberculosis, respectively [254, 258]. Meanwhile, in vitro studies have attempted
to characterize the mechanisms underlying such protective effects. Most studies
pointed towards an immunomodulatory effect of both vitamins, after the identi-
fication of candidate genes regulated by vitamin A or D metabolites [249, 250].
Vitamin A has been shown to decrease LPS-induced expression of pro-inflammatory
cytokines such as TNFα and IL6, or chemokines like MIP-1α and β in human
macrophages and dendritic cells [259]. Recently, a similar effect was demonstrated
in monocytes upon fungal infection, reporting the down-regulation of C. albicans-
induced TNFα and IL6 expression in response to all-trans retinoic acid (atRA),
the active metabolite of vitamin A [260]. Meanwhile, most of the current under-
standing of vitamin D-mediated immunomodulation derives from studies related to
Mycobacterium tuberculosis infections [261]. Beside its ability to induce the expres-
sion of Cathelicidin or β-defensins [250, 261], vitamin D is also able to regulate the
expression of TNFα and IL6 in human monocytes upon bacterial stimulation [262].
In fungal infections, vitamin D has also been shown to modulate the production of
cytokines such as IL6, TNFα, and IFNγ in monocytes [263].
Although these studies are evidence for the significant impact of vitamins on
the immune response of human leukocytes to bacterial and fungal pathogens, they
are all qPCR-based (i.e. single gene-centric) and therefore limited. Moreover, some
findings of candidate genes for the vitamin-mediated control of immune functions
could not be replicated between studies. For example, Oeth et al. (1998) could
not detect any atRA-mediated changes in the TNFα production by monocytes after
stimulation with LPS [264]. In addition, single gene-centric studies cannot exhibit
the complexity of cellular interactions and pathways regulating immune processes.
In the present work, we performed a high-throughput approach based on RNA-
Seq to define the whole immunomodulatory potential of vitamins A and D during
infection. Therefore, we analyzed their differential impact on infections of bacterial
and fungal origin. The bacterium E. coli is one of the most common etiologic agents
of sepsis [265], while C. albicans and A. fumigatus are among the most important

97
Chapter 5. Differential Gene Expression

causes of systemic mycoses [220]. During these systemic infections, monocytes play
a central role in the host defense contributing not only to pathogen recognition,
but also as phagocytes and effector cells [266]. Hence, in this exhaustive study, we
analyzed the noteworthy immunomodulatory role of vitamins on human monocytes.

5.1.2 Study design


AtRA and 1α,25(OH)2 D3 (vitamin D) were purchased from Sigma-Aldrich (Ger-
many) and dissolved in absolute ethanol. FITC-conjugated monoclonal mouse anti-
human CD14 antibody, APC-conjugated monoclonal mouse anti-human CD16 anti-
body, and FITC-conjugated mouse IgG1 κ isotype control antibody were purchased
from eBioscience (USA). APC-conjugated mouse IgG1 κ isotype control antibody
was purchased from Biolegend (USA).

Preparation of fungi and bacteria


Overnight cultures of C. albicans (SC5314) in YPD medium were washed three times
with PBS and resuspended at 108 yeasts/ml in RPMI 1640 GlutaMAX medium
(Gibco, UK) supplemented with 10 % fetal bovine serum (FBS; Biochrom, Ger-
many).
A. fumigatus (AF293) was grown on AMM plates at 30 °C for 6 d. Conidiospores
were harvested by rinsing the plates with water + 0.05 % Tween-20 (Sigma-Aldrich,
Germany) and filtered through 70-µm and 30-µm pre-separation filters (Miltenyi
Biotec, UK) to obtain a single-cell suspension. Conidia were then washed twice in
PBS and resuspended at 107 conidia/ml in RPMI 1640 GlutaMAX medium supple-
mented with 10 % FBS (Biochrom, Germany). Germlings were obtained by incu-
bation of conidia at 37 °C under continuous shaking for 6-8 h. They were then cen-
trifuged and resuspended at 108 cells/ml in fresh RPMI 1640 GlutaMAX medium
supplemented with 10 % FBS.
Overnight culture of Escherichia coli (isolate 018:K1:H7) in LB medium was
washed three times in PBS and resuspended in RPMI 1640 GlutaMAX medium sup-
plemented with 10 % FBS. The concentration of bacteria was adjusted to 109 cfu/ml.
All pathogens were heat-killed by incubation at 65 °C for 30 min and immediately
used for stimulation assays.

Monocyte isolation
Human monocytes were isolated from 500 ml fresh whole blood (drawn within 1 h
before use) of healthy male donors. Blood was layered onto an equal volume of 1-Step
Polymorphs (Accurate Chemical & Scientific Corporation, USA) and centrifuged at
650 × g for 35 min. After centrifugation, the peripheral blood mononuclear cells
(PBMCs) were collected, and normal osmolarity was restored by adding an equal
volume of 0.45 % cold NaCl. After erythrocyte lysis using a hypotonic buffer, cells
were washed twice in cold PBS and counted using a Neubauer chamber. Cell viability
of >95% was assessed by trypan blue staining. Monocytes were isolated from the
PBMCs using the monocyte isolation kit II and quadro-MACS (Miltenyi Biotec,
UK), following manufacturer’s instructions.

98
5.1. Differential effects of vitamins on human monocytes after infections

Ethics statement
The blood of healthy male donors was drawn after written informed consent. This
is in accordance with the Declaration of Helsinki, all protocols were approved by the
Ethics Committee of the University Hospital Jena (permit number: 3639-12/12).

Stimulation assays
Monocytes were resuspended at 5 × 106 cells/ml in RPMI 1640 GlutaMAX medium
(Gibco, UK) supplemented with 10 % FBS (Biochrom, Germany) and 1 % Peni-
cillin/Streptomycin (Thermo Fisher Scientific, USA). They were seeded on 6-well
plates (VWR International, Germany) and allowed to equilibrate at 37 °C and
5 % CO2 for 2 h. Cells were then pre-incubated with 1 µM atRA or 1α,25(OH)2 D3
for 30 min. Then, the heat-killed pathogens were added at a pathogen:host ratio of
1:1 for C. albicans yeast and A. fumigatus germ tubes, and 10:1 in case of E. coli
stimulation. After 6 h of incubation at 37 °C and 5 % CO2 , cell viability >90 % was
assessed by trypan blue staining, and the monocytes were harvested for RNA isola-
tion. The whole experimental workflow is depicted in Fig 5.1.
In total, we had four different immune-stimulatory settings (w/o infection, A. fu-
migatus infection, C. albicans infection and E. coli infection), in each of which we
aimed to address the effect of vitamin A (atRA) or vitamin D supplementation.

Figure 5.1: Experimental workflow. Human monocytes were isolated from fresh whole blood and
purity of the cells was analyzed by flow cytometry. Upper scatterplot: Forward scatter (FSC)
and side scatter (SSC) measurement. Lower scatterplot: Fluorescence intensities of cells stained
with FITC-conjugated CD14 antibody and APC-conjugated CD16 antibody. Monocytes were
then pre-incubated with vitamin A (atRA) or vitamin D, followed by stimulation with heat-killed
A. fumigatus, C. albicans or E. coli for 6h. Poly-(A) RNA was isolated from the monocytes and
subjected to RNA sequencing.

RNA sequencing
RNA was isolated from 5 × 106 monocytes using the RNeasy Mini Kit (Qiagen, Ger-
many). An additional step was included to remove the residual genomic DNA using

99
Chapter 5. Differential Gene Expression

DNaseI (Qiagen, Germany). Total RNA was quantified using a Nanodrop ND-
1000 spectrophotometer (Thermo Fisher Scientific, USA). The quality of the RNA
samples (RNA Integrity Number (RIN) values ≥ 7.0) was measured using a Tape
Station 2200 (Agilent Technologies, USA). Poly-(A) RNA was purified from 2 µg of
total RNA using the Dynabeads mRNA DIRECT Micro Purification Kit (Thermo
Fisher Scientific, USA), according to manufacturer’s instructions. Quality control
for the depletion of rRNA was carried out using High Sensitivity RNA Screen Tapes
(Agilent Technologies, USA).
Strand-specific whole transcriptome libraries were prepared using the Ion To-
tal RNA-Seq Kit v2.0 (Thermo Fisher Scientific, USA). RNAse III was employed to
fragment the purified RNA. Ion adapters were ligated to the resulting fragments, and
reverse transcription was performed using the SuperScript III Enzyme Mix (Thermo
Fisher Scientific, USA). Barcoded primers were used to amplify the libraries with the
Platinum PCR High Fidelity polymerase (Thermo Fisher Scientific, USA). Size dis-
tribution analysis and quantification of the final barcoded libraries was performed on
D1000 Screen Tapes on the Tape Station 2200 (Agilent Technologies, USA). Library
templates were clonally amplified on Ion Sphere particles using the Ion PI Hi-Q Chef
Kit and Ion Chef instrument (Thermo Fisher Scientific, USA), loaded onto Ion PI
Chips and sequenced on an Ion Proton Sequencer (Thermo Fisher Scientific, USA).
For sequencing, in total 48 samples were multiplexed on 12 chips. The raw sequence
data in FASTQ format are stored in the Sequence Read Archive (SRA) at National
Center for Biotechnology Information (NCBI) and can be accessed at NCBI home-
page (https://www.ncbi.nlm.nih.gov/; accession number: SRP076532).

5.1.3 Bioinformatic analysis of RNA-Seq data


Read preprocessing and rRNA filtering
Raw reads in FASTQ format were quality controlled with FastQC [267] (v0.11.3)
and trimmed with a window size of 10 using PRINSEQ [268] (v0.20.3) (Q ≥ 20,
length ≥ 20). The resulting reads were aligned to reference rRNA databases using
SortMeRNA [40] (v2.0). After quality control, up to 19 242 435 reads with a length of
20–373 bp and a GC content of ∼ 50 % were obtained for each sample (see Electronic
Supplement, Tab. S13 ).

Mapping
The quality controlled and rRNA cleaned reads were aligned to reference genomes
using Segemehl [75] (v0.2.0). The mapping was performed against the human
genome version GRCh38, downloaded from Ensembl (release 80). Indices were
build together with the corresponding pathogen genomes, depending on the samples
to be mapped. For E. coli, the complete genome of strain K-12 substr. MG1655
(NC_000913.3) was downloaded from NCBI. Genomes of A. fumigatus and C. albi-
cans were obtained from aspergillusgenome.org (21.05.2015) and candida
genome.org (Ca21, 28.05.2015). All mappings were performed with default pa-
3
http://www.rna.uni-jena.de/supplements/fungi_infection/

100
5.1. Differential effects of vitamins on human monocytes after infections

rameters and the -splits option of Segemehl to allow for multiple spliced read
alignments.

Differential gene expression


We used HTSeq-Count [77] (v0.6.0) to quantify strand-specific and unique mapped
reads on exon level. As reference, the full human Ensembl annotation (version
GRCh38.80), including both protein- and non-coding genes, was used. Overall,
60,604 annotated features, comprising 19,825 protein-coding and 24,150 non-coding
genes, were included in the annotation.
The raw read counts and rRNA cleaned read counts from each sample were
normalized based on the library size and tested for significantly differentially ex-
pressed genes (adjusted p-value ≥ 0.05) using DESeq2 [269] (v1.10.1) and various
Bioconductor [270] packages in R.

Gene filtering
In order to filter out low-expressed mRNAs, we calculated for each gene the tran-
scripts per kilobase per million (TPM) value to eliminate potential biases due to the
transcript length in normalized read counts [79].
!
ci 1
T P Mi = · P cj · 106
li lj
j∈N

where ci is the raw read count of gene i, li is the length of gene i and N is the
number of all genes in the given annotation.
For each gene, we calculated four mean TPM values (T P MM ), based on the
12 samples corresponding to control or one of the three different infection types.
Subsequently, for each stimulatory setting we used TPM=5 as a minimum limit for
detectability [31] of transcripts.

Principal component analyses


Principal component analyses (PCA) were performed subsequent to the correspond-
ing DESeq2 runs on selected gene subsets (e.g. only protein-coding genes or only
specific GO terms) and with different variance cutoffs in R. 3-dimensional PCAs
were plotted with the scatterplot3d package in R. We further investigated the
influence of different variance cutoffs on the PCA. Per default, the standard R plot-
ting function is based on the top 500 most variant genes in the data set. Of course,
this hard cutoff might be not optimal in case of only a few or too much differen-
tial expressed genes. Therefore, we came up with the idea of animated PCA plots4
to check at which variance thresholds the principal components of the PCA might
change.
4
http://www.rna.uni-jena.de/supplements/fungi_infection/pca_
animation.gif. During this project we build various animated 2D and 3D PCA plots
and developed the idea of an interactive web service for animated PCA visualization, that is
currently in preparation [19].

101
Chapter 5. Differential Gene Expression

Identification of co-regulated genes

To compare the effect of atRA and vitamin D during different infections, log2 fold
changes (FC) as computed by DEseq2 were visualized using scatter plots in R. The
scatterplots were overlaid with contour plots for a two-dimensional kernel estimate
(kde2d; MASS package) using the default parameters. Outliers were labeled with the
respective gene names. Box plots of certain gene expression patterns were visualized
with the help of [85].

Heat maps

Gene-wise hierarchical clustering was performed on variance stabilized read counts


to build heat maps of selected differentially expressed gene sets (adjusted p value
≤ 0.05). The input matrices for each heat map were scaled on rows to visualize
changes in expression on gene level.

Differential gene expression and key pathway analysis

Pairwise comparisons were carried out to address the effect of each pathogen stimula-
tion (unstimulated samples versus pathogen-stimulated samples) and also the effect
of the vitamin-mediated regulation in each stimulatory setting (pathogen-stimulated
samples versus pathogen-/vitamin-stimulated samples).
K-means clustering was performed on variance-stabilized read counts to build
a heatmap (selected gene set with adjusted p-value ≤ 0.05) in R. For this, the
pheatmap function was applied with the kmeans option and euclidean clustering
distance of the rows. Beforehand, the model-based optimal number of clusters was
determined using Mclust of the mclust package in R [271, 272]. The assigned
genes of the resulting clusters were annotated by gene ontology analysis using the
PANTHER (Protein ANalysis THrough Evolutionary Relationships) classification
system5 and the Partek Genomics Suite 6.6 (Partek, USA). Pathway analy-
sis was performed using the Partek Pathways software tool (Partek, USA) which
employs the Kegg pathway database. Furthermore, for all genes that are affected by
either atRA or vitamin D treatment during any infection setting, STRING networks
were generated using the STRING database6 . In order to identify key pathways
involved in either atRA- or vitD-mediated immunomodulation across the three in-
fections, we have applied the KeyPathwayMiner tool [276, 277]. For each vitamin,
we used the log2 fold changes derived from the comparisons of pathogen-stimulated
samples versus pathogen-/vitamin-stimulated samples. All three input tables for
each pathogen (and each vitamin, respectively) were logically connected with AND,
and the parameters K and L were kept as default. All nodes of the resulting net-
works represent either genes significantly down-regulated by the vitamins, or were
inferred by KeyPathwayMiner to connect subnetworks.

5
http://pantherdb.org/; http://geneontology.org/; [273, 274]
6
http://string-db.org [275]

102
5.1. Differential effects of vitamins on human monocytes after infections

Reverse transcription and quantitative PCR

Stimulation assays were repeated for an earlier time point. After three hours of stim-
ulation, RNA was isolated as previously described. Complementary DNA (cDNA)
was synthesized from 1.5 µg of RNA using the High Capacity cDNA Reverse Tran-
scription Kit (Applied Biosystems, UK) following manufacturer’s instructions. For
PCR analysis, specific primers for each target gene were designed using the online
Primer-BLAST tool of the National Center for Biotechnology Information7 . In
order to improve the PCR efficiency, possible secondary structures of the amplicons
were taken into account by characterizing their nucleotide sequence using the Mfold
algorithm [278].
To quantify the relative expression of each gene, a Corbett Rotor-Gene 6000
(Qiagen, Germany) was used as RealTime qPCR apparatus. Each sample was an-
alyzed in a total reaction volume of 20 µl containing 10 µl of 2× SensiMix SYBR
Master Mix (Bioline, UK) and 0.2 µM of each primer. All qPCRs were set up using
a CAS-1200 pipetting robot (Qiagen, Germany). The cycling conditions were 95 °C
for 10 min followed by 40 cycles of 95 °C for 15 s, 60 °C for 20 s and 72 °C for 20 s. For
each experiment, an RT-negative sample was included as control. The specificity of
the qPCRs was assessed by melting curve analysis. The relative expression of the
target genes was analysed using a modified Pfaffl method [279, 280]. To determine
significant differences in the mRNA expression between different experimental con-
ditions, the relative quantity (RQ) for each sample was calculated using the formula
1/ECt , where E is the efficiency and Ct the threshold cycle. The RQ was then nor-
malized to the housekeeping gene peptidylprolyl isomerase B (PPIB ). The stability
of the housekeeping gene was assessed using the BestKeeper algorithm [281]. The
normalized RQ (NRQ) values were log2 -transformed for further statistical analysis
with GraphPad PRISM v5.0. Statistical analysis was performed using repeated
measures ANOVA and Bonferroni correction.

5.1.4 Results and discussion


By using an RNA-Seq-based approach, we assessed the impact of different fungal
and bacterial pathogens on the transcriptional response of human monocytes, with
a focus on the potential of vitamins A and D to modulate this response (workflow
depicted in Fig 5.1). This modulatory role was explored after six hours of stimulation
in four different settings: without infection, upon A. fumigatus stimulation, upon
C. albicans stimulation, and upon E. coli stimulation.
In the entire transcriptional dataset, a total of 6,076 protein-coding genes showed
significant differential expression (p < 0.05) in any of the comparisons conducted.
Principal component analysis (PCA) disclosed the immunological challenge as the
main source of variance observed in our dataset, as demonstrated by the strong regu-
latory impact of the pathogens (PC1, Fig 5.2A). Nevertheless, we could also observe
a significant effect of the vitamins on the transcriptional regulation, especially for
vitamin A (atRA) (PC2 and PC3; Fig 5.2A).

7
NCBI, http://www.ncbi.nlm.nih.gov/tools/primer-blast/

103
Chapter 5. Differential Gene Expression

Figure 5.2: Bird’s eye view of transcriptome changes upon stimulation with vitamin during infec-
tion. Our results demonstrate a huge impact of the pathogens and vitamins on the transcriptional
landscape of human monocytes. Furthermore, the transcriptional regulation by the vitamins is de-
pendent on the pathogenic stimulus. (A) 3-dimensional Principal Component Analysis (3D-PCA)
of the top 300 most variant genes was plotted with the scatterplot3d package in R [282]. The
first three principal components (PC1-PC3) account for ∼78 % of the total variance of the data.
(B) Representation of the total number of genes as bars (left y-axis) and the ratio of up-/down-
regulated genes as diamonds and triangles (right y-axis) in response to atRA and vitD during all
stimulatory settings. (C) Venn diagram showing the overlap of the atRA-regulated genes (F C > 2,
p < 0.05) during A. fumigatus stimulation (A.f., blue), C. albicans stimulation (C.a., green), E. coli
stimulation (E.c., magenta) or in absence of pathogen stimulation (w/o inf., orange). (D) Venn
diagram showing the overlap of the vitD-regulated genes (F C > 2, p < 0.05) during A. fumigatus
stimulation (A.f., blue), C. albicans stimulation (C.a., green), E. coli stimulation (E.c., magenta)
or in absence of pathogen stimulation (w/o inf., orange).

Transcriptional regulation by vitamins is altered during infection

Next, we assessed the total amount of genes regulated by vitamins (F C > 2;


p < 0.05) in each of the immune-stimulatory settings (no infection, A. fumiga-
tus infection, C. albicans infection and E. coli infection), as well as the direction
of the regulation (up- vs. down-). While vitamin A stimulation led to a predomi-
nant up-regulation of gene expression in absence of infection, vitamin D was more
active at repressing gene expression (Fig 5.2B). Upon immunological challenge, this
situation was reversed, with atRA showing higher potential as repressor of gene

104
5.1. Differential effects of vitamins on human monocytes after infections

expression. In contrast, the number of vitD-regulated genes decreased drastically


upon infection, and the up-regulated genes almost doubled the down-regulated ones
(Fig 5.2B). These results suggest that the impact of vitamins on the transcriptional
profile of monocytes might be highly dependent on whether the cells are under im-
mune challenge or not. This hypothesis is reinforced when we compare the genes
regulated by the vitamins in each of the immune-stimulatory settings (Fig 5.2C,D).
For vitamin A, we observed that 118 genes were regulated by atRA in all different
settings. Nevertheless, we also observed a surprisingly high number of genes specif-
ically regulated in each of the settings, with up to 614 genes regulated by atRA
exclusively upon E. coli infection, 182 upon C. albicans challenge and 228 upon
A. fumigatus infection (Fig 5.2C). In the case of vitamin D, it is notorious that the
regulatory potential is reduced upon infection. A total of 518 genes were exclusively
regulated by vitD in the absence of immune challenge. Moreover, only 67 (5.8 %)
of the vitD-regulated genes were shared by all different settings, again suggesting a
high dependency on the stimulatory environment.

Immunomodulation as primary function of vitamins upon infection


K-means clustering of the 6,076 differentially expressed genes (DEGs) coupled with
gene ontology (GO) enrichment analysis allowed us to gain insight into the functional
relevance of different groups of vitamin-regulated genes, depending on their role
during infection (Fig 5.3).
Clusters with a strong stimulatory effect of vitamins A or D, but little impact
of the immune challenge, were governed by genes involved in cellular processes,
such as inorganic ion homeostasis, membrane organization or intracellular transport
(Cluster 7 and 10; Fig 5.3). Interestingly, the regulatory impact of the vitamins
on these clusters seems to regress as infection comes into play, especially during
bacterial challenge. On the other hand, and as expected, gene clusters displaying
a strong impact of the different pathogens were composed of genes belonging to
immune-relevant GO categories. Clusters that were characterized by a main effect
of A. fumigatus challenge (Cluster 6, 8, 13 and 18; Fig 5.3) were governed by genes
involved in amino acid metabolism, the immune-relevant Wnt-pathway signaling
and mechanisms of entrance into the host cell, among others. Clusters 11 and 15
(Fig 5.3) are composed by genes mainly related to the defense against C. albicans.
In these clusters we could identify a highly significant over representation of genes
belonging to the type-I IFN signaling pathway, leukocyte chemotaxis and cytokine
production. A main impact of E. coli stimulation was observed in clusters 9 and
12 (Fig 5.3), characterized by genes relevant in the IL6 response, leukocyte migra-
tion and leukocyte activation, among others. Importantly, in most of these gene
clusters we could identify an important regulatory contribution by both vitamins,
especially for vitamin A. Moreover, in several clusters the impact of both vitamins
became apparent only upon pathogen stimulation, suggesting a differential role of
vitamins during infection when compared to cellular homeostasis (i.e. in absence of
immunological challenge).
Subsequently, we performed GO analysis on all vitamin A- or vitamin D-regulated
genes (F C > 2; p < 0.05) during infection. A total of 1,573 genes were differentially
regulated by atRA under any of the pathogen settings. GO analysis of these genes re-

105
Chapter 5. Differential Gene Expression

Figure 5.3: Heatmap of K-means clustering of DEGs and subsequent GO enrichment analysis.
K-means clustering was performed on variance-stabilized read counts to build a heatmap for the
6,076 differentially expressed protein-coding genes (adjusted p-value <0.05). A priori, the model-
based optimal number of K = 18 was determined. The clustering of the rows is based on euclidean
distance. The colors in the map represent row-scaled expression levels: blue indicates the lowest
expression, white indicates intermediate expression, and red indicates the highest expression. Se-
lected clusters were analyzed with regard to their biological function by GO enrichment analysis.
Most enriched GO categories are shown for representative groups of clusters displaying their fold
enrichment.

vealed the Immune System Process (GO:0002376) as the most enriched GO category,
followed by Response to Stimulus (GO:0050896) (Fig 5.5A). A similar enrichment
was obtained also for the 624 vitD-regulated genes (Fig 5.5B), demonstrating the
important role of both vitamins in immune processes. Moreover, Immune System
Process was also the most prominent category among the genes regulated by atRA
during C. albicans and E. coli infections when settings were analyzed separately
(Fig 5.5C). For vitamin D, Immune System Process was the top-enriched GO cate-
gory upon all settings. In addition, most of the immune-relevant genes regulated by
vitamins A and D are highly expressed in monocytes (Fig. 5.4). Kegg-pathway anal-
ysis of vitamin-regulated genes showed the Cytokine-Cytokine Receptor Interaction
as the pathway with highest enrichment score for both vitamins. Other significantly
enriched pathways included Chemokine signaling, TNF signaling and Hematopoi-
etic cell lineage, among several other immune-relevant processes. Thus, pathway
analysis underpins the remarkable impact of the vitamins on immune functions.

106
5.1. Differential effects of vitamins on human monocytes after infections

Figure 5.4: Expression plots (MA plots), showing the vitamin-dependent transcriptional profiles.
The scatter plots display the mean expression (x-axis) and log2 fold changes (y-axis) of differentially
expressed genes in response to atRA and vitD under each of the stimulatory settings (w/o infection,
A. fumigatus infection, C. albicans infection, E. coli infection). Red dots represent significantly
(p < 0.05) regulated genes. Blue dots represent DEGs belonging to the Gene Ontology (GO)
category GO:0002376 (Immune System Process).

107
Chapter 5. Differential
www.nature.com/scientificreports/ Gene Expression

Figure 4.  Gene ontology analysis of the vitamin-induced transcriptional changes during infection. Analysis
revealedontology
Figure 5.5: Gene the Immune System Process
analysis of theasvitamin-induced
the most affected GO transcriptional
category in responsechanges
to vitamin during
treatment.infection.
(A) GO enrichment
Analysis revealed the Immune analysis of all atRA-regulated
System Process asgenesthe(1573
most genes, FC >​  2, p <​ 
affected GO0.05) during anyinof response
category the to
three analysed infections. Percentage of the total enrichment scores are shown for the top five GO categories
vitamin treatment. (A) GO enrichment analysis of all atRA-regulated genes (1,573 genes,
(biological process). (B) GO enrichment analysis of all vitD-regulated genes (624 genes, FC >​  2, p <​  0.05) F C > 2,
during any
p < 0.05) during anyof of
the the
three three
analysedanalyzed
infections. (C) Top atRA-regulated
infections. GO categories
Percentage of theduring
totaleach of the infections.
enrichment scores
(D)the
are shown for Top vitD-regulated
top five GOGOcategories
categories during each of the
(biological infections. (B) GO enrichment analysis of all
process).
vitD-regulated genes (624 genes, F C > 2, p < 0.05) during any of the three analyzed infections.
(C) Top atRA-regulated GO categories during each of the infections. (D) Top vitD-regulated GO
during C. albicans and E. coli infections when settings were analysed separately (Fig. 4C). For vitamin D, Immune
categories during each ofwasthe
System Process the infections.
top-enriched GO category upon all settings. In addition, most of the immune-relevant
genes regulated by vitamins A and D are highly expressed in monocytes (Supplementary Fig. S1). Kegg-pathway
analysis of vitamin-regulated genes showed the Cytokine-Cytokine Receptor Interaction as the pathway with
highest enrichment score for both vitamins (Supplementary Table S2). Other significantly enriched path-
Counteracting theChemokine
ways included transcriptional response
signaling, TNF signaling against cell
and Hematopoietic pathogens
lineage, among several other
immune-relevant processes. Thus, pathway analysis underpins the remarkable impact of the vitamins on immune
functions.
The question remained as to the direction and extent of the vitamin-mediated reg-
ulation in Counteracting
each stimulatory setting. response
the transcriptional In order to address
against pathogens.  these The questions,
question remained weas subse-
to the
direction and extent of the vitamin-mediated regulation in each stimulatory setting. In order to address these
quently analyzed
questions, we the differential
subsequently analysedexpression
the differential of all those
expression immune-relevant
of all those immune-relevant genes genes that
that were
were regulated by both the vitamins and the pathogens. Interestingly, there was a
regulated by both the vitamins and the pathogens. Interestingly, there was a huge overlap of genes regulated
by both stimuli. Thus, of the 235 immune-relevant genes (GO:0002376) that were regulated by atRA during E.
huge overlap of genes
coli infection, up toregulated
195 genes (83%) bywere both stimuli.
also regulated Thus,
by the pathogen ofitself.
theSimilar
235 overlaps
immune-relevant
were observed
during fungal infections with up to 70.5% and 72.7% for A. fumigatus and C. albicans stimulation, respectively.
genes (GO:0002376) that were regulated by atRA during E. coli infection,
For vitamin D, these overlaps with the infections were 73.6%, 64.4% and 74.2% for A. fumigatus, C. albicans and
up to
195 genes E.(83 coli %) were
challenge, also regulated
respectively. By plotting foldbychanges
the relative
pathogen itself. Similar
to their unstimulated controls, we overlaps were
could discrimi-
nate between counteractive and synergistic effects between the vitamins and the pathogenic stimulus in each case
observed during(Fig. 5). fungal infections with up to 70.5 % and 72.7 % for A. fumigatus
and C. albicans stimulation,
In all settings, respectively.
the vast majority For vitamin
of the immune-relevant D, these
genes were up-regulated afteroverlaps with asthe
pathogen challenge,
expected, and this effect was reversed by the vitamins. Especially atRA showed an important counteractive effect
infections against
were the 73.6 %, 64.4
pathogen % and
challenge. AtRA 74.2 % fortheA.
counteracted fumigatus,
effect of the pathogens C.in albicans
78% of the genesandregulated
E. coli
challenge, also by A. fumigatus, 65%
respectively. By ofplotting
the genes regulated
fold by C. albicans,
changes and 85% ofto
relative the their
genes regulated by E. coli. Similar
unstimulated con-
results were obtained for vitD-mediated regulation, with 69%, 62% and 68%, respectively (Fig. 5). Moreover,
trols, we couldthis effectdiscriminate
becomes even morebetweenapparent when counteractive and ofsynergistic
analysing the expression genes belongingeffects
to the GObetween
category
Immune Response (GO:0006955), especially in the case of vitamin A (see Supplementary Fig. S2). This significant
the vitamins and the pathogenic stimulus in each case (Fig 5.6).
In all settings, the vast majority of the immune-relevant genes were up-regulated
after| 7:40599
Scientific Reports pathogen challenge (Fig. 5.4), as expected, and this effect was reversed by the
| DOI: 10.1038/srep40599 7
vitamins. Especially atRA showed an important counteractive effect against the

108
5.1. Differential
www.nature.com/scientificreports/ effects of vitamins on human monocytes after infections

Figure 5.  Vitamins A and D strongly counteract the transcriptional response of human monocytes to
Figure 5.6: pathogens.
VitaminsGraphicA and D strongly
representation counteract
of the the transcriptional
expression dynamic of immune-relevantresponse of human mono-
genes (GO:0002376)
differentially regulated by both the pathogens and the vitamins. Patterns are divided by the type of correlationgenes
cytes to pathogens. Graphic representation of the expression dynamic of immune-relevant
(GO:0002376)observed between the regulated
differentially effects of pathogen
by bothand vitamin stimulations: counteractive
the pathogens effect (up-regulation
and the vitamins. Patterns by are di-
and down-regulation by vitamin, or vice versa) and synergistic effect (same direction observed in the
vided by thepathogen
type of correlation observed between the effects of pathogen and vitamin stimulations:
differential expression induced by pathogen and vitamin stimulations). Pie charts show the proportion of genes
counteractive effectcounteractive
depicting (up-regulation effects by pathogen
(red) andeffects
and synergistic down-regulation
(green). by vitamin, or vice versa) and
synergistic effect (same direction observed in the differential expression induced by pathogen and
vitamin stimulations). Pie charts show the proportion of genes depicting counteractive effects (red)
counteractive
and synergistic effects effect suggests an important immunomodulatory potential for both vitamins during bacterial and
(green).
fungal infections.

Differential role of vitamins A and D as immunomodulators.  Vitamin-dependent regulation of


inflammatory mediators during infection.  Across all three pathogen settings, atRA was able to significantly mod-
ulate (FC >​  2; p <​ 0.05) the expression of 346 genes belonging to GO:0002376 (Immune System Process). Of these
genes, 39 were common for all three infections, whereas 42 genes were regulated by atRA only upon A. fumigatus
infection and 36 during C. albicans infection. Upon E. coli challenge, atRA regulated significantly more genes,
with 136 being specific for that infection (Fig. 6A). These genes included cytokines such as IL1A, IL15, IL19, IL20,
pathogen challenge.
IL23A, IL24, CSF1 AtRAand IFNG,counteracted the effect
chemokines like CXCL10 and IL8,ofasthe pathogens
well as metalloproteasesinsuch
78 as%MMP1,
of the
among others (Fig. 6B, Supplementary Dataset S2). Genes exclusively regulated by atRA during fungal infections
genes regulated also by A. fumigatus, 65 % of the genes regulated by C.
included CXCL6 for A. fumigatus challenge, and the fungal pattern recognition receptor Dectin-2 (CLEC6A) as albicans,
and 85 % of wellthe
as thegenes
type-I interferons
regulated IFNB1by andE. IFNA14
coli.during C. albicans
Similar infection.
results Also the
were cytokine-coding
obtained gene
for vitD-
IL12A was regulated by atRA exclusively upon C. albicans challenge, ranking among the top down-regulated
mediated genes
regulation,
in the wholewith 69 %,
dataset (FC =​  62adjusted
−​42.2, % and p =​ 68 %,049;
1.9E −​  respectively (FigS2).
Supplementary Dataset 5.6). Moreover,
As shown in Fig. 6B, atRA led to an important down-regulation of several immune-relevant genes when com-
this effect pared
becomes even more apparent when analyzing the expression
to their expression upon pathogen challenge alone. Almost all cytokines were down-regulated of genes be-
by atRA,
longing to the GO category Immune Response (GO:0006955), especially in the case
and a similar pattern was observed for the chemokines and metalloproteases. Interestingly, genes involved in
of vitamin A (see Fig. 5.7). This significant counteractive effect suggests an im-
portant
Scientific Reports immunomodulatory
| 7:40599 | DOI: 10.1038/srep40599 potential for both vitamins during bacterial and fungal
8
infections.

109
Chapter 5. Differential Gene Expression

A B C
CCL2 RPS6KA1 PDGFB
TINAGL1 VSIG4 PTGER4
TREM1 NLRC4 ALCAM
GEM FES P2RX7
SEMA3C CFP POLR1C
HBEGF MARCO POLR3C
CTSL CD84 PIK3R1
LIF PIK3CB TNFSF14
PRKCB GPR65 IRF8
CCL20 HMGB2 PRDX1
IL2RA FOS POLR3D
BMP6 CLEC4A TNFRSF14
OSM ADAM15 IL20
SLC11A1 ITGB2 CCL23
CCL22 CD14 CD55
NCF2 PYCARD GZMB
CLEC5A PPP3CA FCAMR
JAG1 CCL24 LILRB1
CDKN1A LILRB3
ICOSLG LILRA4
IL3RA CCL13
IL1RAP LILRA1
CD274 ICOSLG LILRA6
PDGFA IL12A SPTBN5
PDE1B IFNA14 RASA2
EBI3 TNFRSF4 FFAR3
IL7R TNFRSF18 ENPP2
IL6 IL3RA UBE2D1
TNFRSF4 CALM1 LILRA5
TNFRSF18 IL27 DUSP4
P2RX7 IL12B CTSL
ALCAM IL36G HBEGF
MALT1 EBI3 TREM1
IL18BP TNFSF9 LILRB4
CCL24 IL7R DEFB1
DTX4 CD276 SLC11A1
SEMA7A SEMA7A ADARB1
CCL4 BCL2 GEM
PDCD1 CLEC4E PDCD1LG2
CCL3 SRC SEMA3C
RASAL2 DTX4 BMP6
KDM6B CD209 OSM
PRDX1 CSF2 HAMP
TNF LAMP3 FFAR2
EREG IRF8 TNFRSF4
LY9 CD83 TNFRSF18
CXCL6 P2RX7 TNFRSF9
CXCL3 TNFSF18 BIRC3
CXCL1 RGS1 BCL2L1
CD276 CD80 HSPD1
IL36G CXCL11 REL
HAMP IFNB1 LTA
MYO10 SEMA7A
MYO10 CSF2
IL1R1 CXCL9
PPBP CCL19
PDCD1 CXCL2
VNN1 LTA
TNFSF18 EBI3
IL27RA CSF3
PTX3 CLEC6A
CLEC4A CXCL1
IL1R2 TXN
PSTPIP1 PTGER4
LAT2 CXCL8
NCF1 IL1A
ADAM15 CLEC5A IL6
OAS3 ORAI1 IL23A
CASP1 LGALS3 RIPK2
OAS1 PRDX1 CD40
SAMHD1 ALCAM IL36G
LILRA6 CCL23 CCL18
VSIG4 CASP9 NEDD4
LILRA5 IL12B
HLA−DMB CD300LB
LILRB1 LILRA5 TNFSF9
LILRA1 GZMB IL24
CFP LGMN KDM6B
SLAMF1 CSF1
PIK3CG TNF
CD1D LIF
IL2RB IL36RN
PIK3R1 SYNGAP1
CASP9 CTNNB1
FFAR3 MYO10
NLRP3 IL27
CD300LB CD55
CD86 DUSP5
CD84 ZC3HAV1
ADAMDEC1 SPTBN5
ENPP2 CDKN1A
FFAR2 IFIH1
CD14 CXCL5 APOL1
SYK CCL20 CD80
NOD2 CCL2 PELI1
PYCARD CXCL3 IFNG
GPR183 TRAF3IP2 SLAMF7
LILRB5 IL2RA MAP3K8
GPR65 AIM2 CCL3
HMGB2 CTSL IRAK2
control A. fumigatus A. fumigatus A. fumigatus UBE2D1 CCL4
+ atRA + vitD RASGRP3 CCR7
TREM1 NFKB1
GEM RAPGEF2
HBEGF ISG15
FFAR2 PTX3
SEMA3C CCL5
OSM DHX58
MAPKAPK2 IL19
CEBPB BCL2
LILRB4 RGS1
control C. albicans C. albicans C. albicans SOS1
+ atRA + vitD
ADAM17
CLEC4E
RASGRP1
SRC
LAMP3
CXCL9
CD83
DUSP3
CXCL11
CXCL10
CCL22
IL15
VAV3
PIK3CG
SMAD3
NLRC4
CD300LB
C5AR1
RPS6KA1
CFP
PRKACA
CTSH
WAS
GPR65
MAPK14
NCKAP1L
HMGB2
MAP3K14
CD84
TNFAIP8L2
FOS
CCR2
LAMTOR2
CSF1R
FCGR3A
1 0.5 0 −0.5 −1 CAMP
NOD2
CD14
SYK
IRAK4
log2 fold change ITGAM
MAVS
TLR1
MARCH1
FYB
PPP2R5D
control E. coli E. coli E. coli
+ atRA + vitD

Figure 5.7: Hierarchical clustering of differential expressed genes of GO:0006955 (Immune Re-
sponse). Heat map of all genes differentially regulated (DEGs) by both the pathogens and any of
the vitamins in each infection model. (A) during infection with A. fumigatus and treatment with
either vitamin A or D; (B) during infection with C. albicans and treatment with either vitamin A
or D; (C) during infection with E. coli and treatment with either vitamin A or D.

110
5.1. Differential effects of vitamins on human monocytes after infections

Differential role of vitamins A and D as immunomodulators


Vitamin-dependent regulation of inflammatory mediators during infec-
tion. Across all three pathogen settings, atRA was able to significantly modulate
(F C > 2; p < 0.05) the expression of 346 genes belonging to GO:0002376 (Immune
System Process). Of these genes, 39 were common for all three infections, whereas
42 genes were regulated by atRA only upon A. fumigatus infection and 36 dur-
ing C. albicans infection. Upon E. coli challenge, atRA regulated significantly more
genes, with 136 being specific for that infection (Fig 5.8A). These genes included cy-
tokines such as IL1A, IL15, IL19, IL20, IL23A, IL24, CSF1 and IFNG, chemokines
like CXCL10 and IL8, as well as metalloproteases such as MMP1, among others
(Fig 5.8B). Genes exclusively regulated by atRA during fungal infections included
CXCL6 for A. fumigatus challenge, and the fungal pattern recognition receptor
Dectin-2 (CLEC6A) as well as the type-I interferons IFNB1 and IFNA14 during
C. albicans infection. Also the cytokine-coding gene IL12A was regulated by atRA
exclusively upon C. albicans challenge, ranking among the top down-regulated genes
in the whole dataset (F C = −42.2, adjusted p = 1.9E − 049).

Figure 5.8: Immunomodulatory footprint of vitamin A during infection. AtRA shows a pathogen-
specific regulatory role on immune-relevant genes leading to an overall down-regulation of cy-
tokines, chemokines and matrix metalloproteases and an up-regulation of complement-related
genes. (A) Venn diagram showing the overlap and amount of atRA-regulated immune-relevant
genes (GO:0002376) in each of the infections analyzed: A. fumigatus (blue), C. albicans (green)
and E. coli (magenta). (B) Network based on experimental and database-derived knowledge
(edges) generated with the STRING database. # atRA regulated these genes in at least two of
the pathogenic infection settings; P atRA-regulated genes during C. albicans infection;  atRA-
regulated genes during E. coli infection; D atRA-regulated genes during A. fumigatus infection;
red: down-regulation, green: up-regulation.

As shown in Fig 5.8B, atRA led to an important down-regulation of several


immune-relevant genes when compared to their expression upon pathogen challenge
alone. Almost all cytokines were down-regulated by atRA, and a similar pattern
was observed for the chemokines and metalloproteases. Interestingly, genes involved

111
Chapter 5. Differential Gene Expression

in complement activity, such as C5AR1, were rather up-regulated by addition of


atRA. Among the top up-regulated genes we also identified a member of the im-
munoregulatory CD300 molecules, CD300A, up-regulated by atRA in all settings,
up to 32-fold.
Vitamin D regulated less immune-relevant genes (GO:0002376) than atRA, with
a total of 176 genes across all three settings. 38 were specific upon A. fumigatus
infection, 22 upon C. albicans infection and 61 upon E. coli challenge. 26 of the
vitD-regulated genes were common to all stimulatory settings (Fig 5.9A). As shown
in Fig 5.9B, we can also observe a rather down-regulatory effect of vitD on cytokines,
chemokines and metalloproteases. Several of the cytokines regulated by atRA were
also regulated in the same way by vitD. These included IL6, IL1A, IL12B, CCL1,
CXCL1 and CXCL2, among others. Other immune-relevant genes were exclusive
for vitD regulation, such as the antimicrobial peptide Cathelicidin (CAMP ) or the
chemokine CCL8 (Fig 5.9B), both up-regulated by vitD.

Figure 5.9: Immunomodulatory footprint of vitamin D during infection. VitD shows a pathogen-
specific regulatory role on immune-relevant genes leading to an overall down-regulation of cytokines,
chemokines and matrix metalloproteases. (A) Venn diagram showing the overlap and amount of
vitD-regulated immune-relevant genes (GO:0002376) in each of the infections analyzed: A. fumi-
gatus (blue), C. albicans (green) and E. coli (magenta). (B) Network based on experimental and
database-derived knowledge (edges) generated with the STRING database. # vitD regulated these
genes in at least two of the pathogenic infection settings; P vitD-regulated genes during C. al-
bicans infection;  vitD-regulated genes during E. coli infection; D vitD-regulated genes during
A. fumigatus infection; red: down-regulation, green: up-regulation.

Taking all together, both vitamins show an important role as modulators of the
transcriptional response against fungi and gram-negative bacteria, with a strong
impact on specific cytokine- and chemokine-expression, depending on the stimula-
tory setting. Moreover, for both vitamins, we could identify consensus inhibitory
networks across the three different pathogenic stimulations using the KeyPath-
wayMiner [276, 277] tool. These networks confirmed the important regulatory
role of both vitamins on the TNF signaling, metalloprotease production, and IFN
pathways.

112
5.1. Differential effects of vitamins on human monocytes after infections

To address whether transcriptional mechanisms might indirectly contribute to


the vitamin-mediated immunomodulation, we investigated the expression profiles of
central regulators of relevant signaling cascades at an earlier stage. We were able
to identify several genes which were differentially expressed in response to atRA
after three hours of stimulation. These included the activating kinase BTK and the
phosphatase PPP3CA, as well as the inhibitory phosphatases PTPN7, DUSP1 and
DUSP7. Also the regulatory receptors LILRB1 and CD300A were already regulated
by atRA at this early stage (Fig. 5.10 and 5.11).

Comparison of the immunomodulatory potential of atRA and vitD. Next,


we compared the effects exerted by both vitamins on the transcriptional response
during infection. We could observe an important overlap in the regulatory potential
of both vitamins. As shown in Fig 5.12A, up to 444 genes of the vitD-regulated genes
were also regulated by atRA. This fraction represents 71.2 % of the vitD-regulated
genes, but only 28.2 % of the genes regulated by atRA. When we focused only on
the immune-relevant genes, the distribution was similar. 217 genes were exclusively
regulated by atRA and only 47 genes were vitD-specific. A total of 129 genes were
regulated by both vitamins (Fig 5.12B). When we analyzed the regulation of these
129 genes in each of the pathogen settings, most of the genes were regulated in the
same direction by both vitamins (Fig 5.12C). Nevertheless, we could also observe
differential modulatory effects by each of the vitamins as to the direction and extent
of the regulation. Genes such as EDN1, BCL2 and CCL22 were up-regulated by
vitD and down-regulated by atRA. Other genes such as CCL23 or CD300A were
strongly up-regulated only by atRA in all settings. Differences in the magnitude
of the regulation became apparent for genes such as CD14, which was strongly up-
regulated by vitD during both fungal infections. In the C. albicans infection, we
could observe significant differences in the magnitude of down-regulation of genes
such as IL12A, CCL1 and IFNA14. A similar scenario is observed for E. coli stimu-
lation, with a stronger impact of atRA on the differential expression of most genes,
as shown for CSF2 and IL12B (Fig 5.12C).

5.1.5 Conclusions
We used a high-throughput RNA-Seq-based approach to characterize the whole
immunomodulatory potential of the vitamins A and D during infections of bacterial
and fungal origin. Using human monocytes as a host-cell model, we analyzed the
differential role of both vitamins upon four different stimulatory settings: upon
A. fumigatus infection, upon C. albicans infection, upon E. coli infection or in
absence of any inflammatory stimulus. Gene ontology and pathway analyses of the
differential expression patterns were carried out to define the regulatory role of these
vitamins upon each infection type, and to identify their underlying mechanisms.
We observed an important and specific impact of the inflammatory stimulus on
the vitamin-mediated regulation of transcription. Especially in the case of vitamin
D, where infection drastically reduced the amount of vitD-regulated genes when
compared to its regulation in the absence of inflammatory stimulus (Fig 5.2B). Also
the relation of up- vs. down-regulated genes was shifted upon infection, with more

113
Chapter 5. Differential Gene Expression

control A. fumigatus C. albicans E. coli control A. fumigatus C. albicans E. coli

control A. fumigatus C. albicans E. coli control A. fumigatus C. albicans E. coli

control A. fumigatus C. albicans E. coli control A. fumigatus C. albicans E. coli

control A. fumigatus C. albicans E. coli control A. fumigatus C. albicans E. coli

infections vitamins
control C. albicans A. fumigatus E. coli control atRA vitD

Figure 5.10: Analysis of the expression profiles of immunomodulatory genes in response to atRA
and vitD after three hours of stimulation. Relative mRNA expression levels of selected genes were
measured by qPCR. Data were obtained from five independent experiments, each performed with
cells from different donors. Statistical analysis was carried out by using repeated measures ANOVA
and Bonferroni correction. Results are presented as mean SEM of the fold change relative to the
control (unstimulated cells). For RNA-Seq based expression patterns of this genes see Fig. 5.11.
*** p ≤ 0.001, ** p ≤ 0.01, * p ≤ 0.05.

114
5.1. Differential effects of vitamins on human monocytes after infections

BTK STAT1
● ●

● ● ●
● ●
● 10^4.5 ●

●●

10^3.2


● ●
● ● ● ●
● ●
● ● ●
● ● ●

● 10^4.0 ● ●
● ●
10^3.0 ● ● ●
● ● ● ●
● ● ●

● ●
● ● ●
● ●
● ●
● 10^3.5 ● ● ●
● ●
10^2.8 ●




● ● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ● ●
10^3.0 ●
10^2.6 ● ●



● ● ●

control A. fumigatus C. albicans E. coli control A. fumigatus C. albicans E. coli

PPP3CA CD300A
10^3.5 ● ●

10^3.5 ●




● ●
● ● ●
10^3.0
● ●
● ●
● ●

10^3.0 ●

● ● ●
● ●
● ● ●



● 10^2.5
● ●
● ●


● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● 10^2.0 ● ●
● ●

● ● ● ● ●
● ●
10^2.5

● ● ●
● ● ● ● ●
● ●
10^1.5



● ● ●
● ●
● ● ●
● ●

control A. fumigatus C. albicans E. coli control A. fumigatus C. albicans E. coli

LILRB1 PTPN7
10^3.4
● ● ●
● ●
● ●

● ●


10^3.2
● ●
● ●
● ●

10^4.5 ● ●



● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ●

10^3.0
● ● ●
● ●●
● ● ●
● ● ●
● ●
10^4.0



● ● ●

10^2.8

● ● ●

● ●
● ●

● ●
● ● ● ●
● ● ●
10^3.5 ● ●

10^2.6


● ●
● ●
● ● ●

10^2.4
10^3.0 ● ●

control A. fumigatus C. albicans E. coli control A. fumigatus C. albicans E. coli

DUSP1 DUSP7
● ●
● ●

● ● ●
10^4.5


● ●

● 10^2.5 ●
● ● ●
● ●

● ● ●
● ●●
10^4.0 ●


● ●
● ● ● ● ●
● ● ●
● ●
10^2.0
● ●
● ● ●


● ●
10^3.5


● ● ●
● ● ●
● ●
● ● ● ●


● ● 10^1.5 ● ●

10^3.0 ● ●
● ●

● ● ● ●
● ●
● ●
● ●
● ● ●●
● ●

control A. fumigatus C. albicans E. coli control A. fumigatus C. albicans E. coli

infections vitamins
control C. albicans A. fumigatus E. coli control atRA vitD

Figure 5.11: Analysis of the RNA-Seq expression profiles of immunomodulatory genes in response
to atRA and vitD after six hours of stimulation. The box plots show the normalized expression (x-
axis) and the log2 fold change (y-axis) for the same genes like selected in Fig. 5.10 (central regulators
of the main signaling cascades identified in this study). The RNA-Seq abundances estimated here
six hours after infection correspond nicely with the qPCR measured patterns already observed
after three hours and presented in Fig. 5.10.

115
Chapter 5. Differential Gene Expression

Figure 5.12: Comparison of the immunomodulatory potentials of vitamin A and vitamin D. During
infection, an important overlap among the gene sets regulated by both vitamins could be revealed.
Nevertheless, differential modulatory effects by each of the vitamins could be addressed among sev-
eral immune-relevant genes. (A) Venn diagram showing the overlap of atRA- and vitD-regulated
protein-coding genes during any of the infections. (B) Venn diagram showing the overlap of atRA-
and vitD-regulated genes belonging to GO category GO:0002376 (Immune System Process). (C)
Scatter plots display the differential effects of each vitamin on the expression of immune-relevant
genes. Each axis depicts the vitamin-induced log2 fold change as compared to the corresponding
pathogen stimulation alone. Genes written in red indicate significant differences in their regulation
induced by each vitamin (real F C > 3 between the fold changes induced by atRA and vitD during
infections).

up-regulated than down-regulated genes in response to vitD upon infection. Inter-


estingly, the contrary could be observed for vitamin A, where atRA led to slightly
more repression of gene expression than activation, only during infection (Fig 5.2B).
The surprising number of vitamin A-down-regulated genes cannot be explained by
the classical mechanisms of RAR-mediated regulation of transcription, with main
focus on ligand-dependent activation of gene expression [240]. Moreover, also the
recently identified non-genomic effects of atRA rely on the activation of signaling
cascades [240]. Indeed, nuclear receptor-mediated repression and trans-repression
pathways are largely unknown, and only few mechanisms have been described for
glucocorticoid-receptors, peroxisome proliferator-activated receptor (PPAR)γ and
liver-X receptors, but none for RARs [283]. Moreover, the question remained as to
what extent certain groups of genes might be specifically up- or down-regulated by
the vitamins during infection.
K-means clustering coupled with gene ontology enrichment analysis allowed us
to identify different patterns of vitamin-mediated regulation depending on concomi-
tant pathogen stimulation. Predominant enrichment of metabolic pathways was ob-

116
5.1. Differential effects of vitamins on human monocytes after infections

served in gene clusters with none or minimal impact of pathogen stimulation, while
the immunomodulatory role of vitamins was highlighted across pathogen-triggered
clusters (Fig 5.3). Overall, during inflammation, the GO category that was mostly
enriched by vitamin stimulation was Immune System Process (GO:0002376). For
vitamin D this enrichment could be confirmed for each of the infections. For atRA,
Immune System Process ranked first during C. albicans and E. coli infection, but
only 4th upon A. fumigatus infection (Fig 5.5). This slight shift might be explained
by a generally lower transcriptional response in response to A. fumigatus, with less
immune-relevant genes to be susceptible to atRA-mediated regulation. Neverthe-
less, the huge importance of both vitamins as immunomodulators becomes even more
evident considering the amount of immune-relevant genes (GO:0002376) regulated
by atRA and vitD during infection: 346 and 176 genes, respectively. In addition,
pathway analysis underpinned the notion that immune response is the most regu-
lated biological function by both vitamins. Although the immunomodulatory effect
of both vitamins has already been described for single genes in different cell mod-
els [259, 260, 262, 263, 284–286], the dimension of this regulatory function has not
been previously reported.
For all stimulations, we could describe an overwhelming and still unreported
counteractive effect between the pathogen- and the vitamin-driven regulation of
immune-relevant genes. Both vitamins, especially vitamin A, counteracted the tran-
scriptional regulation in response to the pathogens (Fig 5.6). We also observed this
behavior for antisense transcripts and lncRNAs [5]. The most prevalent expression
dynamic was defined by genes that were up-regulated by the pathogens, and this
effect reversed by vitamin A. The functional classification of the atRA-regulated
genes allowed us to identify the cytokines as the best representatives of this dy-
namic. AtRA down-regulated almost all cytokines, and a similar tendency was ob-
served also for the chemokines. On the other hand, complement-activity genes were
rather up-regulated by atRA. These findings might suggest a scenario in which atRA
could lead to an attenuation of the immune response, in terms of pro-inflammatory
cytokine release and immune cell recruitment, but sustain effective phagocytosis.
This type of immunomodulation might have large-scale clinical potential, especially
during systemic infections with hyper-inflammatory response, including C. albicans
and E. coli -infections. During the last decade, much effort has been devoted to
develop new therapeutic strategies for severe sepsis treatment, highlighting the need
for new immunomodulatory agents [287], especially since most of the proposed im-
munomodulators have failed to disclose any clinical benefit or have shown limited
clinical efficacy [288, 289].
We could observe a high degree of specificity in the vitamin-mediated regula-
tory patterns between the different infection models. On the one hand, E. coli
infection triggered the capability of vitamins to regulate gene expression, as com-
pared to the other two infections. This might be attributed to the fact that E. coli
stimulation led to the strongest transcriptional response, thereby presenting more
targets susceptible for regulation by atRA or vitD. On the other hand, we could
identify genes regulated by the vitamins in a pathogen-specific manner. Type-I
interferons (IFN), for instance, were down-regulated by both vitamins exclusively
upon C. albicans-infection. The type-I IFN response was identified in our dataset

117
Chapter 5. Differential Gene Expression

as a key signature of the C. albicans-induced inflammation, which is in agreement


with previous reports [290]. Furthermore, Majer et al. could demonstrate that dur-
ing systemic Candida infection type-I IFNs responses are associated with increased
hyper-inflammation, tissue damage, and lethality [291], highlighting the therapeu-
tical need for immunomodulators able to repress this response.
In order to investigate possible indirect mechanisms leading to vitamin-mediated
down-regulation of pro-inflammatory cytokines, we analyzed the expression profiles
of central regulators of the main signaling cascades. These included the Jak-STAT
signaling, the NFκB signaling or the MAPK signaling pathways, which upon vitamin
stimulations showed significant enrichment in at least one of the stimulatory settings
(data not shown). As shown by qPCR analysis and the RNA-Seq data presented
here, important activators of these pathways, such as BTK and PPP3CA were
down-regulated by atRA after only three hours of stimulation (Fig. 5.10) and still
regulated after six hours according to the RNA-Seq data (Fig. 5.11).
Interestingly, both genes were also identified as potential key players of the regu-
latory role of atRA in the consensus network inferred by the KeyPathwayMiner [276,
277] tool. We could also demonstrate the up-regulation of several inhibitory phos-
phatases by atRA. These included PTPN7, DUSP1 and DUSP7, which have shown
to regulate the MAPK signaling pathway [292, 293]. In addition, the ITIM-bearing
inhibitory receptors CD300A and LILRB1 were already up-regulated by atRA after
three hours of stimulation. Both receptors have been shown to modulate immune
responses [294, 295]. The transcriptional modulation of these genes might explain,
at least in part, the regulatory effects of atRA observed after 6 h in our transcrip-
tome data. Interestingly, none of them were differentially expressed upon vitamin D
treatment, despite the reports pointing towards DUSP1 as central player in vitD-
mediated regulation of immune functions [262]. An additional mechanism which
could contribute to the immunomodulatory action of vitamins would be monocyte
subpopulation differentiation. Nevertheless, despite a strong up-regulation of CD14,
especially by vitamin D, we failed to find a particular pattern of subpopulation shift
when analyzing additional markers, such as CCR2 and CD16, at this early stage
(data not shown). Further studies are needed to disclose the mechanisms of vita-
min D-mediated repression of gene expression.
In conclusion, in the present study we have comprehensively characterized the
immunomodulatory potential of vitamins A and D during infection. We observed
that this potential is dependent on inflammatory stimulus and is to some extent
specific for each of the pathogen scenarios tested. We described the undocumented
ability of both vitamins to counteract the inflammatory response triggered by each
pathogen. As a result, we observed a strong anti-inflammatory activity, especially
by atRA, leading to the down-regulation of several cytokine- and chemokine-coding
genes among others. Our study identified vitamins A and D as potent immunomod-
ulators, that might be of particular importance in systemic infections, where the
dysregulation of the immune response is responsible for the fatal outcome [287, 291,
296]. Moreover, recent studies have described important inadequacies of retinol [297]
and vitamin D [298] in critically ill patients, and associated this deficiency with in-
creased risk of mortality. Thus, monitoring the serum levels of vitamins A and D,
as well as its adequate supplementation in individuals admitted in intensive care

118
5.1. Differential effects of vitamins on human monocytes after infections

units, might have far-reaching prophylactic and therapeutic implications in severe


infections.
Accession codes The raw Ion Torrent sequence data in FASTQ format are stored
in the Sequence Read Archive (SRA) at National Center for Biotechnology In-
formation (NCBI) and can be accessed at NCBI homepage (accession number:
SRP076532).

119
Chapter 5. Differential Gene Expression

5.2 Differential transcriptional responses to Ebola


and Marburg virus infection in bat and human
cells
During my PhD time I had the great opportunity to contribute on the ongoing devel-
opment of antiviral drugs during the 2014 West African Ebola outbreak. Compared
to other RNA-Seq projects, this project was not as straight forward regarding the
bioinformatic analyses that were needed to comprehensively analyze the data. One
problem was the lack of an appropriate reference genome of the fruit bat Rouset-
tus aegyptiacus, that was simply not available in 2014 when the project started.
Therefore, we decided to construct a de novo transcriptome assembly (Chapter 4)
for this bat species, annotated the obtained transcripts and used this as a refer-
ence for mapping, read quantification and differential expression detection. This
approach led directly to the challenging task to compare the expression between
homologous human and bat genes derived from a genome and a transcriptome ref-
erence, respectively. Another problem was the lack of biological replicates. In some
of the comparisons we used the different time points as replicates for normalization,
however the statistical evaluation remained complex.

The Fast-track, in-silico “Fight


against Ebola”. After the start of
the 2014 outbreak, we decided to speed
up our analysis, which otherwise would
have taken up to three years. To ac-
complish this, 30 scientists with experi-
ence in analyzing RNA-Seq data came
together to “Fight against Ebola” and
manually investigated each gene. Dur-
ing this “hackathon”, we investigated
1,500 genes (7.5 % of human protein-
coding genes) in great detail. Each gene
Figure 5.13: Thanks to all the great “Ebola Fight-
was analyzed using the IGV and UCSC ers” which came together in November 2014 in
browsers. The gene, its synonyms, func- Jena to work for one week systematically on
tional information, screenshots of ge- RNA-Seq data of Ebola and Marburg virus in-
nomic locations, isoforms, fold changes fected human and bat cells.
and maximum read counts can be found
in a comprehensive Electronic Supplement.
Here, I would like to take the opportunity to thank all the great people again
which joined us for one week in Jena to “Fight against Ebola” (some of them are
shown in Fig 5.13). As a result of this “hackathon”, where I was mainly responsible
for the coordination and realization, an overwhelming amount of various analysis
was performed from researchers of different expertise. Here, an condensed summary
of all the obtained insights during Ebola and Marburg virus infection of human and
bat cells will be given [4].
Dieses Kapitel ist für die Katze [299–301]

120
5.2. Differential expression in EBOV/MARV infected human and bat cells

The comprehensive Electronic Supplement, comprising an interactive database


browsing all human genes and manual descriptions of the gene expressions8 and a
detailed gene viewer9 are available online. As many references to the Electronic
Supplement are used in this section, online material is additionally indicated with
the prefix “ES” (Electronic Supplement) for more clarity. This section is further
accompanied by Appendix A.

5.2.1 The 2014 Ebola outbreak in West Africa


An Ebola virus (EBOV) outbreak in West Africa of unprecedented severity resulted
in over 28,600 cases and 11,300 deaths as of April 201610 . EBOV and Marburg
virus (MARV) are closely related filoviruses, with a nucleotide identity of 49.5 %.
They contain single-stranded RNA genomes with a negative orientation that are
approximately 19 kb in size and encode seven structural proteins [224]. The natural
hosts of filoviruses are presumed to be bats [302], which may be the origin of the
recent EBOV outbreak in West Africa [303]. Rousettus aegyptiacus is the natural
reservoir of MARV [304–306], and it survives MARV infections without any signs
of the disease [307, 308]. However, humans with filovirus infections experience a
severe fever and vascular leakage, with high fatality rates [309]. Surprisingly little is
known about the response of human cells to EBOV and MARV infections, and the
response in bat cells has not been investigated at all. Barrenas et al. [299] used a
Next-Generation sequencing approach to understand the cellular immune response
of vaccinated cynomolgus macaques after an EBOV challenge. Microarray-based
studies have described the differential expression of several known cellular genes after
EBOV infections in mice and rhesus monkeys [300, 301, 310, 311]. However, the
overall response of human and bat cells to filovirus infections is not yet known. Key
proteins in the infection process and their regulatory circuits have not been defined,
and the transcriptional landscape of non-coding RNAs (ncRNAs) and alternative
mRNA isoforms is unexplored. Cellular targets are an attractive alternative for the
development of antiviral drugs because viruses cannot adapt to a change in the host
cell as easily as they can develop a resistance to an antiviral drug over the course
of treatment [312, 313]. To establish effective antiviral strategies, it is necessary
to understand how the infected cells respond to filovirus infections and how this
response differs between bats and humans. We explored the cellular regulatory
response mechanisms by sequencing the full transcriptomes of immortalized cells
of human and bat origin at three different time points post infection (p.i.). We
constructed a full de novo transcriptome assembly based on the RNA-Seq data of
R. aegyptiacus.
Here, we provide a systematic report on (1) a genome-wide analysis of EBOV,
MARV, human, and bat transcripts, as well as (2) single genes that show strong
differential regulation, (3) the regulatory transcription factors, and (4) the corre-
sponding pathways that are involved in the response to EBOV and MARV infec-
8
http://www.rna.uni-jena.de/supplements/filovirus_human_bat/
9
http://www.rna.uni-jena.de/supplements/filovirus_human_bat/igo.php
10
http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/
previous-case-counts.html

121
Chapter 5. Differential Gene Expression

~ 90% infected cells


A C
MARV or EBOV
MOI = 3 RNA isolation
DNase digestion

EBOV
3h p.i. Rate of infection (IFA) Mock 3h p.i. 7h p.i. 23h p.i.

7h p.i.
Quality/Quantity check
Transcriptome analyses
23h p.i. Viral propagation (PCR)
Mock MARV EBOV
HuH7 or
R06E-J cells ~ 90% infected cells

B Filovirus-speci c real time PCR D ~ 99% infected cells

25.8 3h 7h 23h
25 24.8
23.4
22.8 21.5 21.2
20 19.8 19.2 19.4 19.8
16.9
Ct value

MARV
15 15.7

Mock 3h p.i. 7h p.i. 23h p.i.


10

0
human fruit bat human fruit bat
MARV EBOV ~ 70% infected cells

Figure 5.14: Monitoring sample preparation. (A) HuH7 and R06E-J cells were infected with
MARV or EBOV (MOI = 3) or left uninfected (Mock). Samples were collected after 3, 7 and 23 h
post infection (p.i.). (B) RNA of infected and uninfected cells was isolated at 3, 7 and 23 h p.i.,
checked for quality and quantity, and filovirus-specific real time PCR was performed according to
Panning et al. [314]. (C) and (D) To determine the number of infected cells, immunofluorescence
analyses were performed with infected cells grown on one coverslip within each well used for RNA
preparation. Infected cells were visualized (red) using mouse monoclonal antibodies against EBOV
(C) or MARV (D) nucleoproteins and fluorescently tagged secondary antibodies. DAPI staining
was used to visualize cell nuclei (blue). Ct – cycle threshold.

tions. We focus on the transcriptional differences and similarities between EBOV-


and MARV-infected human and bat cells.

5.2.2 Sequencing and assembly


Cells and RNA extractions

We infected 4×105 HuH7 cells [315] (a human hepatoma cell line) and 4×105 R06E-
J cells [316] (an embryonic cell line from R. aegyptiacus) with EBOV (Ebola virus
strain Zaire, Mayinga, GenBank: NC_002549) or MARV (Lake Victoria Marburg
virus, Leiden, GenBank: JN408064.1 [317]) at a multiplicity of infection (MOI) of
three (Fig. 5.14A). We used HuH7 cells because EBOV infections in humans in-
duce the majority of their histopathological features in the liver, and these cells are
highly susceptible to filovirus infections [318]. We used immortalized cells because
primary cells from bats (macrophages and dendritic cells) are not available in the
large quantities we required for our RNA-Seq analyses (9 samples from each cell
type). Our analyses consisted of computational and extensive manual investiga-
tions as shown in Fig. 5.15. At 3, 7 and 23 h p.i., cells were harvested, and total
RNA was isolated using an RNeasy Mini Kit (QIAGEN) according to the manu-
facturer’s instructions. These time points correspond to the different stages of the
viral replication cycle (Fig. 5.16). Replication and transcription take place after 3 h,
proteins are produced at 7 h, which may regulate further transcription, and a com-

122
5.2. Differential expression in EBOV/MARV infected human and bat cells

plete replication cycle occurs after 23 h. DNaseI digestion was performed. At each
time point, RNA was also isolated from non-infected (Mock) control cells. Qual-
ity controls were performed to ensure proper infection rates and viral propagation.
Real-time PCR was used to detect filovirus RNA (polymerase genes) [314] and to
demonstrate the amplification of viral RNA over the time course of the infections
(Fig. 5.14B). Ct-values are inversely proportional to the amount RNA detected, and
these values for the MARV-infected HuH7 cells at 3 and 7 h p.i. were lower than the
values for the MARV-infected R06E-J cells. The Ct-values for the EBOV-infected
cells showed no clear difference. An immunofluorescence analysis (IFA) of the cells
was performed using mouse monoclonal antibodies directed against nucleoproteins
of EBOV (B6C5, 1:20) and MARV (59-9-10, 1:100). An anti-mouse secondary anti-
body coupled with Alexa 594 (1:500) was used to detect these viral nucleoproteins,
and DAPI (4’,6’-diamidino-2-phenylindole) staining was used to visualize cell nuclei
(1 mg/ml, 1:2000). IFAs of the MARV and EBOV nucleoproteins revealed that a
MOI of 3 was sufficient to initially infect a high percentage of cells (90 % of human
and bat cells infected with EBOV, 70 % of bat cells and 99 % of human cells infected
with MARV, Fig. 5.14C/D). The quantity and quality of the RNA was assessed
using a NanoDrop ® spectrophotometer and an Agilent Bioanalyzer. Nine samples
at different time points were generated from human (HuH7) and bat (R06E-J) cells:

• HuH7-Mock-3h | -7h | -23h • R06E-J-Mock-3h | -7h | -23h

• HuH7-EBOV-3h | -7h | -23h • R06E-J-EBOV-3h | -7h | -23h

• HuH7-MARV-3h | -7h | -23h • R06E-J-MARV-3h | -7h | -23h

Sample preparation and sequencing

The total RNA of the 18 samples was shipped to LGC Genomics for the construc-
tion of cDNA libraries. Ribo-Zero was used for rRNA depletion, and the Illumina
TruSeq kit was used for library construction. Illumina sequencing was performed in
a 2 × 100 nt paired-end mode on a HiSeq 2000 system. R06E-J cells were stimulated
with interferons, PolyIC or thapsigargin to mimic the induction of the interferon
system or a stress response by the endoplasmic reticulum (ER) of the cells. Prior to
stimulation, R06E-J cells were examined for interferon competence via a vesicular
stomatitis virus (VSV) bioassay. The cells secreted cytokines after PolyIC transfec-
tion, and those cytokines partially protected R06E-J cells from VSV infection (data
not shown). RNA was isolated from these cells, pooled with the 9 previously men-
tioned R. aegyptiacus cell samples and shipped to GATC Biotech for normalization
and sequencing on an Illumina MiSeq system (2 × 300 nt mode). This library of
longer paired-end reads was used to improve the de novo transcriptome assembly of
R. aegyptiacus. All reads were preprocessed based on their Phred quality score. At
the 3’ end, bases with a quality score <20, a 5’-bias and poly-A tails were removed
with PRINSEQ [44] (v0.20.3). Quality was assessed and controlled before and after
processing with FastQC [43] (v0.10.1).

123
Chapter 5. Differential Gene Expression

Genome and annotation data


The human genome GRCh37/hg19 was downloaded from the UCSC [319] ftp server11 .
The annotation data were obtained from the NCBI (GRCh37 patch release 5) and
Ensembl (GRCh37 release 75). The genomic sequence of Pteropus vampyrus (Pva,
GCA_000151845.1), the closest related species to R. aegyptiacus (both Megachi-
roptera, see Fig. 6.5) and with well established annotation files, was downloaded
from the UCSC site12 and used for the homology search. The genome sequence
of R. aegyptiacus was published in early 2016 by the Boston University School of
Medicine. We used all scaffolds and the corresponding annotation data downloaded
from the NCBI database13 for mapping and differential gene expression analysis.
The genomic sequence and annotation data for the Zaire Ebola virus (KM034562.1)
were extracted from the UCSC Ebola Genome Portal14 , which is based on the
2014 West African outbreak [320]. Genome and annotation data for the Lake Vic-
toria Marburg virus Leiden (JN408064.1) were obtained from the NCBI-GenBank
database.

De novo transcriptome assemblies


The nine HiSeq libraries for H. sapiens and R. aegyptiacus underwent quality con-
trol assessments and were used for transcriptome assembly (see Electronic Supple-
ment15 , Tab. ES1A). Long reads of the pooled MiSeq libraries were included in
the assembly process for R. aegyptiacus. For the bat HiSeq libraries, 372,082,040
paired-end reads were assembled de novo with Velvet [217] (v1.2.10), followed by
Oases [60] (v0.2.08), the ABySS/Trans-ABySS [52, 61] pipeline (v1.5.1/v1.4.8),
SOAPDenovo-Trans [63] (v1.0.3) and Trinity [223] (v20131110) using default
parameters and multiple k-mer values (25/35/45/55/65/75), if possible. R. aegypti-
acus-derived MiSeq paired-end reads (38,028,488) were preprocessed and assembled
using Mira [64] (v4.0.0). The resulting contigs from each assembly tool were merged
together and clustered based on sequence similarities using CD-HIT-EST [146]
(-c 0.95, v4.6) to improve the quality of the final assembly. The final R. ae-
gyptiacus assembly contained 977,787 contigs (human: 986,920 contigs), which is
similar to the results of Lee et al. [321]. Of these, 277,595 contigs had a length
greater than 1,000 bp. The bat assembly had a maximum contig length of 36,073 bp
with an N50 of 3,923. We used QUAST [322] (v2.3) to calculate several statistics for
the independent and merged transcriptome assemblies (Tab. ES1B).

A comparison between the genome and de novo transcriptome assemblies


of humans and bats
To assess the quality of our de novo transcriptome assemblies, we used various read
count thresholds over all mapped HuH7 and R06E-J samples (Tab. A.3) to extract
transcript subsets from the H. sapiens and R. aegyptiacus genomes, respectively.
11
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/
12
ftp://hgdownload.cse.ucsc.edu/goldenPath/pteVam1/
13
ftp://ftp.ncbi.nlm.nih.gov/genomes/Rousettus_aegyptiacus/
14
ftp://hgdownload.cse.ucsc.edu/goldenPath/eboVir3/
15
http://www.rna.uni-jena.de/supplements/filovirus_human_bat/

124
(1) HuH7/R06E-J cells (2) R06E-J (5) Hg19 gene Ensembl
EBOV, MARV, Mock 3x EBOV pooled Id / String
MiSeq search Genes Literature
3x (3h/7h/23h p.i.) 3x MARV } assembly
3x Mock (2x 300bp)
PVA
9x Mira
isolation total RNA HiSeq homology
assembly search
(2x 100bp)
CDS
sample preparation (Ribo-Zero) Velvet/Oases
final R06E-J homology
ABySS/Trans-ABySS transcriptome search
SOAPdenovo-Trans assembly
HiSeq MiSeq)
sequencing (HiSeq, Trinity CD-HIT-EST

De novo transcriptome
assembly
RAE comparison R06E-J
genome transcriptome

Homology search in bat


assembly

Data acquisition
quality / trimming / FastQC
HSA comparison HuH7
genome transcriptome
(3) Segemehl / TopHat assembly
R06E-J

ge
uniq & multi Cufflinks

n
3x EBOV

es
mapped
RAE+EBOV do novo enriched pathway
3x MARV
RAE+MARV

R06E-J
3x Mock analysis

homologs
s
ne

Hg19 genes
Scale 1 kb hg19
chr22: 41,486,500 41,487,000 41,487,500 41,488,000 41,488,500 41,489,000 41,489,500 41,490,000 41,490,500
81 _ hg19_ebola_tophat-HUH-EBOV-3h__uniq
ge
HUH-EBOV-3h__uniq
R06E-J
1_
120 _ hg19_ebola_tophat-HUH-EBOV-7h__uniq

HUH-EBOV-7h__uniq R06E-J
1_
164 _ hg19_ebola_tophat-HUH-EBOV-23h__uniq
3x EBOV
HUH-EBOV-23h__uniq
uniq & multi 3x MARV transcriptome (4) HES2
(6)
TFCP2L1
1_
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
MIR1281EP300
read counting AKR1B15

RefSeq Genes AKR1B10


RefSeq Genes
Sequences
SNPs
Publications: Sequences in Scientific Articles

Human mRNAs from GenBank


mapped PZP

SAMD14
gene name synonyms
Human mRNAs
Spliced ESTs

Layered H3K27Ac
100 _
Human ESTs That Have Been Spliced
H3K27Ac Mark (Often Found Near Active Regulatory Elements) on 7 cell lines from ENCODE
3x Mock assembly CNN1

PLA2G4C

0_ KLRC3
DNaseI Hypersensitivity Clusters in 125 cell types from ENCODE (V3)
DNase Clusters

UCSC
FOSB
Transcription Factor ChIP-seq (161 factors) from ENCODE with Factorbook Motifs
Txn Factor ChIP
4.88 _ 100 vertebrates Basewise Conservation by PhyloP VGF

SYPL2
100 Vert. Cons
0-
-4.5 _
Multiz Alignments of 100 Vertebrates
CLDN9

KLF4
manual gene inspection
Rhesus
Mouse
Dog LOC338651
Elephant
Chicken
X_tropicalis EREG
Zebrafish
Lamprey EGR2
Simple Nucleotide Polymorphisms (dbSNP 141) Found in >= 1% of Samples
Common SNPs(141)
RepeatMasker
Repeating Elements by RepeatMasker
R06E-J RRAD

RASGEF1B
(IGV & UCSC)
HDAC9

IL8

ANKRD1 (Hg19 & R06E-J)


CXCL1

FLJ 20021

DUSP8

PPP1R15A
3x EBOV multi mapped ATF3

DUSP1

HSPA6
PVA+EBOV DESeq RDH12
conservation
ARHGDIB

AREGB
3x MARV TopHat AREG

DNAH5

CXCL5

RPS17L

G G
PVA+MARV LOC100292680

ANXA1
nucleotide modification analysis
C
A C
T C A
3x Mock LOC100287366

FOSL1

FOS

LOC100287415
TGA TCA

125
SHISA4

LY6G6C

uniq mapped ZEB2

SELPLG
intronic transcripts
SERPINA12

uniq mapped J AKMIP3

TopHat ACCN5

FAM49A

LOC100506546
HuH7 TopHat HPR

MYCNOS
isoforms
MYCN
clustering HP

SAMD13

3x EBOV up/down-stream observations


EBOV_3h_fc
EBOV_7h_fc

MARV_3h_fc
MARV_7h_fc
EBOV_23h_fc
MARV_23h_fc

ISMARA Hg19+EBOV scatter/group


3x MARV Hg19+MARV plots pathway analysis
3x Mock Hg19 NCBI

Differential gene
expression analysis

Mapping
Manual inspection

annotation

Figure 5.15: Methods pipeline. (1) Data acquisition: Total RNA from HuH7 and R06E-J cell lines 3, 7 and 23 h p.i. was depleted of ribosomal RNA
and sequenced. We controlled the quality and trimmed the data with PRINSEQ and FastQC. (2) For bat RNA, we assembled a de novo transcriptome
by adding pooled MiSeq to HiSeq data using various assembly tools and parameter settings. (3) Mapping was performed for Mock-, EBOV-, and
MARV-treated cells onto human/bat genomes and the bat transcriptome with Segemehl and TopHat. (4) Differential gene expression analysis was
performed by counting uniquely mapped reads and applying a DESeq analysis. The results were used for clustering and scatter/group plot analyses. (5)
Homology searches in bats were performed for all significantly differentially expressed genes from (4) and for the genes that were presumed to be involved
in the response to infection based on the literature and an enriched pathway analysis. The R. aegyptiacus genome and coding sequences from P. vampyrus
were used to validate and detect homologous sequences in the bat transcriptome. Detected homologs were used for the differential gene-expression analysis.
We also investigated the quality of the transcriptome assembly by comparing the human and R. aegyptiacus genomes with the corresponding assembly. (6)
During the manual inspection, we identified the synonyms of gene names and noted their existence in the relevant pathways. Each candidate gene was
manually investigated in the IGV and UCSC browsers for the human and bat samples from all time points. We report the conservation of genes according to
the 100 Species Vertebrate Multiz Alignment to chimp, mouse, dog, elephant and chicken sequences. We searched for nucleotide modifications (differential
5.2. Differential expression in EBOV/MARV infected human and bat cells

SNPs, posttranscriptional modifications), intronic transcripts and regulators, alternative splicing and isoforms, and upstream and downstream transcript
characteristics.
Chapter 5. Differential Gene Expression

We denoted these filtered subsets as expressed and blasted (E-value < 10−10 ) them
against the de novo transcriptome assemblies of the human or bat cells. We defined
a transcript (derived from the genomic sequence) as valid, and therefore correctly
assembled, if we obtained a minimum of one blast hit with an alignment length
>90 % of the query. For the human transcriptome assembly, we found between
93.0 % and 98.1 % of the expressed transcripts, and for R. aegyptiacus 81.3–94.0 %.
Therefore, the transcriptome assemblies were of sufficient quality. The results for
different transcript subsets are shown in Tab. A.3. Most of the missing transcripts
can be explained by a low read coverage in comparison to the length of the transcript
or a non-uniform distribution of reads along the transcript. These transcripts may be
assembled as partial contigs (alignment length ≤ 90 %). The higher number of valid
transcripts derived from the human genome can be explained by its better annotation
and assembly status compared to that of the relatively new R. aegyptiacus genome
at the scaffold level.

5.2.3 Differential gene expression


Genome and transcriptome mapping
RNA-Seq data for the HuH7 and R06E-J samples were mapped to the concatenated
virus-host genome file (for each combination of the two viruses and three genomes/-
transcriptomes) in the following two ways: (1) using TopHat [73] (v2.0.11) with the
default parameters and (2) using Segemehl [75] (v0.1.9) with the split read option
-S. The indexing and sorting of the SAM files was performed using SAMtools [76]
(v0.1.19). The ViennaNGS [323] toolbox (v0.10) was used for processing and visual-
ization of the mapped RNA-Seq data. Uniquely mapped reads aligned by TopHat
were used for all statistical calculations and differential gene-expression analyses.
The Segemehl mappings were used to support the detailed analyses of special
splice variants because of the tool’s capability for mapping multi-split reads. The
read mapping rates and statistics can be found in Tab. A.1. Although we observed
some differences for some genes (e.g., NPC1 ) regarding their read counts between
the R. aegyptiacus genome and transcriptome assembly, the calculated fold changes
used in the downstream analyses differed only slightly among all genes.

De novo transcript annotation


We used Cufflinks [46] (v2.2.1) in de novo mode to predict transcript loci based
on the TopHat mappings of all HuH7 samples. Cuffmerge was used to combine
the Cufflinks-assembled transcripts into a final transcriptome assembly of 18,391
locations for the human genome.

Differential gene expression analysis


For a differential gene expression analysis of hg19, we used reads of Mock, EBOV
and MARV samples uniquely mapped by TopHat together with the gene annota-
tion data from the NCBI. Raw read counting was performed using bedtools [324]
(v2.21.0) multicov with the split option. The resulting read counts were directly

126
5.2. Differential expression in EBOV/MARV infected human and bat cells

passed to DESeq [81] in the R/Bioconductor package (v2.14). Due to the lack of
replicates, we performed pairwise comparisons between all of the different infection
conditions and time points with a false discovery rate of 0.1. To compare the charac-
teristics between HuH7 and R06E-J cells at 3, 7 and 23 h p.i., we used the EBOV and
MARV samples at the identical time points as replicates in the DESeq analysis (padj
≤ 0.1). In addition to using gene annotations from the NCBI, we processed the de
novo gene loci obtained from Cufflinks in an identical manner. Furthermore, we
used Cuffquant and Cuffnorm to calculate FPKM values for each locus from the
Cuffmerge -G results. In addition to the normalized read-count-based method,
we calculated the maximum read peak for each gene and sample using bedtools
genomecov (-d -split). We used the notation “=” when the difference between
the two samples was <15 %, “↑/↓” when there was up to a two-fold difference, and
“ n ↑” when the difference was up to n-fold (greater than 2-fold).

Clustering of differentially expressed genes

Fold changes in gene expression were calculated by adding a pseudocount of 1 to


the raw read counts of each gene. Differentially expressed genes in infected cells,
compared to their expression in their Mock control cells, were determined using the
DESeq package as described above, and only genes with padj <0.1 were considered
differentially expressed for a particular contrast (infection versus Mock). For the
clustering of fold changes in gene expression, the log2-fold change (FC) for each
gene g was calculated as log2(ginf ection ) - log2(gmock ). The log2-fold changes in the
expression of genes having a log2-fold change greater than five in at least one of the
contrasts were visualized via hierarchical clustering. Euclidean distances and the
Ward method for agglomeration were used (Fig. ES2A).

Identification of co-regulated genes

To compare the differential gene expression for the Mock/EBOV and Mock/MARV
treatments in HuH7 and R06E-J cells, log2-fold changes, as computed by DESeq,
were visualized using scatter plots in R (x-value: FC of Mock/EBOV; y-value:
FC of Mock/MARV). The scatterplots were overlaid with contour plots for a two-
dimensional kernel estimate (kde2d; MASS package) using the default parameters.
Outliers are labeled with their respective gene names in the Electronic Supplement
(Fig. ES4B).

Search for human-gene homologs within bat transcripts

To compare human and bat genes, we defined homologous loci between human genes
and the transcripts in the R. aegyptiacus assembly. A direct comparison between
human genes and R. aegyptiacus transcripts led to several unidentified orthologous
pairs. To improve ortholog detection, we used the annotated genes of P. vampyrus
from Ensembl to act as an intermediate between the two closely related species
(Fig. 6.5). Sequence comparisons were performed using BLASTn+ (v.2.2.27+), and
hits under the restrictive E-value threshold of < 10−50 were considered orthologs. We

127
Chapter 5. Differential Gene Expression

defined homologous genes between the human and recently published R. aegyptiacus
genomes.

Comprehensive gene analysis


We obtained a comprehensive set of candidate genes based on more than 600 sig-
nificantly differentially expressed genes identified by the DESeq analysis combined
with ∼900 genes extracted from a literature search and a review of immune system-
related pathways. We manually investigated differential gene expression, changes
in expression profiles, nucleotide changes, conservation in other species, intron-exon
structures, alternatively spliced isoforms, intronic transcripts and 5’-UTR/3’-UTR
discrepancies with the help of the UCSC and IGV [325] (v2.3.39) browsers. Each
gene, its synonyms, functional information, screenshots of genomic locations, iso-
forms, fold changes and maximum read counts are listed in the Electronic Supple-
ment (nine samples per species).

Pathway enrichment
We examined which of the differentially expressed genes were over-represented in
KEGG database [326] pathways. Gene set enrichment analyses were performed us-
ing a hypergeometric test. FDR-corrected p-values were significant at 0.05. We set
the threshold to a value of p < 0.1 using a hypergeometric test and FDR correc-
tions [327]. The evaluation was performed with the R-package GAGE [86] to obtain
KEGG pathway information and with pathview [87] to allow for visualization.

Motif activity response analysis


A motif activity response analysis with MARA was performed as previously de-
scribed [328] using the reads uniquely mapped by TopHat as discussed above. We
performed MARA on the EBOV and MARV samples separately to obtain indepen-
dent motif rankings and target predictions for each virus. We also performed one
analysis that considered all samples at once to compare the motif activity changes
observed after EBOV and MARV infections. The activities obtained from this anal-
ysis were used to create the activity change comparison plots. To relate the activity
of a motif m in virus-infected cells (Am,V irus ) to its activity in the corresponding
Mock control (Am,M ock ), the activity change (Am,∆ ) was calculated as follows:

Am,∆ = Am,V irus − Am,M ock

with the error calculated as:


q
2 2
Em,∆ = Em,V irus + Em,M ock

qRT-PCR
We repeated the infection of HuH7 and R06E-J cells and isolated RNA for qRT-PCR
analyses. An IFA of these cells (Fig. ES7E) revealed that the infection rates were
lower than in the previous experiment used for RNA-Seq. Only ∼40 % (∼70 %)

128
5.2. Differential expression in EBOV/MARV infected human and bat cells

of the R06E-J cells were infected with EBOV (MARV) in comparison to ∼90 %
(∼70 %) in the first experiment (Fig. 5.14C). The infection rates were only slightly
lower for HuH7 cells (Fig. ES7E). RNA was isolated and reverse transcribed with
random hexamers. For qRT-PCR analyses of NPC1 and TLR3 gene expression,
degenerate primers that amplified both the human and the bat mRNA sequences
were used. HAVCR1 gene expression was measured using human- and bat-specific
primers. Expression values were normalized to 18S rRNA levels. All primer se-
quences are listed in File ES7A. qPCRs were performed on an Applied Biosystems
7500 Real-Time PCR System using SYBR Green chemistry (iTaq mix, BioRad)
according to the manufacturer’s instructions. Mean values from triplicate analy-
ses were calculated and quantified with the system’s built-in software according to
standard curves constructed for each amplicon from serial cDNA dilutions. Relative
quantification values, melting curves and agarose gel pictures for the three genes are
presented in the Electronic Supplement, Sec. ES7.

5.2.4 Results and discussion


Differences in early viral replication velocities
By comparing viral RNA
levels in HuH7 and R06E-
J cells, we determined that presumed transcription start
endocytosis cell death
EBOV and MARV repli- attachment
protein synthesis formation of nucleocapsides
primary transcription first replication

cate more rapidly in HuH7 detectable } cycle finished


}

than in R06E-J cells be- 01234 67 10 12 15 18 23h 2d- 7d

tween 3 and 7 h p.i. RNA


sample sample sample

40. 6 41. 6
synthesis slowed down in 15. 6
fold change

fold change
RNA 7-23 h
RNA 3-7 h

both species in the sub-


sequent 16 h (Fig. 5.16), 7. 6
4. 8 4. 3
15. 5

3. 0
and transcript levels were E M E M E M E M
nearly identical at 23 h p.i.
(Tab. A.2). The EBOV and
MARV viral proteins were
more abundant in HuH7 Figure 5.16: Human HuH7 cells support an earlier onset of filovi-
ral RNA synthesis than bat R06E-J cells. The first viral repli-
than in R06E-J cells at cation cycle is finished after 15–18 h, when virions are released
7 h p.i. and were present from the host cells. Between 3 and 7 h p.i. of EBOV-infected
at similar levels at 23 h R06E-J cells, we observed an ∼4.8 X increase in the number of
p.i. (Fig. 5.14C and D) reads that mapped uniquely to the EBOV genome. This indi-
cates that EBOV genes are rapidly replicated and transcribed in
as determined using IFAs.
the bat cells in the first 4 h p.i. (see Tab. A.2 for normalized read
Early increases in the lev- counts). We observed a further 41 X increase in reads between
els of viral RNA and pro- 7 and 23 h p.i. in R06E-J cells. So, this RNA synthesis rate
teins in HuH7 cells may slows down within this next 16 h (compared to 4.8 X4 ' 530 X).
be attributed to a faster In comparison, unique reads mapping to the EBOV genome in
HuH7 cells increased 15.6 X between 3 and 7 h p.i. and a further
rate of early replication of
15.5 X in the following 16 h. This result indicates a significant
filoviruses in HuH7 cells increase in the RNA synthesis rate of viral RNAs in the first few
compared to the rate in hours and a marked decrease in the following hours.
R06E-J cells. The differ-

129
Chapter 5. Differential Gene Expression

ences in replication velocities may contribute to the different susceptibilities of HuH7


cells and R06E-J cells to filovirus infection. However, it does not explain which
differences in the cellular regulatory response mechanisms are responsible for the
higher RNA synthesis rate in HuH7 cells. We observed few regulatory effects at 3
and 7 h p.i. However, more than 1,670 genes were up- or down-regulated at 23 h
p.i. in EBOV-infected HuH7 cells, whereas only 74 genes showed altered expression
patterns in MARV-infected samples (Fig. 5.17A). Chemokine ligands (CXCL1 and
CXCL5 ), transcription factors (e.g., FOSL1, FOS, FOSB, ATF3 ), and genes with
diverse functions (e.g., PPP1R15A, dual-specificity phosphatases (DUSPs)) were
among the 55 genes that were most strongly regulated after filovirus infections in
HuH7 cells (log2 F C > 5, Fig. ES2A). This finding is supported by previous re-
ports [300, 311]. Each gene can be viewed along with all relevant details at our
webtool16 .

Differential expression of 2,500 genes


We aimed to answer three questions based on gene transcription levels: (1) What
happens during filovirus infections in the host? (2) What are the differences in host-
cell responses between EBOV and MARV infections? (3) What are the differences in
how HuH7 and R06E-J cells respond after EBOV and MARV infections? Over 2,500
differentially expressed genes were found to be involved in the response to filovirus
infections (Tab. A.4–A.10, details in Electronic Supplement). The most significant
differentially expressed genes during the course of both EBOV and MARV infections
between HuH7 and R06E-J cells include genes for transcription factors that regulate
the expression of kinases and their antagonists and genes involved in ubiquitination
processes (Fig. 5.17B). Although we confirmed the previously reported dysregulation
of genes involved in coagulation [300, 301, 311], these were not among the most
differentially expressed genes.
With the exception of a few genes, we confirmed previously observed data based
on microarray analyses of EBOV-infected HuH7 cells [301], human macrophages
[311] rhesus monkey cells [310], and mouse cells [300].

Regulation of transcription factors


Many transcription factors were found to be among the most differentially expressed
genes. Therefore, we assume that they play an important role in the changes ob-
served after filoviral infection. To identify the regulatory factors that are responsible
for the observed transcriptional changes after filoviral infection in HuH7 cells, we
performed motif activity response analyses with MARA [328]. MARA infers the regula-
tory impact (also called activity) of a regulatory motif from the changes in expression
of the predicted downstream genes (targets) of that motif. For a curated collection
of ∼190 mammalian transcription factor binding motifs and ∼90 miRNA seed fam-
ilies, we found 76 and 38 motifs with a change in activity (z-value >1.5) in response
to EBOV and MARV infections, respectively (see Electronic Supplement, Sec. ES5).
Genes that are regulated by such motifs are involved in NFκB signaling or cell-cycle
16
www.rna.uni-jena.de/supplements/filovirus_human_bat/igo.php

130
5.2. Differential expression in EBOV/MARV infected human and bat cells

A B Z-score of log2(FC)

3 2 1 0 1 2 3
Ebola infection Marburg infection
down-regulated genes up-regulated genes 800 800
600 600
400 400
200 0 0 200 0 2 67
801
0 0
3h 7h 23h 3h 7h 23h
0 0
1 0 877 0 0 7
200 200
400 400
600 600
800 800

C
Gene Samples FC Supplement Type
ATF3 EV 3-23h 5.89 A6, C7, J8 Transcription factor
FOS EV 7-23h 6.09 A4, E10, I7 Transcription factor
FOSB EV 7-23h 6.89 A2, E3, I3 Transcription factor
PPP1R15A EV 3-23h 5.37 B2, E9, I10, J1 Protein phosphatase
DUSP1 EV 7-23h 4.57 B7, F7 Protein phosphatase
DUSP8 EV 7-23h 5.01 B9, F4, J7 Protein phosphatase
NFKB2 EV 3-23h 4.08 B10 Transcription factor
CHAC1 MV 3-23h 3.70 C2 Cation transport
TRIB3 MV 3-23h 4.76 C1 Protein kinase
SQSTM1 EV 7-23h 2.92 C6 Ubiquitination
CDH6 EV 3-23h -2.97 C5 Cadherin

D
Binding
Protein binding

29
18
5
9
12
2 2 3 3
Transporter activity Catalytic activity
Chromatin binding 3h 7h 23h 3h 7h 23h 3h 7h 23h 3h 7h 23h
Structural molecule activity Receptor activity
Nucleic acid binding EBOV MARV EBOV MARV
Signal transducer activity

Figure 5.17: Significantly regulated genes in cells infected with EBOV or MARV. (A) Number of
strongly regulated human genes after infection with EBOV or MARV. There were only a few genes
that were significantly regulated (padj <0.1) at 3 and 7 h p.i. in both EBOV- and MARV-infected
cells compared with their expression in Mock-treated cells. At 23 h p.i., the number of regu-
lated genes was higher (1,678) in EBOV-infected cells than in cells infected with MARV. Adding
these ∼1,600 strongly regulated genes to the findings from analyses comparing the different time
points or viruses resulted in approximately 2,500 genes being identified as significantly differentially
transcribed. (B) Heat map of row-scaled log2-fold changes in expression in infected HuH7 and
R06E-J-samples against the corresponding Mock samples (e.g., column three shows the fold change
between HuH7 Mock-treated cells and HuH7 EBOV-treated cells at 23 h p.i.). The input matrix is
scaled within the rows to visualize changes in expression at the gene level. Fold changes are based
on unique genome read counts of H. sapiens and R. aegyptiacus. Genes without a clear homologous
sequence in the R. aegyptiacus genome or transcriptome assembly are marked with a star. We
identified homologous locations (LOC107508087, LOC107515336, LOC107498547 ) for three genes
that were not directly annotated in the R. aegyptiacus genome (EP300, RPS17L, MX1 ). These
locations were identified using our de novo transcriptome assembly (red boxes). We indicated the
molecular function of each gene based on the color scheme presented in (D). (C) Highly regulated
genes in EBOV- and MARV-infected HuH7 and R06E-J cells. FC – log2 -fold change based on
DESeq normalized read counts. See Appendix (Tab. A.5 and A.6) and corresponding entries for
detailed information. (D) The PANTHER database (v11.0) [329] was used to assign molecular
functions to each of the 64 genes in (B). We further subdivided the dominant group of genes
that we identified to have a general binding function. During filovirus infections, the most promi-
nent regulatory effects were observed for genes encoding transcription factors, those regulating the
NFκB and MAPK pathways, their DUSP inhibitors and growth factors (Fig. ES2A and Tab. A.5
and A.6, full tables in the Electronic Supplement). In addition, changes were also observed for
genes that regulate protein translation (RPS17, PPP1R15A), ubiquitination (TRAF6, SQSTM1 ),
autophagocytosis (SQSTM1 ) and cation transport (CHAC1, ATP2B4 ). We also observed the
strong up-regulation of genes that are involved in energy transfer (e.g., RASGEF1B ). Details can
be found in Tab. A.5–A.8.

131
Chapter 5. Differential Gene Expression

regulation (Fig. 5.18). Interestingly, we found that the FOS /JUN -motif has sig-
nificantly increased activity after both EBOV and MARV infections. This result
indicates that genes having FOS /JUN -motif binding sites in their promoter region
are primarily up-regulated (Fig. 5.18). Consistent with this, transcription factors
that are associated with this motif (e.g., FOSB ) were up-regulated in infected cells
(Fig. 5.17B). The transcription factor AP1, a homo or heterodimer of differentially
expressed FOS and JUN, plays important roles in different viral infections [330, 331].
Other motifs, such as the KLF12 - and the NRF1 -associated motifs (Fig. 5.18), are
more specific to EBOV and MARV, respectively, reflecting the differences in the
impacts of theses two viruses on the transcriptional landscape of infected cells. In
summary, we found various motifs, including the antiviral signaling-associated mo-
tif for NFκB, to have significant changes in activity (Fig. 5.18). For each motif, we
provide associated regulators and target genes (Sec. ES5).

The antiviral mRNA response is mostly unchanged


Unexpectedly, apart from the effect on transcription factors, the majority of genes
and pathways that are relevant for viral infections did not demonstrate significant
regulatory changes in response to filoviral infection (Sec. ES6). The majority of
genes involved in innate immune responses (from Kuri et al. [332]) were either not
expressed in the human cell cultures examined here or were only slightly differ-
entially regulated during infection (Tab. A.11). Consistent with this observation,
some of these genes (e.g., IFITM1 /2, OAS1 -3 ) are not included in our bat tran-
scriptome assembly, suggesting they were not transcribed. However, we did observe
significantly different RNA levels for several bat homologs of innate immune response
genes in filovirus-infected R06E-J cells in comparison with their levels in HuH7 cells.
For example, DDX58 and ADAR demonstrated lower, and NMI higher, expression
levels in EBOV-infected R06E-J cells (Tab. A.11). The differences in the mRNA
concentrations of these transcripts in infected HuH7 and R06E-J cells may play a
role in the defense mechanisms that are active during filovirus infections. Genes en-
coding for proteins initiating pathways known to respond to viral infections, such as
DDX58 (Fig. ES6.8), NFκB (Fig. ES6.10 and Fig. ES6.17), and MAPK pathways
(Fig. ES6.2–6.7), were not induced during filovirus infections in our study. Only
those genes of the examined pathways that encode proteins that act as key players
(see Fig. 5.17B and C) were up-regulated in EBOV-infected HuH7 cells relative to
their expression in R06E-J cells.
The Ebola viral protein VP35 inhibits DDX58 signaling, which determines the
outcome of infection in human cells [333–335]. We inspected the DDX58 pathway,
which induces ISRE3, AP1 (FOS and JUN ), NFκB and interferon β activation in
cells [336, 337] (Fig. ES6.8). Although we observed differences between HuH7 and
R06E-J cells in the levels of DDX58 and ISYNA1 mRNA, all but one gene (IKK)
related to the DDX58 pathway were expressed, but they were not differentially
affected. We noted an up-regulation of the FOS (38 X) and JUN (9 X) genes in
EBOV-infected HuH7 cells 23 h p.i. (Fig. 5.17B). We observed no significant change
in the genes specific to the DDX58 pathway at the transcriptomic level. We therefore
assume that regulation of the DDX58 pathway at the transcriptomic level is not a
driving factor leading to the frequent fatalities observed in humans, which are not

132
5.2. Differential expression in EBOV/MARV infected human and bat cells

E2F1..5.p2 FOS_FOS{B,L1}_J UN{B,D}.p2


Ebola Top Motifs Z-Score Marburg Top Motifs Z-Score

0.00 TGA TCA


E2F1..5.p2 5.24 NRF1.p2 4.73
G

activity change
activity change NFKB1_REL_RELA.p2 4.36 E2F1..5.p2 4.53

0.10
G
0.05
A C
KLF12.p2 4.20 YY1.p2 3.55 C TC A

hsa-miR-129-5p 3.37 NFY{A,B,C}.p2 3.07


PITX1..3.p2 3.26 TFDP1.p2 2.90

0.05
0.10

TFAP2{A,C}.p2 3.23 SP1.p2 2.70

TTT GCGC
GATA1..3.p2 3.13 ELK1,4_GABP{A,B1}.p3 2.64
SRF.p3 3.12 HNF4A_NR2F1,2.p2 2.61

0.00
G
0.15

POU5F1.p2 2.98 hsa-miR-129-5p 2.55


C
G
C
G
CC T G

TFAP4.p2 2.86 FOS_FOS{B,L1}_JUN{B,D}.p2 2.44

3 7 23 3 7 23

NFKB1_REL_RELA.p2 KLF12.p2 NRF1.p2 YY1.p2

0.02
0.15

GGGG TTTCCC CAGTGGG G CCAT


activity change

0.02
0.05
A

0.00
G

AC
T
0.10

C
A

C
T
G
T
AAG T
T
A
T
ACA CA T
A
C
A
T

0.00
0.05

0.02
0.00
CGC TGCGC

0.02
A
0.00

A
A

A
G
C
G
T
G
T

3 7 23 3 7 23 3 7 23 3 7 23
time p.i. [hours] time p.i. [hours] time p.i. [hours] time p.i. [hours]

Figure 5.18: Motif activity response analysis. The table shows the top significant motifs after the
infection of HuH7 cells with EBOV (red) or MARV (blue) compared with the response in Mock
controls. Regulated motifs are predicted to target (1) the cell cycle (E2F1..5.p2) by down-regulating
CDC6, PCNA and MCM6 ; (2) NFκB -signaling (NFKB1_REL_RELA.p2) by targeting CXCL
isoforms, ELF3, NFκB isoforms, FOSL2 and JUN ; (3) EGR1 expression in EBOV-infected cells
(KLF12.p2, YY1.p2 and others); or (4) chromatin organization in MARV-infected cells (NRF1.p2,
YY1.p2 and others). For selected motifs, the inferred activity changes (points +/- 1 SD) after
EBOV or MARV infection relative to the corresponding Mock controls are shown for the different
time points (3, 7 or 23 h p.i.) adjacent to and below the table. Selected regulatory motifs, the
associated genes and their important targets (including their fold change between two time points)
can be viewed in Sec. ES5 and are summarized in File ES5D.

observed in bats, subsequent to filoviral infection.


We investigated the relationship of the DDX58 protein to other proteins at the
mRNA level, such as TRIM25, which interacts with the DDX58 pathway [338].
TRIM-family proteins are induced by interferons and are involved in antiviral cel-
lular responses [339]. We found that several members of the TRIM-family were
up-regulated in EBOV-infected cells between 3–7 h p.i. but were down-regulated
from 7–23 h p.i. (Tab. A.12). The corresponding bat homologs were either not ex-
pressed (e.g., TRIM71 ) or were not significantly differentially expressed between
R06E-J and HuH7 cells (e.g., TRIM25 ).
In line with previous reports [340], our results show that filoviruses neither induce
nor block apoptosis (Fig. ES6.33, as genes involved in apoptosis were not significantly
regulated during EBOV or MARV infections, with the exception of BBC3, which
was up-regulated 9.6 X in EBOV-infected HuH7 cells. However, it is important to
note that both cell lines were immortalized, which may explain the minimal effects
on genes involved in apoptosis.

Host genes demonstrate a similar reaction after EBOV/MARV infection


The overall reaction of EBOV- and MARV-infected HuH7 cells at 3 h was similar to
the reaction in R06E-J cells at 7 h (Fig. 5.19). The gene coverage-dependent plot
illustrates the greater viral replication velocity in HuH7 cells when compared with
the velocity in R06E-J cells. This response may be influenced by viral attachment
and entry processes. We investigated the differences and similarities during filovirus
infection in HuH7 and R06E-J cells.

133
Chapter 5. Differential Gene Expression

HuH7. We determined that the majority of the host genes reacted in a similar
manner in response to EBOV and MARV infections, which may explain the common
symptoms caused by these viruses in humans. CYR61 was among the most up-
regulated genes in HuH7 cells and was usually highly expressed at 3 and 23 h p.i.,
which correspond to the periods of inflammation and wound repair, respectively [341]
(Fig. ES4C). The cytokine genes IL8 and IL32 responded in both MARV- and
EBOV-infected HuH7 cells, showing a significant up-regulation (Fig. 5.17B). IL32
expression can be induced by IL8 and is involved in the apoptosis of T cells in EBOV-
infected patients [342, 343]. It is also up-regulated in response to influenza A virus
infections. The up-regulation of IL8 results in the activation of pro-inflammatory
pathways [344, 345]. We identified NRAV up-regulation 7 h p.i. This long non-
coding RNA was recently reported as a key regulator of antiviral innate immunity
that acts via the suppression of interferon-stimulated gene transcription [346]. We
propose that a cellular component exists, in addition to the filoviral inhibition of
the innate immune system by VP24 and VP35 [347–351], that results in the same
inhibition of innate immunity. At 23 h p.i. we identified several highly up- and
down-regulated genes, which we were unable to categorize (see Sec. ES4). We also
determined that ANXA3 was markedly up-regulated. Annexin A3 is an inhibitor of
phospholipase A2 and possesses anti-coagulant properties [352].
Our data indicate an initial inflammatory response (3 h p.i.) [332], followed by a
repression of antiviral defenses (7 h p.i.) with the majority of up- and down-regulated
gene expression occurring at 23 h p.i. in the HuH7 cells (Fig. 5.19).

R06E-J. R06E-J cells responded differently to filovirus infections than HuH7 cells.
At 3 h p.i., we identified down-regulations of nuclear receptors involved in cell prolif-
eration and differentiation (e.g., NR4A3 [353]), and the cell-cycle-regulating ubiqui-
tin ligase ANAPC10, which controls the progression through mitosis [354]. BLCAP,
which controls cell proliferation, apoptosis and the cell cycle [355], was also down-
regulated at 3 h p.i. HAVCR1 (previously known as TIM-1 ), which was down-
regulated at 7 h p.i., is a receptor for many viruses, including filoviruses [356] and
Dengue virus [357]. We observed that various histone genes (e.g., HIST1H2B6,
HIST1H1C ) were down-regulated at 23 h p.i. This may be an epigenetic signal that
could induce cell death. NPR3 was also down-regulated in R06E-J cells 23 h after
filovirus infection. This gene is involved in the regulation of blood volume and pres-
sure, cardiac function and some metabolic and growth processes [358]. Additional
information about the significant co-regulation of genes occurring during filovirus
infection in HuH7 and R06E-J cells can be found in Sec. ES4.
These findings may describe the major differences between human and bat cells
that occur during filovirus infection.

Differences in host cell responses after EBOV and MARV infections


Out of the 35 examined pathways (Fig. ES6.2–6.33), the JAK/STAT, PPP1R15A
and DUSP pathways demonstrated significant differential regulation during infection
(Fig. 5.20). However, these pathways did not demonstrate identical activities in both
EBOV and MARV infections, and the responses of these pathways cannot explain

134
3h 7h 23h

● ● ●

2
2
2
● ●

● ● ●

● ● ●
● ● ●
● ●
● ●●
● ●●●● ● ● ● ●
● ● ●● ●
● ● ●
● ●
● ●●● ● ●● ●
● ●
●● ●●●●●

●● ● ● ●● ● ● ●● ●
● ●
● ●● ● ●●
● ●● ●
●●● ● ● ●●● ●

1
1
1

0
0
0
●●

-1
-1
-1
● ● ● ● ●●
● ● ● ● ●● ● ●●●●

● ● ●● ● ●● ●

● ● ● ●
● ● ● ●
●●
● ●
● ●
● ●
● ●

-2
-2
-2

-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2

2
2
2


● ●
● ● ● ●
●●
● ●
● ●


●●●● ● ●
● ● ●●● ●
●● ●
●●

1
1
1

135
0
0
0
Mock vs. MARV expression (log2)

● ●

-1
-1
-1

● ●● ●● ●
●●● ●
●● ● ● ● ●
● ● ● ●
● ● ● ● ●

● ●

-2
-2
-2

-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2

Mock vs. EBOV expression (log2)

Figure 5.19: Common gene regulation patterns after filovirus infection. The scatterplots demonstrate the fold changes in expression as determined by
DESeq of coding and non-coding RNAs in MARV- and EBOV-infected cells compared with expression in Mock controls 3, 7 and 23 h after EBOV and
MARV infections. We observed similar expression patterns in HuH7 cells at 3h p.i. and in the bat cell line 7 h p.i., suggesting that the progress of filovirus
infection is slower in R06E-J cells. The scatter plot derived from the differential expression analysis of HuH7 cells at 23 h p.i. shows the large number of
differentially expressed genes. A detailed view of the figures (including genes outside of the plotted range of fold changes) can be found in the Electronic
Supplement, Fig. ES4B. Genes demonstrating a similar expression after infection with EBOV and MARV and with an abs(log2 (F C)) > 1 are marked in
5.2. Differential expression in EBOV/MARV infected human and bat cells

red. Black line: y=x; Dotted line: regression line.


Chapter 5. Differential Gene Expression

the common disease symptoms [309] induced by both viruses.

The JAK/STAT pathway. In EBOV-infected HuH7 cells, all genes coding for
members of the JAK/STAT pathway were slightly induced between 3 and 7 h p.i.
(Fig. 5.20A). However, the JAK/STAT system in R06E-J cells demonstrated only
a minimal response to EBOV and MARV infections. The EBOV protein VP24 has
a negative impact on STAT1 signaling [359–361], and STAT1 /2 were found to be
down-regulated at 23 h compared with their expression at 7 h p.i. in the HuH7 cell
line. Downstream genes in this pathway, such as EP300 and PIM1, were highly
up-regulated at 23 h at the mRNA level in EBOV-infected HuH7 cells (Fig. 5.17B).
mRNA of the signaling receptor IFNGR2 was up-regulated, and mRNA of the
interacting JAK2 was down-regulated at 23 h p.i. in EBOV-infected HuH7 cells.
This regulation may be attributed to the activation of feedback from PIM1 via
CISH to the receptor IFNGR2 [362, 363] on protein level.
MARV infections trigger a different reaction in HuH7 cells than do EBOV infec-
tions: the PIM1 mRNA and the receptor IFNGR2 are down-regulated (Fig. 5.20A).
Most strikingly, while we currently do not know whether VP24 also interacts with
STAT1 in bats, the complete JAK/STAT pathway is mostly unaffected at the mRNA
level during EBOV infections in R06E-J cells (Fig. ES6.22).

The PPP1R15A pathway. The expression of PPP1R15A was markedly up-


regulated in EBOV-infected HuH7 cells but only slightly induced in filovirus-infected
R06E-J cells (Fig. 5.17B). We do not currently know how the activity of PPP1R15A
is linked to EBOV or MARV infections. Fig. 5.20B highlights this extreme differen-
tial expression (45 X), which suggests that this gene may be a previously unidentified
key player in the response of EBOV-infected HuH7 cells.
For MARV-infected HuH7 and R06E-J cells, we did not detect such a huge
change in PPP1R15A mRNA levels. PPP1R15A is involved in the regulation of
programmed cell death [364], and genes involved in apoptosis were not deregulated
(Fig. ES6.33), which might be because only immortalized cell lines were used in these
experiments. Another important activity of PPP1R15A is the negative feedback
control it exerts on eIF2α phosphorylation, which regulates protein translation [365].
EIF2α was not significantly differentially expressed in our samples (Fig. ES6.12).
Protein kinase R, which phosphorylates eIF2α, was slightly up- and down-regulated
in EBOV- and MARV-infected cells, respectively. It is possible that the previously
described inhibition of mTOR by PPP1R15A [366, 367] is responsible for the fact
that the mTOR pathway was not found to be active in the filovirus-infected HuH7
and R06E-J cell lines (Fig. ES6.26).

The DUSP pathway. The most striking difference between EBOV-infected HuH7
and R06E-J cells was observed among the mRNAs encoding the various DUSPs,
which represents a possible key to the contrasting innate immune responses of hu-
man and bat cells [368, 369] (Fig. 5.20C). An up-regulated expression of DUSP1
has been observed for vaccinia virus-infected cells [370] and EBOV-infected human
macrophages [311]. DUSPs are critical regulators of several cellular pathways be-
cause they inhibit central immune activator genes such as MAPK8, MAPK14 and

136
5.2. Differential expression in EBOV/MARV infected human and bat cells

MAPK1 /3 (also known as ERK2 /1 ) [369]. While the expression levels of these three
immune activator genes did not change significantly p.i., the drastic up-regulation
at 23 h p.i. of DUSP genes in HuH7 (up to 25 X) but not R06E-J (up to 3 X) cells
is worth noting.
We hypothesize that the following sequence of events occurs during EBOV in-
fections. Upon EBOV invasion of the host cells, an antiviral response is induced, in-
cluding the activation of NFκB and MAPK. Genes of other innate immune response
pathways (e.g., JAK/STAT, DDX58 ) are then suppressed. After EBOV infection,
DUSPs are highly up-regulated in HuH7 cells, correlating with the down-regulation
of MAPK8, MAPK1 /3 and MAPK14 mRNA levels. When translated, these genes
are responsible for the innate immune response [372]. PPP1R15A plays a central role
by binding to the receptor TGFBR1 and inhibiting additional components of the in-
nate immune system. Compared with HuH7 cells, R06E-J cells demonstrate almost
no or only a very slight up-regulation of PPP1R15A and DUSPs, very likely leading
to a stable antiviral response. Furthermore, the mRNA levels of all JAK/STAT
pathway genes in R06E-J cells remain constant during EBOV-infection compared
with the levels in HuH7 cells. The viral protein VP24 inhibits STAT1 activity in hu-
mans, blocking signaling into the nucleus. We suggest that a possible feedback loop
exists that increases the number of IFNGR2 receptors in HuH7 cells. The robust
expression of interferon-stimulated genes could orchestrate the antiviral response of
the infected cell [373].

Differences in baseline expression levels between human and bat cell lines

To validate the RNA-Seq-derived read counts and observed differences in the base-
line expression levels of certain genes between the HuH7 and R06E-J cell lines,
we performed qRT-PCR analyses for the putative EBOV receptors NPC1 [374] and
HAVCR1 [375] and the toll-like receptor TLR3 on Mock and EBOV samples 3 and
23 h p.i. (Sec. ES7). We compared the 18S-normalized mRNA levels of these genes
(File ES7B) with the RNA-Seq-derived read counts from human and bat cells and
found a strong overall correlation. Based on our data, all three genes are expressed
in both cell lines. We observed that NPC1 is clearly more abundant in HuH7 cells
than in the R. aegyptiacus cell line. TLR3 is expressed at a greater level in R06E-J
cells than in the human cell line, but was also not differentially expressed. These
results support our RNA-Seq data. Interestingly, we observed differences in the
melting curves for TLR3 and HAVCR1 (Fig. ES7B), which could be due to dif-
ferences in the amplified sequences of human and bat RNA as identical degenerate
primers were used for the amplification of TLR3 from HuH7 and R06E-J cells. For
HAVCR1 expression, we observed a slight down-regulation between 3 and 23 h p.i.
in the R06E-J cells. However, further studies with different cell lines and primary
cells are required.
We observed clear differences in the baseline expression of some genes between
the HuH7 and R06E-J cell lines. However, we avoided the complications this issue
may cause when comparing homologous human and bat genes by focused on the
calculated log2 -fold changes instead of directly comparing read counts.

137
A B C Legend
= =
2 = =2 EBOV MARV EBOV
IFNGR2 TGFBR1 plasma membrane human human bat
JAK2 PPP1R15A ZFYVE9

{
{
=
2 = = 3-7h 7-23h
45 2 =
EP300 3
2 = =
Change in expression:
STAT1 STAT2 down/up regulation, >15%
TGFB3
=2 =
= = 0 = 2-fold up regulation
2 2
= 24
2 (indicated if change >100%)
10 = 6 NA
DUSP8 =
expression change <15%
== =2
homo/hetero
3
DUSP10 DUSP6
STAT dimers DUSP16
mTOR cytoplasm Difference observed in:
nucleus human vs human
DUSP1
NA transcription =
MAPK8 MAPK14 8
EBOV vs MARV
EP300 ST13 P 4EBP1 = = = DUSP4
=
= =
3 2
human vs bat
MAPK3

138
DUSP1 MAPK1 DUSP5
Chapter 5. Differential Gene Expression

PIM1
25 3
= = == =
2 3 3
4 ==
Figure 5.20: Effects of filovirus infections on JAK/STAT, PPP1R15A, and DUSP pathways. (A) The JAK/STAT pathway. The JAK/STAT pathway
shows a common trend in expression levels: STAT1, STAT2 and JAK2 were up-regulated (↑) between 3 and 7 h p.i. and then down-regulated (↓) between
7 and 23 h p.i. in EBOV-infected HuH7 cells. The cytokine receptor IFNGR2 is not regulated between 3 and 7h (=) and shows a 2 X up-regulation between
7 and 23 h (2 ↑) (Fig. ES6.22). (B) The PPP1R15A pathway. Growth arrest and DNA damage 34 (GADD34, officially known as PPP1R15A) can be
rapidly induced by several types of cellular stress. In R06E-J cells, PPP1R15A was slightly up-regulated (2 X) due to EBOV infection after 23 h; in HuH7
cells, we observed a strong up-regulation (45 X) in EBOV-infected cells and no up-regulation in MARV-infected cells. (C) The DUSP pathway. DUSP1,
8 and 10 demonstrate the highest specificity for MAPKs (MAPK14 and MAPK8 ). DUSP1 is localized in the nucleus, whereas DUSP8 and DUSP10 are
also available in the cytosol. The nuclear DUSPs are thought to be inducible phosphatases [369], and the implications of DUSP s during viral infections
have been demonstrated for DUSP1, which is up-regulated during Epstein-Barr virus [371] and vaccinia virus infections [370]. In response to the vaccinia
virus, DUSP1 is actively involved in antiviral countermeasures of the host cell via the regulation of MAPK phosphorylation. Legend: Boxes indicate
up/down-regulation from 3 to 7 h and from 7 to 23 h p.i. in EBOV-infected HuH7 cells (red); MARV-infected HuH7 cells (green); and EBOV-infected R06E-J
cells (blue). For cases where the expression level changed by more than 15 %, an arrow indicates the direction of regulation (↑/↓). When the expression
level changed by more than 100 % (2-fold change in transcription), the number beside the arrow indicates the fold change. “=” indicates expression changes
of <15 %. Squares around gene names indicate differential expression within the HuH7 cell line (red), between EBOV- and MARV-infected cells (green)
and between HuH7 and R06E-J cells (blue).
5.2. Differential expression in EBOV/MARV infected human and bat cells

The filovirus infection network


In analyzing the protein-level connections among the hundreds of significantly up-
regulated genes in relation to the connections at the transcriptional level and in
regard to the differences in the analyzed pathways many specific and sometimes
surprising relationships between key players were observed. We identified a vast
and complex network of interacting host genes that may explain the fatal outcome
of the EBOV and MARV infections. We have summarized the genes that are signifi-
cantly regulated at the mRNA level in the “filovirus infection network” illustrated in
Fig. 5.21. The connections shown are based on known protein/protein interactions
from the literature. We summarized the differences between the different time points
over the course of infections in HuH7 and R06E-J cells as well as between EBOV and
MARV infections. This summary is not complete, because several highly regulated
genes, such as SKP2, which inhibit CDKN1B, CYCE and CDKN1A were not con-
nected in our filovirus infection network, probably because of missing information in
the literature. When investigating other pathways, such as those involving MAPK,
NFκB, focal adhesion, or TGFβ, we observed that the portions of the pathways that
were active in the nucleus were differentially regulated between HuH7 and R06E-J
cells during filovirus infections.

5.2.5 Conclusions
The Ebola and Marburg filoviruses cause severe and often fatal infection in humans,
whereas bats, shown to be carriers, do not develop disease symptoms after infection.
As a first step towards identifying the cellular response that allows bats to survive
a filovirus infection, we provide a systematic overview of the genes that are differen-
tially expressed between human and bat cells during EBOV and MARV infections
at three time points p.i. Our investigations are based on 18 full transcriptomic
datasets.
In addition to the state-of-the-art RNA-Seq data analysis, comprising read count-
ing, normalization and calculations of fold changes, we investigated 1,500 genes
(∼7 % of human genes) in detail, overlapping them with the 2,500 genes poten-
tially affected during filovirus infections. For each gene, we investigated the follow-
ing aspects: (1) gene synonyms, (2) functional information, (3) different isoforms,
(4) characteristics in 5’/3’-UTR, (5) intronic transcripts, (6) single-nucleotide ex-
changes, (7) ncRNA detection, (8) description of novel genes, (9) genomic context,
(10) conservation, (11) expression profile changes, and (12) homologous gene detec-
tion in R. aegyptiacus. The result of this multidimensional bioinformatics analysis
is a comprehensive Electronic Supplement17 that provides quick insights into how
individual genes of interest are regulated during EBOV and MARV infections via
transcriptional changes and the generation of alternatively spliced forms. We were
only able to investigate expression patterns in two immortalized cell lines of differ-
ent tissue origin (humans, liver; bat, embryonic). The data collected and presented
here serve as valuable sources of information for generating and testing hypotheses
concerning the regulatory circuits that are active in filovirus-infected cells and may
17
www.rna.uni-jena.de/supplements/filovirus_human_bat/igo.php

139
TRIM69 TRAF3/ TANK /CXCL8 IKK /TBK1/DDX3X
IRS2
CLEC7A
PDGF complex IRS1
SQSTM1 DIAPH3 ENAH WASF2 SRC ITG REDD1 4EBP1
RIPK3 GRID1
PIK3CA PDK1
NCK1 YLPM1 SYK (MYD88) TIRAP IRSp53 caveolin
CSE1L DDIT
RALBP1/TRADD/ TRAF2 / TNFR1 TRIB3
PAK1 CARD9/MALT1/BCL10 IRAK2 / IRAK4 / TRAF6 MAP3K1
Akt = Rac mTor (MYCN)
PDGF RPTOR
CXXC1
IKBKG MAP2K4
TAB1 /TAB2/TAB3/TAK1 EGF EGFR MICAC1
IKK MAP2K7 RAC1 PIK3CA
IKK
FPR1 MAP2K1
RAF1 RAS
PAK1 MAP2K2
NFKB1 / I B / RELA VEGFA FGF SOS1 RTK CBL Epsin
MAP2K3
MAP3K11 GF
MAP2K6 DUSP4 MYLK
CDK9 MAP3K12 MAPK3
NFKB2 / RELB CENPE
MAPK1
SMAD
MAPK14 NR4A1

140
DUSP1 MAPK8
DUSP5 DUSP6
Chapter 5. Differential Gene Expression

p105 / TLP2 / ABIN2


JUN FOS TGFB1 TGFBR1 PPP1R15A
DUSP10 EP300
CDKN1B DUSP16
(IL8)
Dab2 DUSP8 SARA = ZFYVE9
CCNE1 GDNF
SKP2 CycE SMAD7
CCNE2 ZFP36 MK2 KCNA3 SMURF2
LDL
CDKN1A ATF3 EGR1 TGFB3
CREB5
HSP27 = HSPB1
Figure 5.21: The filovirus infection network. EBOV and MARV keyplayers in HuH7 and R06E-J cells on transcriptional level and their interactions
on protein level are displayed. We combined significant differentially expressed genes from known pathways and the literature. Nearly all of the highly
deregulated keyplayers of most significant pathways of this study (e.g. MAPK, NFκB, JAK/STAT and DUSP) are part of this filovirus infection network.
We are not able to connect all genes (e.g. SKP2 inhibiting CDKN1B, CYCE and CDKN1A). We found seven members of the DUSP pathway (DUSP1,
4, 5, 6, 8, 10 and 16 ) being highly deregulated and involved in many interactions, mostly repressing MAPK8, MAPK1 /MAPK3 and MAPK14. We
found several transcription factors (e.g. FOS, JUN, ATF3 ) being up-regulated during EBOV infection on high levels. Cilloniz et al. performed global gene
expression analysis of spleen samples of mice infected with different mouse-adapted EBOVs [300]. Here, we mainly confirm the results for HuH7 cells.
Yellow background – receptors and associated proteins; blue background — nuclear proteins; red boxes — significant differentially expression in HuH7
cells between 3, 7 and 23 h; green boxes – significant differentially expression between EBOV and MARV; blue boxes — significant differentially expression
between HuH7 and R06E-J cells. A detailed picture, gene descriptions and gene regulations can be viewed at Fig. ES6.1.
5.2. Differential expression in EBOV/MARV infected human and bat cells

accelerate filovirus research.


EBOV and MARV replicate more rapidly in HuH7 cells than in R06E-J cells at
the early stages of infection. This behavior is especially distinct for EBOV. This
trend was confirmed based on the detection of larger quantities of viral nucleopro-
teins in HuH7 than in R06E-J cells as assessed via IFA. The slower replication rate
of filoviruses in bat cells suggests that they have more time to establish a success-
ful antiviral defense, whereas humans are almost immediately overwhelmed by the
virus.
Transcription factors are the most differentially expressed genes in human cells
after filovirus infection, and the activity of various transcription factor binding mo-
tifs changes after filoviral infection. Some of these changes are common to both
filoviruses. However, other changes are specific to EBOV or MARV infections.
We mapped the differential expression data of single genes to known innate
immune-response pathways and to those we found significantly enriched. Here,
we report pathways that are strongly deregulated between EBOV-infected HuH7
and R06E-J cells. However, expected deregulated pathways, with commonalities in
transcriptional responses to EBOV and MARV infections but differences between
HuH7 and R06E-J cells, were not observed overall. We condensed the data from this
analysis into a single figure representing the filovirus infection network. This network
provides valuable insights into previously undescribed interactions and responses to
filovirus infections.
Although we examined the expression pattern of three genes (HAVCR1, NPC1,
TLR3 ) by performing qRT-PCR analyses of the HuH7 and R06E-J cell lines, further
investigations (e.g., in similar or different cell types, primary cells, or cells of the
immune system) are required to improve the compiled data, as are in vivo studies
on this topic.
In summary, we provided here a systematic and comprehensive computational
study that provides a basis for further research into the pathogenesis of filoviruses
and contributes significantly to the field of virology in general.

141
Chapter 5. Differential Gene Expression

5.3 A short NGS design detour: RVFV infection in


bat cells
Here, a short detour is given showing differences between the RNA-Seq data pre-
sented in Sec. 5.2 and a comparable experimental setup (bat cells infected with a
virus at different time points) with different NGS design parameters as presented in
Sec. 5.2.
In this RNA-Seq project, we were interested in the transcriptional response of
Myotis daubentonii (Microchiroptera, see Fig. 6.5) bat cells infected with a Rift
Valley Fever Virus Clone 13 (RVFV) [15]. The virus belongs to the Bunyaviridae
family, a group of single-stranded RNA viruses with a genome of negative orientation
approximately 12 kb in size [376]. In comparison to filoviruses such as EBOV the
RVFV is composed of a tripartite genome: comprising the L, M and S segment. The
three genomic segments code for six major proteins. Here, a naturally attenuated
isolate of RVFV avirulent in mice and hamsters was used for infection, altered in
the small segment S [377].
RVFV causes the Rift Valley fever, a viral disease with mild to severe symptoms.
Besides the mild symptoms, such as fever and headaches lasting about one week, the
severe symptoms may include the loss of sight, severe headaches, and bleeding to-
gether with liver problems. In 50 % of the cases, severe infections involving bleeding
end deadly [378].
At the start of this currently ongoing project, we were again faced with the
decision of the NGS design. As this project was set up after the Ebola project,
presented in Sec. 5.2, we used our experiences gathered during the analysis of the
Ebola RNA-Seq data to change certain parameters of the NGS design. First of all, we
were now able to obtain for each condition three biological replicates to improve the
statistical power of the analysis. Before, it was difficult to obtain the replicates under
S4 laboratory safety conditions. We further decided to sequence single-end reads
with a length of 50 bp, in contrast to the 100 bp paired-end reads used in the Ebola
project. As we did not aim to build a comprehensive de novo transcriptome assembly
for M. daubentonii, single-end data of this read length is sufficient for reference-based
mapping and quantification, and less expensive. Of course, longer and paired-end
reads are an advantage, but with the saved budget the sequencing of replicates
and two different library types could be achieved. Further, we used not only one
protocol for library preparation like before, instead we combined two different cDNA
library preparations: 1) depletion of rRNAs with the Illumina TruSeq Stranded Total
RNA with Ribo-Zero kit (rRNA−) and 2) enrichment of smallRNAs like miRNAs
with the Illumina TruSeq Small RNA kit (smallRNA). By doing this, we are not
restricted to molecules with a polyA tail and we can also observe the expression of
small RNAs. In Sec. 5.2, only the rRNA− protocol was used and so a sufficient
detection of smallRNAs was just not possible already from the start of the project.
Furthermore, all samples were sequenced by preserving the strand specificity, also a
great advantage compared with the Ebola RNA-Seq data, allowing us to also detect
small non-coding RNAs and antisense transcripts much more efficiently.
Overall, samples of 6 different conditions were prepared and sequenced in tripli-
cates, comprising uninfected control samples (MOCK ), interferon stimulated sam-

142
5.3. A short NGS design detour: RVFV infection in bat cells

ples (IFN ), and samples infected with the RVFV Clone 13 (RVFV ) at 6 h and 24 h
post infection. Briefly, the RNA-Seq data was quality controlled and mapped to the
Myotis lucifugus (a closely related species with a genome available) reference genome
with STAR [39]. Read quantification was done with FeatureCounts [78] and the
DESeq2 [81] pipeline was used to call DEGs. Preliminary data of the rRNA− and
smallRNA libraries [15] is shown and briefly compared in Fig. 5.22.
With this short detour, we want to point out again the importance of a well-
thought-out study design, that can greatly improve the downstream bioinformatical
analyses. In some cases, factors like the budget or security reasons in the laboratory
(e.g. when working with deadly viruses) set limits to the NGS study design (like
for the project presented in Sec. 5.2). In such cases, a state-of-the-art analysis of
the RNA-Seq data is theoretically possible, but does in most cases not untangle
a comprehensive story out of the data. Much more manual work and specialized
methods are needed, to obtain meaningful results (like shown in Sec. 5.2). Therefore,
factors like the sequencing depth, the amount of biological replicates, the chosen
protocols for molecule enrichment, and the chosen read length are always important
parameters to consider and to optimize within the given terms and conditions of a
new NGS project.

143
Chapter 5. Differential Gene Expression

A smallRNA rRNA-

Mock vs RVFV 6h
Mock vs RVFV 24h

log2 fold change

mean normalized counts

B
smallRNA rRNA-
● ●
10


5
condition 5

● IFN
● Mock ●

PC2: 11.7% variance

PC2: 17.2% variance

● RVFV ●
● 0

timepoint
●●
0 ● 24h
6h
−5 ●

−10
−5



● −15 ●●

−10 0 10 −10 0 10 20
PC1: 45.1% variance PC1: 67.8% variance

Figure 5.22: Shown is some preliminary data comparing the smallRNA (left) and rRNA− (right)
protocols used for library preparation. (A) MA plots visualizing the mean expression per gene (x-
axis, unique normalized counts) against the log2 fold change (y-axis). In the first row, control and
RVFV-infected samples 6 h post infection (p.i.) are compared. In this early state of infection, only
a few genes are significantly (red) dysregulated. The second row shows the comparison between
control and RVFV 24 h p.i., showing a huge amount of differentially expressed genes especially in
the rRNA depleted (rRNA−) samples. Interesting are the different expression patterns between
smallRNA and rRNA− protocols: smallRNAs are not that highly expressed, but show also high
fold changes. (B) PCA plots of all 18 samples based on the smallRNA (left) and rRNA− (right)
samples. A nice separation between the IFN-stimulated samples and the Mock and RVFV samples
can be observed. Also, the Mock and RVFV samples 6 h p.i. cluster together, showing the slow
response of the cells at early steps of infection when viral replication just started. At 24 h p.i. the
RVFV samples separate from the rest. With the help of the PCAs, one outlier within the RVFV
24 h triplicates could be observed. Preliminary data was obtained from [15].

144
Chapter 6

Single Nucleotide Investigations


This chapter is based on our publications “PoSeiDon: a web server for the detection
of evolutionary recombination events and positive selection” [9]*, “Evolution and
antiviral specificity of interferon-induced Mx proteins of bats” [7] and “Evolution
and antiviral specificity of interferon-induced Mx proteins in rodents” [18]*. Here,
we will mainly focus on the topcis Alignment, Phylogeny, Positive Selection, Genetic
Variation, and Visualization (see overview Fig. 2.2; H, I, K–M).
In the previous part of this thesis (Chapters 3–5) we discussed various high
throughput sequencing applications, all based on Next-Generation Sequencing data
of different setups (DNA vs. RNA, single-end vs. paired-end, non- vs. strand-specific,
polyA vs. rRNA−, ...). Such NGS studies can provide comprehensive insights into
a broad amount of biological processes and can help to tackle various questions
depending on the type of analyses performed. Thousands of genes can be tested
for differential expression in parallel. In contrast, with this chapter we will now dig
even deeper inside the Black Box by taking a closer look on more restricted data
sets, single genes and specific nucleotide positions.
Besides high throughput workflows and whole genome/transcriptome studies,
approaches that focus on specific genes or even single nucleotide positions are still
highly important. Therefore, an NGS study should not end with a table showing
differentially expressed genes and their significance, in fact NGS should be seen as a
helpful tool to identify top candidate genes and kingpins for further, more detailed
investigations. Often, this step is neglected at the end of a huge NGS project and
important details are just lost in big data tables and expression heat maps.
This chapter may encourage readers to really go into their data (e.g. by checking
top candidate genes again in a genome browser like IGV [325]) and to not only trust
the reported p-values and significance tables of a differential expression tool.
In this chapter, we will present two selected and closely entangled projects [7,
9], complemented by data from Müther et al. [18]. We will focus on the detection of
recombination events and positive selected sites in protein-coding genes. In theory,
candidate genes for such an analysis can be identified by NGS in a differential gene
expression study (see also Chapter 5 and Fig. 2.14). A possible use case could
involve the NGS-based detection of a differentially expressed immunorelevant gene
(e.g. Mx1 ) between a control and virus-infected sample. In a next step, such a
candidate gene can be further investigated for positive selected sites, that might be
the result of a host-virus ‘arms-race’ during evolution.
* unpublished work, publication in progress

145
Chapter 6. Single Nucleotide Investigations

In Sec. 6.1 we present our pipeline called PoSeiDon, that allows for the detection
of putative recombination events and positively selected sites in an alignment of
protein-coding homologous sequences. The pipeline was developed during my work
on two other projects [7, 18] and is now publicly available1 as an easy-to-use web
server [9].
In the last section of this chapter (Sec. 6.2), we present a comprehensive study of
the immunorelevant gene Mx1 in 13 bat species. Here, we focused again on bats (see
Sec. 5.2), because those flying mammals seem to be an important model organism
to study host-virus ‘arms-races’ as they show no symptoms when becoming infected
by viruses such as EBOV or MARV [4]. In this project, the PoSeiDon pipeline
(Sec. 6.1) was applied to detect positively selected sites in the Mx1 gene of bats.
The project was conducted together with Jonas Fuchs and Prof. Dr. Georg Kochs,
who performed the wet lab work and experiments at the Institute of Virology in
Freiburg.
Besides the single nucleotides investigated within the projects presented here,
we were also working on other more restricted and specialized topics, that will only
mentioned here briefly.
In one project, conducted together with Dr. Daniel Steinbach from the Univer-
sitätsklinikum Jena, we are working on a different type of NGS data than presented
so far, derived from so called whole exome sequencing. Here, we aimed to identify
somatic single nucleotide variants and insertions/deletions between the exome of
bladder cancer patients of different tumor states [14]. The exome combines the part
of the genome that is formed by exons, so the sequences that remain in a mature
RNA after transcription and splicing. It consists of all the DNA that is transcribed
into mature RNA in any cell at any time point. Therefore, the exome is different
from the transcriptome (examined via RNA-Seq) where only RNA that has been
transcribed in a specific cell population and time point is visible. The project is
again accompanied by a comprehensive Electronic Supplement2 , providing interac-
tive online tables to check for specific variant positions in the human genome.
In another project, we were focusing on a single ncRNA class: tRNAs (transfer
RNAs), instead of looking on many (e.g. all annotated) genes in parallel (like in
Chapter 5). Even more specific, we only focused on a special cell organelle: the
mitochondrial genome, instead of looking at the full genome level. On this limited
data set, we tried to detect so called remolding events in alignments of tRNAs by
utilizing maximum likelihood functions. My main contribution in this project was
the calculation of alignments of tRNAs and the implementation of a novel maximum
likelihood based algorithm called MLRD (maximum likelihood remolding detection),
for identifying the position of a remolding event by utilizing a previously calculated
phylogenetic tree. Furthermore I was mainly responsible for the visualization of
the alignments, trees and detected remolding events. More details about the evo-
lutionary process of tRNA remolding and mitochondrial genomes in general can be
found in our publication [3] and in the thesis of my former colleague Dr. Abdullah
H. Sahyoun [20].

1
The PoSeiDon web server: http://www.rna.uni-jena.de/poseidon
2
available at http://www.rna.uni-jena.de/supplements/urology_all/

146
6.1. PoSeiDon: Positive Selection Detection

6.1 PoSeiDon: a web server for the detection of evo-


lutionary recombination events and positive se-
lection
PoSeiDon is an easy-to-use web service helping researchers to detect recombina-
tion events and sites under positive selection in protein-coding sequences. Using
only homologous sequences, the pipeline automatically builds a multiple sequence
alignment, estimates a best-fitting substitution model and performs a recombination
analysis followed by the construction of all corresponding phylogenies. Finally, we
detect significant positively selected sites under varying models for the full alignment
and possible recombination fragments. The outcome of PoSeiDon is a user-friendly
web page, providing all intermediate results and data files and graphically displaying
recombination events and positive selected sites. PoSeiDon is freely available at
http://www.rna.uni-jena.de/poseidon. The pipeline is implemented in
Ruby and processes the output of various tools.

6.1.1 Positive selection and recombination


Selection pressure continuously affects
the evolution of genes and can be stud- Mx1 Virus

ied in multiple ways [379]. One ap-


proach involves the comparison of or- Species 1

thologs to detect sites (codons) that un-


derwent positive (diversifying) selection. Species 2
When positive selection has occurred,
the ratio between the non-synonymous
substitution rate (dN ) and the synony- Species 3

mous substitution rate (dS) became dis-


turbed. For instance, certain amino acid Species 4
changes are favored if they increase the
hosts fitness against a pathogen [380]. Mx1 specific binding No binding Infection

The dN/dS ratio (or ω) may reach


values significantly greater than 1 and Figure 6.1: Exemplarily shown is an ‘arms-race‘
scenario between a host gene (Mx1, for a de-
we call such a site positively selected. tailed evolutionary analysis of this gene in bats
The detection of such sites allows re- see Sec. 6.2) and a virus. The dynamin-like GT-
searchers to gain insights into the evo- Pase Mx1, an interferon-stimulated gene (ISG),
lution of a gene and might also help to inhibits several viruses by blocking early steps
develop counter measurements against of their replication cycle. Viral infections are
thought to be one of the major forces driving se-
pathogens that are in an ‘arms-race’ lective pressure acting on living organisms. So-
with their host (Fig. 6.1). called ‘arms races‘ between host and virus result
Since recombination can have a pro- in high selection pressure on the host to evolve a
found impact on evolutionary processes defence against the pathogens, while the virus it-
self establishes countermeasures to evade the host
and can adversely affect phylogenetic re- immune system.
construction and the accurate detection
of positive selection [381], screening for

147
Chapter 6. Single Nucleotide Investigations

breakpoints to define recombinant parts within an alignment should be a default


step in each comparative evolutionary study.
An comprehensive evolutionary analysis of significantly positively selected sites
consists of several steps, including (1) in-frame alignment; (2) INDEL correction; (3)
phylogenetic tree calculation; (4) selection of a best-fitting nucleotide substitution
model; (5) detection of topological incongruence and breakpoint selection to describe
putative recombination events; (6) calculation of positively selected sites (ω > 1)
under varying models; (7) and their impact on the selective pressure acting on
the whole alignment. Thus, such an analysis involves dozens of different tools and
parameter settings. Additionally, the output of many well established and widely
used tools in this field of evolutionary science is not easy to interpret and to process.
Especially, the accurate detection and handling of putative recombination events is
a challenging but important task.
Only few web servers for the detection of positive selection already exist. Those
web servers either do not automatically combine all of the described steps [382], do
not take possible recombination events into account [383, 384], or focus only on the
detection of positive selection in prokaryotic genomes [385].
Here, we present PoSeiDon, a web based and easy-to-use pipeline that allows
researchers to perform comprehensive evolutionary studies by automatically tak-
ing care of all tasks mentioned above. PoSeiDon does not only detect positively
selected sites in an alignment of homologous sequences, but also possible recombi-
nation events, that could otherwise adversely affect the positive selection detection.
The input is a single FASTA file consisting of protein-coding DNA sequences with
a correct open reading frame. The output is visualized in an user-friendly HTML
page, including all results and intermediate files generated within the pipeline.

6.1.2 Pipeline and implementation


PoSeiDon comprises an assembly of different scripts and tools (Fig. 6.2), which
allow for the detection of recombination and positive selection in protein-coding se-
quences. Starting from homologous coding sequences provided by the user, we build
a multiple sequence alignment guided by amino acid information with TranslatorX
(v1.1) [386], using Muscle (v3.8.31) [387] to align the sequences. The resulting in-
frame nucleotide alignment is cleaned for INDELs. A best-fitting substitution model
is selected using MODELTEST [388], which is part of the HyPhy suite (v2.2) [389].
As recombination can have a profound impact on the evolutionary history of
sequences [381], we check the alignment for possible breakpoints using GARD [390,
391] under the previously selected substitution model. GARD is also part of the
HyPhy package and can be used to screen for phylogenetic incongruences to define
breakpoints due to putative recombination events in a multiple sequence alignment
(Fig. 6.3). All breakpoints are tested for significant topological incongruence using
a Kashino Hasegawa (KH) test [392]. KH-insignificant breakpoints most frequently
arise from variation in branch lengths between segments. The user can define to
also take KH-insignificant breakpoints into account, because we already observed
interesting positively selected sites in fragments without any significant topological
incongruence. KH-insignificant fragments are marked in the final output, as they

148
PoSeiDon
Alignment Recombination Tree Positive Selection Output

CODEML Fragment 1 Fragment 2


Model test NS site models 0.950 0.961 0.686 0.986 1.000 0.832 0.995 0.985 0.991 0.998 0.534 0.655 0.984
M1a
F3x4
GAC CAA GAG TAT CGG ACT TGG CTG CAG AAG ATC CGA GAG AAG GAA TCA GAA CAA CAG
M2a F1x4 0.857 0.966 0.566 0.932 1.000 0.916 0.998 0.989 0.992 0.991 0.538 0.823 0.978
GAC CAA GAA TAT CGG ACT TGG CTG CAG AAG ATC AGA GAG AAG GAA TTA GAA CAA CAG
GARD M0 M7
GAC CAA GAG TAT CGG ACT CAT CTA CAG ATG ATC AGA GAG AAG GAA TTA GAA GAA AAC

GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCT AAA GAA GAG
M8a M8 0.883 0.985 0.762 0.665 0.963 1.000 0.910 0.996 0.986 0.977 0.977 0.722 0.934 0.971
GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCA AAA GAA GAG

GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG
F61
GAC CAA GCG TAC CGG GGT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG

GAC CAA GCG TAC CGG GCT TCG CTG CAG AAG ATC CGA GAG AAG GAA TCG GAA GAG AAG 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582
GAC CAG GCG TAC CGC ACC GCG CTG GGA AAG ATC CGA GGG ATG GAA GCC GAA GAG AAT KH test
User Input GAC ACG ATG TAT CAG AGA TCG TTA CGG AAA ATC AGG GAG AAG GAA AAA GAG AAG GAA
Codon frequencies
GAC CAG GTG TAT CGG AAA TCA TTA CAG ATA GTC AGG GAG AAG GAG AAG GAG AAG GTT

GAC CAG GTG TAC CAG AGA TCA TTA CAG AAA GTC AGG GAG AAG GAG AAT GAA GAA CAA --- --- ---
GAT CAG ATG TAC CAG AGT TCA TTA CAG AAA ATC AGG GCG AAG GAG AAG GAG AAG GAA
GAA TCA GAA CAA CAG - - - AAA AGG AAA TCC ACC TTG GTG ACT TCT GAA AGC AGC CAG CGA AAG ATC
0.9

0.8

0.7
F61 F1x4 F3x4
>seq1 GAA TTA GAA GAA AAC - - - --- --- --- AAG AAG AAG TCC GTC TTT GCG CTT TCT GAA AAC AAT CAG AGA ATG ATC
0.6 RAxML
0.5

GTTATGAAG... 0.4

0.3
M7 vs M8 GAA TCC AAA GAG CAG - - - --- --- AAG GGG AGT TCT CGC GAG CAG ACG TCC TCT CTG GAG GAT CAG CGA AAG ATC
>seq2 0.2 Newick Utilities LRT
0.1
TranslatorX
0
GTACTGAAA... 0 200 400 600 800 1000 1200 1400 1600 1800 2000 M1a vs M2a M8a vs M8 GAA TCC AAA GAG CAG - - - --- --- AAG GGG AGT TCT CGC GAG CAG ACG TCC TCT CTG GAG GAT CAG CGA AAG ATC
>seq1 >seq1
FASTA GTT ATG V M R ... Chi-squared test GAA TCG GAA GAG AAG AAG GGT TGT TCG CGC CAG CAG AAG GAG CAG AAT TTC TAT CAG GAG GAT CAG CGA AAG ATC

149
AAG ... >seq2
GAC CAA GAG TAT CGG ACT TGG CTG CAG AAG ATC CGA GAG AAG GAA TCA GAA CAA CAG

>seq2 V L R ... GAC CAA GAA TAT CGG ACT TGG CTG CAG AAG ATC AGA GAG AAG GAA TTA GAA CAA CAG
GAA GCC GAA GAG AAT - - - --- --- AAG AAG AAG AAG AAG GAG CAT ATT TTC TTT GAA GAG GAC GGA CGA AAG ATC
GAC CAA GAG TAT CGG ACT CAT CTA CAG ATG ATC AGA GAG AAG GAA TTA GAA GAA AAC

GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCT AAA GAA GAG dN/dS ( ) ratios D E C M K
GTA CTG GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCA AAA GAA GAG

GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG

GAC CAA GCG TAC CGG GGT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG GAA AAA GAG AAG GAA - - - GAA GAA AGG AAG AGA ACA TTA GGT CGG GCG ATC TGC GAA GAG AGT CGG AGG AAA ATC
GAC CAA GCG TAC CGG GCT TCG CTG CAG AAG ATC CGA GAG AAG GAA TCG GAA GAG AAG
AAA ... R S T F Y
GAC CAG GCG TAC CGC ACC GCG CTG GGA AAG ATC CGA GGG ATG GAA GCC GAA GAG AAT

GAC ACG ATG TAT CAG AGA TCG TTA CGG AAA ATC AGG GAG AAG GAA AAA GAG AAG GAA

GAC CAG GTG TAT CGG AAA TCA TTA CAG ATA GTC AGG GAG AAG GAG AAG GAG AAG GTT BEB
GAC CAG GTG TAC CAG AGA TCA TTA CAG AAA GTC AGG GAG AAG GAG AAT GAA GAA CAA ---
NT ALN AA ALN GAT CAG ATG TAC CAG AGT TCA TTA CAG AAA ATC AGG GCG AAG GAG AAG GAG AAG GAA
GAG AAT GAA GAA CAA - - - AAC AAG AAT AAA TCA AGA GTT TTG GAC CTT GTA CAG AGT TCT CAG
N AGG
Q G AAA LGTCV

Resulting GAG AAG GAG AAG GAA - - - --- GAA GAA ATG AAG AAG AAA TTT AAT TGT TTG AAC CTT CAA CAG I
CAG A W
AGG AAAHATCP
Muscle
fragments

Figure 6.2: Workflow of the PoSeiDon pipeline and example output. The PoSeiDon pipeline comprises in-frame alignment of homologous protein-coding
sequences, detection of putative recombination events and evolutionary breakpoints, phylogenetic reconstructions and detection of positively selected sites
in the full alignment and all possible fragments. Finally, all results are combined and visualized in a user-friendly and clear HTML web page. The resulting
alignment fragments are indicated with colored bars in the HTML output.
6.1. PoSeiDon: Positive Selection Detection
Chapter 6. Single Nucleotide Investigations

A B
Artibeus jamaicensis 0.9

100
Phyllostomidae
0.8
1-1947
Sturnira lilium

Model averaged support


100 0.7
549
Carollia perspicillata 0.6

270 549
Myotis davidii 0.5

100 88 0.4
Myotis daubentonii
0.3
97
Myotis lucifugus Vespertillionidae 0.2
77
Myotis brandtii 0.1

100
0
Eptesicus fuscus 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Breakpoint location
100
Pipistrellus spec.

Pteropus alecto C

100 Rousettus aegyptiacus Pteropodidae


100
Hypsignatus monstrosus
72

Eidolon helvum
substitutions/site Position: 1-270 271-549 550-1947
0 0.1 0.2 0.3 (90 aa) (92 aa) (466 aa)

Figure 6.3: (A) Species tree of 13 bat species. (B) Putative breakpoints in the nucleotide align-
ment identified with GARD. (C) Schematic view of phylogenetic trees derived from the alignment
fragments based on the identified breakpoints. The largest fragment (550-1947) of the alignment
follows the topology of the species tree, whereas the other parts show different evolutionary seet-
ings. Different topologies can have an impact on the positive selection detection. Therefore, in
PoSeiDon, the full alignment as well as each recombinant fragment are analyzed independently.
The example is based on data from Fuchs et al. [7].

might not occur from real recombination events.


We tested the PoSeiDon pipeline on the Mx1 sequence of 13 rodent species [18]
and were able to observe a recombination breakpoint (aa pos. 152) in the alignment
leading to significant topological incongruence. When splitting the initial alignment
in two parts according to the breakpoint, the first part showed no positive selection,
whereas the second part is positive selected under all tested models and codon
frequencies3 , underpinning the impact possible recombination events can have on
the positive selection detection.
Positions of putative breakpoints that would destroy the open reading frame
are adjusted to the next valid position. Phylogenetic reconstructions on the full
alignment and all fragments are performed with RAxML (v8.0.25) [162] using the
GTRGAMMA model for nucleotide sequences and PROTGAMMAWAG for amino acids.
All calculations are performed with 1 000 bootstrap replicates. Optional outgroup
rooting can be applied by the user. The Newick Utilities suite (v1.6) [163] is used
to visualize the calculated trees in different formats.
Positive selection is analyzed on the full alignment and each of the fragments
separately. Maximum-likelihood tests to detect positive selection under varying site
models are performed with CODEML (M1a vs. M2a, M7 vs. M8) implemented
within the PAML suite (v4.8) [393]. Furthermore, we implemented the M8a vs. M8
test proposed by [394] as additional model test in PoSeiDon. Different statistical
site models that do not allow (neutral models) or allow (selection models) a class of
codons to evolve with ω > 1 are compared. Furthermore, varying codon frequency
models are applied to simulate different nucleotide substitution rates. A gene is
declared to be positively selected if the neutral model can be rejected in favour
3
see http://www.rna.uni-jena.de/supplements/mx1_rodents/html/full_aln/
recomb.html and [18]

150
6.1. PoSeiDon: Positive Selection Detection

L4
B G domain B Stalk B
N Loop C

Combine all
outp
ut

D E C M K
Exemplarily shown is a part of the hypervariable loop L4 R S T F Y
region of bat Mx1. Significant (posterior probability > 0.95) N Q G L V
I A W H P
positive selected sites are marked.

LOOP L4
0.984 0.867 F3x4 0.950 0.961 0.686 0.986 1.000 0.832 0.995 0.985 0.991 0.998 0.534 0.655 0.984
0.895 0.782 F1x4 0.857 0.966 0.566 0.932 1.000 0.916 0.998 0.989 0.992 0.991 0.538 0.823 0.978
0.861 0.807 F61 0.883 0.985 0.762 0.665 0.963 1.000 0.910 0.996 0.986 0.977 0.977 0.722 0.934 0.971
544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580

GAC CAA GAG TAT CGG ACT TGG CTG CAG AAG ATC CGA GAG AAG GAA TCA GAA CAA CAG - - - --- --- --- AAA AGG AAA TCC ACC TTG GTG ACT TCT GAA AGC AGC CAG CGA

GAC CAA GAA TAT CGG ACT TGG CTG CAG AAG ATC AGA GAG AAG GAA TTA GAA CAA CAG - - - --- --- --- AAA AAG AAA CTG GCC TTT GCG CCT TCT GAA AAC AGC CAG AGA

GAC CAA GAG TAT CGG ACT CAT CTA CAG ATG ATC AGA GAG AAG GAA TTA GAA GAA AAC --- --- --- --- AAG AAG AAG TCC GTC TTT GCG CTT TCT GAA AAC AAT CAG AGA

GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCT AAA GAA GAG - - - --- AAG AAG GGG AGT TCT CGC GAG CAG ACG CCC TCT CTG GAG GAT CAG CGA

GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCA AAA GAA GAG - - - --- AAG AAG GGG ATT TCT CTC CAG CAG ACG TCC TCT CTG GCG GAT CAG CGA

GAC CAA GCG TAC CGG GCT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG - - - --- --- AAG GGG AGT TCT CGC GAG CAG ACG TCC TCT CTG GAG GAT CAG CGA

GAC CAA GCG TAC CGG GGT GCG CTG CAG AAG ATC CGA GAG AAG GAA TCC AAA GAG CAG - - - --- --- AAG GGG AGT TCT CGC GAG CAG ACG TCC TCT CTG GAG GAT CAG CGA

GAC CAA GCG TAC CGG GCT TCG CTG CAG AAG ATC CGA GAG AAG GAA TCG GAA GAG AAG AAG GGT TGT TCG CGC CAG CAG AAG GAG CAG AAT TTC TAT CAG GAG GAT CAG CGA

GAC CAG GCG TAC CGC ACC GCG CTG GGA AAG ATC CGA GGG ATG GAA GCC GAA GAG AAT --- --- --- AAG AAG AAG AAG AAG GAG CAT ATT TTC TTT GAA GAG GAC GGA CGA

GAC ACG ATG TAT CAG AGA TCG TTA CGG AAA ATC AGG GAG AAG GAA AAA GAG AAG GAA --- GAA GAA AGG AAG AGA ACA TTA GGT CGG GCG ATC TGC GAA GAG AGT CGG AGG

GAC CAG GTG TAT CGG AAA TCA TTA CAG ATA GTC AGG GAG AAG GAG AAG GAG AAG GTT GAA CAA TAC AAG AAT AAA TCA AGA GTT TTG GAC CTT GAA CAG AGT TTT CAG AGG

GAC CAG GTG TAC CAG AGA TCA TTA CAG AAA GTC AGG GAG AAG GAG AAT GAA GAA CAA --- --- AAC AAG AAT AAA TCA AGA GTT TTG GAC CTT GTA CAG AGT TCT CAG AGG

GAT CAG ATG TAC CAG AGT TCA TTA CAG AAA ATC AGG GCG AAG GAG AAG GAG AAG GAA --- --- GAA GAA ATG AAG AAG AAA TTT AAT TGT TTG AAC CTT CAA CAG CAG AGG

Figure 6.4: Exemplary view of the PoSeiDon visualization of the Mx1 gene, zoomed in the
hypervariable loop region L4 in the Stalk domain (Sec. 6.2). The loop is already known to interact
with certain (RNA) viruses and blocking early steps of the viral replication cycle. Using the here
described evolution-guided pipeline, we were able to identify the loop L4 of Mx1 as a hot spot for
positive selection in bats [7], as previously also shown for primates [396]. For details see Sec. 6.2.
By splitting the alignment by possible recombination events identified by GARD, we also found
high evidence of positive selection in the N-terminal region of bat Mx1 (see exemplary Tab. 6.1,
fragment1).

of the positive selection model based on a likelihood ratio test. Then, a Bayes
empirical Bayes (BEB) approach [395] is applied to calculate posterior probabilities
(P P ) that a codon comes from the site class with ω > 1. Positively selected sites
with an assigned P P > 0.95 are depicted as significant.
We graphically summarize all positively selected sites under varying frequency
models in the output (Fig. 6.2 and 6.4). Thus, we give the user the opportunity
to investigate sites that would be dismissed from the output when using a P P
threshold. For example, such sites could be located in regulatory domains of the final
protein, yielding a lower P P value due to insufficient species sampling [397]. The
final output of PoSeiDon is based on a heavily modified version of the TranslatorX
HTML output (Fig. 6.4). The amino acid color code is adapted from TranslatorX.
All commands executed within the pipeline are summarized in the final output,
allowing advanced users to adjust the predefined parameters.

6.1.3 Conclusions
Here we present PoSeiDon, an easy-to-use, web-based pipeline for the accurate de-
tection of site-specific positive selection and recombination events in protein-coding
sequences. The input is a multiple FASTA file of homologous coding sequences that
is automatically transferred into a codon-based alignment. Since recombination can
have a profound impact on the evolutionary history of sequences, we initially check

151
Chapter 6. Single Nucleotide Investigations

the alignment for topological incongruence to define putative recombination break-


points. The whole evolutionary analysis of PoSeiDon is performed independently
for the full alignment and all possible fragments. PoSeiDon automatically calcu-
lates maximum likelihood-based phylogenetic trees for all alignments, estimates ω
values at each site and computes their impact on the positive selection. All iden-
tified sites and their significance values are projected onto the codon and amino
acid alignment of the input sequences to allow visual identification of evolutionary
hot-spots with high ω values. Additionally, publication-ready PDF and LATEX tables
are provided including all breakpoints and significantly positively selected sites, see
Tab. 6.1. All results are summarized in a user-friendly and clear manner, allowing
researchers to study positive selection in various genes.
Table 6.1: Example of a final output table generated by PoSeiDon. Shown are results of the evo-
lutionary analyses for positively selected sites for the Mx1 gene in bats [7]. P-values were achieved
by performing chi-squared tests on twice the difference of the computed log likelihood values of
the models disallowing (M7) or allowing (M8) dN/dS > 1. The BEB column lists rapidly evolving
sites with a dN/dS > 1 and a posterior probability > 0.95, determined by the Bayes Empirical
Bayes implemented in Codeml. Amino acids refer to Myotis daubentonii. Note that INDELs and
the stop codon were removed from the alignment prior to evolutionary analysis. Shown positions
were mapped back to the alignment with gaps. Fragments arising from insignificant breakpoints
(adjusted p-value >0.1) are marked with an asterisk. Detailed output of the pipeline can be found
at: http://www.rna.uni-jena.de/supplements/mx1_bats/full_aln and is discussed
comprehensively in Sec. 6.2.

Region M7 vs M8 M7 vs M8 % sites with avg(ω) M8 BEB


(χ2 ) p-value ω>1 (P P > 0.95/ > 0.99)
F61
full (aa 1–672) 82.85 < 0.001 8.06 2.75 L138; D141; R205; A209; S361;
R433; D436; F439; R443;
S559; E562; S569; L570;
Q571; Q572; T573; S574;
S575; A577; D578; T581
frag1* (aa 1–106) 17.01 < 0.001 38.07 1.92 A16; T17; D19; P22; S25;
H26; P27; G31; G38; L40;
E41; L44; N46; S47; Q51
frag2* (aa 107–201) 0.39 0.821 NA NA NA
frag3* (aa 202–672) 82.48 < 0.001 8.1 3.0 A209; D436; R443; E562;
S569; L570; Q572; T573;
S574; S575; D578; T581
F1X4
full (aa 1–672) 76.34 < 0.001 7.05 2.79 R205; D436; F439; R443;
E562; L570; Q572; T573;
S574; S575; D578; T581
frag1* (aa 1–106) 20.33 < 0.001 31.04 2.12 A16; T17; D19; P22; S25;
H26; P27; G31; G38; L40; E41;
L44; N46; S47; Q51
frag2* (aa 107–201) 0.09 0.956 NA NA NA
frag3* (aa 202–672) 84.88 < 0.001 7.76 3.02 R205; D436; F439; R443;
E562; L570; Q572; T573;
S574; S575; D578; T581
F3X4
full (aa 1–672) 101.39 < 0.001 6.26 3.45 R205; A209; S361; F439; R443;
T494; A549; E562; S569;
L570; Q572; T573; S574;
S575; D578; T581
Continued on next page

152
6.1. PoSeiDon: Positive Selection Detection

Table 6.1 – continued from previous page


Region M7 vs M8 M7 vs M8 % sites with avg(ω) M8 BEB
(χ2 ) p-value ω>1 (P P > 0.95/ > 0.99)
frag1* (aa 1–106) 24.05 < 0.001 21.3 2.76 A16; T17; D19; P22; A23;
S25; H26; P27; G31; G38;
L40; L44; N46
frag2* (aa 107–201) 0.19 0.908 NA NA NA
frag3* (aa 202–672) 112.69 < 0.001 6.59 3.83 R205; A209; S361; D436;
F439; R443; T494; A549;
E562; S569; L570; Q572;
T573; S574; S575; D578;
T581

153
Chapter 6. Single Nucleotide Investigations

6.2 Evolution and antiviral specificity of interferon-


induced Mx proteins of bats against Ebola-, Influ-
enza-, and other RNA viruses
Bats are a natural reservoir for various viruses that rarely cause clinical symptoms
in bats, but comprise zoonotic pathogens like Ebola (Sec. 5.2) or Rabies virus. It has
been speculated that the interferon system might play a key role in controlling viral
replication in bats. We speculate that the interferon-induced Mx proteins might be
key antiviral factors of bats and have coevolved with bat-borne viruses. This study
evaluates for the first time a large set of bat Mx1 proteins spanning three major
bat families for their antiviral potential, including activity against Ebola virus and
bat influenza A-like viruses, and describes their phylogenetic relationship, revealing
patterns of positive selection that suggest a coevolution with viral pathogens. By
understanding the molecular mechanisms of the innate resistance of bats against
viral diseases, we might gain important insights into how to prevent and fight human
zoonotic infections caused by bat-borne viruses.
Bats (Chiroptera) are one of the
most abundant, ecological diverse and
globally distributed animals within all Pteropodidae Pteropus alecto

Pteropodidae
Megachiroptera

Pteropus vampyrus
vertebrates [402]. Except for polar re- Craseonycteridae Eidolon helvum
Rousettus aegyptiacus

gions bats can be found on all conti- Rhinopomatidae


Yinpterochiroptera

Hipposideridae
nents [403]. Although bats represent

Rhinolophoidae
Rhinolophidae Rhinolophus ferrumequinum
20 % of the mammals [404], only 12 of
Microchiroptera

Megadermatidae Megaderma lyra

the 900 species are sequenced. Bats Molossidae

have evolved a various number of unique Emballonuridae


Yangochiroptera

(Other)

biological features among mammals, in- Myozopodidae

Vespertilionoidea
Myotis brandtii
cluding echolocation and the ability to
Emballonuridae
Myotis lucifugus
(Taphozous)
Vespertilionidae Myotis davidii

fly. They occupy a broad range of differ- Natalidae


Eptesicus fuscus
Miniopterus natalensis

ent ecological niches, such as a relatively Miniopteridae

long lifespan and a natural resistance Noctilionidae

Mormoopidae Pteronotus parnellii


against various pathogenic viruses (e.g. (Pteronotus)
Phyllostomoidea

Classification Mystacinidae
Ebola virus [402]). Until recently, their MM YY Thyropteridae

evolutionary origin remained mostly un- Furipteridae

clear and was still under discussion [405, Mormoopidae


(Mormoops)

406]. Phyllostomidae

The origin of Chiroptera has been


dated in the Cretaceous period, with Figure 6.5: Computationally available genomes of
a diversification explosion process dur- 12 bat species and their phylogenetic relationship
according to MM [398] and YY [399] classifica-
ing the Eocene [407]. Traditionally, tion. Detailed figure can be found in the Sup-
based on morphological attributes, bats plemental Material of Mostajo et al. [12]. MM –
were divided into two monophyletic sub- Micro-/Megachiroptera classification; YY – Yan-
ordinal groups (see Fig. 6.5, ’MM’- go/Yinpterochiroptera classification. Phyloge-
Classification): Megachiroptera (mega- netic tree for genus Pteropus and Myotis adopted
from [398, 400, 401].
bats), only containing a single family
(Pteropodidae) of non laryngeal echolo-

154
6.2. Evolution and antiviral specificity of bat Mx proteins

cating [408] and fruit eating bats; and Microchiroptera (microbats), containing 19
families of echolocating bats. Additional support exists by molecular analysis of
mitochondrial cytochrome b of 648 bats [398] and a large-scale analysis of 27 genes
and morphological features [409].
Recently, the order Chiroptera is classified into Yinpterochiroptera (Pteropodi-
dae and Rhinolophoidae) and Yangochiroptera [403] based on overwhelming molec-
ular evidence (see Fig. 6.5, ’YY’-Classification). This phylogenetic arrangement for
Chiroptera renders echolocation to be paraphyletic and is supported by 13.7 kb of
17 nuclear gene fragments [403], 2 320 CDSs [399], KCNQ4 of 15 species [410], and
18 mitochondrial genomes [411].
Bats serve as a reservoir for various, often zoonotic viruses, including significant
human pathogens such as Ebola- and influenza viruses. However, for unknown
reasons, viral infections rarely cause clinical symptoms in bats. A tight control
of viral replication by the host innate immune defense might contribute to this
phenomenon. Transcriptomic studies revealed the presence of the interferon-induced
antiviral myxovirus resistance (Mx) proteins in bats, but detailed functional aspects
have not been assessed. To provide evidence that bat Mx proteins might act as key
factors to control viral replication we cloned Mx1 cDNAs from three bat families,
Pteropodidae, Phyllostomidae and Vespertilionidae. Phylogenetically these bat Mx1
genes cluster closely with their human ortholog MxA. Using transfected cell cultures,
minireplicon systems, virus-like particles and virus infections, we determined the
antiviral potential of the bat Mx1 proteins. Bat Mx1 showed a significant reduction
of polymerase activity of viruses circulating in bats, including Ebola- and influenza
A-like viruses. The related Thogoto virus, however, which is not known to infect bats
was not inhibited by bat Mx1. Further, we provide evidence for positive selection
in bat Mx1 genes that might explain species-specific antiviral activities of these
proteins. Together, our data suggest a role for Mx1 in controlling these viruses in
their bat hosts.

6.2.1 Bats and the Mx1 gene


Bats are hosts to a broad range of different viruses, therein also comprising a reservoir
of potential zoonotic pathogens [402, 412, 413]. For example, rhabdoviruses includ-
ing Rabies and other lyssaviruses are commonly detected in various bat species [414–
416]. Highly pathogenic filoviruses of the Marburgvirus and Ebolavirus genus have
been detected in African fruit bats of the Pteropodidae family [417, 418]. There is
serological evidence that bats of the Phyllostomidae family found in Guatemala and
Peru are frequently infected with previously unknown influenza A-like viruses [419,
420]. Moreover, serological and viral nucleic acid sequencing data in African and
Asian bats of the Rhinolophidae and Vespertilionidae families suggest infection with
bunyaviruses of the Nairo- and Hantavirus genera [421–423]. However, most of these
pathogens, with the exception of Rabies- and bat lyssaviruses [416, 424], seem to
be under strong host control as they do not cause obvious disease in bats [425].
Therefore, the innate immune response may contribute to an early control of these
viruses [426], preventing disease but eventually allowing viral persistence and shed-
ding. Unveiling the potential mechanisms leading to this phenomenon is of utmost

155
Chapter 6. Single Nucleotide Investigations

importance to understand the ecology of zoonotic viruses, to explain the pathogenic-


ity of these viruses for humans and other dead-end hosts, and potentially also to
identify new ways to control these viruses once they have crossed the species barrier.
The interferon (IFN) system navigates the innate antiviral defense of mammalian
hosts through induction of IFN-stimulated genes (ISG). Invading viral pathogens
are sensed by cellular pattern recognition receptors leading to the secretion of type I
(IFN-α/β) and type III (IFN-γ) interferons. These cytokines activate their respec-
tive receptors and induce an intracellular antiviral state thereby suppressing virus
replication. Critical components of this innate antiviral defense system have also
been described in bats [427–432]. Moreover, constant elevated expression of type I
IFNs in Pteropus alecto was recently described and thus constitutive expression of
ISGs might contribute to an early stringent control of invading viral pathogens in
bats [433].
Myxovirus resistance (Mx ) genes are prominent antiviral ISGs that are exclu-
sively induced by type I and type III IFN [434, 435]. Mx proteins are large GTPases
that were initially described as inhibitors of influenza virus replication [436]. Most
mammals encode two Mx paralogs called Mx1 and Mx2, or MxA and MxB for
the human gene products. Their structures resemble that of dynamin-like large
GTPases with an N-terminal globular GTP-binding (G) domain and a C-terminal
stalk that are both connected by a bundle-signaling element (BSE) [437]. More
recent studies broadened the antiviral spectrum of Mx proteins ranging from tick-
transmitted orthomyxoviruses, rhabdoviruses, and bunyaviruses to HIV and large
DNA viruses [436]. The antiviral action of Mx proteins is based on recognition
of viral target structures like viral ribonucleoprotein complexes, leading to mislo-
calization or even disruption of the viral structures [438–442]. Accordingly, escape
from Mx restriction has been described for human influenza A viruses (FLUAV)
and HIV-1 through mutations in the viral nucleocapsidproteins, the main structural
component of the ribonucleoprotein complexes [443, 444].
To better understand how bats are capable of controlling viral infections and
coexisting with potentially damaging pathogens, we analyzed the antiviral function
of bat Mx1 proteins and compared their activity with that of the well characterized
human MxA ortholog. Mx1 cDNAs from seven different bat species that were pre-
viously described as hosts of zoonotic viral pathogens were tested for inhibition of a
range of viruses that are related to viral agents commonly found in bats. We found
interesting differences in their antiviral specificity that are discussed in respect to
viral persistence in the bat hosts.

6.2.2 Material and methods


Wet lab experiments were conducted by Jonas Fuchs in partial fulfilment of the
requirements for a Ph.D. degree from the Faculty of Biology of the University of
Freiburg, Germany in the lab of Prof. Dr. Georg Kochs. Details about used cell lines,
IFN-treatment, RNA extraction, transfection, western blot analyses, immunofluo-
rescence analyses and different minireplicon systems can be found in [7]. The fol-
lowing Mx1 cDNA sequences were generated in the present study: C. perspicillata
Mx1: KR362561 (GenBank accession); E. helvum Mx1: KR362562; H. monstrosus

156
6.2. Evolution and antiviral specificity of bat Mx proteins

Mx1: KR362563; M. daubentonii Mx1: KR362564 ; P. pipistrellus Mx1: KR362565;


R. aegyptiacus Mx1: KR362566; S. lilium Mx1: KR362567.
In the following, we will focus on the bioinformatical methods performed in the
lab of Prof. Dr. Manja Marz.

Data collection, alignments and evolutionary analysis


The following Mx1 sequences were obtained from GenBank and incorporated in the
evolutionary analysis: Myotis lucifugus (XM_006104436), Myotis davidii (XM_00-
6754325), Myotis brandtii (XM_005885691), Pteropus alecto (NM_001290174), Ep-
tesicus fuscus (XM_008145691). We downloaded RNA-Seq data of Artibeus ja-
maicensis [445], a jamaican fruit bat of the family of Phyllostomidae, from EMBL-
EBI (SRR539297, HiSeq 2000, 100 bp paired-end reads) and built a de novo tran-
scriptome assembly (utilizing ideas presented in Chapter 4) utilizing three different
assembly tools: Velvet (v1.2.10) [217] with Oases (v0.2.08) [60], SOAPDenovo-Trans
(v1.0.3) [63] and Trinity (v20131110) [223] with default parameters and multiple k-
mer values. The resulting contigs of each assembly tool were merged and clustered
by sequence identity using CD-HIT-EST (-c 0.95, v4.6) [146], resulting in a fi-
nal assembly comprising 462,445 contigs. The assembly was further searched by
BLASTN using the Mx1 CDS of closely related bat species (C. perspicillata, S. lil-
ium). We identified full-length open reading frames of homologous Mx1 genes in the
de novo transcriptome assembly of A. jamaicensis and incorporated the best match-
ing sequence in our analyses. Mx1 homologous sequences of other mammals were
downloaded from Ensembl, comprising Homo sapiens (ENSG00000157601), Pan
troglodytes (ENSPTRG00000013927), Canis lupus (ENSCAFG00000010172), Equus
caballus (ENSECAG00000011776), Bos taurus (ENSBTAG00000030913), Mus mus-
culus (ENSMUSG00000000386, ENSMUSG00000023341), Rattus norvegicus (EN-
SRNOG00000001959, ENSRNOG00000001963), Sus scrofa (ENSSSCG00000012077),
Felis catus (ENSFCAG00000008068) and Ovis aries (ENSOARG00000010283). For
M. musculus and R. norvegicus both Mx genes were used, because of the known
homology to human MX1.
In-frame multiple sequence alignments for all 13 bat Mx1 cDNA sequences and
additional mammalian Mx1 genes were conducted using TranslatorX (v1.1) [386] and
the aligner Muscle (v3.8.31) [387]. All alignments were automatically adjusted and
gaps/stop codons were removed prior to following analyses. To determine the evolu-
tionary context of bat and other mammalian Mx1 genes, we constructed maximum
likelihood trees using RAxML (v8.0.25) [162] under the GTRGAMMA substitution
model and 1000 bootstrap replicates.

Recombination events and positive selection analysis


We used the GARD [390] tool to detect possible recombination events in the bat
Mx1 alignment. First, an automatic model selection was applied to suggest the best
fitting nucleotide substitution model for the alignment. Then, the in-frame align-
ment and corresponding model were conducted to the GARD algorithm to estimate
recombination breakpoints using the general discrete model of site-to-site rate vari-
ation and 3 rate classes. Furthermore, we performed a RDP (v4.80) analysis [446]

157
Chapter 6. Single Nucleotide Investigations

to validate the GARD results.


We performed maximum likelihood tests on nested ’site’ models implemented in
CODEML, part of the PAML software suite [393] and the HyPhy suite [389] to detect
significantly positive selected sites under varying models in the full bat alignment
and corresponding fragments, previously identified with GARD. Alignment specific
input trees were calculated with RAxML. The coding sequence alignments were fit
to paired ’site’models that disallow (M1a, M7) or allow (M2a, M8) positive selection
(ω > 1). A likelihood ratio test between paired models was performed to derive p
values for each alignment. If a significant difference (p < 0.01) between models M7
vs. M8 was detected, a Bayes Empirical Bayes (BEB) analysis was performed to
identify codons in the alignment with ω > 1 and to calculate their impact on the
positive selection (significant if posterior probability ≥ 0.95). We run CODEML
multiple times with different starting ω values (0, 0.5, 1, 1.5, 2.0) and under varying
codon frequency models (F3x4, F1x4, F61).
Nevertheless, one must be aware of the possibility that lower sampling of Mx1
sequences could lead to reduced power to detect positively selected sites [397]. There-
fore, we extended the novel bat Mx1 sequences presented in this study by additional
sequences from public databases to achieve a final set of 13 Mx1 cDNAs out of three
bat families. These were used for positive selection analysis. Likelihood ratio tests,
using 2 degrees of freedom, between M7 (null model; positive selection not allowed)
and M8 (positive selection model) showed a significant (P < 0.05) rejection of the
null model for the full Mx1 bat alignment (Fig. 6.6A). We therefore conclude that
positive selection can be adequately detected in this 13-species data set, although
the need for more sequenced bat genes is obvious.
All of the above described procedures were generalized and implemented in
a web based pipeline called PoSeiDon [9] freely available at http://www.rna.
uni-jena.de/poseidon (see Sec. 6.1).

6.2.3 Results and discussion


Sequence comparison of bat Mx1 cDNAs

The nucleotide sequences showed relatively high identities to other members of the
same bat family in a phylogenetic tree analysis (Fig. 6.7A). Moreover, the deduced
Mx1 amino acid sequences showed about 70% identity to Mx1 of other bat families
(Fig. 6.7C). An alignment of the amino acid sequences of all available bat Mx1
sequences (Fig. 6.8) reveals high sequence similarities all over the molecule except
for the highly variable N-terminal part before the first BSE and the loop L4. The
phylogenetic analysis revealed that bat Mx1 form a separate branch within the
mammalian Mx1 tree (Fig. 6.7B). The close similarity to other mammalian Mx1
proteins is reflected by the sequence identities of the bat Mx1 sequences to the
human MxA sequence (Fig. 6.7C). Interestingly, the cDNA clones of all bat Mx1
genes showed allelic variations that were mostly silent, but led in some species to
amino acid changes (Fig. 6.7C). According to our sequence alignment analyses, we
defined the cDNA clones with the most common nucleotide or amino acid variation
at particular positions as allele 1 which was used for further characterization.

158
6.2. Evolution and antiviral specificity of bat Mx proteins

A
Region M7 vs M8 M7 vs M8 % sites avg(ω) M8 BEB
(Χ²) p-value with ω > 1 (PP > 0.95/ > 0.99)

Mx1, F3x4, 13 bat species

R191; A195; S347; F425; R429; T480; A535; E548; S553; L554; Q556; T557;
full (aa 1-649) 101.39 < 0.001 6.26 3.45
S558; S559; D562; T565

A4; T5; D7; P10; A11; S13; H14; P15; G19;


frag1 (aa 1-90) 24.05 < 0.001 21.30 2.76
G26; L28; L32; N34

frag2 (aa 91-183) 0.19 0.908 NA NA none

R191; A195; S347; D422; F425; R429; T480; A535; E548; S553; L554; Q556;
frag3 (aa 184-649) 112.69 < 0.001 6.59 3.83
T557; S558; S559; D562; T565

B
0.9

0.8
aa: 1 90 183 649
Model averaged support

0.7

1-1947
0.6

0.5 549

0.4
270 549
0.3

0.2

0.1

0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Breakpoint location (nucleotides)
Figure 3: Evidence of evolutionary break point positions in the bat Mx1 alignment (13 species)
identified with GARD. T he best-fitting nucleotide substitution model (HK Y 85) was applied.
C
R eferences
90 183
[1] Federico Abascal, Rafael Zardoya, and Maximilian J Telford. TranslatorX: multiple alignment
N B G domain
of nucleotide sequences guided by B Nucleic acidsStalk
amino acid translations. research, page gkq291,
L4 B C
2010.
*****
** **** ** * *** * * ***********
Figure 6.6: (A) Results of the evolutionary analysis for positively selected sites in full length bat
Mx1 (aa 1-649) of 13 bat Mx1 coding sequences as presented in Fig. 6.7. In addition, positive
selection was analyzed for the fragments of the GARD analysis, frag1 (aa 1-90), frag2 (91-183),
frag3 (184-649) with respect to the sequence of M. daubentonii. P-values were achieved by perform-
ing chi-squared tests on twice the difference of the computed log likelihood values of the models
disallowing (M7) or allowing (M8) dN/dS (ω) >1. The BEB column lists rapidly evolving sites
with a dN/dS >1 and a posterior probability >0.95, determined by the Bayes Empirical Bayes
implemented in CODEML. Indels were removed from the alignment prior to evolutionary analyses.
Amino acid positions correspond to the full length Mx1 sequence of M. daubentonii. (B) Evidence
of evolutionary break points in the bat Mx1 alignment of the 13 bat Mx1 sequences (as shown
in Fig. 6.7) identified with GARD. The best-fitting nucleotide substitution model (HKY85) was
applied. The GARD fragments and the supported breakpoints are indicated. (C) Illustration of
the primary structure of bat Mx1 adapted to the crystal structure of MxA [437]: The unstruc-
tured regions (gray), the bundle signaling elements (B, red), the G domain (orange), the stalk
(green, blue) and loop 4 (L4). The rapidly evolving sites are indicated as arrowheads (analysis
of full-length bat Mx1 sequences, Fig. 6.7A) or asterisks (for the GARD fragments, Fig. 6.6B).
Breakpoints of recombination identified by GARD are indicated by vertical dotted lines.

159
Chapter 6. Single Nucleotide Investigations

A B

100

100 Pteropodidae 65
100

100 100

100
75

43
Phyllostomidae

100 Pteropodidae
100
100

100
100
100 Phyllostomidae

100
99 100 82
100 Vespertilionidae
100
Vespertilionidae

100 100

100
74

81

74 100

C
Family Phyllostomidae Vespertilionidae Pteropodidae

Species C. perspicillata S. lilium M. daubentonii P. pipistrellus E. helvum H. monstrosus R. aegyptiacus

S367G L559S
Allelic variants - - - I593V -
I415M V562L
Identity to E.
69,1 % 67,6 % 73,1 % 71,9 % - 79,4 % 80,6 %
helvum Mx1
Identity to
70.2% 69.0% 76.0% 77.3% 73.6% 72.1% 71.5%
human MxA

Figure 6.7: (A) Phylogenetic tree of bat Mx1 using a nucleotide sequence alignment with human
MxA as an outgroup. (B) Phylogenetic tree of mammalian Mx1 nucleotide sequences. (C) Allelic
variants of bat Mx1 and their sequence identity to E. helvum Mx1 and human MxA as determined
via multiple protein alignment.

160
6.2. Evolution and antiviral specificity of bat Mx proteins

Carollia perspicillata
Sturnira lilium
Artibeus jamaicensis
Pipistrellus pipistrellus
Eptesicus fuscus
Myotis brandtii
Myotis lucifugus
Myotis davidii
Myotis daubentonii
Hypsignatus monstrosus
Rousettus aegyptiacus
Eidolon helvum
Pteropus alecto

BSE G domain

Carollia perspicillata
Sturnira lilium
Artibeus jamaicensis
Pipistrellus pipistrellus
Eptesicus fuscus
Myotis brandtii
Myotis lucifugus
Myotis davidii
Myotis daubentonii
Hypsignatus monstrosus
Rousettus aegyptiacus
Eidolon helvum
Pteropus alecto

G domain

Carollia perspicillata
Sturnira lilium
Artibeus jamaicensis
Pipistrellus pipistrellus
Eptesicus fuscus
Myotis brandtii
Myotis lucifugus
Myotis davidii
Myotis daubentonii
Hypsignatus monstrosus
Rousettus aegyptiacus
Eidolon helvum
Pteropus alecto

Carollia perspicillata
Sturnira lilium
Artibeus jamaicensis
Pipistrellus pipistrellus
Eptesicus fuscus
Myotis brandtii
Myotis lucifugus
Myotis davidii
Myotis daubentonii
Hypsignatus monstrosus
Rousettus aegyptiacus
Eidolon helvum
Pteropus alecto

BSE Stalk

Carollia perspicillata
Sturnira lilium
Artibeus jamaicensis
Pipistrellus pipistrellus
Eptesicus fuscus
Myotis brandtii
Myotis lucifugus
Myotis davidii
Myotis daubentonii
Hypsignatus monstrosus
Rousettus aegyptiacus
Eidolon helvum
Pteropus alecto

Stalk

Carollia perspicillata
Sturnira lilium
Artibeus jamaicensis
Pipistrellus pipistrellus
Eptesicus fuscus
Myotis brandtii
Myotis lucifugus
Myotis davidii
Myotis daubentonii
Hypsignatus monstrosus
Rousettus aegyptiacus
Eidolon helvum
Pteropus alecto

Stalk L4 Stalk

Carollia perspicillata
Sturnira lilium
Artibeus jamaicensis
Pipistrellus pipistrellus
Eptesicus fuscus
Myotis brandtii
Myotis lucifugus
Myotis davidii
Myotis daubentonii
Hypsignatus monstrosus
Rousettus aegyptiacus
Eidolon helvum
Pteropus alecto

Stalk BSE

Figure 6.8: Amino acid sequence alignment of bat Mx1, based on the alignment on nucleotide
level used for the positive selection analyses. In-frame multiple sequence alignments of all 13 bat
Mx1 cDNA sequences were conducted using TranslatorX and the aligner Muscle. Lines below
indicate structural domains of Mx1 according to the crystal structure of human MxA [437] (BSE:
bundle signaling element; L4: Loop 4). For a complete output of the positive selection analyses
performed with PoSeiDon visit http://www.rna.uni-jena.de/supplements/mx1_bats/
full_aln [9].

161
Chapter 6. Single Nucleotide Investigations

Antiviral effect against Ebola virus (EBOV)

Within this thesis we present only one antiviral assay from this study, showing ex-
emplarily the effect of bat Mx1 against Ebola viruses (EBOV). Details about the
antiviral activity of bat Mx1 against other viruses like Vesicular stomatitis virus,
orthomyxoviruses, and bunyaviruses can be found in Fuchs et al. [7]. A comprehen-
sive analysis of the whole transcriptome response of human and bat cells to EBOV
(and also Marburg virus) infections is given in Sec. 5.2.

Old world fruit bats were identified


as possible reservoirs of EBOV [417, EBOV: VLP infection

418]. Therefore, the antiviral activity * * ** **


0.53 0.52 0.51 0.50
of bat Mx1 against EBOV was deter-
mined using a newly established VLP
(% relative to empty vector)
100
assay [447]. An artificial tetracistronic
renilla activity

minigenome encoding the viral VP24,


VP40 and GP structural proteins and 50
a Renilla luciferase reporter was co-
transfected with expression plasmids en-
coding the viral RNA polymerase L, the 0
phosphoprotein VP35, the transcription FLAG

(M 1) x

initiation factor VP30 and the nucleo-  



actin
protein NP. These helper plasmids sup-
x1
 .  xA   xA

)
lla illat )

A)
1 (  1
P.  x1 (T  1
ist  tre  A)
i  )
pty  r ( - 

Mx  

 A
 M

rsp rsp  103 


 ic  A

he  elv 


Mx M
ve  L)
sa  +L

x

x
03

port transcription and replication of the


 . h  98

0 1
m   M
ta  a M
ns ens

lus  s 
H.  tor ( 
em ecto

pip  1  
rel  llu

T1
pe  (T 
c 

lvu  um 
T

 v 

1 (
pty 

 
M 

viral minigenome as well as packaging


p
em 

s
M

E

and formation of VLPs. To test the ef-


pip  

pie

ici
C
sa

fect of Mx1 on EBOV replication, Mx1


E.
H.

pe

P.

expression plasmids were co-transfected


C.

with viral helper plasmids encoding L- Figure 6.9: EBOV: VLP infection 293T cells (12
polymerase, VP35, VP30, NP, and the well format) were co-transfected with 50 ng NP,
cellular TIM-1 adhesion factor. The 375 ng L, 50 ng VP35, 30 ng VP30, 90 ng TIM-1
and 300 ng of the indicated Mx1 expression plas-
latter was included to enhance suscep- mids 24 h prior to infection with EBOV VLPs.
tibility of the cells to infection with As a control cells were treated with comparable
EBOV-VLPs at 24 h post transfection. amounts of VLP preparation produced in the ab-
Using this approach, a robust Renilla sence of the L construct (-L). At 24 h post infec-
tion the activity of the Renilla luciferase encoded
luciferase expression was detected in by the viral genome was determined. The empty
the cell lysates (Fig. 6.9). Omission vector control (without Mx expression) was set to
of the L helper plasmid abolished ex- 100%. Mx1 expression was controlled by Western
pression of the luciferase encoded by blot analysis. Significance was calculated with a
the minigenome, indicating the depen- one-sided student’s t-test (n=3, *p ≤ 0.05 and
**p ≤ 0.01).
dency of reporter gene expression on vi-
ral polymerase activity. Co-expression
of human MxA or the three bat Mx1 proteins reduced luciferase activity to about
50% compared to the activity measured in the presence of the respective inactive
mutants, indicating that bat Mx1 are able to control EBOV polymerase activity.

162
6.2. Evolution and antiviral specificity of bat Mx proteins

Evolutionary analysis of bat Mx1


After confirmation of the antiviral capacity of our bat Mx1 proteins against a diverse
range of RNA viruses (details see [7]), we used our Mx1 cDNA sequences and in
addition other publicly available bat Mx1 sequences to analyze the evolution of the
bat Mx1 genes with respect to their phylogenetic relationship. We tested whether
rates of non-synonymous changes (dN) exceeded synonymous changes (dS) using
the PAML software suite [393] with a dN/dS (ω) ratio greater 1 indicating positive
selection. Details about positive selection detection and recombination analysis can
be found in Sec. 6.1. Analysis of the 13 bat Mx1 sequences (Fig. 6.7A) revealed 16
positive selected sites, mostly in the stalk and nine of these positions concentrated
in the flexible loop L4 (Fig. 6.6A and C, arrow heads). The calculated likelihood
values and posterior probabilities were robust under varying codon frequency models
(F3x4, F1x4, F61) and the initial ω values used.
A recent phylogenetic analysis of mammalian Mx1 genes detected multiple re-
combination events between Mx paralogs that might influence the evolutionary his-
tory of different Mx fragments and therefore the comparative analysis of Mx or-
thologs [396]. We analyzed the bat Mx1 sequences for recombination events using
GARD, a genetic algorithm for recombination detection [390], and identified two re-
combination breakpoints (Fig. 6.6B). Both breakpoints are located near exon/intron
boundaries, when referring to the genomic sequence of M. lucifugus as a reference
(ENSMLUG00000011447). Breakpoint 1 is located at the 5’ end of exon 3 and
breakpoint 2 at the 3’ end of exon 4. Therefore, fragment 1 comprises exons 1 and 2
and the first five nucleotides of exon 3. Fragment 2 comprises the rest of exon 3 and
exon 4, except the last 15 nt at the 3’ end of exon 4. Fragment 3 comprises those 15
nt of exon 4 and exons 5 to exon 13. Breakpoints in similar regions of mammalian
Mx genes have been recently described by [396] for the analyses of rodent and pri-
mate Mx homologs. We independently repeated this recombination analysis of the
bat Mx1 sequences using the RDP tool [446] and confirmed the most significant
breakpoints with slightly shifted positions when compared to the GARD analysis:
246 instead of 270 and 554 instead of 549. The identified cDNA fragments code
for important functional regions in the Mx structure. Fragment 1 encodes the first
helix of the BSE, fragment 2 highly conserved elements of the GTP-binding domain
and fragment 3 the C-terminal part of the G domain and the stalk (Fig. 6.6C).
However, the detected breakpoints in our bat Mx1 cDNA analysis did not show a
significant topological incongruence based on a Kishino Hasegawa (KH) test [392]
applied after the GARD analysis. Most likely, such KH-insignificant breakpoints
arise due to variation in branch lengths between fragments. This could indicate
forms of recombination or other processes like heterotachy.
An extended analysis for positions under positive selection was separately per-
formed for the fragments obtained by the GARD analysis. Interestingly, fragment
1 now showed several residues under positive selection within the N-terminal region
in addition to the positions identified in the C-terminal fragment 3 (Fig. 6.6A, C,
asterisks). We performed the same calculations for the fragments resulting from the
RDP analysis with no impact on the significant positively selected sites identified
above. Again, fragment 2 showed a strong purifying selection of all residues, attest-
ing its evolutionary conservation. These two hot spots of positive selection might be

163
Chapter 6. Single Nucleotide Investigations

of specific importance for evolutionary adaptation processes in the host-pathogen


arms-race between bats and their viral infectious agents.

6.2.4 Conclusions
Bats are recognized as an important reservoir of potential zoonotic viruses [448].
However, it is unclear how bats deal with the infection, replication and possible
persistence of these viruses which often induce severe to fatal infections in humans
upon zoonotic transmission. Here we evaluated the role of the IFN-induced bat
Mx1 proteins in the control of diverse RNA viruses and found a broad antiviral
spectrum similar to the human MxA ortholog against orthomyxo-, rhabdo-, filo-,
and bunyaviruses (details about each virus can be found in Fuchs et al. [7]). Our
phylogenetic analysis of Mx1 proteins grouped the bat Mx1 sequences according
to their affiliation into the three different bat families [449]. Within the individual
families, bat Mx1 showed around 80% sequence identity. Between the families, the
bat Mx1 sequences displayed a reduced (around 70%) identity, which is comparable
with the identity to other mammalian Mx1 proteins, e.g. the human MxA. This
may reflect the long, about 40 to 50 million years, independent evolutionary history
and diversification of Mx1 genes in the individual Chiroptera families [407].
A detailed examination of the mammalian Mx1 sequences revealed a high pro-
portion of invariant amino acid residues under purifying selection, supporting the
view of Mx proteins as dynamin-like molecular machines with a sophisticated, highly
conserved structure [437] that allows evolution-driven variations only in a few flex-
ible regions [396, 450]. Accordingly, when analyzing the bat Mx1 sequences, we
identified residues under positive selection in two variable and surface exposed re-
gions. Most residues under positive selection were identified in the N terminus ahead
of the first BSE helix and in the C-terminal loop L4 (Fig. 6.6). The accumulation
of residues under positive selection in these two surface exposed, variable regions of
Mx1 [437] indicates the structural flexibility of these two regions compared to the
overall high structural conservation of the majority of the bat Mx1 molecules. In-
terestingly, positively selected positions in the N terminus were identified only when
individual gene segments were analyzed. Using GARD analysis, we detected two
breakpoints resulting in three gene segments, suggesting exchanges between ancient
bat Mx paralogs by recombination events (Fig. 6.6B and C). A comparable analysis
using the orthologous primate MxA genes identified clusters of positively selected
residues in the L4 loop, which has been previously identified as a determinant of
Mx antiviral specificity [450]. Positively selected positions in the N-terminal region
were identified in the paralogous primate MxB genes [396]. The various positively
selected residues in these two regions of bat Mx1 molecules argues for a long-standing
conflict of bats with diverse viral agents [451], which circulated or are still present
in the bat kingdom.
The overall efficient inhibition of various viral pathogens by bat Mx1 indicates
that this ISG exhibits a central role in the control of viral replication in bats. Of note,
we cannot speculate about the identity of the pathogens that drove the ancient arms
race with bat Mx1, resulting in positive selected positions in today’s bat species.
Since the bat Mx1 cDNAs showed rather comparable antiviral activities against the

164
6.2. Evolution and antiviral specificity of bat Mx proteins

limited spectrum of RNA viruses employed in the present study, our results cannot
fully explain the evolution of bat Mx1 proteins. Follow up studies should extend the
spectrum of viruses tested for bat Mx1 sensitivity and will refine the connection of
Mx1 antiviral capacity with the genetic evolution of these important components of
the innate antiviral defense.

165
Chapter 6. Single Nucleotide Investigations

166
Chapter 7

Conclusions and future perspectives

With its unprecedented throughput in combination with high scalability, paral-


lelization, high sensitivity and speed, Next-Generation Sequencing (NGS) allows
researchers to study biological systems and processes at a level never possible be-
fore. Complex genomic and transcriptomic research questions require a depth of
information far beyond the scope of traditional DNA sequencing technologies like
Sanger Sequencing. NGS has filled the gap and emerged as an everyday research
tool to address versatile biological questions.
However, the large amount of data produced by current NGS systems needs to be
evaluated, stored, and processed efficiently. Finally, the data needs to be presented
in a way that other researchers can comprehensively interpret and constructively
work with those data. Therefore, NGS has great power, but the generation of
meaningful results out of a tremendous pool of short sequencing reads often remains
an intransparent process – comparable to a Black Box – and causes various problems
for researchers.
In Chapter 2, we build the basis for the following chapters by entering the Black
Box of NGS and discussing basic approaches, methodologies, and algorithms.

Genome Assembly. In Chapter 3, we have presented the assemblies of two rather


small, bacterial-sized genomes. The generation of such essential reference genomes
and their corresponding annotations is a crucial task, because subsequent analyses
heavily rely on those references.
In the first part of this chapter, we have presented the full-length genome assem-
bly of Chlamydia gallinacea type strain 08-1274/3 [2]. The availability of the full
genomic representation of this Chlamydia type strain is of high importance for the
whole Chlamydia community, because the sequence of this reference strain will be
helpful in many future experiments and comparative studies. The novel sequence
data of C. gallinacea type strain 08-1274/3 and its plasmid p1274 have been de-
posited in NCBI GenBank with the accession numbers CP015840 and CP015841,
respectively.
In the second part of this chapter, we have given comprehensive insights into
the Mycobacterium avium subsp. paratuberculosis genome using new NGS data of
a sheep strain (JIII-386) from Germany [1, 8]. In comparison to the C. gallinacea
genome assembly presented before, we were not able to achieve a complete genomic

167
Chapter 7. Conclusions and future perspectives

representation without any gaps for this bacteria. We conducted an extensive an-
notation of protein- and non-coding genes for this new assembly and seven other
Mycobacteria genomes. We showed that the combination of different annotation
tools can improve the overall annotation. Further, we comprehensively compared
all eight Mycobacteria genomes and provided deep insights into the gene composition
and phylogeny of this pathogen.
In the future, the assembled genomes and corresponding annotations of protein-
and non-coding genes of both bacteria will be invaluable for the Chlamydia and
Mycobacteria community. For the Mycobacteria genome assembly, we showed that
the usage of different assembly tools and parameter settings can help to improve
the overall assembly. Currently, the assembly approach used for the Chlamydia
assembly is further adapted to the assembly of more Chlamydia strains, which will
build the basis of a new comparative genome study.

Transcriptome Assembly. In Chapter 4, we switched from genome assembly


to the assembly of RNA-Seq data. Many current NGS projects focus on the con-
struction of a transcriptome assembly for annotation and expression quantification
instead of constructing a full genomic representation for the species of interest. If
no reference is available, a de novo transcriptome assembly can be constructed from
the RNA-Seq data.
In the first part of this chapter, we comprehensively compared the performance
of ten de novo assembly tools on nine different RNA-Seq data sets originating from
different Illumina sequencing setups and various kingdoms of life [16]. We calculated
more than 200 single assemblies and evaluated the performance of each tool with the
help of different metrics. We found out that the clear definition of powerful evalua-
tion metrics for de novo transcriptome assemblies is an important task that currently
lacks an unification in the transcriptomic community. To the best of our knowledge,
we selected 20 powerful metrics and identified specific assembly tools that generally
performed well on most of the tested data sets. By concluding the results of this
extensive study, Trans-ABySS [61], Trinity [62], SOAPdenovo-Trans [63],
and SPAdes [54] performed best. Based on our results, we conclude to execute
multiple assembly tools on different parameter settings (most importantly different
k-mer values), and select and combine the best-performing assemblies in a merged
assembly for further analyses.
However, the application of multiple assembly tools and the automatic selection
of the best assemblies and/or contigs for the clustering step is still a challenging
task. In the second part of this chapter, we have presented a possible transcrip-
tome assembly pipeline, utilizing multiple tools and parameter settings [17]. As a
proof-of-concept, we evaluated the performance of different clustering approaches
on the assemblies of a mouse RNA-Seq data set. We found out that an appropriate
preselection of the best-performing assembly results based on various metrics can
greatly improve the transcript clustering and finally the whole de novo transcriptome
assembly.
In the near future, we will implement this de novo cluster-transcriptome assem-
bly pipeline for short-read RNA-Seq data and extensively test its performance. The
pipeline will be implemented in a modular way so that newly emerging assembly

168
tools can be easily integrated. To further improve the selection of the best assem-
bly results, we will adjust and extend our set of evaluation metrics. For example,
metrics like the complete amount of transcripts in an assembly or the N50 value
can influence the metric score. The N50 value can be easily increased by adding
nucleotide stretches of low complexity to the contigs of an assembly.
Furthermore, we can easily extend our evaluation pipeline and add novel de novo
assembly tools to further improve and complement our comparisons. For example,
a new de novo transcriptome assembly tool called IsoTree [452] was presented in
May 2017 and will be incorporated. If the tool performs well, it will be integrated
in our assembly pipeline.

Differential Gene Expression. Another application of NGS data (especially of


RNA-Seq data) involves the estimation of transcript abundances and the comparison
of expression levels of biological samples from different conditions. This information
can be used to detect significantly differential expressed genes (DEGs). Such appli-
cations are generally based on the mapping of short RNA-Seq reads to a reference
genome (Chapter 3) or transcriptome (Chapter 4). Reads, mapping to a certain
feature like a gene, can be quantified and the counts (representing expression levels)
can be compared between samples of different conditions. For example, a control
condition can be compared with a virus-infected sample to identify genes that play
an important role during the infection.
In Chapter 5, we have presented two comprehensive DEG studies. In the first
part, we investigated the differential effects of vitamins A (atRA) and D (vitD) on
the transcriptional landscape of human monocytes during the infection with Ara-
bidopsis fumigatus, Candida albicans and Escherichia coli [5, 6]. This study repre-
sents a straightforward DEG analysis, because reference genomes and appropriate
gene annotations of the human and the pathogens are available. Therefore, there
was no need to construct a transcriptome assembly beforehand. In this study, we
have defined the whole immunomodulatory role of atRA and vitD during the infec-
tion with A. fumigatus, C. albicans and E. coli. Strikingly, both vitamins showed
an unexpected ability to counteract the pathogen-induced transcriptional responses.
Moreover, we investigated the possible direct and indirect mechanisms of vitamin-
mediated regulation of the immune response. Our findings highlight the importance
of vitamin-monitoring in critically ill patients. The results of this RNA-Seq study un-
derpin the potential of atRA and vitD as therapeutic options for anti-inflammatory
treatment, that should and will be investigated in more detail in the future.
In the second part, we studied the transcriptional changes during an infection
with two filoviruses (Ebola and Marburg virus) which can result in a severe and often
fatal infection in humans [4]. However, bats are natural hosts and survive filovirus
infections without obvious symptoms. The molecular basis of this striking difference
in the response to filovirus infections is not well understood. Within this section, we
reported a systematic overview of DEGs, activity motifs and pathways in human and
bat cells infected with Ebola and Marburg viruses, and we demonstrated that the
replication of filoviruses is more rapid in human cells than in bat cells. We found out
that the most strongly regulated genes upon filovirus infection are chemokine ligands
and transcription factors. Based on this unique RNA-Seq data set, we provided a

169
Chapter 7. Conclusions and future perspectives

resource that can be used by other researchers to identify cellular responses that
might allow bats to survive filovirus infections.
This comprehensive study presented here was only manageable in such a short
time with the help of 30 experienced scientists who came together in 2014 to “Fight
against Ebola” and manually investigated a tremendously amount of human and bat
genes. During this “hackathon” organized by our group, 1,500 genes (7.5 % of human
protein-coding genes) were investigated in great detail and build the backbone of
this outstanding study.
In comparison to the first DEG study presented in this chapter, the filovirus
project confronted us with much more challenging tasks. At the start of this project,
no genome of the fruit bat Rousettus aegyptiacus was available, so we decided to
construct a comprehensive de novo transcriptome assembly with various tools (as
described in Chapter 4). We annotated and conducted this assembly to find sig-
nificant DEGs for this bat species. Furthermore, the lack of biological replicates
and the high replication rate of the Ebola virus made this project one of the most
challenging ones presented in this thesis. On the other hand, we had the great
opportunity to contribute on the development of antiviral drugs during the 2014
West African Ebola outbreak. With this project, a great entanglement has been
presented by combining the analysis of genome reference data with transcriptome
assembly data for human and bat. We exemplary showed that DEGs identified with
a genomic and a transcriptomic approach are actually comparable. Currently, the
best candidate genes and pathways identified by our study are investigated further
in the wet lab of Prof. Dr. Stephan Becker in Marburg.

Single Nucleotide Investigations. Such NGS studies as presented in Chap-


ter 5 can provide comprehensive insights into a broad amount of biological processes
and help to tackle various questions depending on the type of performed analysis.
Thousands of genes can be tested for differential expression in parallel. However,
besides those high throughput workflows and whole genome/transcriptome studies,
approaches that focus on specific genes or even single nucleotide positions are still
of high importance. In fact, NGS should be seen as a helpful tool to identify top
candidate genes and kingpins for further, more detailed investigations.
In Chapter 6, we have dipped even deeper in the Black Box by taking a closer
look on more restricted data sets, single genes and specific nucleotide positions. In
the first part, we have presented our pipeline for detection of evolutionary recombi-
nation events and positively selected sites in protein-coding sequences. This pipeline,
called PoSeiDon [9], was further implemented as an easy-to-use web server and is
now publicly available at www.rna.uni-jena.de/poseidon. PoSeiDon auto-
matically builds a multiple sequence alignment, estimates a best-fitting substitution
model and performs a recombination analysis followed by the construction of all cor-
responding phylogenies. Finally, we detect significant positively selected sites with
various models for the full alignment and possible recombination fragments. With
the help of our web service, researchers can easily detect positively selected sites
and can get insights into the evolution of a specific gene. For example, this new
knowledge might be helpful to develop counter-measures against pathogens that are
in ‘arms-race’ with their host.

170
In the second part of this chapter, we extensively discussed the evolution and
antiviral specificity of a single gene of bats coding for the interferon-induced Mx
protein and its reaction against an infection with Ebola-, Influenza-, and other
RNA viruses [7]. As already presented in Sec. 5.2, bats are a natural reservoir
for various viruses that rarely cause clinical symptoms in bats, but carry zoonotic
pathogens like Ebola or Rabies virus. It has been speculated that the interferon
system might play a key role in controlling viral replication in bats. In this project,
we showed that the interferon-induced Mx proteins are indeed key antiviral factors
in bats and have co-evolved with bat-borne viruses. For the first time, we evaluated
a large set of bat Mx1 proteins spanning three major bat families for their antiviral
potential. We described their phylogenetic relationship by revealing patterns of
positive selection. Our pipeline PoSeiDon was conducted to detect recombination
events and positively selected sites in the bat Mx1 gene [7], as well as in the Mx1
gene of rodents [18].
We already collected multiple ideas for an improvement of the PoSeiDon web
server. First of all, we plan to implement branch-site models into the pipeline. At
the moment, we can only test whether a whole gene is under positive selection and
if so, we can statistically determine which single positions in the alignment have a
high impact on the positive selection. With the integration of branch-site models, we
would be able to detect if specific branches in a phylogenetic tree are under positive
selection. Also, we will provide a version of PoSeiDon for local installation and
execution. We plan to distribute a Docker container for Linux/Windows/MacOS,
which can be easily downloaded and executed without the need of locally installing
and compiling all required dependencies.
Furthermore, one objective of all projects presented throughout this thesis was
to find appropriate visualizations for the data and results. Whenever reasonable,
the obtained results are accompanied by comprehensive and interactive electronic
supplements to encourage other researchers to investigate the results in a productive
and transparent way. The general idea is to provide functions that allow a quick
examination of large amounts of data, to expose trends and to find patterns and
correlations within the data. Effective data visualization is an important part in the
decision making process and helps to gain further insights and picturing possible
answers for certain life-science questions.
Besides the short-read NGS data this thesis is mainly based on, sequencing tech-
nologies emerged in the last few years which are able to produce reads of tremen-
dously read lengths [29]. Such long-read sequencing technologies, as developed by
PacBio and Nanopore, overcome the length limitation of other NGS approaches like
Illumina, and are able to produce reads of a length of >40,000 bases. As those tech-
niques remain considerably more expensive and have lower throughput and higher
error rates than other platforms, the universal adaption of these technologies is still
limited. However, costs and error rates are continuously decreasing and with the
emergence of such new technologies, existing problems will exacerbated and new
problems will arise, which need to be computationally tackled in the future.
In this thesis, we have presented a broad variety of bioinformatical approaches,
not exclusively related to NGS data. From far away, all the main results presented

171
Chapter 7. Conclusions and future perspectives

here (Chapter 3–6) deal with different species, data sets, computational topics and
biological questions. However, we showed how to connect the different approaches
and methods in order to obtain a more comprehensive picture to answer certain
biological questions. Besides the overwhelming amount of results presented in this
thesis, one of our main focuses was to encourage the reader to really look into the
data and to not only trust reported significance values and fold changes obtained
from an NGS study. Combining different approaches of complementing fields, such
as genomics, transcriptomics and single nucleotide investigations, has the greatest
potential of producing comprehensive and helpful results, and to bring some light
in the Dark Art of Next-Generation Sequencing.

172
Bibliography

[1] Petra Möbius, Martin Hölzer, Marius Felder, Gabriele Nordsiek, Marco Groth,
Heike Köhler, Kathrin Reichwald, Matthias Platzer, and Manja Marz. “Compre-
hensive insights in the Mycobacterium avium subsp. paratuberculosis genome using
new WGS data of sheep strain JIII-386 from Germany”. In: Genome biology and
evolution 7.9 (2015), pp. 2585–2601.
[2] Martin Hölzer, Karine Laroucau, Heather Huot Creasy, Sandra Ott, Fabien Vo-
rimore, Patrik M Bavoil, Manja Marz, and Konrad Sachse. “Whole-genome se-
quence of Chlamydia gallinacea type strain 08-1274/3”. In: Genome Announcements
4.4 (2016), e00708–16.
[3] Abdullah H Sahyoun, Martin Hölzer, Frank Jühling, Christian Höner zu Siederdis-
sen, Marwa Al-Arab, Kifah Tout, Manja Marz, Martin Middendorf, Peter F Stadler,
and Matthias Bernt. “Towards a comprehensive picture of alloacceptor tRNA re-
molding in metazoan mitochondrial genomes”. In: Nucleic acids research 43.16
(2015), pp. 8044–8056.
[4] Martin Hölzer, Verena Krähling, et al. “Differential transcriptional responses to
Ebola and Marburg virus infection in bat and human cells”. In: Scientific Reports
6 (2016), p. 34589.
[5] Konstantin Riege, Martin Hölzer, Tilman E. Klassert, Emanuel Barth, Julia
Bräuer, Collatz Maximilian, Franziska Hufsky, Nelly B. Mostajo, Magdalena Stock,
Bertram Vogel, Hortense Slevogt, and Manja Marz. “Massive Effect on LncRNAs
in Human Monocytes During Fungal and Bacterial Infections and in Response to
Vitamins A and D”. In: Scientific Reports 7 (2017), p. 40598.
[6] Tilman E. Klassert, Julia Bräuer, Martin Hölzer, Magdalena Stock, Konstantin
Riege, Christina Zubiría-Barrera, Mario M. Müller, Silke Rummler, Christine Skerka,
Manja Marz, and Hortense Slevogt. “Differential Effects of vitamins A and D on
the Transcriptional Landscape of Human Monocytes during Infection”. In: Scientific
Reports 7 (2017), p. 40599.
[7] Jonas Fuchs, Martin Hölzer, Mirjam Schilling, Corinna Patzina, Andreas Schoen,
Thomas Hoenen, Gert Zimmer, Manja Marz, Friedemann Weber, Marcel A. Müller,
and Georg Kochs. “Evolution and antiviral specificity of interferon-induced Mx
proteins of bats against Ebola-, Influenza-, and other RNA viruses.” In: Journal of
Virology (2017), JVI–00361.
[8] Petra Möbius, Elisabeth Liebler-Tenorio, Martin Hölzer, and Heike Köhler. “Eval-
uation of associations between genotypes of Mycobacterium avium subsp. paratu-
berculosis and presence of intestinal lesions characteristic of paratuberculosis”. In:
Veterinary Microbiology 201 (2017), pp. 188–194.

173
Bibliography

[9] Martin Hölzer and Manja Marz. “PoSeiDon: a web server for the detection of
evolutionary recombination events and positive selection”. In: Bioinformatics (sub-
mitted).
[10] Petra Möbius, Gabriele Nordsiek, Martin Hölzer, Michael Jarek, Manja Marz,
and Heike Köhler. “Complete genome sequence of JII-1961 – a bovine Mycobac-
terium avium subsp. paratuberculosis field isolate from Germany”. In: Genome An-
nouncements (2017), submitted.
[11] The RNA tools and software consortium. “A community-driven catalog of RNA
bioinformatics tools and their ontologies”. In: preparation (2017).
[12] Nelly B Mostajo, Martin Hölzer, Abdullah H Sahyoun, Verena Krähling, Stephan
Becker, and Manja Marz. “A comprehensive annotation of non-coding RNAs in
bats”. In: preparation (2017).
[13] Sebastian Bartschat, Clara Bermudez-Santana, Anke Busch, Alexander Donath,
Jan Engelhardt, Andreas R Gruber, Jana Hertel, Michael Hiller, Martin Hölzer,
Franziska Hufsky, Emanuel Barth, Frank Jühling, et al. “Comparative Analysis of
Non-Coding RNAs in Nematodes”. In: preparation (2017).
[14] Martin Hölzer, Manja Marz, and Daniel Steinbach. “Elucidation of the molecular
mechanisms of progression of the non-muscle invasive urothelial carcinoma of the
urinary bladder (NMIBC) and identification of possible prognostic markers and
therapeutic targets by exom and 3’/5’ UTR mutation analyzes”. In: preparation
(2017).
[15] Martin Hölzer, Friedemann Weber, and Manja Marz. “Description of the tran-
scriptomic landscape of the microbat Myotis daubentonii in response to interferon
stimulation and an infection with the Rift Valley fever virus”. In: Journal of Virology
(in preparation).
[16] Martin Hölzer and Manja Marz. “The Dark Art of de novo Transcriptome As-
sembly: A Comprehensive Across-species Comparison of Short Read RNA-Seq As-
semblers”. In: preparation (2017).
[17] Martin Hölzer and Manja Marz. “GOAssembler: A Method Pipeline for the Con-
struction, Evaluation and Clustering of de novo Transcriptome Assemblies”. In:
preparation (2018).
[18] Barbara Müther, Martin Hölzer, Manja Marz, and Georg Kochs. “Evolution and
antiviral specificity of interferon-induced Mx proteins in rodents”. In: Journal of
Virology (in preparation).
[19] Martin Hölzer, Ruman Gerst, and Manja Marz. “PCAGO: An interactive web
service to analyze RNA-Seq data with principal component analysis”. In: prepara-
tion (2017).
[20] Abdullah H Sahyoun. “Computational investigations into the evolution of mito-
chondrial genomes”. MA thesis. Leipzig: University Leipzig, 2015.
[21] Jon Ison, Matúš Kalaš, Inge Jonassen, Dan Bolser, Mahmut Uludag, Hamish Mc-
William, James Malone, Rodrigo Lopez, Steve Pettifer, and Peter Rice. “EDAM:
an ontology of bioinformatics operations, types of data and identifiers, topics and
formats”. In: Bioinformatics 29.10 (2013), pp. 1325–1332.

174
Bibliography

[22] Christine Durinx, Jo McEntyre, Ron Appel, Rolf Apweiler, Mary Barlow, Niklas
Blomberg, Chuck Cook, Elisabeth Gasteiger, Jee-Hyub Kim, Rodrigo Lopez, et al.
“Identifying ELIXIR Core Data Resources”. In: F1000Research 5 (2016).
[23] Mark D Adams, Jenny M Kelley, et al. “Complementary DNA sequencing: ex-
pressed sequence tags and human genome project”. In: Science 252.5013 (1991),
p. 1651.
[24] Francis S Collins, Michael Morgan, and Aristides Patrinos. “The Human Genome
Project: lessons from large-scale biology”. In: Science 300.5617 (2003), pp. 286–290.
[25] Jeremy Schmutz, Jeremy Wheeler, Jane Grimwood, Mark Dickson, Joan Yang,
Chenier Caoile, Eva Bajorek, Stacey Black, Yee Man Chan, Mirian Denys, et al.
“Quality assessment of the human genome sequence”. In: Nature 429.6990 (2004),
pp. 365–368.
[26] Jay Shendure and Hanlee Ji. “Next-generation DNA sequencing”. In: Nature biotech-
nology 26.10 (2008), pp. 1135–1145.
[27] Michael L Metzker. “Sequencing technologies – the next generation”. In: Nature
reviews genetics 11.1 (2010), pp. 31–46.
[28] HPJ Buermans and JT Den Dunnen. “Next generation sequencing technology: ad-
vances and applications”. In: Biochimica et Biophysica Acta (BBA)-Molecular Basis
of Disease 1842.10 (2014), pp. 1932–1941.
[29] Sara Goodwin, John D McPherson, and W Richard McCombie. “Coming of age:
ten years of next-generation sequencing technologies”. In: Nature Reviews Genetics
17.6 (2016), pp. 333–351.
[30] Anthony Rhoads and Kin Fai Au. “PacBio sequencing and its applications”. In:
Genomics, proteomics & bioinformatics 13.5 (2015), pp. 278–289.
[31] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and Barbara
Wold. “Mapping and quantifying mammalian transcriptomes by RNA-Seq”. In:
Nature methods 5.7 (2008), pp. 621–628.
[32] Zhong Wang, Mark Gerstein, and Michael Snyder. “RNA-Seq: a revolutionary tool
for transcriptomics”. In: Nature Reviews Genetics 10.1 (2009), pp. 57–63.
[33] David C Corney. “RNA-seq using next generation sequencing”. In: Mater Methods
3 (2013), p. 203.
[34] Nicholas J Croucher, Maria C Fookes, Timothy T Perkins, Daniel J Turner, Samuel
B Marguerat, Thomas Keane, Michael A Quail, Miao He, Sammey Assefa, Jürg
Bähler, et al. “A simple method for directional transcriptome sequencing using
Illumina technology”. In: Nucleic acids research 37.22 (2009), e148–e148.
[35] Peter JA Cock, Christopher J Fields, Naohisa Goto, Michael L Heuer, and Peter M
Rice. “The Sanger FASTQ file format for sequences with quality scores, and the
Solexa/Illumina FASTQ variants”. In: Nucleic acids research 38.6 (2010), pp. 1767–
1771.
[36] John C Marioni, Christopher E Mason, Shrikant M Mane, Matthew Stephens, and
Yoav Gilad. “RNA-seq: an assessment of technical reproducibility and comparison
with gene expression arrays”. In: Genome research 18.9 (2008), pp. 1509–1517.
[37] Paul L Auer and RW Doerge. “Statistical design and analysis of RNA sequencing
data”. In: Genetics 185.2 (2010), pp. 405–416.

175
Bibliography

[38] Kimberly Robasky, Nathan E Lewis, and George M Church. “The role of replicates
for error mitigation in next-generation sequencing”. In: Nature Reviews Genetics
15.1 (2014), pp. 56–62.
[39] Alexander Dobin, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski,
Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. “STAR: ul-
trafast universal RNA-seq aligner”. In: Bioinformatics 29.1 (2013), pp. 15–21.
[40] Evguenia Kopylova, Laurent Noé, and Hélène Touzet. “SortMeRNA: fast and ac-
curate filtering of ribosomal RNAs in metatranscriptomic data”. In: Bioinformatics
28.24 (2012), pp. 3211–3217.
[41] Jack A Gilbert and Margaret Hughes. “Gene expression profiling: metatranscrip-
tomics”. In: High-Throughput Next Generation Sequencing: Methods and Applica-
tions (2011), pp. 195–205.
[42] Matthew Kanke, Jeanette Baran-Gale, Jonathan Villanueva, and Praveen Sethupa-
thy. “miRquant 2.0: an Expanded Tool for Accurate Annotation and Quantification
of MicroRNAs and their isomiRs from Small RNA-Sequencing Data”. In: Journal
of Integrative Bioinformatics 13.5 (2016), p. 307.
[43] S Andrews et al. “FastQC: A quality control tool for high throughput sequence
data”. In: Reference Source (2010).
[44] Robert Schmieder and Robert Edwards. “Quality control and preprocessing of
metagenomic datasets.” eng. In: Bioinformatics 27.6 (Mar. 2011), pp. 863–864.
doi: 10.1093/bioinformatics/btr026.
[45] Marcel Martin. “Cutadapt removes adapter sequences from high-throughput se-
quencing reads”. In: EMBnet. journal 17.1 (2011), pp–10.
[46] Cole Trapnell, Brian A. Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Mar-
ijke J. van Baren, Steven L. Salzberg, Barbara J. Wold, and Lior Pachter. “Tran-
script assembly and quantification by RNA-Seq reveals unannotated transcripts
and isoform switching during cell differentiation.” eng. In: Nat Biotechnol 28.5
(May 2010), pp. 511–515. doi: 10.1038/nbt.1621.
[47] Mitchell Guttman, Manuel Garber, Joshua Z Levin, Julie Donaghey, James Robin-
son, Xian Adiconis, Lin Fan, Magdalena J Koziol, Andreas Gnirke, Chad Nusbaum,
et al. “Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals
the conserved multi-exonic structure of lincRNAs”. In: Nature biotechnology 28.5
(2010), pp. 503–510.
[48] Zhenyu Li, Yanxiang Chen, Desheng Mu, Jianying Yuan, Yujian Shi, Hao Zhang,
Jun Gan, Nan Li, Xuesong Hu, Binghang Liu, et al. “Comparison of the two major
classes of assembly algorithms: overlap–layout–consensus and de–bruijn–graph”. In:
Briefings in functional genomics 11.1 (2012), pp. 25–37.
[49] Phillip EC Compeau, Pavel A Pevzner, and Glenn Tesler. “How to apply de Bruijn
graphs to genome assembly”. In: Nature biotechnology 29.11 (2011), pp. 987–991.
[50] Pavel A Pevzner, Haixu Tang, and Michael S Waterman. “An Eulerian path ap-
proach to DNA fragment assembly”. In: Proceedings of the National Academy of
Sciences 98.17 (2001), pp. 9748–9753.
[51] Daniel R Zerbino and Ewan Birney. “Velvet: algorithms for de novo short read
assembly using de Bruijn graphs”. In: Genome Research 18.5 (2008), pp. 821–829.

176
Bibliography

[52] Jared T Simpson, Kim Wong, Shaun D Jackman, Jacqueline E Schein, Steven JM
Jones, and İnanç Birol. “ABySS: a parallel assembler for short read sequence data”.
In: Genome Research 19.6 (2009), pp. 1117–1123.
[53] Ruibang Luo et al. “SOAPdenovo2: an empirically improved memory-efficient short-
read de novo assembler”. In: GigaScience 1.1 (2012), p. 18.
[54] Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail
Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham,
Andrey D Prjibelski, et al. “SPAdes: A new genome assembly algorithm and its
applications to single-cell sequencing”. In: Journal of Computational Biology 19.5
(2012), pp. 455–477.
[55] Sergey Nurk, Anton Bankevich, Dmitry Antipov, Alexey Gurevich, Anton Ko-
robeynikov, Alla Lapidus, Andrey Prjibelsky, Alexey Pyshkin, Alexander Sirotkin,
Yakov Sirotkin, et al. “Assembling Genomes and mini-metagenomes from highly
chimeric reads”. In: Research in Computational Molecular Biology. Springer. 2013,
pp. 158–170.
[56] Inanç Birol, Shaun D Jackman, Cydney B Nielsen, Jenny Q Qian, Richard Varhol,
Greg Stazyk, Ryan D Morin, Yongjun Zhao, Martin Hirst, Jacqueline E Schein,
et al. “De novo transcriptome assembly with ABySS”. In: Bioinformatics 25.21
(2009), pp. 2872–2877.
[57] David R Bentley, Shankar Balasubramanian, Harold P Swerdlow, Geoffrey P Smith,
John Milton, Clive G Brown, Kevin P Hall, Dirk J Evers, Colin L Barnes, Helen R
Bignell, et al. “Accurate whole human genome sequencing using reversible termi-
nator chemistry”. In: Nature 456.7218 (2008), pp. 53–59.
[58] Jeffrey Martin, Vincent M Bruno, Zhide Fang, Xiandong Meng, Matthew Blow,
Tao Zhang, Gavin Sherlock, Michael Snyder, and Zhong Wang. “Rnnotator: an au-
tomated de novo transcriptome assembly pipeline from stranded RNA-Seq reads”.
In: BMC genomics 11.1 (2010), p. 663.
[59] Steven L Salzberg and James A Yorke. “Beware of mis-assembled genomes”. In:
Bioinformatics 21.24 (2005), pp. 4320–4321.
[60] Marcel H Schulz, Daniel R Zerbino, Martin Vingron, and Ewan Birney. “Oases:
robust de novo RNA-seq assembly across the dynamic range of expression levels”.
In: Bioinformatics 28.8 (2012), pp. 1086–1092.
[61] Gordon Robertson, Jacqueline Schein, Readman Chiu, Richard Corbett, Matthew
Field, Shaun D Jackman, Karen Mungall, Sam Lee, Hisanaga Mark Okada, Jenny Q
Qian, et al. “De novo assembly and analysis of RNA-seq data”. In: Nature Methods
7.11 (Nov. 2010), pp. 909–912.
[62] M G Grabherr et al. “Full-length transcriptome assembly from RNA-seq data with-
out a reference genome”. In: Nature Biotechnology 29.7 (May 2011), pp. 644–652.
[63] Yinlong Xie et al. “SOAPdenovo-Trans: de novo transcriptome assembly with short
RNA-Seq reads.” eng. In: Bioinformatics 30.12 (June 2014), pp. 1660–1666. doi:
10.1093/bioinformatics/btu077.
[64] Bastien Chevreux, Thomas Pfisterer, Bernd Drescher, Albert J Driesel, Werner
EG Müller, Thomas Wetter, and Sándor Suhai. “Using the miraEST assembler for
reliable and automated mRNA transcript assembly and SNP detection in sequenced
ESTs”. In: Genome research 14.6 (2004), pp. 1147–1159.

177
Bibliography

[65] Yann Surget-Groba and Juan I Montoya-Burgos. “Optimization of de novo tran-


scriptome assembly from next-generation sequencing data”. In: Genome Research
20.10 (2010), pp. 1432–1440.
[66] Geng Chen, Kang Ping Yin, Charles Wang, and Tie Liu Shi. “De novo transcrip-
tome assembly of RNA-Seq reads with different strategies”. In: Science China Life
Sciences 54.12 (2011), pp. 1129–1133. doi: 10.1007/s11427-011-4256-9.
[67] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J
Lipman. “Basic local alignment search tool”. In: Journal of molecular biology 215.3
(1990), pp. 403–410.
[68] Eric P Nawrocki, Diana L Kolbe, and Sean R Eddy. “Infernal 1.0: inference of RNA
alignments”. In: Bioinformatics 25.10 (2009), pp. 1335–1337.
[69] Konstantin Riege and Manja Marz. “GORAP”. In: progress (2015).
[70] Nuno A Fonseca, Johan Rung, Alvis Brazma, and John C Marioni. “Tools for
mapping high-throughput sequencing data”. In: Bioinformatics (2012), bts605.
[71] Heng Li and Richard Durbin. “Fast and accurate short read alignment with Burrows–
Wheeler transform”. In: Bioinformatics 25.14 (2009), pp. 1754–1760.
[72] Ben Langmead and Steven L Salzberg. “Fast gapped-read alignment with Bowtie
2”. In: Nature methods 9.4 (2012), pp. 357–359.
[73] Cole Trapnell, Lior Pachter, and Steven L. Salzberg. “TopHat: discovering splice
junctions with RNA-Seq.” eng. In: Bioinformatics 25.9 (May 2009), pp. 1105–1111.
doi: 10.1093/bioinformatics/btp120.
[74] Daehwan Kim, Ben Langmead, and Steven L Salzberg. “HISAT: a fast spliced
aligner with low memory requirements”. In: Nature methods 12.4 (2015), pp. 357–
360.
[75] Steve Hoffmann, Christian Otto, Stefan Kurtz, Cynthia M. Sharma, Philipp Kha-
itovich, Jörg Vogel, Peter F. Stadler, and Jörg Hackermüller. “Fast mapping of
short sequences with mismatches, insertions and deletions using index structures.”
eng. In: PLoS Comput Biol 5.9 (Sept. 2009), e1000502. doi: 10.1371/journal.
pcbi.1000502.
[76] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor
Marth, Goncalo Abecasis, Richard Durbin, et al. “The sequence alignment/map
format and SAMtools”. In: Bioinformatics 25.16 (2009), pp. 2078–2079.
[77] Simon Anders, Paul Theodor Pyl, and Wolfgang Huber. “HTSeq–a Python frame-
work to work with high-throughput sequencing data”. In: Bioinformatics (2014),
btu638.
[78] Yang Liao, Gordon K Smyth, and Wei Shi. “FeatureCounts: an efficient general pur-
pose program for assigning sequence reads to genomic features”. In: Bioinformatics
30.7 (2014), pp. 923–930.
[79] Bo Li, Victor Ruotti, Ron M Stewart, James A Thomson, and Colin N Dewey.
“RNA-Seq gene expression estimation with read mapping uncertainty”. In: Bioin-
formatics 26.4 (2010), pp. 493–500.
[80] Günter P Wagner, Koryu Kin, and Vincent J Lynch. “Measurement of mRNA
abundance using RNA-seq data: RPKM measure is inconsistent among samples”.
In: Theory in Biosciences 131.4 (2012), pp. 281–285.

178
Bibliography

[81] Simon Anders and Wolfgang Huber. “Differential expression analysis for sequence
count data”. eng. In: Genome Biol 11.10 (2010), R106. doi: 10.1186/gb-2010-
11-10-r106.
[82] Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. “edgeR: a Biocon-
ductor package for differential expression analysis of digital gene expression data”.
In: Bioinformatics 26.1 (2010), pp. 139–140.
[83] Matthew E Ritchie, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi,
and Gordon K Smyth. “Limma powers differential expression analyses for RNA-
sequencing and microarray studies”. In: Nucleic acids research (2015), gkv007.
[84] Robert C Gentleman, Vincent J Carey, Douglas M Bates, Ben Bolstad, Marcel
Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry,
et al. “Bioconductor: open software development for computational biology and
bioinformatics”. In: Genome biology 5.10 (2004), R80.
[85] Melanie A Huntley, Jessica L Larson, Christina Chaivorapol, Gabriel Becker, Michael
Lawrence, Jason A Hackney, and Joshua S Kaminker. “ReportingTools: an au-
tomated result processing and presentation toolkit for high-throughput genomic
analyses”. In: Bioinformatics 29.24 (2013), pp. 3220–3221.
[86] Weijun Luo, Michael S Friedman, Kerby Shedden, Kurt D Hankenson, and Peter J
Woolf. “GAGE: generally applicable gene set enrichment for pathway analysis.” eng.
In: BMC Bioinformatics 10 (2009), p. 161. doi: 10.1186/1471-2105-10-161.
[87] Weijun Luo and Cory Brouwer. “Pathview: an R/Bioconductor package for pathway-
based data integration and visualization”. In: Bioinformatics 29.14 (2013), pp. 1830–
1831. doi: 10.1093/bioinformatics/btt285.
[88] Ryan R Wick, Mark B Schultz, Justin Zobel, and Kathryn E Holt. “Bandage:
interactive visualization of de novo genome assemblies”. In: Bioinformatics (2015),
btv383.
[89] Konrad Sachse, Patrik M Bavoil, Bernhard Kaltenboeck, Richard S Stephens, Cho-
Chou Kuo, Ramon Rosselló-Móra, and Matthias Horn. “Emendation of the family
Chlamydiaceae: proposal of a single genus, Chlamydia, to include all currently rec-
ognized species”. In: Systematic and applied microbiology 38.2 (2015), pp. 99–103.
[90] Konrad Sachse, Karine Laroucau, Konstantin Riege, Stefanie Wehner, Meik Dilcher,
Heather Huot Creasy, Manfred Weidmann, Garry Myers, Fabien Vorimore, Nadia
Vicari, et al. “Evidence for the existence of two new members of the family Chlamy-
diaceae and proposal of Chlamydia avium sp. nov. and Chlamydia gallinacea sp.
nov.” In: Systematic and applied microbiology 37.2 (2014), pp. 79–88.
[91] Karine Laroucau, Fabien Vorimore, Rachid Aaziz, Angela Berndt, Evelyn Schubert,
and Konrad Sachse. “Isolation of a new chlamydial agent from infected domestic
poultry coincided with cases of atypical pneumonia among slaughterhouse workers
in France”. In: Infection, Genetics and Evolution 9.6 (2009), pp. 1240–1247.
[92] Virginie Hulin, Sabrina Oger, Fabien Vorimore, Rachid Aaziz, Bertille de Bar-
beyrac, Jacques Berruchon, Konrad Sachse, and Karine Laroucau. “Host prefer-
ence and zoonotic potential of Chlamydia psittaci and C. gallinacea in poultry”.
In: Pathogens and disease 73.1 (2015), pp. 1–11.
[93] Konrad Sachse, K Laroucau, and D Vanrompay. “Avian Chlamydiosis”. In: Curr
Clin Microbiol Reports 2 (2015), pp. 10–21.

179
Bibliography

[94] Weina Guo, Jing Li, Bernhard Kaltenboeck, Jiansen Gong, Weixing Fan, and
Chengming Wang. “Chlamydia gallinacea, not C. psittaci, is the endemic chlamydial
species in chicken (Gallus gallus)”. In: Scientific reports 6 (2016), p. 19638.
[95] K Laroucau, R Aaziz, L Meurice, V Servas, I Chossat, H Royer, B de Barbeyrac, V
Vaillant, JL Moyen, F Meziani, et al. “Outbreak of psittacosis in a group of women
exposed to Chlamydia psittaci–infected chickens”. In: Euro Surveill (2014).
[96] Chin Lung Lu, Kun-Tze Chen, Shih-Yuan Huang, and Hsien-Tai Chiu. “CAR: con-
tig assembly of prokaryotic draft genomes using rearrangements”. In: BMC bioin-
formatics 15.1 (2014), p. 381.
[97] Kazutaka Katoh and Daron M Standley. “MAFFT multiple sequence alignment
software version 7: improvements in performance and usability”. In: Molecular bi-
ology and evolution 30.4 (2013), pp. 772–780.
[98] Torsten Seemann. “Prokka: rapid prokaryotic genome annotation”. In: Bioinformat-
ics (2014), btu153.
[99] CJ Clarke and D Little. “The pathology of ovine paratuberculosis: gross and his-
tological changes in the intestine and other tissues”. In: Journal of comparative
pathology 114.4 (1996), pp. 419–437.
[100] Marie-Françoise Thorel, Micah Krichevsky, and Véronique Vincent Lévy-Frébault.
“Numerical Taxonomy of Mycobactin-Dependent Mycobacteria, Emended Descrip-
tion of Mycobacterium avium, and Description of Mycobacterium avium subsp.
avium subsp. nov., Mycobacterium avium subsp. paratuberculosis subsp. nov., and
Mycobacterium avium subsp. silvaticum subsp. nov.” In: International Journal of
Systematic Bacteriology 40.3 (1990), pp. 254–260.
[101] Wouter Mijs, Petra de Haas, Rudi Rossau, Tridia Van der Laan, Leen Rigouts,
Françoise Portaels, and Dick van Soolingen. “Molecular evidence to support a
proposal to reserve the designation Mycobacterium avium subsp. avium for bird-
type isolates and ’M. avium subsp. hominissuis’ for the human/porcine type of
M. avium.” In: International Journal of Systematic and Evolutionary Microbiology
52.5 (2002), pp. 1505–1518.
[102] Christine Y Turenne, Desmond M Collins, David C Alexander, and Marcel A Behr.
“Mycobacterium avium subsp. paratuberculosis and M. avium subsp. avium are
independently evolved pathogenic clones of a much broader group of M. avium
organisms”. In: Journal of bacteriology 190.7 (2008), pp. 2479–2487.
[103] Chia-wei Wu, Jeremy Glasner, Michael Collins, Saleh Naser, and Adel M Talaat.
“Whole-genome plasticity among Mycobacterium avium subspecies: insights from
comparative genomic hybridizations”. In: Journal of Bacteriology 188.2 (2006),
pp. 711–723.
[104] Michael Paustian, Xiaochun Zhu, Srinand Sreevatsan, Suelee Robbe-Austerman,
Vivek Kapur, and John Bannantine. “Comparative genomic analysis of Mycobac-
terium avium subspecies obtained from multiple host species”. In: BMC Genomics
9.1 (2008), p. 135.
[105] Chung-Yi Hsu, Chia-Wei Wu, and Adel M Talaat. “Genome-wide sequence vari-
ation among Mycobacterium avium subspecies paratuberculosis isolates: a better
understanding of Johne’s disease transmission dynamics”. In: Frontiers in microbi-
ology 2 (2011), pp. 236–236.

180
Bibliography

[106] Desomnd M Collins, Diana M Gabric, and Goeffrey W de Lisle. “Identification of


two groups of Mycobacterium paratuberculosis strains by restriction endonuclease
analysis and DNA hybridization”. In: Journal of Clinical Microbiology 28.7 (1990),
pp. 1591–1596.
[107] Karen Stevenson, Valerie M Hughes, Lucía de Juan, Neil F Inglis, Frank Wright,
and J Michael Sharp. “Molecular characterization of pigmented and nonpigmented
isolates of Mycobacterium avium subsp. paratuberculosis”. In: Journal of clinical
microbiology 40.5 (2002), pp. 1798–1804.
[108] Elena Castellanos, Alicia Aranaz, Katherine A Gould, Richard Linedale, Karen
Stevenson, Julio Alvarez, Lucas Dominguez, Lucia de Juan, Jason Hinds, and Tim
J Bull. “Discovery of stable and variable differences in the Mycobacterium avium
subsp. paratuberculosis type I, II, and III genomes by pan-genome microarray anal-
ysis”. In: Applied and environmental microbiology 75.3 (2009), pp. 676–686.
[109] Iker Sevilla, Joseba M Garrido, Marivi Geijo, and Ramon A Juste. “Pulsed-field gel
electrophoresis profile homogeneity of Mycobacterium avium subsp. paratuberculosis
isolates from cattle and heterogeneity of those from sheep and goats”. In: BMC
microbiology 7.1 (2007), p. 18.
[110] Isabel Fritsch, Gabriele Luyven, Heike Köhler, Walburga Lutz, and Petra Möbius.
“Suspicion of Mycobacterium avium subsp. paratuberculosis transmission between
cattle and wild-living red deer (Cervus elaphus) by multitarget genotyping”. In:
Applied and environmental microbiology 78.4 (2012), pp. 1132–1139.
[111] RJ Whittington, AF Hope, DJ Marshall, CA Taragel, and I Marsh. “Molecular
epidemiology of Mycobacterium avium subsp. paratuberculosis: IS900 restriction
fragment length polymorphism and IS1311 polymorphism analyses of isolates from
animals and a human in Australia”. In: Journal of clinical microbiology 38.9 (2000),
pp. 3240–3248.
[112] L De Juan, A Mateos, L Dominguez, JM Sharp, and K Stevenson. “Genetic diversity
of Mycobacterium avium subspecies paratuberculosis isolates from goats detected by
pulsed-field gel electrophoresis”. In: Veterinary microbiology 106.3 (2005), pp. 249–
257.
[113] L De Juan, J Alvarez, A Aranaz, A Rodriguez, B Romero, J Bezos, A Mateos, and
L Dominguez. “Molecular epidemiology of Types I/III strains of Mycobacterium
avium subspecies paratuberculosis isolated from goats and cattle”. In: Veterinary
microbiology 115.1 (2006), pp. 102–110.
[114] Petra Möbius, Isabel Fritsch, Gabriele Luyven, Helmut Hotzel, and Heike Köhler.
“Unique genotypes of Mycobacterium avium subsp. paratuberculosis strains of Type
III”. In: Veterinary Microbiology 139.3 (2009), pp. 398–404.
[115] Elena Castellanos, Beatriz Romero, Sabrina Rodríguez, Lucía De Juan, Javier Be-
zos, Ana Mateos, Lucas Domínguez, and Alicia Aranaz. “Molecular characterization
of Mycobacterium avium subspecies paratuberculosis Types II and III isolates by
a combination of MIRU–VNTR loci”. In: Veterinary microbiology 144.1 (2010),
pp. 118–126.

181
Bibliography

[116] Pallab Ghosh, Chungyi Hsu, Essam J Alyamani, Maher M Shehata, Musaad A Al-
Dubaib, Abdulmohsen Al-Naeem, Mahmoud Hashad, Osama M Mahmoud, Khalid
BJ Alharbi, Khalid Al-Busadah, et al. “Genome-wide Analysis of the Emerging
Infection with Mycobacterium avium subspecies paratuberculosis in the Arabian
Camels (Camelus dromedarius)”. In: PloS one 7.2 (2012), e31947.
[117] Richard J Whittington, D Jeff Marshall, Paul J Nicholls, Ian B Marsh, and Leslie A
Reddacliff. “Survival and dormancy of Mycobacterium avium subsp. paratuberculo-
sis in the environment”. In: Applied and Environmental Microbiology 70.5 (2004),
pp. 2989–3004.
[118] RW Pickup, G Rhodes, TJ Bull, S Arnott, K Sidi-Boumedine, M Hurley, and J
Hermon-Taylor. “Mycobacterium avium subsp. paratuberculosis in lake catchments,
in river water abstracted for domestic use, and in effluent from domestic sewage
treatment works: diverse opportunities for environmental cycling and human expo-
sure”. In: Applied and environmental microbiology 72.6 (2006), pp. 4067–4077.
[119] Glenn Rhodes, Hollian Richardson, John Hermon-Taylor, Andrew Weightman, An-
drew Higham, and Roger Pickup. “Mycobacterium avium subspecies paratuberculo-
sis: human exposure through environmental and domestic aerosols”. In: Pathogens
3.3 (2014), pp. 577–595.
[120] H Shankar, SV Singh, PK Singh, AV Singh, JS Sohal, and RJ Greenstein. “Presence,
characterization, and genotype profiles of Mycobacterium avium subspecies paratu-
berculosis from unpasteurized individual and pooled milk, commercial pasteurized
milk, and milk products in India by culture, PCR, and PCR-REA methods”. In:
International Journal of Infectious Diseases 14.2 (2010), e121–e126.
[121] Birgit Stief, Petra Möbius, Heidemarie Türk, Uwe Hörügel, Carina Arnold, and
Dietrich Pöhle. “Paratuberculosis in a miniature donkey (Equus asinus f. asinus)”.
In: Berl. Münch. Tierärztl. Wschr. 7.1–2 (2012), pp. 38–44.
[122] Ken Over, Philip G Crandall, Corliss A O’Bryan, and Steven C Ricke. “Current
perspectives on Mycobacterium avium subsp. paratuberculosis, Johne’s disease, and
Crohn’s disease: a review”. In: Critical reviews in microbiology 37.2 (2011), pp. 141–
156.
[123] Raja Atreya, Michael Bülte, Gerald-F Gerlach, Ralph Goethe, Mathias W Hornef,
Heike Köhler, Jochen Meens, Petra Möbius, Elke Roeb, Siegfried Weiss, et al.
“Facts, myths and hypotheses on the zoonotic nature of Mycobacterium avium sub-
species paratuberculosis”. In: International Journal of Medical Microbiology 304.7
(2014), pp. 858–867.
[124] Lingling Li, John P Bannantine, Qing Zhang, Alongkorn Amonsin, Barbara J May,
David Alt, Nilanjana Banerji, Sagarika Kanjilal, and Vivek Kapur. “The complete
genome sequence of Mycobacterium avium subspecies paratuberculosis”. In: Proceed-
ings of the National Academy of Sciences of the United States of America 102.35
(2005), pp. 12344–12349.
[125] James W Wynne, Torsten Seemann, Dieter M Bulach, Scott A Coutts, Adel M
Talaat, and Wojtek P Michalski. “Resequencing the Mycobacterium avium subsp.
paratuberculosis K10 genome: improved annotation and revised genome sequence”.
In: Journal of bacteriology 192.23 (2010), pp. 6319–6320.

182
Bibliography

[126] James W Wynne, Tim J Bull, Torsten Seemann, Dieter M Bulach, Josef Wagner,
Carl D Kirkwood, and Wojtek P Michalski. “Exploring the zoonotic potential of
Mycobacterium avium subspecies paratuberculosis through comparative genomics”.
In: PloS one 6.7 (2011), e22171.
[127] John P Bannantine, Chia-wei Wu, Chungyi Hsu, Shiguo Zhou, David C Schwartz,
Darrell O Bayles, Michael L Paustian, David P Alt, Srinand Sreevatsan, Vivek
Kapur, et al. “Genome sequencing of ovine isolates of Mycobacterium avium sub-
species paratuberculosis offers insights into host association”. In: BMC genomics
13.1 (2012), p. 89.
[128] John P Bannantine, Lingling Li, Michael Mwangi, Rebecca Cote, JA Raygoza
Garay, and Vivek Kapur. “Complete Genome Sequence of Mycobacterium avium
subsp. paratuberculosis, Isolated from Human Breast Milk”. In: Genome Announc.
2(1) (2014), e01252–13. doi: 10.1128/genomeA.01252-13.
[129] Karen Dohmann, Birgit Strommenger, Karen Stevenson, Lucia de Juan, Janin
Stratmann, Vivek Kapur, Tim J Bull, and Gerald-Friedrich Gerlach. “Character-
ization of genetic differences between Mycobacterium avium subsp. paratuberculo-
sis type I and type II isolates”. In: Journal of clinical microbiology 41.11 (2003),
pp. 5215–5223.
[130] Ian B Marsh, John P Bannantine, Michael L Paustian, Mark L Tizard, Vivek
Kapur, and Richard J Whittington. “Genomic comparison of Mycobacterium avium
subsp. paratuberculosis sheep and cattle strains by microarray hybridization”. In:
Journal of bacteriology 188.6 (2006), pp. 2290–2293.
[131] Makeda Semret, Christine Y Turenne, Petra de Haas, Desmond M Collins, and
Marcel A Behr. “Differentiating host-associated variants of Mycobacterium avium
by PCR for detection of large sequence polymorphisms”. In: Journal of clinical
microbiology 44.3 (2006), pp. 881–887.
[132] David C Alexander, Christine Y Turenne, and Marcel A Behr. “Insertion and dele-
tion events that define the pathogen Mycobacterium avium subsp. paratuberculosis”.
In: Journal of bacteriology 191.3 (2009), pp. 1018–1025.
[133] C M Sharma, S Hoffmann, F Darfeuille, J Reignier, S Findeiss, A Sittka, S Chabas,
K Reiche, J Hackermüller, R Reinhardt, P F Stadler, and J Vogel. “The pri-
mary transcriptome of the major human pathogen Helicobacter pylori”. In: Nature
464.7286 (2010), pp. 250–255. doi: 10.1038/nature08756.
[134] Marcus Lechner, Astrid Nickel, Stefanie Wehner, Konstantin Riege, Wieseke, Be-
nedikt M Beckmann, Roland K Hartmann, and Manja Marz. “Genomewide com-
parison and novel ncRNAs in Aquificales”. In: BMC genomics 15.1 (2014), p. 522.
[135] Stefanie Wehner, Gopala K. Mannala, Xiaoxing Qing, Madhugiri Ramakanth, Tri-
nad Chakraborty, Mobarak Abu Mraheil, Torsten Hain, and Manja Marz. “Detec-
tion of very long antisense transcripts by whole transcriptome RNA-Seq analysis of
Listeria monocytogenes by semiconductor sequencing technology”. In: PLos ONE
9.10 (2014), e108639. doi: 10.1371/journal.pone.0108639.
[136] Kristine B Arnvig and Douglas B Young. “Identification of small RNAs in My-
cobacterium tuberculosis”. In: Molecular Microbiology 73.3 (2009), pp. 397–408.
[137] Kristine B Arnvig and Douglas B Young. “Regulation of pathogen metabolism by
small RNA”. In: Drug Discovery Today: Disease Mechanisms 7.1 (2010), e19–e24.

183
Bibliography

[138] Dmitriy Ignatov, Sofia Malakho, Konstantin Majorov, Timofey Skvortsov, Alexan-
der Apt, and Tatyana Azhikina. “RNA-Seq Analysis of Mycobacterium avium Non-
Coding Transcriptome”. In: PloS one 8.9 (2013), e74209.
[139] S Englund, G Bölske, A Ballagi-Pordany, and K-E Johansson. “Detection of My-
cobacterium avium subsp. paratuberculosis in tissue samples by single, fluorescent
and nested PCR based on the IS900 gene”. In: Veterinary microbiology 81.3 (2001),
pp. 257–271.
[140] Petra Möbius, Gabriele Luyven, Helmut Hotzel, and Heike Köhler. “High genetic
diversity among Mycobacterium avium subsp. paratuberculosis strains from Ger-
man cattle herds shown by combination of IS900 restriction fragment length poly-
morphism analysis and mycobacterial interspersed repetitive unit-variable-number
tandem-repeat typing”. In: Journal of clinical microbiology 46.3 (2008), pp. 972–
981.
[141] Virginie C Thibault, Maggy Grayon, Maria Laura Boschiroli, Christine Hubbans,
Pieter Overduin, Karen Stevenson, Maria Cristina Gutierrez, Philip Supply, and
Franck Biet. “New variable-number tandem-repeat markers for typing Mycobac-
terium avium subsp. paratuberculosis and M. avium strains: comparison with IS900
and IS1245 restriction fragment length polymorphism typing”. In: Journal of clin-
ical microbiology 45.8 (2007), pp. 2404–2410.
[142] Alongkorn Amonsin, Ling Ling Li, Qing Zhang, John P Bannantine, Alifiya S
Motiwala, Srinand Sreevatsan, and Vivek Kapur. “Multilocus short sequence re-
peat sequencing approach for differentiating among Mycobacterium avium subsp.
paratuberculosis strains”. In: Journal of clinical microbiology 42.4 (2004), pp. 1694–
1702.
[143] DICK van Soolingen, PW Hermans, PE De Haas, DR Soll, and JD Van Embden.
“Occurrence and stability of insertion sequences in Mycobacterium tuberculosis com-
plex strains: evaluation of an insertion sequence-dependent DNA polymorphism as
a tool in the epidemiology of tuberculosis.” In: Journal of clinical microbiology 29.11
(1991), pp. 2578–2586.
[144] Marten Boetzer, Christiaan V Henkel, Hans J Jansen, Derek Butler, and Walter
Pirovano. “Scaffolding pre-assembled contigs using SSPACE”. In: Bioinformatics
27.4 (2011), pp. 578–579.
[145] Te-Chin Chu, Chen-Hua Lu, Tsunglin Liu, Greg C Lee, Wen-Hsiung Li, and Arthur
Chun-Chieh Shih. “Assembler for de novo assembly of large genomes”. In: Proc Natl
Acad Sci 110.36 (2013), E3417–E3424.
[146] Weizhong Li and Adam Godzik. “Cd-hit: a fast program for clustering and compar-
ing large sets of protein or nucleotide sequences”. In: Bioinformatics 22.13 (2006),
pp. 1658–1659.
[147] Mitchell A Yakrus and Robert C Good. “Geographic distribution, frequency, and
specimen source of Mycobacterium avium complex serotypes isolated from patients
with acquired immunodeficiency syndrome.” In: Journal of clinical microbiology
28.5 (1990), pp. 926–929.
[148] Aaron CE Darling, Bob Mau, Frederick R Blattner, and Nicole T Perna. “Mauve:
multiple alignment of conserved genomic sequence with rearrangements”. In: Genome
research 14.7 (2004), pp. 1394–1403.

184
Bibliography

[149] Aaron E Darling, Bob Mau, and Nicole T Perna. “progressiveMauve: multiple
genome alignment with gene gain, loss and rearrangement”. In: PloS one 5.6 (2010),
e11147.
[150] M Lechner. “Detection of orthologs in large-scale analysis”. Masters thesis. Univer-
sity of Leipzig, 2009.
[151] Marcus Lechner, Sven Findeiß, Lydia Steiner, Manja Marz, Peter F Stadler, and
Sonja J Prohaska. “Proteinortho: Detection of (Co-) orthologs in large-scale anal-
ysis”. In: BMC Bioinformatics 12.1 (2011), p. 124.
[152] Kazutaka Katoh, Kazuharu Misawa, Kei-ichi Kuma, and Takashi Miyata. “MAFFT:
a novel method for rapid multiple sequence alignment based on fast Fourier trans-
form”. In: Nucleic acids research 30.14 (2002), pp. 3059–3066.
[153] Paul P Gardner, Jennifer Daub, John G Tate, Eric P Nawrocki, Diana L Kolbe,
Stinus Lindgreen, Adam C Wilkinson, Robert D Finn, Sam Griffiths-Jones, Sean
R Eddy, et al. “Rfam: updates to the RNA families database”. In: Nucleic acids
research 37.suppl 1 (2009), pp. D136–D140.
[154] Wade Winkler, Ali Nahvi, and Ronald R Breaker. “Thiamine derivatives bind mes-
senger RNAs directly to regulate bacterial gene expression”. In: Nature 419.6910
(2002), pp. 952–956.
[155] Zasha Weinberg, Jeffrey E Barrick, Zizhen Yao, Adam Roth, Jane N Kim, Jeremy
Gore, Joy Xin Wang, Elaine R Lee, Kirsten F Block, Narasimhan Sudarsan, et al.
“Identification of 22 candidate structured RNAs in bacteria using the CMfinder
comparative genomics pipeline”. In: Nucleic acids research 35.14 (2007), pp. 4809–
4819.
[156] Zasha Weinberg, Joy X Wang, Jarrod Bogue, Jingying Yang, Keith Corbino, Ryan
H Moy, Ronald R Breaker, et al. “Comparative genomics reveals 104 candidate
structured RNAs from bacteria, archaea, and their metagenomes”. In: Genome Biol
11.3 (2010), R31.
[157] Jeffrey E Barrick, Keith A Corbino, Wade C Winkler, Ali Nahvi, Maumita Mandal,
Jennifer Collins, Mark Lee, Adam Roth, Narasimhan Sudarsan, Inbal Jona, et al.
“New RNA motifs suggest an expanded scope for riboswitches in bacterial genetic
control”. In: Proceedings of the National Academy of Sciences of the United States
of America 101.17 (2004), pp. 6421–6426.
[158] Dilmurat Yusuf, Manja Marz, Peter Stadler, and Ivo Hofacker. “Bcheck: a wrapper
tool for detecting RNase P RNA genes”. In: BMC genomics 11.1 (2010), p. 432.
[159] Karin Lagesen, Peter Hallin, Einar Andreas Rødland, Hans-Henrik Stærfeldt, Tor-
bjørn Rognes, and David W Ussery. “RNAmmer: consistent and rapid annotation
of ribosomal RNA genes”. In: Nucleic acids research 35.9 (2007), pp. 3100–3108.
[160] Todd M Lowe and Sean R Eddy. “tRNAscan-SE: a program for improved detection
of transfer RNA genes in genomic sequence”. In: Nucleic acids research 25.5 (1997),
pp. 0955–964.
[161] Sam Griffiths-Jones. “RALEE—RNA ALignment editor in Emacs”. In: Bioinfor-
matics 21.2 (2005), pp. 257–259.
[162] Alexandros Stamatakis. “RAxML version 8: a tool for phylogenetic analysis and
post-analysis of large phylogenies.” eng. In: Bioinformatics 30.9 (May 2014), pp. 1312–
1313. doi: 10.1093/bioinformatics/btu033.

185
Bibliography

[163] Thomas Junier and Evgeny M Zdobnov. “The Newick utilities: high-throughput
phylogenetic tree processing in the UNIX shell.” eng. In: Bioinformatics 26.13 (July
2010), pp. 1669–1670. doi: 10.1093/bioinformatics/btq243.
[164] Makeda Semret, David C Alexander, Christine Y Turenne, Petra de Haas, Pieter
Overduin, Dick van Soolingen, Debby Cousins, and Marcel A Behr. “Genomic poly-
morphisms for Mycobacterium avium subsp. paratuberculosis diagnostics”. In: Jour-
nal of clinical microbiology 43.8 (2005), pp. 3704–3712.
[165] J Shine and L Dalgarno. “The 3’-terminal sequence of Escherichia coli 16S ribo-
somal RNA: complementarity to nonsense triplets and ribosome binding sites”. In:
Proceedings of the National Academy of Sciences 71.4 (1974), pp. 1342–1346.
[166] JW Dale et al. “Mobile genetic elements in Mycobacteria”. In: The European respi-
ratory journal. Supplement 20 (1995), 633s–648s.
[167] Srinand Sreevatsan, XI Pan, Kathryn E Stockbauer, Nancy D Connell, Barry N
Kreiswirth, Thomas S Whittam, and James M Musser. “Restricted structural gene
polymorphism in the Mycobacterium tuberculosis complex indicates evolutionarily
recent global dissemination”. In: Proceedings of the National Academy of Sciences
94.18 (1997), pp. 9869–9874.
[168] Laura Rindi and Carlo Garzelli. “Genetic diversity and phylogeny of Mycobacterium
avium”. In: Infection, Genetics and Evolution 21 (2014), pp. 375–383.
[169] EP Green, MLV Tizard, MT Moss, J Thompson, DJ Winterbourne, JJ McFadden,
and J Hermon-Taylor. “Sequence and characteristics or IS900, an insertion element
identified in a human Crohn’s disease isolate of Mycobacterium paratuberculosis”.
In: Nucleic Acids Research 17.22 (1989), pp. 9063–9073.
[170] Makeda Semret, Christine Y Turenne, and Marcel A Behr. “Insertion sequence
IS900 revisited”. In: Journal of clinical microbiology 44.3 (2006a), pp. 1081–1083.
[171] Ingrid Olsen, Tone Bjordal Johansen, Helen Billman-Jacobe, Sigrun Fredsvold
Nilsen, and Berit Djønne. “A novel IS element, IS Mpa1, in Mycobacterium avium
subsp. paratuberculosis”. In: Veterinary microbiology 98.3 (2004), pp. 297–306.
[172] Kai Papenfort and Jörg Vogel. “Regulatory RNA in bacterial pathogens”. In: Cell
host & microbe 8.1 (2010), pp. 116–127.
[173] Paolo Miotto, Francesca Forti, Alessandro Ambrosi, Danilo Pellin, Diogo F Veiga,
Gabor Balazsi, Maria L Gennaro, Clelia Di Serio, Daniela Ghisotti, and Daniela M
Cirillo. “Genome-wide discovery of small RNAs in Mycobacterium tuberculosis”. In:
PloS one 7.12 (2012), e51950.
[174] J Hindley. “Fractionation of 32P-labelled ribonucleic acids on polyacrylamide gels
and their characterization by fingerprinting”. In: J Mol Biol 30.1 (1967), pp. 125–
136.
[175] G G Brownlee. “Sequence of 6S RNA of E. coli”. In: Nat New Biol 229.5 (1971),
pp. 147–149.
[176] A T Cavanagh, A D Klocko, X Liu, and K M Wassarman. “Promoter specificity for
6S RNA regulation of transcription is determined by core promoter sequences and
competition for region 4.2 of sigma70”. In: Mol Microbiol 67.6 (2008), pp. 1242–
1256. doi: 10.1111/j.1365-2958.2008.06117.x.

186
Bibliography

[177] N Gildehaus, T Neusser, R Wurm, and R Wagner. “Studies on the function of the
riboregulator 6S RNA from E. coli: RNA polymerase binding, inhibition of in vitro
transcription and synthesis of RNA-directed de novo transcripts”. In: Nucleic Acids
Res 35.6 (2007), pp. 1885–1896. doi: 10.1093/nar/gkm085.
[178] A E Trotochaud and K M Wassarman. “6S RNA function enhances long-term cell
survival”. In: J Bacteriol 186.15 (2004), pp. 4978–4985. doi: 10.1128/JB.186.
15.4978-4985.2004.
[179] A E Trotochaud and K M Wassarman. “6S RNA regulation of pspF transcription
leads to altered cell survival at high pH”. In: J Bacteriol 188.11 (2006), pp. 3936–
3943. doi: 10.1128/JB.00079-06.
[180] Stefanie Wehner, Katrin Damm, Roland K Hartmann, and Manja Marz. “Dissemi-
nation of 6S RNA among Bacteria”. In: RNA biology 11.11 (2014), pp. 1467–1478.
[181] David Alland, David W Lacher, Manzour Hernando Hazbón, Alifiya S Motiwala,
Weihong Qi, Robert D Fleischmann, and Thomas S Whittam. “Role of large se-
quence polymorphisms (LSPs) in generating genomic diversity among clinical iso-
lates of Mycobacterium tuberculosis and the utility of LSPs in phylogenetic analy-
sis”. In: Journal of clinical microbiology 45.1 (2007), pp. 39–46.
[182] Torsten M Eckstein, John T Belisle, and Julia M Inamine. “Proposed pathway
for the biosynthesis of serovar-specific glycopeptidolipids in Mycobacterium avium
serovar 2”. In: Microbiology 149.10 (2003), pp. 2797–2807.
[183] Elzbieta Krzywinska and Jeffrey S Schorey. “Characterization of genetic differences
between Mycobacterium avium subsp. avium strains of diverse virulence with a
focus on the glycopeptidolipid biosynthesis cluster”. In: Veterinary microbiology
91.2 (2003), pp. 249–264.
[184] IB Marsh and RJ Whittington. “Deletion of an mmpL gene and multiple associated
genes from the genome of the S strain of Mycobacterium avium subsp. paratuber-
culosis identified by representational difference analysis and in silico analysis”. In:
Molecular and cellular probes 19.6 (2005), pp. 371–384.
[185] Michael L Paustian, John P Bannantine, and V Kapur. “Paratuberculosis: Organ-
ism, Disease, Control”. In: CAB International, 2010. Chap. 8. Mycobacterium avium
subsp. paratuberculosis Genome. isbn: 9781845936136.
[186] Eugenie Dubnau, Patricia Fontán, Riccardo Manganelli, Sonia Soares-Appel, and
Issar Smith. “Mycobacterium tuberculosis genes induced during infection of human
macrophages”. In: Infection and immunity 70.6 (2002), pp. 2787–2795.
[187] Lalita Ramakrishnan, Nancy A Federspiel, and Stanley Falkow. “Granuloma-specific
expression of Mycobacterium virulence proteins from the glycine-rich PE-PGRS
family”. In: Science 288.5470 (2000), pp. 1436–1439.
[188] Yongjun Li, Elizabeth Miltner, Martin Wu, Mary Petrofsky, and Luiz E Bermudez.
“A Mycobacterium avium PPE gene is associated with the ability of the bacterium
to grow in macrophages and virulence in mice”. In: Cellular microbiology 7.4 (2005),
pp. 539–548.
[189] Michael L Paustian, John P Bannantine, Vivek Kapur, MA Behr, DM Collins, et
al. “Mycobacterium avium subsp. paratuberculosis genome”. In: Paratuberculosis:
organism, disease and control. CAB, Oxfordshire, UK (2010), pp. 73–81.

187
Bibliography

[190] Chen Tian and Xie Jian-ping. “Roles of PE_PGRS family in Mycobacterium tuber-
culosis pathogenesis and novel measures against tuberculosis”. In: Microbial patho-
genesis 49.6 (2010), pp. 311–314.
[191] Stewart T Cole. “Comparative and functional genomics of the Mycobacterium tu-
berculosis complex”. In: Microbiology 148.10 (2002), pp. 2919–2928.
[192] Michael J Brennan and Giovanni Delogu. “The PE multigene family: a ’molecular
mantra’ for mycobacteria”. In: Trends in microbiology 10.5 (2002), pp. 246–249.
[193] Giovanni Delogu, Maurizio Sanguinetti, Cinzia Pusceddu, Alessandra Bua, Michael
J Brennan, Stefania Zanetti, and Giovanni Fadda. “PE_PGRS proteins are differ-
entially expressed by Mycobacterium tuberculosis in host tissues”. In: Microbes and
infection 8.8 (2006), pp. 2061–2067.
[194] Nicolaas C Gey van Pittius, Samantha L Sampson, Hyeyoung Lee, Yeun Kim, Paul
D Van Helden, and Robin M Warren. “Evolution and expansion of the Mycobac-
terium tuberculosis PE and PPE multigene families and their association with the
duplication of the ESAT-6 (esx) gene cluster regions”. In: BMC evolutionary biology
6.1 (2006), p. 95.
[195] Pradeep Reddy Marri, John P Bannantine, Michael L Paustian, and G Brian Gold-
ing. “Lateral gene transfer in Mycobacterium avium subspecies paratuberculosis”.
In: Canadian journal of microbiology 52.6 (2006), pp. 560–569.
[196] STea Cole, R Brosch, J Parkhill, T Garnier, C Churcher, D Harris, SV Gordon, K
Eiglmeier, S Gas, CE 3rd Barry, et al. “Deciphering the biology of Mycobacterium
tuberculosis from the complete genome sequence”. In: Nature 393.6685 (1998),
pp. 537–544.
[197] Ruth Hershberg, Mikhail Lipatov, Peter M Small, Hadar Sheffer, Stefan Niemann,
Susanne Homolka, Jared C Roach, Kristin Kremer, Dmitri A Petrov, Marcus W
Feldman, et al. “High functional diversity in Mycobacterium tuberculosis driven by
genetic drift and human demography”. In: PLoS biology 6.12 (2008), e311.
[198] David Stucki and Sebastien Gagneux. “Single nucleotide polymorphisms in My-
cobacterium tuberculosis and the need for a curated database”. In: Tuberculosis
93.1 (2013), pp. 30–39.
[199] JS Sohal, SV Singh, PK Singh, and AV Singh. “On the evolution of ’Indian Bison
type’ strains of Mycobacterium avium subspecies paratuberculosis”. In: Microbio-
logical research 165.2 (2010), pp. 163–171.
[200] Christine Y Turenne, Makeda Semret, Debby V Cousins, Desmond M Collins, and
Marcel A Behr. “Sequencing of hsp65 distinguishes among subsets of the Mycobac-
terium avium complex”. In: Journal of clinical microbiology 44.2 (2006), pp. 433–
440.
[201] IB Marsh and RJ Whittington. “Genomic diversity in Mycobacterium avium: single
nucleotide polymorphisms between the S and C strains of M. avium subsp. paratu-
berculosis and with M. a. avium”. In: Molecular and cellular probes 21.1 (2007),
pp. 66–75.
[202] Jeffrey A Martin and Zhong Wang. “Next-generation transcriptome assembly”. In:
Nature Reviews Genetics 12.10 (2011), pp. 671–682.
[203] Brian J Haas, Michael C Zody, et al. “Advancing RNA-seq analysis”. In: Nature
biotechnology 28.5 (2010), p. 421.

188
Bibliography

[204] Leandro Lima, Blerina Sinaimeri, Gustavo Sacomoto, Helene Lopez-Maestre, Camille
Marchet, Vincent Miele, Marie-France Sagot, and Vincent Lacroix. “Playing hide
and seek with repeats in local and global de novo transcriptome assembly of short
RNA-Seq reads”. In: Algorithms for Molecular Biology 12.1 (2017), p. 2.
[205] Qiong-Yi Zhao, Yi Wang, Yi-Meng Kong, Da Luo, Xuan Li, and Pei Hao. “Optimiz-
ing de novo transcriptome assembly from short-read RNA-Seq data: a comparative
study”. In: BMC bioinformatics 12.14 (2011), S2.
[206] Sujai Kumar and Mark L Blaxter. “Comparing de novo assemblers for 454 tran-
scriptome data”. In: BMC genomics 11.1 (2010), p. 571.
[207] Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kris-
tiansen, and Jun Wang. “SOAP2: an improved ultrafast tool for short read align-
ment”. In: Bioinformatics 25.15 (2009), pp. 1966–1967.
[208] BingXin Lu, ZhenBing Zeng, and Tieliu Shi. “Comparative study of de novo assem-
bly and genome-guided assembly strategies for transcriptome reconstruction based
on RNA-Seq”. In: Science China. Life sciences 56.2 (2013), p. 143.
[209] Kaitlin Clarke, Yi Yang, Ronald Marsh, LingLin Xie, and Ke K Zhang. “Compar-
ative analysis of de novo transcriptome assembly”. In: Science China. Life sciences
56.2 (2013), p. 156.
[210] Sufang Wang and Michael Gribskov. “Comprehensive evaluation of de novo tran-
scriptome assembly programs and their effects on differential gene expression anal-
ysis”. In: Bioinformatics (2016), btw625.
[211] Juntao Liu, Guojun Li, Zheng Chang, Ting Yu, Bingqiang Liu, Rick McMullen,
Pengyin Chen, and Xiuzhen Huang. “BinPacker: Packing-Based De Novo Tran-
scriptome Assembly from RNA-seq Data.” In: PLoS Comput Biol 12 (2 2016),
e1004772. issn: 1553-7358. doi: 10.1371/journal.pcbi.1004772.
[212] Zheng Chang, Guojun Li, Juntao Liu, Yu Zhang, Cody Ashby, Deli Liu, Carole L
Cramer, and Xiuzhen Huang. “Bridger: a new framework for de novo transcriptome
assembly using RNA-seq data.” In: Genome Biol 16 (2015), p. 30. issn: 1474-760X.
doi: 10.1186/s13059-015-0596-2.
[213] Yu Peng, Henry CM Leung, Siu-Ming Yiu, Ming-Ju Lv, Xin-Guang Zhu, and Fran-
cis YL Chin. “IDBA-tran: a more robust de novo de Bruijn graph assembler for
transcriptomes with uneven expression levels”. In: Bioinformatics 29.13 (2013),
pp. i326–i334.
[214] Zhaleh Safikhani, Mehdi Sadeghi, Hamid Pezeshk, and Changiz Eslahchi. “SSP:
An interval integer linear programming for de novo transcriptome assembly and
isoform discovery of RNA-seq reads”. In: Genomics 102.5 (2013), pp. 507–514.
[215] Sreeram Kannan, Joseph Hui, Kayvon Mazooji, Lior Pachter, and David Tse.
“Shannon: An Information-Optimal de novo RNA-Seq Assembler”. In: bioRxiv
(2016), p. 039230.
[216] Elena Bushmanova, Dmitry Antipov, Alla Lapidus, Vladimir Suvorov, and Andrey
D Prjibelski. “rnaQUAST: a quality assessment tool for de novo transcriptome
assemblies.” In: Bioinformatics 32 (14 2016), pp. 2210–2212. issn: 1367-4811. doi:
10.1093/bioinformatics/btw218.

189
Bibliography

[217] Daniel R. Zerbino and Ewan Birney. “Velvet: algorithms for de novo short read
assembly using de Bruijn graphs.” eng. In: Genome Res 18.5 (May 2008), pp. 821–
829. doi: 10.1101/gr.074492.107.
[218] Maureen K Thomason, Thorsten Bischler, Sara K Eisenbart, Konrad U Förstner,
Aixia Zhang, Alexander Herbig, Kay Nieselt, Cynthia M Sharma, and Gisela Storz.
“Global transcriptional start site mapping using differential RNA sequencing reveals
novel antisense RNAs in Escherichia coli”. In: Journal of bacteriology 197.1 (2015),
pp. 18–28.
[219] Paul Flicek, M Ridwan Amode, Daniel Barrell, Kathryn Beal, Simon Brent, Denise
Carvalho-Silva, Peter Clapham, Guy Coates, Susan Fairley, Stephen Fitzgerald,
et al. “Ensembl 2012”. In: Nucleic acids research (2011), gkr991.
[220] MA Pfaller and DJ Diekema. “Rare and emerging opportunistic fungal pathogens:
concern for resistance beyond Candida albicans and Aspergillus fumigatus”. In:
Journal of clinical microbiology 42.10 (2004), pp. 4419–4431.
[221] Fabien Cottier, Alrina Shin Min Tan, Jinmiao Chen, Josephine Lum, Francesca
Zolezzi, Michael Poidinger, and Norman Pavelka. “The transcriptional stress re-
sponse of Candida albicans to weak organic acids”. In: G3: Genes| Genomes| Ge-
netics 5.4 (2015), pp. 497–505.
[222] Zhibing Lai, Craig M Schluttenhofer, Ketaki Bhide, Jacob Shreve, Jyothi Thimma-
puram, Sang Yeol Lee, Dae-Jin Yun, and Tesfaye Mengiste. “MED18 interaction
with distinct transcription factors regulates multiple plant functions”. In: Nature
communications 5 (2014).
[223] Manfred G. Grabherr et al. “Full-length transcriptome assembly from RNA-Seq
data without a reference genome.” eng. In: Nat Biotechnol 29.7 (July 2011), pp. 644–
652. doi: 10.1038/nbt.1883.
[224] H. Feldmann, H. D. Klenk, and A. Sanchez. “Molecular biology and evolution of
filoviruses”. In: Arch. Virol. Suppl. 7 (1993), pp. 81–100.
[225] Thasso Griebel, Benedikt Zacher, Paolo Ribeca, Emanuele Raineri, Vincent Lacroix,
Roderic Guigó, and Michael Sammeth. “Modelling and simulating generic RNA-
Seq experiments with the flux simulator.” In: Nucleic Acids Res 40 (20 2012),
pp. 10073–10083. issn: 1362-4962. doi: 10.1093/nar/gks666.
[226] Satshil B Rana, Frank J Zadlock IV, Ziping Zhang, Wyatt R Murphy, and Car-
olyn S Bentivegna. “Comparison of De Novo Transcriptome Assemblers and k-mer
Strategies Using the Killifish, Fundulus heteroclitus”. In: PloS one 11.4 (2016),
e0153104.
[227] Ratan Chopra, Gloria Burow, Andrew Farmer, Joann Mudge, Charles E Simpson,
and Mark D Burow. “Comparisons of de novo transcriptome assemblers in diploid
and polyploid species using peanut (Arachis spp.) RNA-seq data”. In: PloS one
9.12 (2014), e115055.
[228] Joanna Moreton, Stephen P Dunham, and Richard D Emes. “A consensus approach
to vertebrate de novo transcriptome assembly from RNA-seq data: assembly of
the duck (Anas platyrhynchos) transcriptome”. In: Frontiers in genetics 5 (2014),
p. 190.
[229] Richard Smith-Unna, Chris Boursnell, Rob Patro, Julian M Hibberd, and Steven
Kelly. “TransRate: reference-free quality assessment of de novo transcriptome as-
semblies”. In: Genome research 26.8 (2016), pp. 1134–1144.

190
Bibliography

[230] Bo Li, Nathanael Fillmore, Yongsheng Bai, Mike Collins, James A Thomson, Ron
Stewart, and Colin N Dewey. “Evaluation of de novo transcriptome assemblies from
RNA-Seq data”. In: Genome biology 15.12 (2014), p. 553.
[231] Felipe A Simão, Robert M Waterhouse, Panagiotis Ioannidis, Evgenia V Krivent-
seva, and Evgeny M Zdobnov. “BUSCO: assessing genome assembly and annotation
completeness with single-copy orthologs”. In: Bioinformatics (2015), btv351.
[232] Shu Chen, J Scott McElroy, Fenny Dane, and Leslie R Goertzen. “Transcriptome
Assembly and Comparison of an Allotetraploid Weed Species, Annual Bluegrass,
with its Two Diploid Progenitor Species, Schrad and Kunth”. In: The Plant Genome
9.1 (2016).
[233] Sunetra Das, Natalie L Pitts, Megan R Mudron, David S Durica, and Donald L
Mykles. “Transcriptome analysis of the molting gland (Y-organ) from the blackback
land crab, Gecarcinus lateralis”. In: Comparative Biochemistry and Physiology Part
D: Genomics and Proteomics 17 (2016), pp. 26–40.
[234] Oliver Rupp, Jennifer Becker, Karina Brinkrolf, Christina Timmermann, Nicole
Borth, Alfred Pühler, Thomas Noll, and Alexander Goesmann. “Construction of
a public CHO cell line transcript database using versatile bioinformatics analysis
pipelines”. In: PloS one 9.1 (2014), e85568.
[235] Laura S Robertson and Robert S Cornman. “Transcriptome resources for the frogs
Lithobates clamitans and Pseudacris regilla, emphasizing antimicrobial peptides
and conserved loci for phylogenetics”. In: Molecular ecology resources 14.1 (2014),
pp. 178–183.
[236] Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. “CD-HIT:
accelerated for clustering the next-generation sequencing data”. In: Bioinformatics
28.23 (2012), pp. 3150–3152.
[237] Weizhong Li, Lukasz Jaroszewski, and Adam Godzik. “Clustering of highly homol-
ogous sequences to reduce the size of large protein databases”. In: Bioinformatics
17.3 (2001), pp. 282–283.
[238] Weizhong Li, Lukasz Jaroszewski, and Adam Godzik. “Tolerating some redundancy
significantly speeds up clustering of large protein databases”. In: Bioinformatics
18.1 (2002), pp. 77–82.
[239] Julia Bräuer. “Differential effects of vitamins A and D in human monocytes dur-
ing infection – gene expression profiling using RNA sequencing”. MA thesis. Jena:
Friedrich Schiller University Jena, 2016.
[240] Z Al Tanoury, A Piskunov, and C Rochette-Egly. “Vitamin A and retinoid sig-
naling: genomic and nongenomic effects thematic review series: Fat-soluble vita-
mins: Vitamin A.” In: Journal of lipid research 54 (2013), pp. 1761–1775. doi:
10.1194/jlr.r030833.
[241] S Christakos, P Dhawan, A Verstuyf, L Verlinden, and G Carmeliet. “Vitamin D:
Metabolism, molecular mechanism of action, and pleiotropic effects.” In: Physiol
Rev 96 (2016), pp. 365–408. doi: 10.1152/physrev.00014.2015.
[242] CM Bunce, G Brown, and M Hewison. “Vitamin D and hematopoiesis.” In: Trends
in Endocrinology & Metabolism 8 (1997), pp. 245–251. doi: 10.1016/s1043-
2760(97)00066-0.

191
Bibliography

[243] M Hewison. “Vitamin D and the immune system: new perspectives on an old
theme.” In: Endocrinology and metabolism clinics of North America 39 (2010),
pp. 365–379. doi: 10.1016/j.ecl.2010.02.010.
[244] S Christakos, P Dhawan, Y Liu, X Peng, and A Porta. “New insights into the
mechanisms of vitamin D action.” In: Journal of cellular biochemistry 88 (2003),
pp. 695–705. doi: 10.1002/jcb.10423.
[245] M Clagett-Dame and D Knutson. “Vitamin A in reproduction and development.”
In: Nutrients 3 (2011), pp. 385–428. doi: 10.3390/nu3040385.
[246] M Mark, NB Ghyselinck, and P Chambon. “Function of retinoic acid receptors
during embryonic development.” In: Nucl Recept Signal 7 (2009), e002. doi: 10.
1621/nrs.07002.
[247] A Sommer and KS Vyas. “A global clinical view on vitamin A and carotenoids.”
In: The American journal of clinical nutrition 96 (2012), 1204S–1206S. doi: 10.
3945/ajcn.112.034868.
[248] R Bouillon and T Suda. “Vitamin D: calcium and bone homeostasis during evolu-
tion.” In: BoneKEy reports 3 (2014). doi: 10.1038/bonekey.2013.214.
[249] JA Hall, JR Grainger, SP Spencer, and Y Belkaid. “The role of retinoic acid in
tolerance and immunity.” In: Immunity 35 (2011), pp. 13–22. doi: 10.1016/j.
immuni.2011.07.002.
[250] B Prietl, G Treiber, TR Pieber, and K Amrein. “Vitamin D and immune function.”
In: Nutrients 5 (2013), pp. 2502–2521. doi: 10.3390/nu5072502.
[251] P Glasziou and D Mackerras. “Vitamin A supplementation in infectious diseases:
a meta-analysis.” In: Bmj 306 (1993), pp. 366–370. doi: 10.1136/bmj.306.
6874.366.
[252] WB Grant. “Variations in vitamin D production could possibly explain the sea-
sonality of childhood respiratory infections in hawaii.” In: The Pediatric infectious
disease journal 27 (2008), p. 853.
[253] PA Danai, S Sinha, M Moss, MJ Haber, and GS Martin. “Seasonal variation in
the epidemiology of sepsis.” In: Critical care medicine 35 (2007), pp. 410–415. doi:
10.1097/01.ccm.0000253405.17038.43.
[254] E Villamor and WW Fawzi. “Vitamin A supplementation: implications for mor-
bidity and mortality in children.” In: Journal of Infectious Diseases 182 (2000),
S122–S133. doi: 10.1086/315921.
[255] R Semba. “Vitamin A and immunity to viral, bacterial and protozoan infections.”
In: Proceedings of the Nutrition Society 58 (1999), pp. 719–727. doi: 10.1017/
s0029665199000944.
[256] W Waters, M Palmer, B Nonnecke, D Whipple, and R Horst. “Mycobacterium bovis
infection of vitamin D-deficient nos2-/- mice.” In: Microbial pathogenesis 36 (2004),
pp. 11–17. doi: 10.1016/j.micpath.2003.08.008.
[257] Joan Hui Juan Lim, Sharada Ravikumar, Yan-Ming Wang, Thomas Paulraj Tham-
boo, Lizhen Ong, Jinmiao Chen, Jessamine Geraldine Goh, Sen Hee Tay, Lufei
Chengchen, Mar Soe Win, et al. “Bimodal Influence of Vitamin D in Host Re-
sponse to Systemic Candida Infection—Vitamin D Dose Matters.” In: Journal of
Infectious Diseases (2015), jiv033.

192
Bibliography

[258] Alexandra Yamshchikov, Nirali Desai, Henry Blumberg, Thomas Ziegler, and Vin
Tangpricha. “Vitamin D for treatment and prevention of infectious diseases: a sys-
tematic review of randomized controlled trials.” In: Endocrine Practice 15.5 (2009),
pp. 438–449.
[259] Kacper A Wojtal, Lutz Wolfram, Isabelle Frey-Wagner, Silvia Lang, Michael Scharl,
Stephan R Vavricka, and Gerhard Rogler. “The effects of vitamin A on cells of
innate immunity in vitro”. In: Toxicology in Vitro 27.5 (2013), pp. 1525–1532.
[260] Tilman E Klassert, Anja Hanisch, Julia Bräuer, Esther Klaile, Kerstin A Heyl,
Michael M Mansour, Jenny M Tam, Jatin M Vyas, and Hortense Slevogt. “Modula-
tory role of vitamin A on the Candida albicans-induced immune response in human
monocytes”. In: Medical microbiology and immunology 203.6 (2014), pp. 415–424.
[261] Adrian F Gombart. “The vitamin D–antimicrobial peptide pathway and its role in
protection against infection”. In: Future microbiology 4.9 (2009), pp. 1151–1165.
[262] Yong Zhang, Donald YM Leung, Brittany N Richers, Yusen Liu, Linda K Remigio,
David W Riches, and Elena Goleva. “Vitamin D inhibits monocyte/macrophage
proinflammatory cytokine production by targeting MAPK phosphatase-1”. In: The
Journal of Immunology 188.5 (2012), pp. 2127–2135.
[263] Ai-Leng Khoo, Louis YA Chai, Hans JPM Koenen, Bart-Jan Kullberg, Irma Joosten,
André JAM van der Ven, and Mihai G Netea. “1, 25-dihydroxyvitamin D3 modu-
lates cytokine production induced by Candida albicans: impact of seasonal variation
of immune responses”. In: Journal of Infectious Diseases 203.1 (2011), pp. 122–130.
[264] Paul Oeth, Jin Yao, Sao-Tah Fan, and Nigel Mackman. “Retinoic acid selectively
inhibits lipopolysaccharide induction of tissue factor gene expression in human
monocytes”. In: Blood 91.8 (1998), pp. 2857–2865.
[265] Florian B Mayr, Sachin Yende, and Derek C Angus. “Epidemiology of severe sepsis”.
In: Virulence 5.1 (2014), pp. 4–11.
[266] Natalya V Serbina, Ting Jia, Tobias M Hohl, and Eric G Pamer. “Monocyte-
mediated defense against microbial pathogens”. In: Annu. Rev. Immunol. 26 (2008),
pp. 421–452.
[267] S Andrews, F Krueger, A Seconds-Pichon, F Biggins, and S Wingett. “FastQC:
a quality control tool for high throughput sequence data.” In: Cambridge, UK:
Babraham Institute (2014). doi: 10.1093/bioinformatics/btw627.
[268] Robert Schmieder and Robert Edwards. “Quality control and preprocessing of
metagenomic datasets”. In: Bioinformatics 27.6 (2011), pp. 863–864.
[269] Michael I Love, Wolfgang Huber, and Simon Anders. “Moderated estimation of fold
change and dispersion for RNA-seq data with DESeq2”. In: Genome biology 15.12
(2014), p. 550.
[270] Simon Anders, Davis J McCarthy, Yunshun Chen, Michal Okoniewski, Gordon
K Smyth, Wolfgang Huber, and Mark D Robinson. “Count-based differential ex-
pression analysis of RNA sequencing data using R and Bioconductor”. In: Nature
protocols 8.9 (2013), pp. 1765–1786.
[271] Chris Fraley and Adrian E Raftery. “Model-based clustering, discriminant analysis,
and density estimation”. In: Journal of the American statistical Association 97.458
(2002), pp. 611–631.

193
Bibliography

[272] Chris Fraley, AE Raftery, and L Scrucca. “Normal mixture modeling for model-
based clustering, classification, and density estimation”. In: Department of Statis-
tics, University of Washington 23 (2012), p. 2012.
[273] Paul D Thomas, Michael J Campbell, Anish Kejariwal, Huaiyu Mi, Brian Karlak,
Robin Daverman, Karen Diemer, Anushya Muruganujan, and Apurva Narechania.
“PANTHER: a library of protein families and subfamilies indexed by function”. In:
Genome research 13.9 (2003), pp. 2129–2141.
[274] Paul D Thomas, Anish Kejariwal, Michael J Campbell, Huaiyu Mi, Karen Diemer,
Nan Guo, Istvan Ladunga, Betty Ulitsky-Lazareva, Anushya Muruganujan, Steven
Rabkin, et al. “PANTHER: a browsable database of gene products organized by
biological function, using curated protein family and subfamily classification”. In:
Nucleic acids research 31.1 (2003), pp. 334–341.
[275] Damian Szklarczyk, Andrea Franceschini, Stefan Wyder, Kristoffer Forslund, Da-
vide Heller, Jaime Huerta-Cepas, Milan Simonovic, Alexander Roth, Alberto San-
tos, Kalliopi P Tsafou, et al. “STRING v10: protein–protein interaction networks,
integrated over the tree of life”. In: Nucleic acids research (2014), gku1003.
[276] Nicolas Alcaraz, Josch Pauling, Richa Batra, Eudes Barbosa, Alexander Junge,
Anne GL Christensen, Vasco Azevedo, Henrik J Ditzel, and Jan Baumbach. “Key-
PathwayMiner 4.0: condition-specific pathway analysis by combining multiple omics
studies and networks with Cytoscape”. In: BMC systems biology 8.1 (2014), p. 99.
[277] Nicolas Alcaraz, Markus List, Martin Dissing-Hansen, Marc Rehmsmeier, Qihua
Tan, Jan Mollenhauer, Henrik J Ditzel, and Jan Baumbach. “Robust de novo path-
way enrichment with KeyPathwayMiner 5”. In: F1000Research 5 (2016).
[278] Michael Zuker. “Mfold web server for nucleic acid folding and hybridization predic-
tion”. In: Nucleic acids research 31.13 (2003), pp. 3406–3415.
[279] Michael W Pfaffl. “A new mathematical model for relative quantification in real-
time RT–PCR”. In: Nucleic acids research 29.9 (2001), e45–e45.
[280] Ivo Rieu and Stephen J Powers. “Real-time quantitative RT-PCR: design, calcula-
tions, and statistics”. In: The Plant Cell 21.4 (2009), pp. 1031–1033.
[281] Michael W Pfaffl, Ales Tichopad, Christian Prgomet, and Tanja P Neuvians. “De-
termination of stable housekeeping genes, differentially regulated target genes and
sample integrity: BestKeeper–Excel-based tool using pair-wise correlations”. In:
Biotechnology letters 26.6 (2004), pp. 509–515.
[282] U Ligges and M Mächler. “Scatterplot3d – an R package for visualizing multivariate
data”. In: Journal of Statistical Software 8 (2003), pp. 1–20. doi: 10.18637/jss.
v008.i11.
[283] Christopher K Glass and Kaoru Saijo. “Nuclear receptor transrepression pathways
that regulate inflammation in macrophages and T cells”. In: Nature Reviews Im-
munology 10.5 (2010), pp. 365–376.
[284] H Israel, C Odziemiec, and M Ballow. “The effects of retinoic acid on immunoglob-
ulin synthesis by human cord blood mononuclear cells”. In: Clinical immunology
and immunopathology 59.3 (1991), pp. 417–425.

194
Bibliography

[285] Harry D Dawson, Gary Collins, Robert Pyle, Michael Key, Ashani Weeraratna,
Vishwa Deep-Dixit, Celeste N Nadal, and Dennis D Taub. “Direct and indirect
effects of retinoic acid on human Th2 cytokine and chemokine expression by human
T lymphocytes”. In: BMC immunology 7.1 (2006), p. 27.
[286] Yu-Chien Tsai, Hui-Wen Chang, Tai-Tsung Chang, Min-Sheng Lee, Yu-Te Chu,
and Chih-Hsing Hung. “Effects of all-trans retinoic acid on Th1-and Th2-related
chemokines production in monocytes”. In: Inflammation 31.6 (2008), pp. 428–433.
[287] Wibke Schulte, Jürgen Bernhagen, and Richard Bucala. “Cytokines in sepsis: po-
tent immunoregulators and potential therapeutic targets–an updated view”. In:
Mediators of inflammation 2013 (2013).
[288] EJ Giamarellos-Bourboulis. “Clarithromycin: A Promising Immunomodulator in
Sepsis”. In: (2009), pp. 111–118.
[289] Anastasia Antonopoulou and Evangelos J Giamarellos-Bourboulis. “Immunomo-
dulation in sepsis: state of the art and future perspective”. In: Immunotherapy 3.1
(2011), pp. 117–128.
[290] Sanne P Smeekens, Aylwin Ng, Vinod Kumar, Melissa D Johnson, Theo S Plantinga,
Cleo Van Diemen, Peer Arts, Eugène TP Verwiel, Mark S Gresnigt, Karin Fransen,
et al. “Functional genomics identifies type I interferon pathway as central for host
defense against Candida albicans”. In: Nature communications 4 (2013), p. 1342.
[291] Olivia Majer, Christelle Bourgeois, Florian Zwolanek, Caroline Lassnig, Dontscho
Kerjaschki, Matthias Mack, Mathias Müller, and Karl Kuchler. “Type I interfer-
ons promote fatal immunopathology by regulating inflammatory monocytes and
neutrophils during Candida infections”. In: PLoS Pathog 8.7 (2012), e1002811.
[292] Juan José Muñoz, Céline Tárrega, Carmen Blanco-Aparicio, and Rafael Pulido.
“Differential interaction of the tyrosine phosphatases PTP-SL, STEP and HePTP
with the mitogen-activated protein kinases ERK1/2 and p38alpha is determined by
a kinase specificity sequence and influenced by reducing agents”. In: Biochemical
Journal 372.1 (2003), pp. 193–201.
[293] Kate L Jeffrey, Montserrat Camps, Christian Rommel, and Charles R Mackay.
“Targeting dual-specificity phosphatases: manipulating MAP kinase signalling and
immune responses”. In: Nature reviews Drug discovery 6.5 (2007), pp. 391–403.
[294] Katie J Anderson and Rachel L Allen. “Regulation of T-cell immunity by leuco-
cyte immunoglobulin-like receptors: innate immune receptors for self on antigen-
presenting cells”. In: Immunology 127.1 (2009), pp. 8–17.
[295] Francisco Borrego. “The CD300 molecules: an emerging family of regulators of the
immune system”. In: Blood 121.11 (2013), pp. 1951–1960.
[296] Elisabeth Esteban, Ricard Ferrer, Laia Alsina, and Antonio Artigas. “Immunomod-
ulation in sepsis: the role of endotoxin removal by polymyxin B-immobilized car-
tridge”. In: Mediators of inflammation 2013 (2013).
[297] C Ribeiro Nogueira, A Ramalho, E Lameu, CADS Franca, C David, and E Acciolly.
“Serum concentrations of vitamin A and oxidative stress in critically ill patients
with sepsis”. In: Nutr Hosp 24.3 (2009), pp. 312–7.

195
Bibliography

[298] Takuhiro Moromizato, Augusto A Litonjua, Andrea B Braun, Fiona K Gibbons,


Edward Giovannucci, and Kenneth B Christopher. “Association of low serum 25-
hydroxyvitamin D levels and sepsis in the critically ill”. In: Critical care medicine
42.1 (2014), pp. 97–107.
[299] Fredrik Barrenas, Richard R Green, Matthew J Thomas, G Lynn Law, Sean C
Proll, Flora Engelmann, Ilhem Messaoudi, Andrea Marzi, Heinz Feldmann, and
Michael G Katze. “Next generation sequencing reveals a controlled immune re-
sponse to Zaire Ebola virus challenge in cynomolgus macaques immunized with
VSV∆G/EBOVgp”. In: Clinical and Vaccine Immunology (2015), pp. CVI–00733.
[300] C. Cilloniz, H. Ebihara, C. Ni, G. Neumann, M. J. Korth, S. M. Kelly, Y. Kawaoka,
H. Feldmann, and M. G. Katze. “Functional genomics reveals the induction of
inflammatory response and metalloproteinase gene expression during lethal Ebola
virus infection”. In: J. Virol. 85.17 (Sept. 2011), pp. 9060–9068.
[301] J. C. Kash, E. Muhlberger, V. Carter, M. Grosch, O. Perwitasari, S. C. Proll, M. J.
Thomas, F. Weber, H. D. Klenk, and M. G. Katze. “Global suppression of the host
antiviral response by Ebola- and Marburgviruses: increased antagonism of the type
I interferon response is associated with enhanced virulence”. In: J. Virol. 80.6 (Mar.
2006), pp. 3009–3020.
[302] X. Pourrut, M. Souris, J. S. Towner, P. E. Rollin, S. T. Nichol, J. P. Gonzalez, and
E. Leroy. “Large serological survey showing cocirculation of Ebola and Marburg
viruses in Gabonese bat populations, and a high seroprevalence of both viruses in
Rousettus aegyptiacus”. In: BMC Infect. Dis. 9 (2009), p. 159.
[303] A. M. Saez et al. “Investigating the zoonotic origin of the West African Ebola
epidemic”. In: EMBO Mol Med (Dec. 2014).
[304] J. S. Towner, X. Pourrut, C. G. Albarino, C. N. Nkogue, B. H. Bird, G. Grard,
T. G. Ksiazek, J. P. Gonzalez, S. T. Nichol, and E. M. Leroy. “Marburg virus
infection detected in a common African bat”. In: PLoS ONE 2.8 (2007), e764.
[305] J. S. Towner et al. “Isolation of genetically diverse Marburg viruses from Egyptian
fruit bats”. In: PLoS Pathog. 5.7 (July 2009), e1000536.
[306] B. R. Amman et al. “Seasonal pulses of Marburg virus circulation in juvenile Rouset-
tus aegyptiacus bats coincide with periods of increased risk of human infection”. In:
PLoS Pathog. 8.10 (2012), e1002877.
[307] J. T. Paweska, P. Jansen van Vuren, J. Masumu, P. A. Leman, A. A. Grobbelaar, M.
Birkhead, S. Clift, R. Swanepoel, and A. Kemp. “Virological and serological findings
in Rousettus aegyptiacus experimentally inoculated with vero cells-adapted hogan
strain of Marburg virus”. In: PLoS ONE 7.9 (2012), e45479.
[308] B. R. Amman, M. E. Jones, T. K. Sealy, L. S. Uebelhoer, A. J. Schuh, B. H.
Bird, J. D. Coleman-McCray, B. E. Martin, S. T. Nichol, and J. S. Towner. “Oral
shedding of Marburg virus in experimentally infected egyptian fruit bats (Rousettus
aegyptiacus)”. In: J. Wildl. Dis. (Nov. 2014).
[309] A. A. Ansari. “Clinical features and pathobiology of Ebolavirus infection”. In: J.
Autoimmun. 55C (Dec. 2014), pp. 1–9.
[310] J. Y. Yen, S. Garamszegi, J. B. Geisbert, K. H. Rubins, T. W. Geisbert, A. Honko,
Y. Xia, J. H. Connor, and L. E. Hensley. “Therapeutics of Ebola hemorrhagic
fever: whole-genome transcriptional analysis of successful disease mitigation”. In:
J. Infect. Dis. 204 Suppl 3 (Nov. 2011), S1043–1052.

196
Bibliography

[311] V. Wahl-Jensen, S. Kurz, F. Feldmann, L. K. Buehler, J. Kindrachuk, V. DeFil-


ippis, J. da Silva Correia, K. Fruh, J. H. Kuhn, D. R. Burton, and H. Feldmann.
“Ebola virion attachment and entry into human macrophages profoundly effects
early cellular gene expression”. In: PLoS Negl Trop Dis 5.10 (Oct. 2011), e1359.
[312] S. Ludwig. “Disruption of virus-host cell interactions and cell signaling pathways
as an anti-viral approach against influenza virus infections”. In: Biol. Chem. 392.10
(Oct. 2011), pp. 837–847.
[313] S. Ludwig. “Will omics help to cure the flu?” In: Trends Microbiol. 22.5 (May 2014),
pp. 232–233.
[314] M. Panning et al. “Diagnostic reverse-transcription polymerase chain reaction kit
for filoviruses based on the strain collections of all European biosafety level 4 lab-
oratories”. In: J. Infect. Dis. 196 Suppl 2 (Nov. 2007), pp. 199–204.
[315] H. Nakabayashi, K. Taketa, K. Miyano, T. Yamane, and J. Sato. “Growth of human
hepatoma cells lines with differentiated functions in chemically defined medium”.
In: Cancer Res. 42.9 (Sept. 1982), pp. 3858–3863.
[316] I. Jordan, V. J. Munster, and V. Sandig. “Authentication of the R06E fruit bat cell
line”. In: Viruses 4.5 (May 2012), pp. 889–900.
[317] A. Timen, M. P. Koopmans, A. C. Vossen, G. J. van Doornum, S. Gunther, F.
van den Berkmortel, K. M. Verduin, S. Dittrich, P. Emmerich, A. D. Osterhaus,
J. T. van Dissel, and R. A. Coutinho. “Response to imported case of Marburg
hemorrhagic fever, the Netherland”. In: Emerging Infect. Dis. 15.8 (Aug. 2009),
pp. 1171–1175.
[318] R. B. Martines, D. L. Ng, P. W. Greer, P. E. Rollin, and S. R. Zaki. “Tissue and
cellular tropism, pathology and pathogenesis of Ebola and Marburg Viruses”. In:
J. Pathol. (Oct. 2014).
[319] W James Kent, Charles W. Sugnet, Terrence S. Furey, Krishna M. Roskin, Tom
H. Pringle, Alan M. Zahler, and David Haussler. “The human genome browser at
UCSC.” eng. In: Genome Res 12.6 (June 2002), pp. 996–1006. doi: 10.1101/gr.
229102.ArticlepublishedonlinebeforeprintinMay2002.
[320] S. K. Gire et al. “Genomic surveillance elucidates Ebola virus origin and trans-
mission during the 2014 outbreak”. In: Science 345.6202 (Sept. 2014), pp. 1369–
1372.
[321] Albert K Lee, Kirsten A Kulcsar, Oliver Elliott, Hossein Khiabanian, Elyse R
Nagle, Megan EB Jones, Brian R Amman, Mariano Sanchez-Lockhart, Jonathan
S Towner, Gustavo Palacios, et al. “De novo transcriptome reconstruction and
annotation of the Egyptian rousette bat”. In: BMC genomics 16.1 (2015), p. 1.
[322] Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. “QUAST:
quality assessment tool for genome assemblies.” eng. In: Bioinformatics 29.8 (Apr.
2013), pp. 1072–1075. doi: 10.1093/bioinformatics/btt086.
[323] Michael T. Wolfinger, Jörg Fallmann, Florian Eggenhofer, and Fabian Amman.
“ViennaNGS: A toolbox for building efficient next-generation sequencing analysis
pipelines”. In: F1000Research 4.50 (2015). doi: 10 . 12688 / f1000research .
6157.1.

197
Bibliography

[324] Aaron R. Quinlan and Ira M. Hall. “BEDTools: a flexible suite of utilities for
comparing genomic features.” eng. In: Bioinformatics 26.6 (Mar. 2010), pp. 841–
842. doi: 10.1093/bioinformatics/btq033.
[325] Helga Thorvaldsdóttir, James T. Robinson, and Jill P. Mesirov. “Integrative Ge-
nomics Viewer (IGV): high-performance genomics data visualization and explo-
ration.” eng. In: Brief Bioinform 14.2 (Mar. 2013), pp. 178–192. doi: 10.1093/
bib/bbs017.
[326] M. Kanehisa and S. Goto. “KEGG: kyoto encyclopedia of genes and genomes.” eng.
In: Nucleic Acids Res 28.1 (Jan. 2000), pp. 27–30.
[327] Yoav Benjamini and Yosef Hochberg. “Controlling the False Discovery Rate: A
Practical and Powerful Approach to Multiple Testing”. In: Journal of the Royal Sta-
tistical Society. Series B (Methodological) 57.1 (1995), pp. 289–300. issn: 00359246.
doi: 10.2307/2346101.
[328] Piotr J Balwierz, Mikhail Pachkov, Phil Arnold, Andreas J Gruber, Mihaela Za-
volan, and Erik van Nimwegen. “ISMARA: automated modeling of genomic sig-
nals as a democracy of regulatory motifs.” In: Genome research 24.5 (Mar. 2014),
pp. 869–84. issn: 1549-5469. doi: 10.1101/gr.169508.113.
[329] Huaiyu Mi, Sagar Poudel, Anushya Muruganujan, John T Casagrande, and Paul
D Thomas. “PANTHER version 10: expanded protein families and functions, and
analysis tools”. In: Nucleic acids research 44.D1 (2016), pp. D336–D342.
[330] J. de Wilde, J. De-Castro Arce, P. J. Snijders, C. J. Meijer, F. Rosl, and R. D.
Steenbergen. “Alterations in AP-1 and AP-1 regulatory genes during HPV-induced
carcinogenesis”. In: Cell. Oncol. 30.1 (2008), pp. 77–87.
[331] B. Varshney and S. K. Lal. “SARS-CoV accessory protein 3b induces AP-1 tran-
scriptional activity through activation of JNK and ERK pathways”. In: Biochem-
istry 50.24 (June 2011), pp. 5419–5425.
[332] T. Kuri, X. Zhang, M. Habjan, L. Martinez-Sobrido, A. Garcia-Sastre, Z. Yuan,
and F. Weber. “Interferon priming enables cells to partially overturn the SARS
coronavirus-induced block in innate immune activation”. In: J. Gen. Virol. 90.Pt
11 (Nov. 2009), pp. 2686–2694.
[333] W. B. Cardenas, Y. M. Loo, M. Gale, A. L. Hartman, C. R. Kimberlin, L. Martinez-
Sobrido, E. O. Saphire, and C. F. Basler. “Ebola virus VP35 protein binds double-
stranded RNA and inhibits alpha/beta interferon production induced by RIG-I
signaling”. In: J. Virol. 80.11 (June 2006), pp. 5168–5178.
[334] P. Ramanan, R. S. Shabman, C. S. Brown, G. K. Amarasinghe, C. F. Basler, and
D. W. Leung. “Filoviral immune evasion mechanisms”. In: Viruses 3.9 (Sept. 2011),
pp. 1634–1649.
[335] P. Luthra, P. Ramanan, C. E. Mire, C. Weisend, Y. Tsuda, B. Yen, G. Liu, D. W.
Leung, T. W. Geisbert, H. Ebihara, G. K. Amarasinghe, and C. F. Basler. “Mutual
antagonism between the Ebola virus VP35 protein and the RIG-I activator PACT
determines infection outcome”. In: Cell Host Microbe 14.1 (July 2013), pp. 74–84.
[336] M. L. Schmitz, M. Kracht, and V. V. Saul. “The intricate interplay between RNA
viruses and NF-κB”. In: Biochim. Biophys. Acta 1843.11 (Nov. 2014), pp. 2754–
2764.

198
Bibliography

[337] S. Reikine, J. B. Nguyen, and Y. Modis. “Pattern Recognition and Signaling Mech-
anisms of RIG-I and MDA5”. In: Front Immunol 5 (2014), p. 342.
[338] M. U. Gack, Y. C. Shin, C. H. Joo, T. Urano, C. Liang, L. Sun, O. Takeuchi,
S. Akira, Z. Chen, S. Inoue, and J. U. Jung. “TRIM25 RING-finger E3 ubiquitin
ligase is essential for RIG-I-mediated antiviral activity”. In: Nature 446.7138 (Apr.
2007), pp. 916–920.
[339] K. Ozato, D. M. Shin, T. H. Chang, and H. C. Morse. “TRIM family proteins and
their emerging roles in innate immunity”. In: Nat. Rev. Immunol. 8.11 (Nov. 2008),
pp. 849–860.
[340] Judith Olejnik, Jesus Alonso, Kristina M. Schmidt, Zhen Yan, Wei Wang, Andrea
Marzi, Hideki Ebihara, Jinghua Yang, Jean L. Patterson, Elena Ryabchikova, and
Elke Mühlberger. “Ebola virus does not block apoptotic signaling pathways.” eng.
In: J Virol 87.10 (May 2013), pp. 5384–5396. doi: 10.1128/JVI.01461-12.
[341] J. I. Jun and L. F. Lau. “Taking aim at the extracellular matrix: CCN proteins as
emerging therapeutic targets”. In: Nat Rev Drug Discov 10.12 (Dec. 2011), pp. 945–
963.
[342] Chiho Goda, Taisuke Kanaji, Sachiko Kanaji, Go Tanaka, Kazuhiko Arima, Shigeaki
Ohno, and Kenji Izuhara. “Involvement of IL-32 in activation-induced cell death in
T cells”. In: International immunology 18.2 (2006), pp. 233–240.
[343] N. Wauquier, P. Becquart, C. Padilla, S. Baize, and E. M. Leroy. “Human fatal
zaire ebola virus infection is associated with an aberrant innate immunity and with
massive lymphocyte apoptosis”. In: PLoS Negl Trop Dis 4.10 (2010).
[344] Wei Li, Yan Liu, Muhammad Mahmood Mukhtar, Rui Gong, Ying Pan, Sahibzada
T Rasool, Yecheng Gao, Lei Kang, Qian Hao, Guiqing Peng, et al. “Activation of
interleukin-32 pro-inflammatory pathway in response to influenza A virus infection”.
In: PLoS One 3.4 (2008), e1985.
[345] W. Li, W. Sun, L. Liu, F. Yang, Y. Li, Y. Chen, J. Fang, W. Zhang, J. Wu, and Y.
Zhu. “IL-32: a host proinflammatory factor against influenza viral replication is up-
regulated by aberrant epigenetic modifications during influenza A virus infection”.
In: J. Immunol. 185.9 (Nov. 2010), pp. 5056–5065.
[346] J. Ouyang, X. Zhu, Y. Chen, H. Wei, Q. Chen, X. Chi, B. Qi, L. Zhang, Y. Zhao,
G. F. Gao, G. Wang, and J. L. Chen. “NRAV, a long noncoding RNA, modu-
lates antiviral responses through suppression of interferon-stimulated gene tran-
scription”. In: Cell Host Microbe 16.5 (Nov. 2014), pp. 616–626.
[347] Christopher F Basler, Xiuyan Wang, Elke Mühlberger, Victor Volchkov, Jason
Paragas, Hans-Dieter Klenk, Adolfo Garcia-Sastre, and Peter Palese. “The Ebola
virus VP35 protein functions as a type I IFN antagonist”. In: Proceedings of the
National Academy of Sciences 97.22 (2000), pp. 12289–12294.
[348] S. P. Reid, L. W. Leung, A. L. Hartman, O. Martinez, M. L. Shaw, C. Carbonnelle,
V. E. Volchkov, S. T. Nichol, and C. F. Basler. “Ebola virus VP24 binds karyopherin
alpha1 and blocks STAT1 nuclear accumulation”. In: J. Virol. 80.11 (June 2006),
pp. 5156–5167.
[349] Christopher F. Basler and Gaya K. Amarasinghe. “Evasion of interferon responses
by Ebola and Marburg viruses.” eng. In: J Interferon Cytokine Res 29.9 (Sept.
2009), pp. 511–520. doi: 10.1089/jir.2009.0076.

199
Bibliography

[350] Michael Schümann, Thorsten Gantke, and Elke Mühlberger. “Ebola virus VP35
antagonizes PKR activity through its C-terminal interferon inhibitory domain.”
eng. In: J Virol 83.17 (Sept. 2009), pp. 8993–8997. doi: 10.1128/JVI.00523-
09.
[351] M. Mateo, S. P. Reid, L. W. Leung, C. F. Basler, and V. E. Volchkov. “Ebolavirus
VP24 binding to karyopherins is required for inhibition of interferon signaling”. In:
J. Virol. 84.2 (Jan. 2010), pp. 1169–1175.
[352] R. Kubisch, L. Meissner, S. Krebs, H. Blum, M. Gunther, A. Roidl, and E. Wag-
ner. “A Comprehensive Gene Expression Analysis of Resistance Formation upon
Metronomic Cyclophosphamide Therapy”. In: Transl Oncol 6.1 (Feb. 2013), pp. 1–
9.
[353] T. Nomiyama, T. Nakamachi, F. Gizard, E. B. Heywood, K. L. Jones, N. Ohkura,
R. Kawamori, O. M. Conneely, and D. Bruemmer. “The NR4A orphan nuclear
receptor NOR1 is induced by platelet-derived growth factor and mediates vascular
smooth muscle cell proliferation”. In: J. Biol. Chem. 281.44 (Nov. 2006), pp. 33467–
33476.
[354] L. Jin, A. Williamson, S. Banerjee, I. Philipp, and M. Rape. “Mechanism of ubiquitin-
chain formation by the human anaphase-promoting complex”. In: Cell 133.4 (May
2008), pp. 653–665.
[355] J. Yao, L. Duan, M. Fan, J. Yuan, and X. Wu. “Overexpression of BLCAP induces
S phase arrest and apoptosis independent of p53 and NF-kappaB in human tongue
carcinoma : BLCAP overexpression induces S phase arrest and apoptosis”. In: Mol.
Cell. Biochem. 297.1-2 (Mar. 2007), pp. 81–92.
[356] A. S. Kondratowicz et al. “T-cell immunoglobulin and mucin domain 1 (TIM-1) is
a receptor for Zaire Ebolavirus and Lake Victoria Marburgvirus”. In: Proc. Natl.
Acad. Sci. U.S.A. 108.20 (May 2011), pp. 8426–8431.
[357] Laurent Meertens, Xavier Carnec, Manuel Perera Lecoin, Rasika Ramdasi, Florence
Guivel-Benhassine, Erin Lew, Greg Lemke, Olivier Schwartz, and Ali Amara. “The
TIM and TAM families of phosphatidylserine receptors mediate dengue virus entry”.
In: Cell host & microbe 12.4 (2012), pp. 544–557.
[358] Naveen L Pereira, Dong Lin, Linda Pelleymounter, Irene Moon, Gail Stilling, Bruce
W Eckloff, Eric D Wieben, Margaret M Redfield, John C Burnett, Vivien C Yee,
et al. “Natriuretic Peptide Receptor-3 Gene (NPR3) Nonsynonymous Polymor-
phism Results in Significant Reduction in Protein Expression Because of Acceler-
ated Degradation”. In: Circulation: Cardiovascular Genetics 6.2 (2013), pp. 201–
210.
[359] W. Xu et al. “Ebola virus VP24 targets a unique NLS binding site on karyopherin
alpha 5 to selectively compete with nuclear import of phosphorylated STAT1”. In:
Cell Host Microbe 16.2 (Aug. 2014), pp. 187–200.
[360] A. P. Zhang, Z. A. Bornholdt, T. Liu, D. M. Abelson, D. E. Lee, S. Li, V. L.
Woods, and E. O. Saphire. “The ebola virus interferon antagonist VP24 directly
binds STAT1 and has a novel, pyramidal fold”. In: PLoS Pathog. 8.2 (Feb. 2012),
e1002550.
[361] R. S. Shabman, E. E. Gulcicek, K. L. Stone, and C. F. Basler. “The Ebola virus
VP24 protein prevents hnRNP C1/C2 binding to karyopherinp α1 and partially
alters its nuclear import”. In: J. Infect. Dis. 204 Suppl 3 (Nov. 2011), S904–910.

200
Bibliography

[362] K. J. Peltola, K. Paukku, T. L. Aho, M. Ruuska, O. Silvennoinen, and P. J. Kosk-


inen. “Pim-1 kinase inhibits STAT5-dependent transcription via its interactions
with SOCS1 and SOCS3”. In: Blood 103.10 (May 2004), pp. 3744–3750.
[363] R. A. Piganis, N. A. De Weerd, J. A. Gould, C. W. Schindler, A. Mansell, S. E.
Nicholson, and P. J. Hertzog. “Suppressor of cytokine signaling (SOCS) 1 inhibits
type I interferon (IFN) signaling via the interferon alpha receptor (IFNAR1)-
associated tyrosine kinase Tyk2”. In: J. Biol. Chem. 286.39 (Sept. 2011), pp. 33811–
33818.
[364] D. A. Liebermann and B. Hoffman. “Myeloid differentiation (MyD)/growth arrest
DNA damage (GADD) genes in tumor suppression, immunity and inflammation”.
In: Leukemia 16.4 (Apr. 2002), pp. 527–541.
[365] I. Novoa, H. Zeng, H. P. Harding, and D. Ron. “Feedback inhibition of the unfolded
protein response by GADD34-mediated dephosphorylation of eIF2alpha”. In: J. Cell
Biol. 153.5 (May 2001), pp. 1011–1022.
[366] R. Watanabe, Y. Tambe, H. Inoue, T. Isono, M. Haneda, K. Isobe, T. Kobayashi,
O. Hino, H. Okabe, and T. Chano. “GADD34 inhibits mammalian target of ra-
pamycin signaling via tuberous sclerosis complex and controls cell survival under
bioenergetic stress”. In: Int. J. Mol. Med. 19.3 (Mar. 2007), pp. 475–483.
[367] R. Mukai and T. Ohshima. “HTLV-1 HBZ positively regulates the mTOR signaling
pathway via inhibition of GADD34 activity in the cytoplasm”. In: Oncogene 33.18
(May 2014), pp. 2317–2328.
[368] R. Lang, M. Hammer, and J. Mages. “DUSP meet immunology: dual specificity
MAPK phosphatases in control of the inflammatory response”. In: J. Immunol.
177.11 (Dec. 2006), pp. 7497–7504.
[369] K. I. Patterson, T. Brummer, P. M. O’Brien, and R. J. Daly. “Dual-specificity
phosphatases: critical regulators with diverse cellular targets”. In: Biochem. J. 418.3
(Mar. 2009), pp. 475–489.
[370] A. Caceres, B. Perdiguero, C. E. Gomez, M. V. Cepeda, C. Caelles, C. O. Sorzano,
and M. Esteban. “Involvement of the cellular phosphatase DUSP1 in vaccinia virus
infection”. In: PLoS Pathog. 9.11 (2013), e1003719.
[371] X. Chen, W. L. Ma, S. Liang, Z. J. Liao, T. Shang, and W. L. Zheng. “Effect
of Epstein-Barr virus reactivation on gene expression profile of nasopharyngeal
carcinoma”. In: Ai Zheng 27.1 (Jan. 2008), pp. 1–7.
[372] J. S. Arthur and S. C. Ley. “Mitogen-activated protein kinases in innate immunity”.
In: Nat. Rev. Immunol. 13.9 (Sept. 2013), pp. 679–692.
[373] F. Weber, G. Kochs, and O. Haller. “Inverse interference: how viruses fight the
interferon system”. In: Viral Immunol. 17.4 (2004), pp. 498–515.
[374] Jan E Carette, Matthijs Raaben, Anthony C Wong, Andrew S Herbert, Gregor
Obernosterer, Nirupama Mulherkar, Ana I Kuehne, Philip J Kranzusch, April M
Griffin, Gordon Ruthel, et al. “Ebola virus entry requires the cholesterol transporter
Niemann-Pick C1”. In: Nature 477.7364 (2011), pp. 340–343.
[375] Shuai Yuan, Lei Cao, Hui Ling, Minghao Dang, Yao Sun, Xuyuan Zhang, Yutao
Chen, Liguo Zhang, Dan Su, Xiangxi Wang, et al. “TIM-1 acts a dual-attachment
receptor for Ebolavirus by interacting directly with viral GP and the PS on the
viral envelope”. In: Protein & cell 6.11 (2015), pp. 814–824.

201
Bibliography

[376] Brian H Bird, Thomas G Ksiazek, Stuart T Nichol, and N James MacLachlan. “Rift
Valley fever virus”. In: Journal of the American Veterinary Medical Association
234.7 (2009), pp. 883–893.
[377] Rolf Muller, Jean-Francois Saluzzo, Nora Lopez, Thomas Dreier, Michael Turell,
Jonathan Smith, and Michele Bouloy. “Characterization of clone 13, a naturally
attenuated avirulent isolate of Rift Valley fever virus, which is altered in the small
segment”. In: The American journal of tropical medicine and hygiene 53.4 (1995),
pp. 405–411.
[378] GH Gerdes. “Rift Valley fever”. In: Revue scientifique et technique (International
Office of Epizootics) 23.2 (2004), pp. 613–623.
[379] Joseph J Vitti, Sharon R Grossman, and Pardis C Sabeti. “Detecting natural se-
lection in genomic data.” eng. In: Annu Rev Genet 47 (2013), pp. 97–120. doi:
10.1146/annurev-genet-111212-133526.
[380] Matteo Fumagalli, Manuela Sironi, Uberto Pozzoli, Anna Ferrer-Admetlla, Anna
Ferrer-Admettla, Linda Pattini, and Rasmus Nielsen. “Signatures of environmental
genetic adaptation pinpoint pathogens as the main selective pressure through hu-
man evolution.” eng. In: PLoS Genet 7.11 (Nov. 2011), e1002355. doi: 10.1371/
journal.pgen.1002355.
[381] Daniel Shriner, David C Nickle, Mark A Jensen, and James I Mullins. “Poten-
tial impact of recombination on sitewise approaches for detecting positive natural
selection.” eng. In: Genet Res 81.2 (Apr. 2003), pp. 115–121.
[382] Wayne Delport, Art F Y Poon, Simon D W Frost, and Sergei L Kosakovsky Pond.
“Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biol-
ogy.” eng. In: Bioinformatics 26.19 (Oct. 2010), pp. 2455–2457. doi: 10.1093/
bioinformatics/btq429.
[383] Adi Doron-Faigenboim, Adi Stern, Itay Mayrose, Eran Bacharach, and Tal Pupko.
“Selecton: a server for detecting evolutionary forces at a single amino-acid site.”
eng. In: Bioinformatics 21.9 (May 2005), pp. 2101–2103.
[384] Adi Stern, Adi Doron-Faigenboim, Elana Erez, Eric Martz, Eran Bacharach, and
Tal Pupko. “Selecton 2007: advanced models for detecting positive and purifying
selection using a Bayesian inference approach.” eng. In: Nucleic Acids Res 35.Web
Server issue (July 2007), W506–W511. doi: 10.1093/nar/gkm382.
[385] Fei Su, Hong-Yu Ou, Fei Tao, Hongzhi Tang, and Ping Xu. “PSP: rapid identifi-
cation of orthologous coding genes under positive selection across multiple closely
related prokaryotic genomes.” eng. In: BMC Genomics 14 (Dec. 2013), p. 924. doi:
10.1186/1471-2164-14-924.
[386] Federico Abascal, Rafael Zardoya, and Maximilian J Telford. “TranslatorX: multi-
ple alignment of nucleotide sequences guided by amino acid translations.” eng. In:
Nucleic Acids Res 38.Web Server issue (July 2010), W7–13. doi: 10.1093/nar/
gkq291.
[387] Robert C Edgar. “MUSCLE: multiple sequence alignment with high accuracy and
high throughput.” eng. In: Nucleic Acids Res 32.5 (2004), pp. 1792–1797. doi:
10.1093/nar/gkh340.
[388] D Posada and KA Crandall. “MODELTEST: testing the model of DNA substitu-
tion.” eng. In: Bioinformatics 14.9 (1998), pp. 817–818.

202
Bibliography

[389] Sergei L Kosakovsky Pond, Simon D W Frost, and Spencer V Muse. “HyPhy:
hypothesis testing using phylogenies.” eng. In: Bioinformatics 21.5 (Mar. 2005),
pp. 676–679. doi: 10.1093/bioinformatics/bti079.
[390] Sergei L Kosakovsky Pond, David Posada, Michael B Gravenor, Christopher H
Woelk, and Simon D W Frost. “GARD: a genetic algorithm for recombination
detection.” eng. In: Bioinformatics 22.24 (Dec. 2006), pp. 3096–3098. doi: 10 .
1093/bioinformatics/btl474.
[391] Sergei L Kosakovsky Pond, David Posada, Michael B Gravenor, Christopher H
Woelk, and Simon D W Frost. “Automated phylogenetic detection of recombination
using a genetic algorithm.” eng. In: Mol Biol Evol 23.10 (Oct. 2006), pp. 1891–1901.
doi: 10.1093/molbev/msl051.
[392] H Kishino and M Hasegawa. “Evaluation of the maximum likelihood estimate of
the evolutionary tree topologies from DNA sequence data, and the branching order
in Hominoidea.” eng. In: J Mol Evol 29.2 (Aug. 1989), pp. 170–179.
[393] Ziheng Yang. “PAML 4: phylogenetic analysis by maximum likelihood.” eng. In:
Mol Biol Evol 24.8 (Aug. 2007), pp. 1586–1591. doi: 10.1093/molbev/msm088.
[394] Willie J Swanson, Rasmus Nielsen, and Qiaofeng Yang. “Pervasive adaptive evolu-
tion in mammalian fertilization proteins”. In: Molecular biology and evolution 20.1
(2003), pp. 18–20.
[395] Ziheng Yang, Wendy S W Wong, and Rasmus Nielsen. “Bayes Empirical Bayes
Inference of Amino Acid Sites Under Positive Selection.” eng. In: Mol Biol Evol
22.4 (Apr. 2005), pp. 1107–1118. doi: 10.1093/molbev/msi097.
[396] Patrick S Mitchell, Janet M Young, Michael Emerman, and Harmit S Malik. “Evo-
lutionary analyses suggest a function of MxB immunity proteins beyond lentivirus
restriction”. In: PLoS Pathog 11.12 (2015), e1005304.
[397] Ross M McBee, Shea A Rozmiarek, Nicholas R Meyerson, Paul A Rowley, and
Sara L Sawyer. “The Effect of Species Representation on the Detection of Positive
Selection in Primate Gene Data Sets.” eng. In: Mol Biol Evol 32.4 (Apr. 2015),
pp. 1091–1096. doi: 10.1093/molbev/msu399.
[398] Ingi Agnarsson, Carlos M Zambrana-Torrelio, Nadia Paola Flores-Saldana, and
Laura J May-Collado. “A time-calibrated species-level phylogeny of bats (Chi-
roptera, Mammalia)”. In: PLoS currents 3 (2011).
[399] G. Tsagkogeorga, J. Parker, E. Stupka, J. A. Cotton, and S. J. Rossiter. “Phy-
logenomic analyses elucidate the evolutionary relationships of bats”. In: Curr. Biol.
23.22 (Nov. 2013), pp. 2262–2267.
[400] F. C. Almeida, N. P. Giannini, N. B. Simmons, and K. M. Helgen. “Each flying fox
on its own branch: a phylogenetic tree for Pteropus and related genera (Chiroptera:
Pteropodidae)”. In: Mol. Phylogenet. Evol. 77 (Aug. 2014), pp. 83–95.
[401] M. Ruedi, B. Stadelmann, Y. Gager, E. J. Douzery, C. M. Francis, L. K. Lin, A.
Guillen-Servent, and A. Cibois. “Molecular phylogenetic reconstructions identify
East Asia as the cradle for the evolution of the cosmopolitan genus Myotis (Mam-
malia, Chiroptera)”. In: Mol. Phylogenet. Evol. 69.3 (Dec. 2013), pp. 437–449.
[402] Charles H Calisher, James E Childs, Hume E Field, Kathryn V Holmes, and Tony
Schountz. “Bats: important reservoir hosts of emerging viruses”. In: Clinical micro-
biology reviews 19.3 (2006), pp. 531–545.

203
Bibliography

[403] Emma C Teeling, Mark S Springer, Ole Madsen, Paul Bates, Stephen J O’brien,
and William J Murphy. “A molecular phylogeny for bats illuminates biogeography
and the fossil record”. In: Science 307.5709 (2005), pp. 580–584.
[404] Don E Wilson and DeeAnn M Reeder. Mammal species of the world: a taxonomic
and geographic reference. Vol. 12. Johns Hopkins University Press, 2005.
[405] John E McCormack, Brant C Faircloth, Nicholas G Crawford, Patricia Adair Gowaty,
Robb T Brumfield, and Travis C Glenn. “Ultraconserved elements are novel phy-
logenomic markers that resolve placental mammal phylogeny when combined with
species-tree analysis”. In: Genome research 22.4 (2012), pp. 746–754.
[406] Mariana F Nery, Dimar J Gonzalez, Federico G Hoffmann, and Juan C Opazo.
“Resolution of the laurasiatherian phylogeny: evidence from genomic data”. In:
Molecular phylogenetics and evolution 64.3 (2012), pp. 685–689.
[407] Nancy B Simmons. “An Eocene big bang for bats”. In: Science 307.5709 (2005),
pp. 527–528.
[408] Nancy B Simmons and J H Geisler. “Phylogenetic relationships of Icaronycteris,
Archeonycteris, Hassianycteris and Palaeochiropteryix to extant bat lineages, with
comments on the evolution of echolocation and foraging strategies in Microchi-
roptera”. In: Bulletin of the American Museum of Natural History 235 (1998), pp. 1–
182.
[409] Maureen A O’Leary, Jonathan I Bloch, John J Flynn, Timothy J Gaudin, Andres
Giallombardo, Norberto P Giannini, Suzann L Goldberg, Brian P Kraatz, Zhe-Xi
Luo, Jin Meng, et al. “The placental mammal ancestor and the post–K-Pg radiation
of placentals”. In: Science 339.6120 (2013), pp. 662–667.
[410] Zhen Liu, Shude Li, Wei Wang, Dongming Xu, Robert W Murphy, and Peng Shi.
“Parallel evolution of KCNQ4 in echolocating bats”. In: PLoS One 6.10 (2011),
e26618.
[411] Michal Szczesniak, Misako Yoneda, Hiroki Sato, Izabela Makalowska, Shigeru Kyuwa,
Sumio Sugano, Yutaka Suzuki, Wojciech Makalowski, and Chieko Kai. “Character-
ization of the mitochondrial genome of Rousettus leschenaulti”. In: Mitochondrial
DNA (2013), pp. 1–2.
[412] Lin-Fa Wang, Peter J Walker, and Leo LM Poon. “Mass extinctions, biodiversity
and mitochondrial function: are bats ’special’ as reservoirs for emerging viruses?”
In: Current opinion in virology 1.6 (2011), pp. 649–657.
[413] James W Wynne and Lin-Fa Wang. “Bats and viruses: friend or foe?” In: PLoS
Pathog 9.10 (2013), e1003651.
[414] Tabea Binger, Augustina Annan, Jan Felix Drexler, Marcel Alexander Müller, René
Kallies, Ernest Adankwah, Robert Wollny, Anne Kopp, Hanna Heidemann, Dickson
Dei, et al. “A novel rhabdovirus isolated from the straw-colored fruit bat Eidolon
helvum, with signs of antibodies in swine and humans”. In: Journal of virology 89.8
(2015), pp. 4588–4597.
[415] Paul M Arguin, Kristy Murray-Lillibridge, ME Miranda, Jean S Smith, Alan B
Calaor, and Charles E Rupprecht. “Serologic evidence of Lyssavirus infections
among bats, the Philippines”. In: Emerging Infectious Diseases 8.3 (2002), pp. 258–
262.

204
Bibliography

[416] Hume Field, Brad McCall, and Janine Barrett. “Australian bat lyssavirus infection
in a captive juvenile black flying fox”. In: Emerging infectious diseases 5.3 (1999),
p. 438.
[417] Eric M Leroy, Brice Kumulungui, Xavier Pourrut, Pierre Rouquet, Alexandre Has-
sanin, Philippe Yaba, André Délicat, Janusz T Paweska, Jean-Paul Gonzalez, and
Robert Swanepoel. “Fruit bats as reservoirs of Ebola virus”. In: Nature 438.7068
(2005), pp. 575–576.
[418] Xavier Pourrut, Marc Souris, Jonathan S Towner, Pierre E Rollin, Stuart T Nichol,
Jean-Paul Gonzalez, and Eric Leroy. “Large serological survey showing cocircula-
tion of Ebola and Marburg viruses in Gabonese bat populations, and a high sero-
prevalence of both viruses in Rousettus aegyptiacus”. In: BMC infectious diseases
9.1 (2009), p. 159.
[419] Suxiang Tong, Yan Li, Pierre Rivailler, Christina Conrardy, Danilo A Alvarez
Castillo, Li-Mei Chen, Sergio Recuenco, James A Ellison, Charles T Davis, Ian
A York, et al. “A distinct lineage of influenza A virus from bats”. In: Proceedings
of the National Academy of Sciences 109.11 (2012), pp. 4269–4274.
[420] Suxiang Tong, Xueyong Zhu, Yan Li, Mang Shi, Jing Zhang, Melissa Bourgeois,
Hua Yang, Xianfeng Chen, Sergio Recuenco, Jorge Gomez, et al. “New world bats
harbor diverse influenza A viruses”. In: PLoS Pathog 9.10 (2013), e1003657.
[421] Sabrina Weiss, Peter T Witkowski, Brita Auste, Kathrin Nowak, Natalie Weber,
Jakob Fahr, Jean-Vivien Mombouli, Nathan D Wolfe, Jan Felix Drexler, Christian
Drosten, et al. “Hantavirus in bat, Sierra Leone”. In: (2012).
[422] Wen-Ping Guo, Xian-Dan Lin, Wen Wang, Jun-Hua Tian, Mei-Li Cong, Hai-Lin
Zhang, Miao-Ruo Wang, Run-Hong Zhou, Jian-Bo Wang, Ming-Hui Li, et al. “Phy-
logeny and origins of hantaviruses harbored by bats, insectivores, and rodents”. In:
PLoS Pathog 9.2 (2013), e1003159.
[423] Marcel A Müller, Stéphanie Devignot, Erik Lattwein, Victor Max Corman, Gaël
D Maganga, Florian Gloza-Rausch, Tabea Binger, Peter Vallo, Petra Emmerich,
Veronika M Cottontail, et al. “Evidence for widespread infection of African bats
with Crimean-Congo hemorrhagic fever-like viruses”. In: Scientific reports 6 (2016).
[424] MF Almeida, LFA Martorelli, CC Aires, PC Sallum, EL Durigon, and E Massad.
“Experimental rabies infection in haematophagous bats Desmodus rotundus”. In:
Epidemiology and infection 133.03 (2005), pp. 523–527.
[425] Judith N Mandl, Rafi Ahmed, Luis B Barreiro, Peter Daszak, Jonathan H Epstein,
Herbert W Virgin, and Mark B Feinberg. “Reservoir host immune responses to
emerging zoonotic viruses”. In: Cell 160.1 (2015), pp. 20–35.
[426] ML Baker, Tony Schountz, and L-F Wang. “Antiviral immune responses of bats: a
review”. In: Zoonoses and public health 60.1 (2013), pp. 104–116.
[427] Susanne E Biesold, Daniel Ritz, Florian Gloza-Rausch, Robert Wollny, Jan Fe-
lix Drexler, Victor M Corman, Elisabeth KV Kalko, Samuel Oppong, Christian
Drosten, and Marcel A Müller. “Type I interferon reaction to viral infection in
interferon-competent, immortalized cell lines from the African fruit bat Eidolon
helvum”. In: PloS one 6.11 (2011), e28131.

205
Bibliography

[428] Christopher Cowled, Michelle Baker, Mary Tachedjian, Peng Zhou, Dieter Bulach,
and Lin-Fa Wang. “Molecular characterisation of Toll-like receptors in the black
flying fox Pteropus alecto”. In: Developmental & Comparative Immunology 35.1
(2011), pp. 7–18.
[429] Christopher Cowled, Michelle L Baker, Peng Zhou, Mary Tachedjian, and Lin-
Fa Wang. “Molecular characterisation of RIG-I-like helicases in the black flying
fox, Pteropus alecto”. In: Developmental & Comparative Immunology 36.4 (2012),
pp. 657–664.
[430] Anthony T Papenfuss, Michelle L Baker, Zhi-Ping Feng, Mary Tachedjian, Gary
Crameri, Chris Cowled, Justin Ng, Vijaya Janardhana, Hume E Field, and Lin-Fa
Wang. “The immune gene repertoire of an important viral reservoir, the Australian
black flying fox”. In: BMC genomics 13.1 (2012), p. 261.
[431] Peng Zhou, Christopher Cowled, Lin-Fa Wang, and Michelle L Baker. “Bat Mx1
and Oas1, but not Pkr are highly induced by bat interferon and viral infection”.
In: Developmental & Comparative Immunology 40.3 (2013), pp. 240–247.
[432] Jinju Li, Guangxu Zhang, Dalong Cheng, Hua Ren, Min Qian, and Bing Du.
“Molecular characterization of RIG-I, STAT-1 and IFN-beta in the horseshoe bat”.
In: Gene 561.1 (2015), pp. 115–123.
[433] Peng Zhou, Mary Tachedjian, James W Wynne, Victoria Boyd, Jie Cui, Ina Smith,
Christopher Cowled, Justin HJ Ng, Lawrence Mok, Wojtek P Michalski, et al.
“Contraction of the type I IFN locus and unusual constitutive expression of IFN-α
in bats”. In: Proceedings of the National Academy of Sciences (2016), p. 201518240.
[434] Dirk Holzinger, Carl Jorns, Silke Stertz, Stéphanie Boisson-Dupuis, Robert Thimme,
Manfred Weidmann, Jean-Laurent Casanova, Otto Haller, and Georg Kochs. “In-
duction of MxA gene expression by influenza A virus requires type I or type III
interferon signaling”. In: Journal of virology 81.14 (2007), pp. 7776–7785.
[435] Markus Mordstein, Georg Kochs, Laure Dumoutier, Jean-Christophe Renauld, Søren
R Paludan, Kevin Klucher, and Peter Staeheli. “Interferon-λ contributes to innate
immunity of mice against influenza A virus but not against hepatotropic viruses”.
In: PLoS Pathog 4.9 (2008), e1000151.
[436] Otto Haller, Peter Staeheli, Martin Schwemmle, and Georg Kochs. “Mx GTPases:
dynamin-like antiviral machines of innate immunity”. In: Trends in microbiology
23.3 (2015), pp. 154–163.
[437] Song Gao, Alexander von der Malsburg, Alexej Dick, Katja Faelber, Gunnar F
Schröder, Otto Haller, Georg Kochs, and Oliver Daumke. “Structure of Myxovirus
resistance protein A reveals intra-and intermolecular domain interactions required
for the antiviral function”. In: Immunity 35.4 (2011), pp. 514–525.
[438] Judith Verhelst, Eef Parthoens, Bert Schepens, Walter Fiers, and Xavier Saelens.
“Interferon-inducible protein Mx1 inhibits influenza virus by interfering with func-
tional viral ribonucleoprotein complex assembly”. In: Journal of virology 86.24
(2012), pp. 13445–13455.
[439] Georg Kochs and Otto Haller. “Interferon-induced human MxA GTPase blocks
nuclear import of Thogoto virus nucleocapsids”. In: Proceedings of the National
Academy of Sciences 96.5 (1999), pp. 2082–2086.

206
Bibliography

[440] Georg Kochs, Christian Janzen, Heinz Hohenberg, and Otto Haller. “Antivirally
active MxA protein sequesters La Crosse virus nucleocapsid protein into perinu-
clear complexes”. In: Proceedings of the National Academy of Sciences 99.5 (2002),
pp. 3153–3158.
[441] Martin Schwemmle, Kirsten C Weining, Marc F Richter, Beats Schumacher, and
Peter Staeheli. “Vesicular stomatitis virus transcription inhibited by purified MxA
protein”. In: Virology 206.1 (1995), pp. 545–554.
[442] Thomas Fricke, Tommy E White, Bianca Schulte, Daniel A de Souza Aranha Vieira,
Adarsh Dharan, Edward M Campbell, Alberto Brandariz-Nuñez, and Felipe Diaz-
Griffero. “MxB binds to the HIV-1 core and prevents the uncoating process of
HIV-1”. In: Retrovirology 11.1 (2014), p. 68.
[443] Benjamin Mänz, Dominik Dornfeld, Veronika Götz, Roland Zell, Petra Zimmer-
mann, Otto Haller, Georg Kochs, and Martin Schwemmle. “Pandemic influenza A
viruses escape from restriction by human MxA through adaptive mutations in the
nucleoprotein”. In: PLoS Pathog 9.3 (2013), e1003279.
[444] Idoia Busnadiego, Melissa Kane, Suzannah J Rihn, Hannah F Preugschas, Joseph
Hughes, Daniel Blanco-Melo, Victoria P Strouvelle, Trinity M Zang, Brian J Wil-
lett, Chris Boutell, et al. “Host and viral determinants of Mx2 antiretroviral activ-
ity”. In: Journal of virology 88.14 (2014), pp. 7738–7752.
[445] Timothy I Shaw, Anuj Srivastava, Wen-Chi Chou, Liang Liu, Ann Hawkinson,
Travis C Glenn, Rick Adams, and Tony Schountz. “Transcriptome sequencing and
annotation for the Jamaican fruit bat (Artibeus jamaicensis)”. In: PloS one 7.11
(2012), e48472.
[446] Darren P Martin, Ben Murrell, Michael Golden, Arjun Khoosal, and Brejnev Muhire.
“RDP4: Detection and analysis of recombination patterns in virus genomes”. In:
Virus Evolution 1.1 (2015), vev003.
[447] Ari Watt, Felicien Moukambi, Logan Banadyga, Allison Groseth, Julie Callison,
Astrid Herwig, Hideki Ebihara, Heinz Feldmann, and Thomas Hoenen. “A novel
life cycle modeling system for Ebola virus shows a genome length-dependent role
of VP24 in virus infectivity”. In: Journal of virology 88.18 (2014), pp. 10511–10524.
[448] Angela D Luis, David TS Hayman, Thomas J O’Shea, Paul M Cryan, Amy T
Gilbert, Juliet RC Pulliam, James N Mills, Mary E Timonin, Craig KR Willis,
Andrew A Cunningham, et al. “A comparison of bats and rodents as reservoirs of
zoonotic viruses: are bats special?” In: Proc. R. Soc. B. Vol. 280. 1756. The Royal
Society. 2013, p. 20122753.
[449] Kate E Jones, Andy Purvis, ANN Maclarnon, OLAF RP BININDA-EMONDS, and
Nancy B Simmons. “A phylogenetic supertree of the bats (Mammalia: Chiroptera)”.
In: Biological Reviews 77.2 (2002), pp. 223–259.
[450] Patrick S Mitchell, Corinna Patzina, Michael Emerman, Otto Haller, Harmit S Ma-
lik, and Georg Kochs. “Evolution-guided identification of antiviral specificity de-
terminants in the broadly acting interferon-induced innate immunity factor MxA”.
In: Cell host & microbe 12.4 (2012), pp. 598–604.
[451] Patrick S Mitchell, Michael Emerman, and Harmit S Malik. “An evolutionary per-
spective on the broad antiviral specificity of MxA”. In: Current opinion in micro-
biology 16.4 (2013), pp. 493–499.

207
Bibliography

[452] Jin Zhao, Haodi Feng, Daming Zhu, Chi Zhang, and Ying Xu. “IsoTree: De Novo
Transcriptome Assembly from RNA-Seq Reads”. In: International Symposium on
Bioinformatics Research and Applications. Springer. 2017, pp. 71–83.

208
Appendix A

Fight against Ebola

Further supplemental material, including input data, detailed figures, and


complete gene tables is available at: http://www.rna.uni-jena.de/
supplements/filovirus_human_bat/

209
Table A.1: Read statistics. Read counts and assembly/mapping statistics for all 18 HiSeq samples and the additional MiSeq library for R. aegyptiacus.
We mapped all corresponding samples to the H. sapiens and R. aegyptiacus genome with TopHat and segemehl. Additionally, we build de novo
transcriptome assemblies for both species. For R. aegyptiacus, a de novo transcriptome assembly was computed based on the HiSeq and pooled MiSeq
reads. MiSeq data was assembled with Mira only. For each assembly tool, the number of contigs (>= 0 bp, >= 1000 bp) and the N50 value are listed. For
TopHat and segemehl, overall read mapping statistics are provided. A large amount of reads in the EBOV 23 h sample mapped to the EBOV genome.
Detailed statistics can be found in the electronic supplement: www.rna.uni-jena.de/supplements/filovirus_human_bat.
HuH7 (Homo sapiens) R06E-J (Rousettus aegyptiacus)
Sample Mock EBOV MARV Mock EBOV MARV
3h 7h 23 h 3h 7h 23 h 3h 7h 23 h 3h 7h 23 h 3h 7h 23 h 3h 7h 23 h
Read data (million reads)
raw 40.5 38.0 39.0 34.4 49.9 53.0 44.2 48.4 36.3 50.4 44.0 48.5 41.5 38.4 39.3 37.5 48.7 45.6
proc. 38.4 36.0 36.9 32.8 46.6 50.1 41.8 45.7 34.3 47.8 41.4 45.5 39.4 36.0 37.4 35.5 45.9 43.3
Mapping on human genome(overall read mapping rate in %) Mapping on bat genome (overall read mapping rate in %)
TopHat 89.4 90.8 91.3 89.9 88.9 55.7 90.6 89.3 88.9 90.6 90.7 92.2 90.2 91.0 72.4 91.3 91.1 89.1
segemehl 95.3 95.6 95.1 95.1 93.1 58.3 95.4 94.5 92.8 97.5 92.0 97.2 97.2 96.7 76.9 97.3 96.9 95.4

210
Cell line R06E-J (Rousettus aegyptiacus)
Samples Mock EBOV MARV pooled
3h 7h 23 h 3h 7h 23 h 3h 7h 23 h MiSeq
Read data (million reads)
Appendix A. Fight against Ebola

raw 50.4 44.0 48.5 41.5 38.4 39.3 37.5 48.7 45.6 38.2
processed 47.8 41.4 45.5 39.4 36.0 37.4 35.5 45.9 43.3 38.0
de novo transcriptome assembly
| {z }
>= 0 bp >= 1000 bp N50
Oases 370 200 180 458 3 875
TransABySS 790 204 169 324 1 788
SOAP-Trans 699 418 147 144 3 261
Trinity 484 826 188 534 5 071
Mira 162 861 21 987 774
| {z }
Combined 977 787 277 595 3 923
Mapping on bat transcriptome (overall read mapping rate in %)
TopHat 94.6 94.7 95.2 94.6 95.0 95.8 95.5 95.2 94.8 –
segemehl 98.5 97.0 97.6 98.5 96.6 98.4 98.2 97.6 97.4 –
Table A.2: Number of reads mapping to the viral genomes. For R06E-J samples, we used Blastn+ to find contigs within the R. aegyptiacus
transcriptome assembly which represent the full EBOV (contig610) and MARV (contig5818) genome, respectively. Read counts were normalized by library
size. Read maximum peaks were calculated for each sample. Interestingly, EBOV seems to replicate much faster in human cells compared to bat cells
between 3 and 7 h (15.6X). However, EBOV decreases its transcription speed again in the following 16 h (15.5X) (see Fig. 5.14B and Fig. 5.16 in the main
text). Similarly, MARV replicates faster between 3 and 7 h in human cells (7.6X) than bat cells (4.3X). The RNA profiles mapping to the viral genomes
are astonishingly similar, showing no mutations and only a minor fraction of reads mapping to the 5’ and 3’ UTR of the genome, showing the difference
between genomic and transcriptomic level. Read counts are based on unique TopHat mappings.

HuH7 EBOV HuH7 MARV R06E-J EBOV R06E-J MARV


(KM034562v1) (JN408064) (contig610) (contig5818)

211
# reads peak norm. # reads peak norm. # reads peak norm. # reads peak norm.
Mock 3 h 3 689 182 96.07 134 9 3.49 3 956 124 82.76 158 11 3.31
Mock 7 h 1 897 102 52.69 104 8 2.89 4 722 151 119.85 155 13 3.74
Mock 23 h 3 469 148 94.01 128 6 3.47 4 868 164 106.99 289 10 6.35
EBOV 3 h 28 009 1 653 853.93 39 274 1 156 948.65
EBOV 7 h 619 370 43 222 13 291.20 162 618 7 260 4 517.17
EBOV 23 h 10 334 085 429 012 206 269.16 6 853 608 228 449 183 251.55
MARV 3 h 37 504 1 794 897.22 3 896 126 109.75
MARV 7 h 313 238 13 683 6 854.22 21 654 782 471.76
MARV 23 h 701 757 24 435 20 459.39 848 647 22 119 19 599.24
Appendix A. Fight against Ebola

Table A.3: Comparison of genome and de novo transcriptome assemblies. From the
genomic sequences of H. sapiens and R. aegyptiacus we selected different sets of expressed genes
using various filter thresholds: 1) we selected transcripts from the genome with at least N ∈
{100, 1000, 5000} unique mapped
P reads in one sample (= ∃) or 2) accumulated all unqiue mapped
reads over all samples (= ∀). The selected transcript sets were further blasted against the
corresponding de novo transcriptome assembly of human and bat, respectively. We defined a
transcript (derived from the genomic sequence) as true positive and therefore correctly assembled,
if we got at least one blast hit with an alignment length > 90%.
P
∃ sample ∀ samples
read count ≥ 100 ≥ 1000 ≥ 5000 ≥ 100 ≥ 1000 ≥ 5000
H. sapiens 96.54% 97.39% 98.17% 93.0% 97.18% 98.08%
R. aegyptiacus 88.26% 92.8% 94.02% 81.25% 90.20% 92.19%

Table A.4: Differential gene expression. Differential expression levels of NCBI-annotated


protein-coding genes and ncRNAs, de novo Cufflinks-predicted genes, and hand-selected genes
of interest from the literature. From the 2 364 de novo gene loci showing differential expression,
92 % could be mapped back to already annotated genes (hg19 annotations). Thus, we detected
189 (8 %) unannotated gene loci with significant differential expression.
DEG – differential expressed genes ((padj) < 0.1); FC – mean fold change of differential expressed
genes (top 300 and all) calculated with DESeq (padj < 0.1); for each gene the maximum fold
change obtained over all combinations of time points and infections was used.

NCBI ncRNAs de novo genes of interest Total


HuH7 samples
# genes 25 051 2 349 18 391 1 508 47 299
# DEG 2 492 20 2 364 167 5 043
FC Top 300 4.74 2.69 4.58 2.53
standard deviation ± 0.84 ± 1.02 ± 0.91 ± 1.09
FC total 2.53 2.69 2.35 2.53
standard deviation ± 1.03 ± 1.02 ± 1.04 ± 1.09
R06E-J samples
# genes 11 358 499 10 496 915 23 268
# DEG 641 8 368 58 1 075
FC Top 300 2.28 3.13 5.61 1.64
standard deviation ± 0.69 1.46 ± 2.14 ± 0.64
FC total 1.77 3.13 5.04 1.64
standard deviation ± 0.68 1.46 ± 2.29 ± 0.64
Manually inspected
400 117 170 793 1480

212
Table A.5: Top 10 keyplayers of human and bat infection. Comparison between all con-
ditions and time points within one species. The read_max values are based on multiple mapped
reads and candidates listed here are filtered based on a read_max of at least 100 reads in one
sample. Fold changes for human samples are based on unique mapped reads. Interestingly, genes
coding for histones are up-regulated between 7 h and 23 h in all samples including Mock.
HIST2H4B – in Mock and EBOV highly regulated, probably cell induced, independent from infec-
tion; CENPE – the other samples are fairly constant at around 500 read_max; superscript sized
numbers – among top 10 of following list (sorted by read_max), number of rank; FC – log2 fold
change based on DESeq normalized read counts; norm_reads – DESeq normalized read counts;
change_max – divided read_max values; read_max – maximum number of reads mapping to one
nucleotide position of this gene; Mo – Mock; EV – EBOV; MV – MARV; Genes specified by a
number refer to the corresponding LOC, for example LOC338651. Further details about differ-
ential expression can be obtained from the various tables and pathway figures in the electronic
supplement.

Gene Samples FC norm_reads change_max read_max


EBOV and MARV on human cells (sorted by fold change)
A1 PZP EV 3 h/23 h 8.38 6.84 2283.13 2.8660 664 1903
A2 FOSB B8 EV 7 h/23 h 6.89 20.32 2416.03 63.25 4 253
A3 RPS17 Mo 3 h/23 h -6.85 343.00 2.98 -2.0222 1727 854
A4 FOS B3 EV 7 h/23 h 6.09 36.82 2190.53 63.8333 6 383
A5 AREG EV 7 h/23 h 6.01 1.63 105.27 20.3333 27 549
A6 ATF3 B1 EV 3 h/23 h 5.89 176.81 10457.63 73.7368 19 1401
A7 338651B4 EV 3 h/23 h 5.85 10.27 590.85 24.5600 25 614
read_max EV 7 h/23 h 5.80 10.57 590.85 34.1111 18 614
A8 TMEM88 EV 7 h/23 h 5.74 1.63 86.85 8.6145 83 715
A9 MYCN EV 7 h/23 h -5.69 3530.65 68.43 -21.2500 340 16
A10 GZMM EV 3 h/23 h 5.67 2.28 104.42 2.28369 141 323
EBOV and MARV on human cells (sorted by read_max)
B2 PPP1R15A EV 3 h/23 h 5.37 496.95 18529.31 53.1282 39 2072
B5 EGR1 EV 3 h/23 h 4.91 136.88 4110.94 31.6154 13 411
B6 NR4A1 EV 3 h/23 h 4.44 227.96 4466.49 28.1538 13 366
B7 DUSP1 EV 7 h/23 h 4.57 364.87 7569.54 25.3051 59 1493
B9 DUSP8 EV 7 h/23 h 5.01 281.28 9032.49 24.0769 26 626
B10 NFKB2 EV 3 h/23 h 4.08 443.38 6769.7 23.0000 32 736
EBOV and MARV on bat cells (sorted by fold change)
C1 TRIB3 D2 MV 3 h-23 h 4.76 8.90 240.54 14.0000 3 42
C2 CHAC1 D1 MV 3 h-23 h 3.70 61.18 796.86 21.9286 14 307
C3 DDIT4 D3 Mo 7 h-23 h 3.66 47.74 609.84 11.4444 9 103
C4 HIST1H4A Mo 7 h-23 h 3.06 40.85 339.59 5.4815 27 148
C5 CDH6 D6 EV 3 h-23 h -2.97 4279.83 546.64 -6.8868 365 53
C6 SQSTM1 D4 EV 7 h-23 h 2.92 716.19 5435.56 9.4000 80 752
C7 ATF3 Mo 3 h-23 h 2.89 28.56 211.33 5.1111 46 9
read_max EV 3 h-23 h 2.37 99.81 515.84 5.0800 25 127
C8 HIST2H4B D7 Mo 7 h-23 h 2.83 49.02 349.36 4.7097 31 146
read_max EV 3 h-7 h 2.40 20.14 106.22 6.5556 9 59
C9 CYP1B1 MV 3 h-23 h -2.76 1239.22 182.81 -3.8857 136 35
C10 DUSP5 D8 MV 3 h-23 h 2.74 188.00 1252.58 6.2069 29 180
EBOV and MARV on bat cells (sorted by read_max)
D5 PLEKHA4 EV 7 h-23 h 1.58 76.66 228.41 8.8461 13 115
D9 MICAL1 MV 3 h-23 h 1.74 254.74 852.84 5.8824 17 100
D10 CENPE Mo 3 h-23 h -1.96 30412.24 7837.38 -5.8202 1036 178

213
Appendix A. Fight against Ebola

Table A.6: Top 10 differences between EBOV/Mock, MARV/Mock and EBOV/MARV


in human cells. Genes with highest differential expression between Mock samples and EBOV-
and MARV-infected samples, respectively. In addition, genes with the highest differential expres-
sion between EBOV- and MARV-infected samples are summarized. By manual inspection, we
found CXCL8, AKR1B10 and AKR1B15 to play rather a minor role, since we determined only
low level transcription and mapping artifacts (AKR1B10 and AKR1B15 are located next to each
other, reads were mapped twice). AMOTL2 is part of the MAPK signalling pathway (see elec-
tronic supplement). LOC100507347 refers to a protein with unknown function (also described as
BC078172). Abbreviations as in Tab. A.5.

Gene Samples FC norm_reads change_max read_max


Mock vs. EBOV infection (sorted by fold change)
E1 ANKRD1 F 1 23 h 7.41 55.60 9427.27 111.33300 15 1670
E2 RPS17 23 h 6.86 2.98 346.09 2.28923 854 1955
E3 FOSB 23 h 6.73 22.83 2416.03 63.25 4 253
E4 PZP 3h -6.45 598.72 6.84 0.71084 472 664
E5 CXCL8 23 h 6.40 10.38 754.70 57.0 3 171
E6 MYCN 23 h -6.17 4917.46 68.43 0.04290 373 16
E7 338651F 2 23 h 6.05 8.94 590.85 55.8182 11 614
E8 AREG F 6 23 h 5.80 1.99 110.54 39.2000 10 392
E9 PPP1R15AF 3 23 h 5.70 414.04 18529.31 51.8000 40 2072
E10 FOS 23 h 5.58 52.92 2190.53 25.5333 15 383
Mock vs. EBOV infection (sorted by read_max)
F4 DUSP8 23 h 5.34 222.39 9032.49 48.1538 13 626
F5 CXCL5 23 h 5.24 83.40 3158.21 40.6000 10 406
F7 DUSP1 23 h 5.34 216.88 7569.54 37.3250 40 1493
F8 AREG 23 h 0.00 0.00 105.27 34.3125 16 549
F9 AMOTL2 23 h 4.68 914.39 23387.89 33.3171 41 1366
F10 CREB5 23 h 4.98 225.37 7117.82 32.2000 10 322
Mock vs. MARV (sorted by fold change)
G1 AKR1B10 H1 23 h 6.46 34.75 3050.01 65.5714 7 459
G2 RPS17 23 h 6.05 2.98 196.81 1.2424 854 106
G3 AKR1B15 H4 23 h 5.77 2.98 163.10 65.5 2 131
G4 ANXA1 23 h 5.48 5.96 265.31 0.8041 97 78
G5 NCF2 23 h 4.76 17.87 484.96 0.8171 164 134
G6 CXXC1 H2 23 h 4.57 96.30 2282.34 21.0000 19 399
G7 ANXA3 H3 23 h 4.49 63.54 1430.95 19.2222 9 173
G8 100507347H6 23 h 4.08 1.99 33.71 8.1500 20 163
G9 F2RL2 H7 23 h 3.86 295.86 4288.50 7.8400 25 196
G10 CXCL5 H5 23 h 3.82 83.40 1179.77 12.2000 10 122
Mock vs. MARV (sorted by read_max)
H8 GPX2 23 h 3.13 124.10 1089.52 6.9118 34 235
H9 ANKRD1 23 h 3.79 55.60 770.93 6.3333 15 95
H10 PTGR1 23 h 2.66 1551.78 9778.51 4.7797 177 846
EBOV vs. MARV infection (sorted by fold change)
I1 PZP 3h 7.36 6.84 1124.58 1.4142 664 939
I2 AKR1B10 J2 23 h 7.18 21.05 3050.01 28.6875 16 459
I3 FOSB J4 23 h -6.53 2416.03 26.10 -62.25 253 4
I4 CXXC1 J5 23 h 6.12 32.90 2282.34 24.9375 16 399
I5 AREG 23 h -5.60 105.27 2.17 -11.9348 549 46
I6 GZMM 23 h -5.26 82.90 2.17 -2.4030 322 134
I7 FOS J3 23 h -5.13 2190.53 67.99 -27.3571 383 14
I8 GPX2 23 h 5.11 31.58 1089.52 12.3684 19 235
I9 F2RL2 23 h 4.90 143.44 4288.50 12.2500 16 196
I10 PPP1R15A 23 h -4.80 18529.31 719.21 -30.9254 2072 67
EBOV vs. MARV infection (sorted by read_max)
J1 PPP1R15A 23 h -4.80 18529.31 719.21 -30.9254 2072 67
J6 338651 23 h -4.28 590.85 30.45 -24.5600 614 25
J7 DUSP8 23 h -4.07 9032.49 537.15 -24.0769 626 26
J8 ATF3 23 h -4.31 10457.63 527.36 -20.9104 1401 67
J9 ANKRD1 23 h -3.61 9427.27 770.93 -17.5789 1670 95
J10 AREG 23 h -3.50 110.54 9.79 -14.5185 392 27

214
Table A.7: Top 15 differences between human and bat cells. To investigate genes that were
differentially expressed between human and Rousettus aegyptiacus tissues, we compared R. aegyp-
tiacus transcripts with the corresponding human genes. R. aegyptiacus transcripts were identified
by homology to annotated Pteropus vampyrus genes. Most of the top 15 differences between human
and bat cells after infection with EBOV and MARV are shut down completely in either human
or bat cells. No gene, except RELN, is part of Tab. A.5 or Tab. A.6, indicating, that these genes
are not differentially expressed during infection, but rather point out general differences of the cell
lines HuH7 and R06E-J. The genes are associated with calcium regulated pathways (ATP2B4),
acyl-CoA pathways (ACADSB), transcription factors (HNF4A), adenylatkinase (AK4) possibly for
nucleotide synthesis, cell cycle (CCND2), keratins for fibrous proteins forming structural frame-
work (KRT5, KRT75), or are involved in actin pathways (ACTA2). FC – log2 fold change based
on DESeq normalized read counts; norm_reads – DESeq normalized read counts; EV – EBOV;
MV – MARV; Mo – Mock. For the complete table, see the electronic supplement.

Gene Samples FC norm_reads


Human cells Bat cells
Human vs. Bat after EBOV infection
RELN 23 h -14.63 32405.90 1.28
ATP2B4 7h -13.73 13614.49 0.00
ACADSB 7 h -13.59 12305.64 0.00
HNF4A 7h -13.24 9692.00 0.00
CCND2 23 h 13.18 0.00 9253.02
TRIM71 7h -13.07 8592.89 0.00
AK4 7h -12.97 8049.84 0.00
ACTA2 7h 12.95 1.63 12899.06
DAB2 23 h -12.65 6415.12 0.00
COCH 3h -12.54 5956.65 0.00
KRT5 23 h 12.52 0.00 5888.52
KRT75 3h 12.39 0.00 5369.48
BMP2 23 h -12.38 5336.06 0.00
SULT1C4 23 h -12.19 4672.84 0.00
CXCL10 23 h 12.15 0.00 4559.14
Human vs. Bat after MARV infection
ACTA2 3h 14.20 0.00 18826.36
ATP2B4 7h -13.71 13373.93 0.00
HNF4A 7h -13.39 10724.09 0.00
CCND2 23 h 13.38 1.09 11619.63
AK4 23 h -13.13 8965.17 0.00
RELN 23 h -13.12 8899.93 0.00
KRT5 23 h 13.12 0.00 8915.03
KRT75 23 h 13.02 1.09 9033.99
TRIM71 7h -12.95 7886.39 0.00
ACADSB 23 h -12.74 5964.10 0.87
MAGED1 3 h 11.86 0.00 3719.89
PTPRZ1 3h 11.82 0.00 3607.53
COCH 23 h -10.01 6290.30 6.12
PDPN 7h 11.02 0.00 2077.09
BMP2 7h -11.75 3446.80 0.00

215
Table A.8: Comparison of human and bat cells (EBOV and MARV as replicates) infected with filoviruses (3h, 23h). Although we observed
various differences in gene expression profiles between EBOV- and MARV-infected cells, both infections share the same disease symptoms. To find genes
that are differentially expressed between human and bat during filovirus infection, we treated EBOV and MARV samples (from the same time point)
as replicates for DESeq analysis (padj ≤ 0.1). Genes sorted by the maximum fold change of 3 h and 23 h p.i. More than half of the top 30 genes are
related to actin, connecting tissues and cell-cell interaction. Since we observed these massive differences also between human-Mock cells and bat-Mock
cells, they might origin from the differences between cell lines HuH7 and R06E-J. To overcome this cell line artifact, we removed differentially expressed
Mock samples (between human and bat cells, padj < 0.1) and list 30 manually selected genes in Tab. A.9. Moreover, we used EBOV and MARV samples
at same time points as replicates to analyze the impact of filovirus infection compared to Mock in the human cell line (Tab. A.10). FC – log2 fold change
based on DESeq normalized read counts; norm_reads – DESeq normalized read counts; read_max – maximum number of reads mapping to one nucleotide
position of this gene; EV – EBOV; MV – MARV. For the complete table, see the electronic supplement. Genes related to actin , connecting tissues and
cell-cell interaction are marked.
norm_reads read_max
human bat human bat
Gene Sample F Cmax EV+MV EV+MV EV MV EV MV Function
COL5A1 23 h 16.39 0.42 36585.11 0 2 904 1288 connective tissues
ATP1A3 23 h 16.25 0.48 37903.82 1 0 2524 2370 cation Na+ /K+ transport
ACTA1 23 h 15.82 1.45 84020.59 2 0 15143 18700 actin, alpha skeletal muscle
COL6A3 23 h 15.26 0.42 16721.46 1 1 307 369 connective tissues
EEF1A2 3h 15.15 6.1 221259.05 2 3 31523 30185 Elongation factor 1-alpha 2
23 h 14.96 0.42 13558.56 0 1 1790 2633 cell cyclus

216
CCND2
MYO10 3h 14.45 0.34 7571.73 24 37 285 178 actin-based, filopodia
RELN 23 h -14.19 15418.01 0.82 502 234 1 0 cell-cell interaction
ACADL 3h 13.91 0.34 5231.1 0 1 567 492 Acyl CoA
PTK7 23 h 13.7 1.33 17812.49 1 1 694 930 tyrosin protein kinase
3h 13.7 0.34 4529.52 0 1 156 108 connective tissues
Appendix A. Fight against Ebola

COL4A2
GPM6A 23 h 13.4 0.42 4589.59 1 2 877 942 membrane glycoprotein
KRT75 23 h 13.39 0.91 9771.54 12 9 904 1621 extracellular matrix
MAP3K13 23 h -13.32 11719.28 1.15 2743 2328 1 2 serine/threonine kinase
ACTA2 3h 13.22 2.64 25179.38 10 12 2882 3400 actin, alpha smooth muscle
RASA3 3h 13.22 0.44 4196.36 2 0 303 202 GTPase activating
ACTG2 23 h 13.13 1.27 11391.44 0 2 1112 1901 actin, cytoskeleton
KIT 23 h 13.12 0.48 4306.14 1 0 214 221 cytokin receptor
PXDN 23 h 13.11 0.97 8596.75 5 2 320 625 peroxidasin homolog
ADAM12 3h 13.11 0.88 7802.54 3 7 395 396 cell-cell ineraction
SPG20 23 h 13.06 0.42 3621.9 1 1 230 265 microtubulin, GTP
CACNA2D1 3h 13.0 0.44 3615.05 9 6 191 184 Ca2+ channel complex
LOXL1 3h 12.9 0.88 6733.85 2 0 529 464 connective tissues
HTR1D 3h 12.89 0.44 3351.02 1 0 322 496 serotonin rexeptor
PTPN13 3h 12.88 2.0 15025.71 10 12 418 312 cytoskeleton, GTPase
SLC26A5 3h -12.77 4089.95 0.59 1085 1480 1 1 prestin, motor protein
IQGAP2 3h -12.72 7910.54 1.17 440 500 2 1 Ras-GTPase
COL1A1 3h 12.72 14.31 96443.9 4 4 3277 1784 connective tissues
TMEM47 3h 12.72 0.34 2285.25 0 1 377 529 transmembrane protein
KCNA4 3h 12.69 1.02 6731.18 1 2 333 251 hexokinase
Table A.9: Comparison of human and bat samples (EBOV and MARV as replicates) with filovirus infected samples (3h, 23h). To find
genes that are differentially expressed between human and bat during filovirus infection, we treated EBOV and MARV samples (from the same time point)
as replicates for DESeq analysis (padj ≤ 0.1). We reduced the influence of the different cell types by removing all genes from the initial list (Tab. A.8)
which were also detected as significantly Mock samples (Mock3h,7h,23h used as replicates for human and bat samples, respectively). Examples in this list
are manually selected from both lists. Genes sorted by the maximum fold change of 3 h and 23 h p.i..
Rk – Rank/position in the corresponding sample list. Abbreviations as in Tab. A.8. For the complete table, see the electronic supplement.

norm_reads read_max
human bat human bat
Gene Sample Rk F Cmax EV+MV EV+MV EV MV EV MV Function
ALPK3 23 h 1 -5.94 3121.47 50.99 177 58 11 5 kinase, adenovirus related
ARHGAP20 23 h 2 5.82 0.85 48.12 16 12 15 8 GTPase activated protein
SCN4A 23 h 3 5.17 1.7 61.25 2 2 10 14 sodium channel
TCTEX1D4 3h 1 4.94 0.44 13.51 10 2 4 5 connecting phosphatase
OSGIN1 3h 2 -4.47 513.5 23.1 58 176 12 44 oxidative stress, inhibits growth
SLC12A3 23 h 4 4.3 14.62 287.7 71 145 20 40 sodium chlorid carrier
SLC16A11 23 h 5 4.29 1.39 27.19 3 2 6 6 carrier monocarboxylate
CCDC78 23 h 6 -4.23 30.8 1.65 6 3 1 2 unknown function
IGSF6 23 h 7 -4.22 30.75 1.65 5 5 3 2 immunoglobulin, inflammatory
UNC13A 23 h 8 4.17 0.97 17.41 7 9 11 8 vesicle, exocytose

217
NEIL1 23 h 9 -3.9 42.93 2.87 6 5 5 4 endonuclease, modulated by virus
METRN 3h 4 -3.85 31.13 2.16 11 11 4 14 cell differentiation
ELN 23 h 10 3.79 0.97 13.37 3 2 4 6 elastin, cell-cell
SLC40A1 23 h 11 -3.78 6409.48 465.85 755 1770 59 82 carrier, iron
C11orf52 23 h 12 -3.73 22.87 1.72 10 10 1 3 together with HSP transcribed
SLC10A1 23 h 13 -3.73 26.07 1.97 5 3 5 4 carrier, NA2+ , entry point HBV/HDV
IGSF6 3h 5 -3.72 19.03 1.44 5 5 3 2 immunoglobulin
MAP6 23 h 15 3.56 0.97 11.48 13 3 6 6 microtubule associated protein
TMEM27 23 h 16 -3.55 23.1 1.97 29 14 4 3 transmembrane
TMOD4 23 h 17 3.53 10.37 119.62 4 6 17 29 tropomodulin, related muscle actin
GRIN2D 23 h 19 -3.25 63.38 6.66 10 7 5 5 glutamate receptor
CLEC4A 23 h 20 -3.24 24.01 2.54 6 9 7 4 cell-cell, immune system
UBC 23 h 24 -3.14 31161.4 3539.27 12040 4245 774 554 ubiquitin
CEP72 23 h 25 -3.11 1003.72 116.45 115 52 21 23 microtubuli, centromer
MAST4 23 h 26 3.11 602.04 5186.53 31 44 168 84 microtubuli
ELF3 23 h 27 -3.1 1012.45 118.16 218 66 19 30 TF, effector of ERBB2 pathway
GLDN 23 h 28 -3.06 28.62 3.44 7 7 21 5 Ranvier nodes along muelinated axons
TRAF4 3h 15 -2.32 1743.78 348.73 731 201 97 136 activation of NFκB + MAPKs
PLIN2 23 h 54 -2.24 4646.53 983.85 1199 473 152 209 lipid storage
TRIB1 3h 17 -2.18 909.52 201.26 296 147 118 70 Ser/Thr protein kinase
Table A.10: Comparison of filovirus infection to Mock samples (EBOV and MARV as replicates). Comparison of filovirus (EBOV and MARV
treated as replicates) infected samples at 23 h p.i. against Mock samples (3 h, 7 h and 23 h treated as replicates) of human cell samples (padj < 0.1), to
find genes differentially expressed in both filovirus-infected cells compared to Mock. Genes sorted by the maximum fold change and filtered manually for
interesting hits. Abbreviations as in Tab. A.9. For the complete table, see the electronic supplement.
norm_reads read_max
Gene Rank F Cmax MO3,7,23 EV23 +MV23 MOread_max EVread_max MVreadm ax Function
SBK3 1 4.68 2.02 51.73 2 11 6 kinase
2 -4.56 25.56 1.09 6 7 4 sulfotransferase

218
SULT1E1
PLAU 4 -3.91 175.46 11.7 33 29 23 urokinase, degra. of ex. matrix
FMNL1 8 3.67 35.67 455.05 5 47 25 cytokinese
ANXA3 20 2.79 226.65 1562.27 66 296 173 cell growth
MYCNOS 21 -2.74 296.44 44.27 73 57 75 viral related oncogene
Appendix A. Fight against Ebola

MYCN 31 -2.44 3825.08 702.93 373 340 322 transcription factor


CYP1A1 32 -2.44 2652.15 489.17 236 395 562 cytochrome p450, electron
GDF15 43 2.24 404.62 1915.81 84 431 332 cell growth, inflammation
PEG10 53 -2.09 148275.9 34921.14 6192 7865 6568 retrotransposon-derived protein
SKP2 66 -1.92 14370.05 3798.2 1621 1233 1536 s-phase kinase-associated
Table A.11: Expression of genes involved in IFN-induction and -signaling. The IFN sig-
naling pathway and the induced antiviral effector proteins are important antiviral defence mecha-
nisms[332]. We checked the expression of genes involved in IFN signal transduction, immune/an-
tiviral response and ISGylation for differential expression during EBOV and MARV infection. We
found many genes to be not expressed (IFIH1, IRF7, GBP1, IFI16, IFI27, IFI35, IFI44, IFI44L,
IFIITM1, IFITM2, OAS1, OAS2, OAS3, OASL, TRIM21 and HERC6 ). However, several genes
were up-regulated between 3 h and 7 h p.i. and down-regulated between 7 h and 23 h p.i. (STAT-1,
STAT-2, ADAR, IFIT1, IFIT5, MX1, MX2, TRIM22, TRIM25, UBE2L6 and USP18 ). Listed
genes were selected according to Weber et al.[332]. First characters refer to the expression between
3 h and 7 h p.i., second characters to the expression between 7 h and 23 h p.i. Numbers correspond
to the read maximum of the sample. ↑ – up-regulated; ↓ – down-regulated; = – equal expression;
0 – no expression. Numbers preceding arrows indicate up-/down-regulation for more than 200 %
(2 – 200 %, 3 – 300 % and so on).

Category/gene Human Bat


MOCK EBOV MARV MOCK EBOV MARV
IFN signal transduction
DDX58 (RIG-I) =↑ 112 ↑↑ 156 =↑ 133 00 4 00 8 4 ↑↓ 11
IFIH1 (MDA5) 00 – 00 – 00 – NA – NA – NA –
IRF7 00 – 00 – 00 – NA – NA – NA –
IRF9 ↑= 53 ↑↑ 123 ↑↓ 74 NA – NA – NA –
NMI ↑2 ↓ 24 == 18 =↓ 21 ↓↓ 244 ↓= 179 == 184
STAT-1 =↓ 343 ↑↓ 389 =2 ↓ 433 == 200 =↑ 231 ↑= 250
STAT-2 ↑= 76 2 ↑2 ↓ 88 ↑= 79 == 74 ↓= 50 ↑= 84
STAT-3 ↑↓ 166 ↑= 203 ↑↓ 183 == 242 =↑ 318 ↑= 256
Immune/antiviral response
ADAR == 1041 ↑2 ↓ 1320 =↓ 1329 ↓= 164 ↓↑ 118 ↑↑ 150
EIF2AK2 (PKR) ↓↓ 122 =↑ 107 =↓ 111 NA – NA – NA –
GBP1 00 – 00 – 00 – NA – NA – NA –
IFI16 00 – 00 – 00 – NA – NA – NA –
IFI27 00 – 00 – 00 – NA – NA – NA –
IFI35 ↓= 199 =↑ 275 ↑= 325 ↑0 16 ↑↓ 11 ↑↑ 15
IFI44 00 – 00 – 00 – NA – NA – NA –
IFI44L 00 – 00 – 00 – NA – NA – NA –
IFIT1 ↑↑ 260 ↑4 ↓ 245 ↑↓ 273 NA – NA – NA –
IFIT5 ↓↓ 59 =3 ↓ 41 =↓ 55 NA – NA – NA –
IFITM1 00 – 00 – 00 – NA – NA – NA –
IFITM2 00 – 00 – 00 – NA – NA – NA –
IFITM3 =↓ 15 03 ↑ 24 =↓ 14 NA – NA – NA –
MX1 00 – ↑↓ 22 ↑2 ↓ 22 =↑ 150 ↓↑ 159 2 ↑↑ 268
MX2 ↑= 241 ↑↓ 271 ↑↓ 343 ↑= 134 ↓↑ 156 2 ↑↑ 256
OAS1 00 – 00 – 00 – NA – NA – NA –
OAS2 00 – 00 – 00 – 00 – 00 – 00 –
OAS3 00 – 00 – 00 – NA – NA – NA –
OASL 00 – 00 – 00 – ↓2= 15 00 9 ↑↑ 19
PLSCR1 ↓↑ 51 ↑↓ 47 ↑= 48 NA – NA – NA –
RSAD2 (Cig5) 00 – 00 – 00 – 00 – 00 – 00 –
SP100 2 ↑= 26 ↑↑ 44 == 34 ↓↑ 84 =↑ 93 ↑↑ 115
PMP22 ↑↓ 15 ↑↑ 16 =↓ 17 NA – NA – NA –
Ubiquitylation and ISGylation
HERC5 ↓= 14 ↑= 16 == 11 ↓↑ 24 ↓↑ 23 ↑= 23
HERC6 00 – 00 – 00 – ↓↓ 45 ↓= 46 == 40
ISG15 00 – =4 ↑ 36 0= 11 NA – NA – NA –
UBE2L6 ↑= 89 ↑↓ 76 ↑3 ↓ 117 ↓↑ 20 ↑↓ 22 3 ↓3 ↑ 24
USP18 ↑= 70 ↑2 ↓ 106 ↑↓ 77 NA – NA – NA –

219
Appendix A. Fight against Ebola

Table A.12: The most regulated TRIM genes. TRIM proteins were recently reviewed by
Ozato et al.[339]. They represent a superfamily of tripartite motif-containing proteins with more
than 60 members from which several are known to be required for the restriction of lentivirus
infections. Based on their emerging role in innate immunity, we investigated their features. We
identified at least 11 TRIM genes (TRIM2, 6, 8, 15, 16L, 25, 32, 34, 38, 45, 47, 54, 67, 71 ) to be
differentially regulated. TRIM14, 21 and 22 were not reported to be differentially expressed, but
show interesting features in a small level of transcripts (see electronic supplement). Classical fold
change values are reported in the electronic supplement. EV – EBOV; hum – human; read_max
– maximum number of reads mapping to one nucleotide position of this gene.

read_max
TRIM Sample 3h 7h 23 h Remarks
TRIM2 hum-EV 143 184 164 TRIM2 localizes to cytoplasmic filaments
bat-EV 106 99 110
TRIM6 hum-EV 85 104 57 Down-regulation for EBOV 23 h, a read-through transcript
from this gene into the downstream TRIM34 gene has been
observed, which is here not the case
bat-EV NA NA NA
TRIM8 hum-EV 120 116 383 TRIM8 localizes to nuclear bodies; strong up-regulation for
EBOV 23 h
bat-EV 437 648 523
TRIM14 hum-EV 107 160 223
bat-EV 15 23 24
TRIM15 hum-EV 10 12 33 TRIM15 localizes to the cytoplasm
bat-EV NA NA NA
TRIM16L hum-EV 32 26 15
bat-EV 109 119 200 putative homolog
TRIM21 hum-EV <10 <10 <10
bat-EV 45 48 62
TRIM22 hum-EV 84 160 83
bat-EV 60 68 73
TRIM25 hum-EV 80 103 26 TRIM25 localizes to the cytoplasm; interacts with DDX58 ;
similar pattern after MARV infection, containing mir-3614 in
3’UTR
bat-EV 319 255 299 a much higher and constant level of transcription than human
cells
TRIM32 hum-EV 65 63 34 TRIM32 localizes to cytoplasmic bodies; Mock 23 h
& EBOV 23 h down-regulated, MARV 23 h up-regulated
(read_max:142)
bat-EV 128 111 120
TRIM34 hum-EV 9 13 11 here no read-through transcript from the upstream TRIM6
gene
bat-EV NA NA NA
TRIM38 hum-EV 14 15 14 almost no expression
bat-EV NA NA NA
TRIM45 hum-EV 11 19 15 TRIM45 may function as a transcriptional repressor of the
mitogen-activated protein kinase pathway almost no expres-
sion
bat-EV 46 30 54 putative homolog
TRIM47 hum-EV 12 18 17 almost no expression
bat-EV 26 23 25 putative homolog
TRIM54 hum-EV 0 0 0 may be important for the regulation of titin kinase and
microtubule-dependent signal pathways in striatedmuscles; no
expression
bat-EV NA NA NA
TRIM67 hum-EV 17 39 41 up-regulated in EBOV 7 h
bat-EV NA NA NA
TRIM69 hum-EV 10 14 18 Only the first two exons are transcribed, possibly a splice vari-
ant
bat-EV NA NA NA No homolog in Pva and Rae
TRIM71 hum-EV 506 860 283 E3 ubiquitin protein ligase; MARV-infected cells stay at about
read_max=750
bat-EV 0 0 0

220
Appendix B

The Dart Art of de novo


transcriptome assembly

Further supplemental material, including descriptions of the used RNA-Seq


data sets and assembly tools, detailed tables and figures of all evaluations con-
ducted with rnaQUAST, TransRate, Detonate, and BUSCO as well as runtime
and memory information are available at: http://www.rna.uni-jena.de/
supplements/the_dark_art/

221
Table B.1: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten tools
on the Escherichia coli RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic supplement,
content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of ambiguous bases
is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length or longer cover
50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs in the estimated
true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to Detonates
Appendix B. The Dart Art of de novo transcriptome assembly

estimated “true” assembly.


Trans-ABySS Oases SOAP-Trans Trinity IDBA-Tran Shannon Bridger BinPacker SPAdes-sc SPAdes-rna
k-mer size 25,35,45,55,65 25,35,45,55,65 default default 25,35,45,55,65 default default default default default
Hisat2
Overall mapping rate 85.56 49.16 95.7 56.62 87.35 71.09 34.31 76.69 87.66 87.91
rnaQUAST
Transcripts 1000 bp 367 743 460 439 363 66 414 372 486 51
Database coverage 0.33 0.33 0.04 0.23 0.32 0.01 0.18 0.32 0.19 0.21
Misassemblies 13 100 20 1 52 8 0 54 12 3
Mismatches per transcript 0.39 0.71 0.28 0.27 2.05 0.19 0.28

222
0.13 0.17 0.07
Average alignment length 469.99 457.65 308.92 364.63 472.64 1100.77 540.83 482.56 552.48 194.25
Mean isoform coverage 0.65 0.58 0.22 0.45 0.64 0.73 0.51 0.64 0.51 0.39
TransRate
N50 560 1051 1039 958 582 1809 680 598 755 7795
Reference coverage 0.32 0.31 0.03 0.19 0.31 0.01 0.16 0.3 0.17 0.18
Mean ORF percentage 75.13 65.72 72.56 71.99 74.59 45.13 71.99 73.34 69.35 82.09
Optimal score NA NA NA NA NA NA NA NA NA NA
Percentage good mappings NA NA NA NA NA NA NA NA NA NA
Percentage bases uncovered NA NA NA NA NA NA NA NA NA NA
Number of ambiguous bases 2255 3602 3153 2579 2117 191 2069 2145 2378 2238
DETONATE
Nucleotide F1 0.63 0.49 0.6 0.74 0.64 0.06 0.66 0.62 0.71 0.57
Contig F1 0.03 0.03 0.03 0.06 0.03 0 0.03 0.03 0.06 0.03
KC score 0.82 0.79 0.82 0.86 0.81 0.16 0.83 0.81 0.88 0.65
RSEM EVAL -1.75 -2.45 -1.48 -3.4 -1.74 -2.62 -4.37 -1.97 -2 -1.76
BUSCO
Complete single-copy 258 136 234 316 281 48 296 261 332 96
Missing BUSCOs 189 172 166 178 190 711 196 198 172 370
Table B.2: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten tools
on the Candida albicans RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic supplement,
content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of ambiguous bases
is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length or longer cover
50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs in the estimated
true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to Detonates
estimated “true” assembly.

Trans-ABySS Oases SOAP-Trans Trinity IDBA-Tran Shannon Bridger BinPacker SPAdes-sc SPAdes-rna
k-mer size 25,35,45,55,65 25,35,45,55,65 default default 25,35,45,55,65 default default default default default
Hisat2
Overall mapping rate 97.29 93.61 98.56 95.12 96.72 96.66 86.34 96.51 97.5 96.98
rnaQUAST
Transcripts 1000 bp 3504 9813 8384 2769 4038 4267 2949 4077 3496 3482
Database coverage 0.52 0.6 0.66 0.56 0.48 0.47 0.53 0.54 0.49 0.52
Misassemblies 34 613 667 9 236 259 8 362 64 39
Mismatches per transcript 1.55 2.12 1.34 0.57 2 2.28 0.81 1 1.82 1.24

223
Average alignment length 957.38 941.6 798.11 567.17 1131.33 1189.55 922.1 678.95 1279.36 843.75
Mean isoform coverage 0.82 0.82 0.84 0.76 0.81 0.82 0.77 0.77 0.85 0.81
TransRate
N50 1573 1629 1666 1349 1824 1846 1315 1093 2105 1911
Reference coverage 0.2 0.24 0.39 0.2 0.17 0.17 0.2 0.3 0.19 0.2
Mean ORF percentage 82.35 79.54 81.25 84.15 79.4 78.65 83.37 87.69 76.75 78.96
Optimal score 0.45 0.06 0.02 0.48 0.36 0.37 0.41 0.05 0.54 0.53
Percentage good mappings 0.75 0.11 0.05 0.83 0.67 0.66 0.72 0.16 0.88 0.87
Percentage bases uncovered 0.2 0.87 0.83 0.02 0.31 0.36 0 0.48 0 0.01
Number of ambiguous bases 9753 26010 23615 8947 10651 11072 8652 12793 9179 9538
DETONATE
Nucleotide F1 0.72 0.51 0.58 0.73 0.74 0.73 0.73 0.64 0.74 0.74
Contig F1 0.08 0.08 0.08 0.06 0.06 0.07 0.05 0.07 0.06 0.07
KC score 0.68 0.62 0.75 0.6 0.68 0.68 0.55 0.72 0.7 0.66
RSEM EVAL -3.39 -4.19 -3.37 -4.48 -3.54 -3.55 -6.11 -3.56 -3.78 -4
BUSCO
Complete single-copy 1279 611 348 1039 1149 1146 1069 460 1510 1458
Missing BUSCOs 162 140 109 248 133 135 240 359 84 93
Table B.3: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten
tools on the Arabidopsis thaliana RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic
supplement, content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of
ambiguous bases is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length
or longer cover 50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs
in the estimated true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to
Appendix B. The Dart Art of de novo transcriptome assembly

Detonates estimated “true” assembly.


Trans-ABySS Oases SOAP-Trans Trinity IDBA-Tran Shannon Bridger BinPacker SPAdes-sc SPAdes-rna
k-mer size 25,35,45,55,65 25,35,45,55,65 default default 25,35,45,55,65 default default default default default
Hisat2
Overall mapping rate 92.99 91.93 97.74 85.31 90.64 67.15 89.04 51.53 94.86 94.92
rnaQUAST
Transcripts 1000 bp 15268 32647 20100 8709 12882 3775 9577 12018 11154 9063
Database coverage 0.31 0.32 0.33 0.3 0.26 0.05 0.29 0.25 0.28 0.29
Misassemblies 1208 6804 8055 51 1995 1233 201 769 1223 106
Mismatches per transcript 1.27 3.47 0.62 2.43 5.9 0.72 0.42

224
0.1 0.14 0.15
Average alignment length 979.22 1232.69 740.81 561.75 990.35 1312.17 848.36 1044.9 936.65 566.67
Mean isoform coverage 0.68 0.73 0.64 0.59 0.67 0.75 0.65 0.7 0.68 0.63
TransRate
N50 1517 1832 1654 1318 1628 1879 1282 1607 1633 1389
Reference coverage 0.18 0.19 0.21 0.15 0.13 0.03 0.15 0.15 0.15 0.14
Mean ORF percentage 74.59 72.07 72.04 80.78 74.66 66.84 80.71 76.08 74.57 80.49
Optimal score NA NA NA NA NA NA NA NA NA NA
Percentage good mappings NA NA NA NA NA NA NA NA NA NA
Percentage bases uncovered NA NA NA NA NA NA NA NA NA NA
Number of ambiguous bases 40711 78762 54460 28632 33008 9081 27666 30178 29654 28079
DETONATE
Nucleotide F1 0.68 0.42 0.58 0.77 0.72 0.26 0.76 0.63 0.79 0.74
Contig F1 0.05 0.03 0.04 0.04 0.04 0 0.03 0.04 0.06 0.04
KC score 0.71 0.66 0.75 0.62 0.66 0.48 0.66 0.37 0.72 0.7
RSEM EVAL -5.34 -6.01 -4.5 -7.33 -5.4 -9.63 -6.71 -1.34 -5.78 -5.38
BUSCO
Complete single-copy 858 546 732 1042 978 203 908 804 1053 859
Missing BUSCOs 222 248 224 248 229 1162 269 296 224 357
Table B.4: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten tools
on the Mus musculus RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic supplement,
content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of ambiguous bases
is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length or longer cover
50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs in the estimated
true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to Detonates
estimated “true” assembly.

Trans-ABySS Oases SOAP-Trans Trinity IDBA-Tran Shannon Bridger BinPacker SPAdes-sc SPAdes-rna
k-mer size 25,35,45,55,65 25,35,45,55,65 default default 25,35,45,55,65 default default default default default
Hisat2
Overall mapping rate 92.83 89.29 94 90.98 91.66 54.31 70.64 86.6 94.8 91.98
rnaQUAST
Transcripts 1000 bp 26155 51832 32136 12603 18266 2037 12294 16915 2630 8692
Database coverage 0.22 0.02 0.25 0.1 0.08 0.01 0.1 0 0.01 0.09
Misassemblies 453 52668 690 61 2643 628 41 1927 248 30
Mismatches per transcript 0.7 0.63 0.32 0.16 0.91 4.61 0.19 0.8 33.2 0.13

225
Average alignment length 1296.66 406.08 789.42 519.96 1073.4 1972.81 811.99 877.71 636.58 356.91
Mean isoform coverage 0.66 0.25 0.62 0.38 0.47 0.81 0.43 0.22 0.31 0.35
TransRate
N50 2794 1622 2571 2678 2771 3467 1501 1879 2093 2356
Reference coverage 0.23 0.02 0.25 0.1 0.1 0.02 0.09 0 0.09 0.09
Mean ORF percentage 51.94 45.36 53.12 51.45 49.56 38.86 54.28 55.04 50.33 60.82
Optimal score 0.15 0.02 0.09 0.4 0.18 0.13 0.29 0.18 0.37 0.43
Percentage good mappings 0.35 0.06 0.21 0.74 0.42 0.3 0.51 0.38 0.72 0.73
Percentage bases uncovered 0.62 0.91 0.66 0.12 0.41 0.6 0.02 0.43 0.03 0.09
Number of ambiguous bases 91174 188947 111356 53611 66954 6514 45526 52948 51123 40869
DETONATE
Nucleotide F1 0.4 0.19 0.39 0.51 0.45 0.07 0.5 0.38 0.52 0.42
Contig F1 0.01 0 0.02 0.02 0.01 0 0.01 0.01 0.01 0.01
KC score 0.69 0.47 0.69 0.55 0.6 0.28 0.53 0.56 0.68 0.59
RSEM EVAL -2.26 -3.33 -2.14 -2.63 -2.37 -4.88 -5.02 -2.93 -3.43 -3.25
BUSCO
Complete single-copy 2323 1529 2020 3454 2836 217 2661 2333 3592 2582
Missing BUSCOs 1945 2476 1929 1987 1926 5871 2188 2534 1992 2822
Table B.5: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten tools
on the Homo sapiens + EBOV 3h RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic
supplement, content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of
ambiguous bases is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length
or longer cover 50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs
in the estimated true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to
Appendix B. The Dart Art of de novo transcriptome assembly

Detonates estimated “true” assembly.


Trans-ABySS Oases SOAP-Trans Trinity IDBA-Tran Shannon Bridger BinPacker SPAdes-sc SPAdes-rna
k-mer size 25,35,45,55,65 25,35,45,55,65 default default 25,35,45,55,65 default default default default default
Hisat2
Overall mapping rate 89.32 86.92 94.02 89.8 89.95 36.6 78.41 90.23 92.55 92.37
rnaQUAST
Transcripts 1000 bp 26903 98196 31481 13110 20434 1646 13508 28248 13366 12028
Database coverage 0.16 0.18 0.19 0.14 0.12 0.01 0.14 0.14 0.14 0.14
Misassemblies 2318 41408 11298 196 3690 592 281 3936 1639 718
Mismatches per transcript 1.15 2.81 1.56 8.4 0.87 1.35 1.18

226
0.84 0.47 0.86
Average alignment length 754.83 1263.54 505.05 464.16 880.23 2871.75 720.92 885.59 610.84 517.24
Mean isoform coverage 0.47 0.55 0.44 0.38 0.44 0.74 0.45 0.52 0.43 0.42
TransRate
N50 1358 2794 2172 2486 2540 5067 1092 2193 1463 1589
Reference coverage 0.09 0.12 0.12 0.07 0.07 0 0.07 0.11 0.07 0.07
Mean ORF percentage 57.54 48.7 55.56 52.67 51.85 44.28 54.47 58.17 51.23 55.05
Optimal score 0.22 0.02 0.11 0.36 0.2 0.05 0.43 0.05 0.58 0.59
Percentage good mappings 0.43 0.05 0.25 0.74 0.39 0.14 0.71 0.13 0.86 0.87
Percentage bases uncovered 0.48 0.94 0.59 0.11 0.48 0.83 0.01 0.69 0.02 0.01
Number of ambiguous bases 92866 318936 130492 62753 83187 6752 54494 104126 67514 62896
DETONATE
Nucleotide F1 0.49 0.21 0.45 0.56 0.49 0.05 0.55 0.39 0.6 0.58
Contig F1 0.02 0.01 0.05 0.04 0.01 0 0.01 0.03 0.01 0.02
KC score 0.44 0.4 0.56 0.45 0.51 0.22 0.44 0.52 0.51 0.55
RSEM EVAL -1.45 -1.72 -1.19 -1.34 -1.24 -2.95 -1.84 -1.26 -1.32 -1.24
BUSCO
Complete single-copy 1501 972 2309 3316 2477 106 2151 1145 3621 3629
Missing BUSCOs 2103 1892 1839 1934 1862 5976 2216 2096 1856 1975
Table B.6: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten tools
on the Homo sapiens + EBOV 7h RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic
supplement, content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of
ambiguous bases is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length
or longer cover 50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs
in the estimated true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to
Detonates estimated “true” assembly. The rnaQUAST statistics are missing for the Oases assembly, because the tool crashed all the time we executed
it on this data set.

Trans-ABySS Oases SOAP-Trans Trinity IDBA-Tran Shannon Bridger BinPacker SPAdes-sc SPAdes-rna
k-mer size 25,35,45,55,65 25,35,45,55,65 default default 25,35,45,55,65 default default default default default
Hisat2
Overall mapping rate 91.75 85.78 94.3 90.52 89.14 84.63 76.39 90.99 92.97 92.94
rnaQUAST
Transcripts 1000 bp 34490 NA 38048 15251 23487 22928 16181 35785 16051 13995
Database coverage 0.18 NA 0.21 0.16 0.12 0.09 0.15 0.16 0.14 0.15
Misassemblies 2043 NA 7193 198 5132 5522 310 5070 2222 819
Mismatches per transcript 1.42 NA 0.88 0.49 1.5 3.77 0.91 1.49 1.19 0.88

227
Average alignment length 951.5 NA 518.79 454.55 807.15 2445.81 697.57 879.65 577.97 497.39
Mean isoform coverage 0.49 NA 0.46 0.38 0.42 0.67 0.45 0.53 0.42 0.42
TransRate
N50 2287 3352 2205 2422 2550 4065 1020 2287 1257 1359
Reference coverage 0.1 0.12 0.13 0.08 0.08 0.06 0.08 0.12 0.08 0.08
Mean ORF percentage 52.7 43.82 51.88 49.38 47.64 47.66 50.41 54.69 47.14 51.18
Optimal score 0.17 0.02 0.08 0.33 0.18 0.11 0.39 0.04 0.53 0.55
Percentage good mappings 0.34 0.04 0.19 0.71 0.37 0.3 0.67 0.1 0.82 0.84
Percentage bases uncovered 0.59 0.95 0.62 0.13 0.47 0.76 0.01 0.71 0.02 0.02
Number of ambiguous bases 135010 413418 161502 77171 101788 82239 67564 134108 84058 78158
DETONATE
Nucleotide F1 0.45 0.2 0.45 0.55 0.49 0.3 0.56 0.37 0.6 0.58
Contig F1 0.01 0.01 0.05 0.04 0.01 0 0.01 0.03 0.01 0.02
KC score 0.53 0.45 0.58 0.49 0.51 0.5 0.43 0.55 0.52 0.51
RSEM EVAL -1.73 -2.35 -1.63 -1.85 -1.8 -1.94 -2.79 -1.73 -2.02 -1.91
BUSCO
Complete single-copy 1873 704 1938 3362 2471 1909 2128 1108 3481 3606
Missing BUSCOs 1788 1817 1768 1855 1785 2392 2217 1955 1833 1873
Table B.7: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten tools
on the Homo sapiens + EBOV 23h RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic
supplement, content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of
ambiguous bases is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length
or longer cover 50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs
in the estimated true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to
Appendix B. The Dart Art of de novo transcriptome assembly

Detonates estimated “true” assembly.


Trans-ABySS Oases SOAP-Trans Trinity IDBA-Tran Shannon Bridger BinPacker SPAdes-sc SPAdes-rna
k-mer size 25,35,45,55,65 25,35,45,55,65 default default 25,35,45,55,65 default default default default default
Hisat2
Overall mapping rate 93.68 70.05 95.55 93.1 92.43 81.57 48.37 31.29 94.33 92.34
rnaQUAST
Transcripts 1000 bp 30896 83530 33314 13418 20560 10862 14196 15007 13993 12316
Database coverage 0.16 0.17 0.19 0.14 0.12 0.04 0.13 0.11 0.13 0.14
Misassemblies 1599 22306 4973 147 3735 3028 225 598 1704 565
Mismatches per transcript 1.38 2.6 1.49 3.8 0.94 1.29 0.96

228
0.89 0.51 0.63
Average alignment length 1040.23 1445.77 563.12 480.92 874.3 2974.03 724.8 699.82 623.06 527.92
Mean isoform coverage 0.49 0.56 0.46 0.38 0.43 0.72 0.45 0.47 0.43 0.42
TransRate
N50 2555 3512 2278 2483 2733 4709 1103 1419 1460 1528
Reference coverage 0.09 0.11 0.12 0.07 0.07 0.03 0.07 0.08 0.07 0.07
Mean ORF percentage 53.94 47.19 53.62 51.42 48.88 46.97 52.33 61.58 48.36 52.72
Optimal score 0.32 0.03 0.05 0.4 0.2 0.07 0.24 0.05 0.41 0.43
Percentage good mappings 0.51 0.11 0.13 0.81 0.41 0.33 0.42 0.12 0.63 0.66
Percentage bases uncovered 0.62 0.94 0.62 0.13 0.48 0.82 0.01 0.47 0.03 0.02
Number of ambiguous bases 119663 301780 133425 64547 87621 43913 57271 60817 71110 65429
DETONATE
Nucleotide F1 0.44 0.22 0.45 0.55 0.48 0.2 0.55 0.38 0.59 0.58
Contig F1 0.01 0.01 0.04 0.04 0.01 0 0.01 0.03 0.01 0.02
KC score 0.72 0.21 0.77 0.69 0.74 0.71 0.16 0.1 0.32 0.43
RSEM EVAL -1.6 -3.86 -1.44 -1.59 -1.54 -2.01 -4.31 -4.94 -3.13 -2.82
BUSCO
Complete single-copy 1971 939 2078 3238 2581 883 2034 1426 3354 3391
Missing BUSCOs 1976 2014 1952 2042 1978 4392 2419 2764 2012 2103
Table B.8: Selected metrics based on the output of rnaQUAST, Hisat2, DETONATE, TransRate and BUSCO for the transcripts assembled by all ten tools
on the Homo sapiens flux simulated RNA-Seq data set. Details and much more statistics, complementing this evaluation, can be found in the electronic
supplement, content S4–S8. In each row the top three values are indicated with bold italic. The RSEM-EVAL score is multiplied by 109 . Number of
ambiguous bases is given in thousand. N50 – the length of the shortest contig in the assembly, so that the accumulated bases of all contigs of this length
or longer cover 50 % of all the bases in the assembly. F1 score – a measure of a test’s accuracy. An F1 score of 1 would mean that all nucleotides/contigs
in the estimated true assembly were recovered with at least 90 % identity. KC score – k-mer compression score reflecting the similarity of each assembly to
Detonates estimated “true” assembly.

Trans-ABySS Oases SOAP-Trans Trinity IDBA-Tran Shannon Bridger BinPacker SPAdes-sc SPAdes-rna
k-mer size 25,35,45,55,65 25,35,45,55,65 default default 25,35,45,55,65 default default default default default
Hisat2
Overall mapping rate 95.79 73.26 99.55 91.68 94.02 93.03 85.34 30.77 97.25 96.34
rnaQUAST
Transcripts 1000 bp 9860 28143 7734 4263 5167 7424 2740 2341 2630 2657
Database coverage 0.48 0.59 0.56 0.41 0.35 0.37 0.38 0.19 0.37 0.38
Misassemblies 185 4094 118 49 533 785 8 66 56 124
Mismatches per transcript 1 2.11 0.41 0.31 1.65 1.85 0.1 0.39 0.2 0.4

229
Average alignment length 2340.04 2090.55 1090.25 1009.08 1581.3 1755.16 859.97 1061.01 992.74 738.02
Mean isoform coverage 0.82 0.74 0.7 0.61 0.66 0.66 0.59 0.66 0.58 0.54
TransRate
N50 3821 4266 2653 3036 3136 3364 1444 2137 2996 2735
Reference coverage 0.3 0.38 0.47 0.18 0.19 0.2 0.18 0.13 0.16 0.17
Mean ORF percentage 42.66 37.43 44.83 36.28 42.22 43.57 47.62 47.37 40.16 42.78
Optimal score 0.1 0.01 0.17 0.21 0.14 0.13 0.35 0.06 0.46 0.45
Percentage good mappings 0.22 0.05 0.41 0.47 0.33 0.28 0.6 0.14 0.81 0.78
Percentage bases uncovered 0.55 0.83 0.34 0.17 0.22 0.43 0.01 0.17 0.01 0.03
Number of ambiguous bases 35252 110723 26848 15259 18025 25851 10489 7801 10744 11203
DETONATE
Nucleotide F1 0.53 0.22 0.65 0.73 0.71 0.6 0.78 0.45 0.79 0.79
Contig F1 0.06 0.05 0.1 0.08 0.05 0.05 0.05 0.07 0.05 0.07
KC score 0.89 0.74 0.94 0.6 0.82 0.82 0.6 0.26 0.78 0.73
RSEM EVAL -2.78 -4.61 -2.25 -5.27 -3.66 -3.64 -7.52 -1.15 -4.55 -5.18
BUSCO
Complete single-copy 203 86 290 191 316 256 226 143 393 363
Missing BUSCOs 22 22 18 109 28 29 142 364 60 75
Appendix B. The Dart Art of de novo transcriptome assembly

230
Curriculum vitae

Curriculum vitae
Education
since 07/2013 Ph.D. Student
Friedrich Schiller University Jena
Prof. Dr. Manja Marz
RNA Bioinformatics and High Throughput Analysis
2013 Diploma in Bioinformatics (Dipl.-Bioinf.)
Diploma thesis:
“Datenmanagement von Massenspektren und Fragmentierungsbäu-
men mit BExIS.” (Prof. Dr. Sebastian Böcker)
2007 - 2013 Study of Bioinformatics
Friedrich Schiller University Jena
2006 Diploma qualifying for university admission
Friedrich-Fröbel-Gymnasium Bad Blankenburg

231
Conferences and Workshops

Conferences and Workshops


05/2017 Bioinformatics Mittelerde Meeting Poster
Leipzig, Germany
04/2017 Hacken: “Stay Young or Die Trying” – Hackathon Coordinator
on Aging
Jena, Germany
03/2017 27th Annual Meeting of the Society for Virology Poster
Marburg, Germany
03/2017 1st EVBC Meeting Facilitator
Jena, Germany
02/2017 32nd Winterseminar der Bioinformatik Talk
Bled, Slovenia
10/2016 14th Herbstseminar der Bioinformatik Talk
Doubice, Czech Republic
04/2016 26th Annual Meeting of the Society for Virology Poster
Münster, Germany
02/2016 31th Winterseminar der Bioinformatik Talk
Bled, Slovenia
10/2015 13th Herbstseminar der Bioinformatik Attendance
Doubice, Czech Republic
09/2015 Elixir Hackathon: RNA tools registry Attendance
Copenhagen, Denmark
03/2015 25th Annual Meeting of the Society for Virology Talk
Bochum, Germany
02/2015 30th TBI Winterseminar Talk
Bled, Slovenia
11/2014 “Fight against Ebola” – in silico Coordinator
Jena, Germany
10/2014 12th Herbstseminar der Bioinformatik Talk
Doubice, Czech Republic
02/2014 29th TBI Winterseminar Talk
Bled, Slovenia
10/2013 11th Herbstseminar der Bioinformatik Talk
Doubice, Czech Republic

232
Ehrenwörtliche Erklärung

Ehrenwörtliche Erklärung
Hiermit erkläre ich

• dass mir die Promotionsordnung der Fakultät bekannt ist,

• dass ich die Dissertation selbst angefertig habe, keine Textabschnitte oder
Ergebnisse eines Dritten oder eigenen Prüfungsarbeiten ohne Kennzeichnung
übernommen und alle von mir benutzten Hilfsmittel, persönliche Mitteilungen
und Quellen in meiner Arbeit angegeben habe,

• dass ich die Hilfe eines Promotionsberaters nicht in Anspruch genommen habe
und dass Dritte weder unmittelbar noch mittelbar geldwerte Leistungen von
mir für Arbeiten erhalten haben, die im Zusammenhang mit dem Inhalt der
vorgelegten Dissertation stehen,

• dass ich die Dissertation noch nicht als Prüfungsarbeit für eine staatliche oder
andere wissenschaftliche Prüfungen eingereicht habe.

Bei der Auswahl und Auswertung des Materials sowie bei der Herstellung des Ma-
nuskripts haben mich folgende Personen unterstützt:

Manja Marz

Ich habe weder die gleiche, noch eine ähnliche oder eine andere Arbeit an einer
anderen Hochschule als Dissertation eingereicht.

Jena, 30. Juni 2017

Martin Hölzer

233

You might also like