Genestack User Tutorials Readthedocs Io en Latest

Genestack User Tutorials
Release 1.0
Genestack
Aug 16, 2019

Contents
1 Genestack key concepts 3

1.1 Rich metadata system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Public data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Interactive data analysis and visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Format-free files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Get started with Genestack 5

2.1 How to find relevant data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 How to import data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 How to build and run a pipeline? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 How to reproduce your work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 How to share and manage your data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Case studies 25
3.1 Testing differential gene expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Whole-genome bisulfite sequencing data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Whole-exome sequencing data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Whole-genome sequencing data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.5 Microbiome data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.6 Expression microarray data analysis with Microarray Explorer . . . . . . . . . . . . . . . . . . . . . 134
4 Genestack Platform overview 141

4.1 Manage applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.2 Manage groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.3 Manage users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5 Importing data 147

5.1 Step 1: Getting data into the platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.2 Step 2: Format recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.3 Step 3: Editing metainfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.4 Data browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.5 File manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.6 Edit metadata manually . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.7 Import metainfo data from your computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.8 Compose file names using metainfo keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
5.9 Make a subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
i
5.10 Access control model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.11 Managing users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.12 Managing groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
5.13 Sharing files with a group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6 Introduction to NGS data analysis 207
7 Applications review 209

7.1 Sequencing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
7.2 Microarray data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
8 FAQ 303
8.1 Where do I find data that was shared with me? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
8.2 How do I reuse a data flow? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
8.3 How can I initialize files? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
8.4 How do I create a data flow? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
8.5 What is the difference between BWA and Bowtie2? . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
8.6 How does Genestack process paired-end reads? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
8.7 What is the difference between an Datasets and a folder? . . . . . . . . . . . . . . . . . . . . . . . . 304
8.8 What is the difference between masked and unmasked reference genomes? . . . . . . . . . . . . . . 304
8.9 What is the difference between Data Flow Runner and Data Flow Editor? . . . . . . . . . . . . . . . 305
9 References 307
ii
Genestack User Tutorials, Release 1.0
Welcome to the Genestack user guide!

Genestack is a platform to manage, process and visualize biological data, with a focus on next-generation sequencing
(NGS) and microarray data. Genestack is one of the bioinformatic platforms promising to lower the entry barrier to
bioinformatics. Our community edition is freely available online for academic and non-commercial users.
No coding skills are required to use the platform, but a basic level of understanding of the sequencing data analysis
process and the tools involved is crucial for correct data analysis and results interpretation.
This guide is intended for end-users of the platform. It covers the key concepts of the platform, a detailed walkthrough
of the user interface and of the different bioinformatics pipelines available on Genestack. To go back to the Genestack
website, click here.
Contents 1
2 Contents
CHAPTER 1
Genestack key concepts
Genestack is a platform to manage, analyse and visualize bioinformatics data, with a focus on next-generation se-
quencing (NGS) and microarray data. Genestack was built around several key concepts which are described below.
1.1 Rich metadata system
Bioinformatics data always comes with complex metadata, which is essential for search and analysis.
On Genestack, each file has metadata directly attached to it, which can store biological information as well as technical
information (file owner, application parameters, etc.). Metadata values can support different formats (text, numeric
data, dates, measurements with units, etc.), and they can be viewed and edited on Genestack, as well as easily imported
from an Excel or CSV spreadsheet.
Genestack files can be searched and filtered by metadata. Additionally, controlled vocabularies and ontologies can be
used to validate metadata, to ensure harmonisation of metadata and improve the quality of the search results. Finally,
metadata templates can be defined by users to ensure that all the collaborators of a group are using consistent metadata
conventions.
1.2 Public data
Genestack is pre-loaded with millions of publicly available datasets from major repositories like ArrayExpress, GEO,
SRA, and ENA, as well as numerous reference genomes for multiple organisms from Ensembl and UCSC. In practice,
this means that the platform can serve as a data repository, that allows users to work both on private and public data
seamlessly. Our Data Browser application can help you to easily find not only datasets you are interested in but also
analysis results performed on these data.
3
1.3 Interactive data analysis and visualisation
On Genestack, you will find many applications to process bioinformatics data, but you will also find a range of
graphical, interactive applications that help users better understand their data. This ranges from our FastQC Report
application that visualises the quality of raw or preprocessed sequencing reads, to our Variant Explorer application,
performing real-time interactive genomic variant filtering by type, impact, quality, frequency, etc. All the Genestack
applications are described in the section Applications review.
1.4 Format-free files
When doing bioinformatics, a lot of effort is often spent trying to get your files in the right format, to work with the
right software.
In Genestack, we address this issue by dealing with high-level biological objects instead of traditional file formats. For
instance, this means that when you upload some sequencing reads to Genestack, whether they are in FASTQ, SRA,
paired-end or not, compressed or not, they will appear in Genestack as “Raw Reads”. And subsequently, you will only
be able to use these files with applications that know how to work with Raw Reads; if any format conversion needs to
be done, they are done automatically and internally, and you will not have to worry about them.
Therefore, we say that files in Genestack are “format-free” in the sense that, as a user of the system, you do not have
to think about file format compatibility and conversion inside Genestack, because Genestack does that for you.
Note: Formatting Hell

You might be wondering why we made our platform format-free and why this is such a big deal. In the current
landscape of bioinformatics there seems to be a never-ending number of formats your data might be saved in. There
are a few prominent formats used in NGS data analysis, like FASTQ, BAM and VCF. But very often new programs
come with new formats. Bioinformaticians say they spend almost 80% of their time worrying about data grooming
and file reformatting and only 20% on actual data analysis. Now, that’s insane, isn’t it? On Genestack you don’t have
to worry about formats at all — our platform takes care of all the routine tasks so that you can focus on your work.
1.5 Reproducibility
Every file on Genestack “remembers” how it was made: all applications, tool versions, parameters, and source files
are recorded in the metadata of each file.
This way, you can select any file in Genestack and visualize its “provenance”, a graph representing exactly what steps
were taken to produce this file, which source files and which applications were used to produce it.
Genestack hosts multiple tool versions at any given time in case you want to reproduce past results.
4 Chapter 1. Genestack key concepts

CHAPTER 2
Get started with Genestack
In this tutorial, we would like to introduce you to the core features of Genestack Platform. You will learn how our
system deals with files, how it helps you organise and manage your data and how to share data with your colleagues.
You will see how easy it is to work on private and public data simultaneously and seamlessly, and how to reproduce
complex analyses with data flows, a built-in mechanism for capturing and replaying your research.
2.1 How to find relevant data?
From the Dashboard you can go to the Data Browser — the application that makes a search of the relevant biological
data fast and effective. Although at this point, you may have no private files, you can access public data available on
the platform.
Currently, we have a comprehensive collection of publicly available datasets imported from GEO NCBI, ENA, SRA
and ArrayExpress. We also provide other useful data that can be used in bioinformatic analysis, namely reference
genomes or annotations.
One of the key features of Genestack is that all files are format-free objects: raw reads, mapped sequence, statistics,
genome annotations, genomic data, codon tables, and so forth. All files have rich metadata, different for each file type.
5
Let us take a look at an example. Apply filter “Reference genomes” to see pre-loaded reference genomes. There
is no single, standard, commonly accepted file format for storing and exchanging genomic sequence and features:
the sequence can be stored in FASTA, EMBL or GenBank formats. Genomic features (introns, exons, etc.) can, for
example, be represented via GFF or GTF files. Each of these formats themselves has flavours, versions, occasionally
suffering from incompatibilities and conflicts. In Genestack you no longer have to worry or know about file formats.
A Reference Genome file contains packed sequence and genomic features. When data, such as reference genomes, is
imported into Genestack (and several different formats can be imported) it is “packed” into a Genestack file, meaning
all reference genomes will behave identically, regardless of any differences in the physical formats underneath. You
can browse reference genomes with our Genome Browser you can use them to map raw sequencing reads, to analyse
variations, to add and manage rich annotations such as Gene Ontology and you never have to think about formats
again.
Click a dataset name to view the associated metainformation with the Edit Metainfo app. Some metadata fields are
filled in by our curators, some are available for you to edit with Edit Metainfo app, and some are computed when files
are initialised.
Besides, you can explore metadata of any file wherever you are in the platform using View metainfo option in the
context menu. All files have rich metadata, different for each file type.
6 Chapter 2. Get started with Genestack

2.2 How to import data?
Now let’s discuss importing data into the platform. On the dashboard you can find an Import data option. Once you
click it, this will take you to the Import Data app page.
There are various options for importing your data. You can drag and drop or select files from your computer, import
2.2. How to import data? 7

data from URL or use previous uploads.
After data is uploaded and imported, the platform automatically recognizes file formats and transforms them into
biological data types such as raw reads, mapped reads, reference genomes and so on. This means you will not have to
worry about formats at all and this will most likely save you a lot of time. If files are unrecognized, you can manually
allocate them to a specific data type by drag and drop.

On the next Edit metainfo step, you can describe uploaded data. Using an Excel-like spreadsheet you can edit the file
metainfo and add new attributes, for example, cell type or age.
Additional option of importing your data is using import templates. On the Dashboard you can find an Add import
template option. Import templates allow you to specify required and optional metainfo attributes for different file
kinds. When you scroll down to the bottom of the page, you will see an Add import template button.
2.2. How to import data? 9

2.3 How to build and run a pipeline?
All files on Genestack are created by various applications. When an application creates a new file, it specifies what
should happen when it is initialised: a script, a download, indexing, computation. In practice, it means that uninitialised
files are cheap and quick to create, can be configured, used as inputs to applications to create other files, and then,
later, computed all at once. Let’s look at an example. Go to the public experiment library and choose “Whole genome
sequencing of human (russian male)” dataset.
Click Analyse button and, then, select Trim Adaptors and Contaminants in the list of the suggested applications. If
you want to analyse some of the files from a given dataset, you can select the files you are interested in and Make a
subset the entire dataset.

Regardless the input you would like to start with, at this step you do not have to start initialisation right away. In fact,
you can use the file created by the app as an input to applications and continue building the pipeline. Notice that you
can edit the parameters of analysis on the app page. You can change them because the file is not yet initialised, i.e.
the computation – in this case, trimming – has not yet been started. After initialisation has completed, these parameters
are fixed. Thanks to these parameters are saved in the metainfo, they can be further used to identically reproduce your
work.
To start initialisation of a newly created file, click on the name of the file and select Start initialisation.
2.3. How to build and run a pipeline? 11

To use this file as an input for a different application, for example to map the trimmed raw reads to a reference genome,
you should click on Add step and select the “Spliced Mapping with Tophat2” application.
As a result, another dataset called “Spliced Mapping with Tophat2 ” is created and is waiting to be initialised. On the
application page you can check if the system suggested a correct reference genome and if not, select the correct one.

This dataset, in turn, can be used as an input for a different application. As the last step of the analysis you could, for
example, identify genetic variants by adding the ”Variant Calling” app. In order to see the entire data flow we have
just created, click on the name of the last created file, go to “Manage” and File Provenance.
2.3. How to build and run a pipeline? 13

It will show you processes that have been completed, and ones that need to be initialised. To initialise only one of the
steps, click on a given cell, then on Actions and later select Start initialization. To initialise all of the uninitialised
dependencies, simply press Start initialisation at the top.

You can track the progress of your computations using Task Manager that can be found at the top of the page.
2.4 How to reproduce your work?
Now, let’s talk about reproducibility. We will show you how to take any data file in Genestack Platform, and repeat
the analysis steps that led up to it on different data.
Let’s go back to the genetic variations file you created called “Variant calling” file. To find analysis results, you might
go to the Recent Results on the Dashbord, find a dataset in the Data Browser or go to the “My datasets” folder in the
File Manager. You can also find it in the tutorial folder. Rather than viewing its provenance like we did before, let’s
see if we can reuse the provenance. To do this, select the file, go to Manage and Create new Data Flow.
2.4. How to reproduce your work? 15

In the next screen you will see the data flow we have previously created.

The data flow editor has one core goal: to help you create more files using this diagram. To do this, you will need to
make some decisions for boxes in the diagram via the Action menu. If you want to select different files, go to Choose
another file. If you want to leave the original file simply do not change anything.
2.4. How to reproduce your work? 17

In this example, we will use this data flow to produce variant calls for another raw sequence data file, FS02 reproducing
the entire workflow including trimming low-quality bases, spliced mapping and variant calling. All you need to do is
choose another input file and click on Run dataflow button at the top of the page. You will be given a choice: you can
initialize the entire data flow now or delay initialization.
If you decide to delay the initialization till later, you will be brought back to the Data Flow Runner page where you
can initialize individual files by clicking on the file name and later selecting Start initialization.

2.5 How to share and manage your data?
To share data we use groups – a shared project for two or more users. To manage existing groups or create new one
click on the Genestack logo in the upper left corner and select Manage groups on the shortcuts menu.
2.5. How to share and manage your data? 19

On the Manage Groups page, if you have no groups click so far, click Create group button and create the first one.
Right away we have a new group:

And we can add a new member to this newly created group:
Now your group looks like this:
No confirmation is needed – any user in your organisation can create a group and add other users from your organisa-
tion to it. You are the group administrator of any groups that you create. As the group administrator, you can add or
remove other users from your group, or change their permissions, i.g. make them administrators, make them “sharing”
or “non-sharing” users.
All groups appear as folders under Shared with me in File Manager, and the moment you add a user to a group they
will see the group’s folder in their File Manager.

Group folders are the same as all other folders in the system: you can add and remove files to group folders just like to
any other regular folder. There is an important point to note though: adding a file to a group folder is not the same
as sharing it with the group.
To share one or more files with a group, you need to select them click Share using context menu. Some applications
(e.g. Data Browser, Metainfo Editor) have Share button as well. In the opening window choose a group you want to
share the selected files with.
After that click Share, and specify whether you want to add the shared data to the group folder or not with Link
option.

If you choose to link the shared files, all group members will see the shared data at the top level of the group folder. If
you do not link the files, the files will be still shared but members of the group will not see them in the group folder,
although they could access the data via search. Moreover, you can always add shared files to group folders later.
It is very easy to share data with users in the same organisation. You simply create a group and share files; all group
members see shared data immediately. What about sharing across organisations? Say, you work in a hospital
research group and have imported some valuable pathogenic specimen sequence data into Genestack Platform and
want to share it with your colleagues in a pharma company who work on some novel drugs to kill the pathogen. It is
easy to set up a new cross-organisational group or to turn an existing group into one. When you add new users, simply
type in the email address of the user from another organisation. Genestack Platform will autocomplete only users in
your organisation, not for others. This is a security feature, which means that no one from any other organisation can
find out who is registered in Genestack Platform from yours. After you enter the user’s email, you should create an
invitation and send it to another organisation:
Your organisation administrator will need to approve the invitation, and then the other organisation’s administrator will
have to approve it, too. After confirmation of collaboration by organisation administrators of both parties, the group

becomes a cross-organisational group and other users can be added easily. The inviting organisation’s administrator
will see on their group management screens the following:
Once they confirm the outgoing invitation, the other organisation’s administrator will see the same in their Incoming
invitations section and will have to confirm it as well. After both confirmations, the new group has members from
both organisations:
Note that you can change the status of users from your organisation, but not from other organisations. A cross-
organisational group can have multiple organisations participating in it. The addition of each new participating or-
ganisation needs approvals of administrators of all organisations in the group, as well as that of an administrator
from the organisation being invited. Once the approvals are in, sharing is easy. So, you can easily collaborate across
organisational (enterprise) boundaries, and appropriate administrative controls are in place.

CHAPTER 3
Case studies
3.1 Testing differential gene expression
One of the most widespread applications of RNA-Seq technology is differential gene expression (DGE) analysis. By
how gene expression levels change across different experimental conditions, we can gain clues about gene function
and learn how genes work together to carry out biological processes.
In this tutorial we will use Genestack applications to identify differentially expressed (DE) genes and further annotate
them according to biological process, molecular function, and cellular component.
The whole analysis includes the following steps:
1. Setting up an RNA-Seq experiment
2. Quality control of raw reads
3. Preprocessing of raw reads
4. Mapping RNA-Seq reads onto a reference genome
5. Quality control of mapped reads
6. Calculate read coverage for genes
7. Differential gene expression analysis
Let’s deal with these steps one by one.
3.1.1 Setting up an RNA-seq experiment
The first step is to choose RNA-Seq dataset. There are more than 110.000 datasets (see the Public experiments folder)
imported from various biological databases and public repositories (e.g. ArrayExpress, GEO, SRA, and ENA). All
these data are absolutely free to use. The Data Browser app allows you to explore already existing RNA-Seq data.
Data Browser provides filters that help to find the most appropriate data, that you can use it further in the analysis.
Besides searching through all the data available on the platform, you can also upload your own data with the Import
app.
25
Feel free to find all the data for this tutorial in the folder Testing Differential Gene Expression on Genestack Platform.
To find it go to File Manager app and open the “Tutorials” folder located in the Public Data.
The RNA-Seq experiment we will use comes from Hibaoui et al. 2013 and is publicly available on Genestack. Open
it in Metainfo Editor to see more details:
The authors investigated the transcriptional signature of Down syndrome (trisomy 21) during development, and anal-
ysed mRNA of induced pluripotent stem cells (iPSCs) derived from fetal fibroblasts of monozygotic twins discordant
for trisomy 21: three replicates from iPSCs carrying the trisomy, and four replicates from normal iPSCs. They iden-
tified down-regulated genes expressed in trisomic samples and involved in multiple developmental processes, specif-
ically in nervous system development. Genes up-regulated in Twin-DS-iPSCs are mostly related to the regulation of
transcription and different metabolic processes.
To reproduce these results, we will use Differential Gene Expression Analysis data flow. But before let’s check the
quality of raw reads to decide whether we should improve it or not.
3.1.2 Quality control of raw reads
Raw sequencing reads can include PCR primers, adaptors, low quality bases, duplicates and other contaminants com-
ing from the experimental protocols. That is why we recommend you to check the quality of your raw data looking at
such aspects as GC percentage, per base sequence quality scores, and other quality sttistics. The easiest way to do this
is to run the Raw Reads QC data flow:
After generating reports, you will be able to review various statistics and plots for each sample.
Per sequence GC content. In a random library you expect a roughly normal GC content distribution. An unusually
shaped or shifted distribution could indicate a contamination.
26 Chapter 3. Case studies

Per base sequence quality plot depicts the range of quality scores for each position in the reads. A good sam-
ple will have qualities all above 28:
Per sequence quality scores plot shows the frequencies of quality scores in a sample. If the reads are of good quality,
the peak on the plot should be shifted to the right as far as possible. In our case, the majority of Illumina reads have a
good quality - in the range from 35 to 40 (the best score is 41):
3.1. Testing differential gene expression 27

Per base sequence content shows the four nucleotides’ proportions for each position. In a random library you expect
no nucleotide bias and the lines should be almost parallel with each other:
There is a bias at the beginning of the reads, which is common for RNA-Seq data. This occurs during RNA-Seq library
preparation, when “random” primers are annealed to the start of sequences. These primers are not truly random, and
it leads to a variation at the beginning of the reads.
Sequence duplication levels plots represent the proportion of the library made up of sequences with different dupli-
cation levels. Sequences with 1, 2, 3, 4, etc duplicates are grouped to give the overall duplication level.

Looking at these plots, you may notice 15 % of sequences are duplicated more than 10 times, 6 % of sequences
are repeated more than 100 times, etc. The overall rate of duplication is about 40 %. Nevertheless, while analysing
transcriptome sequencing data, we should not remove these duplicates because we do not know whether they represent
PCR duplicates or high gene expression of our samples.
We have run QC on all the data in the experiment and collected reports in Raw reads QC reports for Hibaoui et al
(2013) folder.
3.1.3 Preprocessing of raw reads
QC reports not only provide you with the information on the data quality but can also help you to decide how to
preprocess the data in order to improve its quality and get more reliable results in further analysis. There are various
Genestack applications that allow you to do preprocessing, we will use Trim Adaptors and Contaminants app, as a
first procedure of preprocessing.

Once the quality of raw data has been checked, we can go back to the main Differential Gene Expression Analysis
data flow and choose sources: 7 raw reads from the tested dataset and a human reference genome. You can select the
input data from the existing datasets or upload files directly into the data flow using Import Data.


After that, we run the data flow to create all the files.


All resulting files are collected in Trimmed raw reads for Hibaoui et al (2013) folder.
3.1.4 Mapping RNA-seq reads onto a reference genome
When all files were created, you can run the whole analysis here, choosing Expression Navigator for genes. But first,
let’s align RNA-Seq reads to the reference genome across splice junctions and then explore mappings in Genome
Browser.
Find Spliced Mapping step, click on “7 files”. In “Explore” section choose Genome Browser and start initialization
there.
We run Spliced Mapping app with default parameters. To change them go to the app page and choose Edit parameters
button. If you want to learn more about the app and its options, click on the app name and then on About application.
Find completed Mapped Reads files in Mapped reads files for Hibaoui et al (2013) folder. Let’s open some of them in
Genome Browser to analyse reads coverage on chromosome 21 on the region chr21:30007376-40007694 (10 Mb):

Here is a combined track for all trisomic and control samples:

As you see, the majority of chr21 genes are indeed more expressed in the trisomic samples than in the euploid ones,
which is consistent with the overall up-regulation of chr21 genes in individuals with Down syndrome.
3.1.5 Quality control of mapped reads
The optional step is to check how mapping went using Mapped Reads QC Report app. You can “generate reports”
for each mapping separately or just run Mapped Reads Quality Control data flow for multiple samples:
Output report includes mapping statistics such as:
1. Mapped reads: total reads which mapped to the reference genome;
2. Unmapped reads: total reads which failed to map to the reference genome;
3. Mapped reads with mapped mate: total paired reads where both mates were mapped;
4. Mapped reads with partially mapped mate: total paired reads where only one mate was mapped;
5. Mapped reads with “properly” mapped mate: total paired reads where both mates were mapped with the
expected orientation;
6. Mapped reads with “improperly” mapped mate: total paired reads where one of the mates was mapped with
unexpected orientation.
The Coverage by chromosome plot shows the percentage of bases covered (y-axis) by at least N (x-axis) reads.
For paired reads, you can look at insert size statistics, such as median and mean insert sizes, median absolute deviation
and standard deviation of insert size. The Insert size distribution plot is generated:

We already prepared all QC reports for mapped reads and put them in Mapped reads QC reports for Hibaoui et al
(2013) folder. You can open all of them in Multiple QC Report app to view mapping statistics interactively:

Overall, more than 80 % of reads are mapped. It includes properly and partially mate pairs. Less than 11 % of reads
are unmapped among the samples. Additionally, you can sort your samples by QC statistics or metainfo values. Read
more what the app does in our blog post about interactive sequencing quality control reports.
3.1.6 Calculate read coverage for genes
After mapping, we can count reads mapped to annotated features (genes, exons, etc.) running Quantify Raw Cover-
age in Genes app. To run the app, click on “7 files” and then “Start initialization”.
For our analysis we counted reads mapping within exons, grouped them by gene_id and assigned reads to all exons
they overlap with. We calculated read coverage in all samples and collected resulting files in Raw gene counts for
Hibaoui et al (2013) folder.
3.1.7 Differential gene expression analysis
The final step is to identify genes whose patterns of expression differ according to different experimental conditions. In
this tutorial, we are looking for variation in gene expression for trisomic samples compared to the control ones.
Open Expression Navigator file, re-group samples and start the analysis:
We prepared two Differential Expression Statistics files (considering the DE genes reported by both packages) and
stored them in Differential gene expression analysis for Hibaoui et al (2013) folder. As an example, let’s analyse

DE genes reported by DESeq2 package. You can see the table with top genes that are differentially expressed in one
particular group compared to the average of the other groups. The table shows the corresponding Log FC (log fold
change), Log CPM (log counts per million), p-value, and FDR (false discovery rate) for each gene. Genes with positive
Log FC are considered to be up-regulated in the selected group, ones with negative Log FC are down-regulated. In
the “Trisomy 21” group we identified 4426 low expressed genes (NR2F1, XIST, NEFM, etc.) and 4368 highly over-
expressed genes (ZNF518A, MYH14, etc.). By selecting the checkbox next to a gene, more detailed information about
that gene is displayed: its Ensembl identifier, description and location.

There are several options to filter/sort the genes displayed on the “Top Differentially Expressed Genes” table. You can
filter them by minimum Log FC, minimum CPM and regulation type. By default, the genes are ranked by their FDR.

tutorials/DGE_analysis/images/filters.png
Let’s find genes that are most over-expressed in the “Trisomy 21” group by lowering the Max P-value threshold and
increasing the Min LogFC and Min LogCPM thresholds. Change Regulation to “Up”, set both Min LogFC and Min
LogCPM equal to ‘2’ and apply sorting by LogFC. As consistent with paper results, there is a number of zinc finger
protein genes that are up-regulated in Twin-DS-iPSCs:

Interactive counts graph shows gene normalised counts across samples. This allows you to observe how a gene’s ex-
pression level varies within and across groups. Select several genes to compare expression level distributions between
them:

If you move cursor to the top right corner of the graph, three icons will appear:
1. Filter icon lets you filter the graph data by samples, groups, features, etc.
2. Data icon will display all the data contained in the graph, and allow you to save the table data locally.
3. Camera icon lets you save the displayed graph locally. Add labels to the graph and change its appearance by
modifying the parameters on display when you right-click the graph area.
This is the end of this tutorial. We hope you found it useful and that you are now ready to make the most out of our
platform. If you have any questions and comments, feel free to email us at support@genestack.com. Also we invite
you to follow us on Twitter @genestack.
3.2 Whole-genome bisulfite sequencing data analysis
Investigating the DNA methylation pattern is extremely valuable because this epigenetic system regulates gene expres-
sion. It is involved in embryonic development, genomic imprinting, X chromosome inactivation and cell differentia-
tion. Since methylation takes part in many normal cellular functions, aberrant methylation of DNA may be associated
with a wide spectrum of human diseases, including cancer. A strong interplay between methylation patterns and other
epigenetic systems provides a unique picture of active and supressed genes. Indeed, methylomes of a malignant cell
and a healthy one are different. Moreover, both hyper- and hypomethylation events are associated with cancer.
As methylation patterns of DNA are reversible and can be modified under the influence of the environment, epigenetics
opens new possibilities in diagnosis and therapy of cancers and other severe diseases.
Bisulfite sequencing approaches are currently considered a “gold standard” allowing comprehensive understanding of
the methylome landscape. Whole-genome bisulfite sequencing (WGBS), provides single- base resolution of methy-
lated cytosines accross the entire genome.
In this tutorial step-by-step we will show you how to analyse and interprete bisulfite sequencing data with Genestack
Platform.
The overall pipeline includes the following steps:
1. Setting up a WGBS experiment
3.2. Whole-genome bisulfite sequencing data analysis 43

Fig. 1: Adapted from Lorenzen J.M., Martino F., Thum T. Epigenetic modifications in cardiovascular disease. Basic
Res Cardiol 107:245 (2012)

2. Quality control of bisulfite sequencing reads

3. Preprocessing of the raw reads: trimming adaptors, contaminants and low quality bases
4. Bisulfite sequencing mapping of the preprocessed reads onto a reference genome
5. Merging the mapped reads
6. Quality control of the mapped reads
7. Methylation ratio analysis
8. Exploring the genome methylation levels in Genome Browser
The details will be further elaborated in the sections below. To follow along go to Tutorials folder in the Public data.
Then select the Whole-Genome Bisulfite Sequencing Data Analysis on Genestack Platform folder, containing all the
tutorial files we talk about here for your convenience. Find there processed files, explore results, and repeat the analysis
steps on data of your interest with a WGBS data analysis (for Rodriguez et al., 2014) dataflow.
3.2.1 Setting up a WGBS experiment
For this tutorial we picked the data set by Rodriguez et al., 2014 from the Genestack collection of Public Experiments
. Feel free to reproduce the workflow on any other relevant data set that you can find with Data Browser. If you do
not find there needed experiment or you intend to analyse your own data use our Import Data application allowing to
upload files from your computer or from URL.
In this experiment authors applied WGBS of genomic DNA to investigate the mechanisms that could promote changes
in DNA methylation and contribute to cell differentiation and malignant transformation. They investigated the cytosine
methylation profile in normal precursors of leukemia cells, hematopoietic stem cells (HSCs).
The team discovered novel genomic features, they called them DNA methylation canyons, that are uncommonly large
DNA regions of low methylation. Canyons are distinct from CpG islands and associated with genes involved in
development.
In mammals, a methyl group is added to cytosine residues by DNA methyltransferase (DNMT) enzymes. As the de
novo DNA methyltransferase Dnmt3a is shown to be crucial for normal HSCs differentiation, and Dnmt3a gene is
often mutated in human leukemias, the authors further explore how loss of Dnmt3a influences canyons.
They compared DNA methylation patterns in wild type and Dnmt3a knockout mouse HSCs. It turned out that the
loss of Dnmt3a results in methylation changes at the edge of canyons and can influence canyon size.
Now let’s start reproducing these results with data flows pre-prepared by Genestack.
To learn more just open the experiment in Metainfo Editor:

Fig. 2: Adapted from Jeong M. & Goodell M.A. New answers to old questions from genome-wide maps of DNA
methylation in hematopoietic cells. Exp Hematol 42(8):609-617
3.2.2 Quality control of bisulfite sequencing reads
Quality control of raw reads is an essential step in our pipeline to ensure that the data is of high quality and is suitable
for further analysis. The raw data may be contaminated with PCR primers and adapter dimers or it can contain bases
of poor quality. Moreover, the bisulfite sequencing protocol utilizes sodium bisulfite conversion of unmethylated
cytosines to thymidines and that reduces the sequence complexity resulting in errors during alignment.
With our platform you can quickly and easily obtain an interactive FastQC Report that covers different quality aspects
of raw data such as general sequence quality, total number of reads, GC content distribution, read quality distribution
and much more. To start the analysis we can use public Raw Reads Quality Control data flow. There is no need to
repeat the same steps for each sample — the pipelines for all 16 assays from our experiment can easily be built simul-
taneously using this public data flow.
In the Data Flow Runner page you can see 16 raw reads samples from the experiment.

To start computation click on the button Run Data Flow and create resulting files. You will be suggested to initial-
ize the created files at once or delay it till later.

The computation will take a while. You can track the progress of the report generation using our Task Manager that
can be found at the top of the page. All created QC reports are located in the folder “Created files”. You can explore
each report in FastQC Report app or compare QC-statistics for several samples using Multiple QC Report app.
If you decide to delay initialization, you can generate statistical reports for all the chosen samples directly on Multiple
QC Report app page or from the Data Flow Runner page by clicking on ”16 files” and selecting “Start initializa-
tion”. Thereby, with just one click, the process will begin for all the samples we have selected.

We will demonstrate the examples of QC-reports using files previously prepared by our team. Go to the folder with
processed files for our tutorial and click on the complete Multiple QC Report for Raw Reads we have created for all
16 raw read files in order to compare the quality of our raw reads.

Looking at the plot we can see the number of nucleotides counted for each individual sample obtained from Dnmt3a-
KO (blue) or WT (red) HSCs samples. Additionally, on the app page you could specify the statistics and metainfo
which will be displayed on the plot and sort the samples by specific QC statistics or metainfo keys of choice.
Now let’s look at the FastQC report for one of the assays, for example, “ko3a_b2l4 Bisulfite-Seq”. Per sequence GC
content graph shows the GC content across the whole length of each read. Ideally, we will see the normal distribution
of GC content. Our results reflect some deviation from from normal distribution: unusual sharp shape of the central
peak may indicate the presence of contaminants in our library, for example adaptor dimers.

On the Per base sequence quality plots we can see that all bases in our sequence have the quality score equal or more
than 30, which corresponds to 99.9 % base calling accuracy. The quality is degraded in the last bases, but it is an
expected behaviour corresponding to the sequencing chemistry.
Per sequence quality score graph shows an average quality distribution over the set of sequences. It will help us see
if there are any problems with sequencing run, for example a significant proportion of low quality sequences can be a

signal of a systematic problem. In our case the overwhelming majority of reads are of a high quality (more than 30).
Let’s move on to the Per base sequence content graphs. The fact that our data failed this metric indicates that the base
distribution is not uniform, namely the difference between A and T, or G and C is greater than 20 %. Indeed, we can
see fluctuations in base compositions over the entire read length. This should not alarm us, because bisulfite treatment
converts the most of the cytosines to thymines and that obviously affects the base composition. Looking at the plot we
can see that the number of thymines is approximately 50 %, while cytosines are almost absent.

Sequence duplication levels metric allows us to assess the duplication level as well as the number of sequences that
are not unique in the raw data. According to the plot, we have more than 30 % of non-unique sequences of the total in
the assay. Such a high duplication level can be linked to PCR artefacts, contaminants or sequencing of the same area
several times.
The application also detects Overrepresented sequences that may correspond to primer or adapter contamination.
Indeed, in our case two over-represented sequences were found in our assay. Here they are:

These contaminants can strongly influence the results of analysis and should be trimmed.
All prepared FastQC reports for all the samples are stored in the FastQC reports for Rodriguez et al., 2014 folder.
After checking the quality of our data, we can proceed with appropriate steps for improving the original raw data in
order to get reliable results in the downstream analysis.
The authors analysed two biological replicates for two murine phenotypes: wild type (WT) HSCs and conditionally
Dnmt3a knocked out (KO) HSCs. Moreover, each biological replicate of WT or Dnmt3a-null HSCs condition has sev-
eral technical replicates. Let’s select the raw reads “m12_b4l1 Bisulfite-Seq”, “m12_b4l2 Bisulfite-Seq” and “m12_b3
Bisulfite-Seq” that are three technical replicates for the second biological replicate of WT HSCs from our experiment
and right click on them. The “Data Flow for WGBS data analysis (for Rodriguez et al., 2014)” public data flow created
for this tutorial is freely available on the platform, you can find it in the tutorial folder. To explore the data flow in
more detail you can use Data flow runner app, where all the steps of our pipeline are schematically represented.

In the first block you will see the source files we have just selected. Also you need to specify reference genome onto

which our reads will be mapped. So Choose sources, find appropriate murine reference genome and Select.
Let’s run data flow by click on the corresponding button and take a closer look at all the steps of our pipeline. As we
will describe below, we will run this data flow several times to obtain methylation ratios for biological replicates of
the two tested phenotypes separately. The first part of our pipeline is preprocessing of raw sequencing data. Based
on the QC statistics we highly recommend you to remove adapters and contaminants, trim low quality bases
and remove duplicates. And we also remove duplicates during Methylation Ratio Analysis, but you can also use
a separate preprocess application Remove Duplicated Reads. Firstly, we can easily remove the found overrepresented
sequences from WGBS data using Trim adapters and contaminants app.
Later, to avoid mismatches in read mapping, we should remove low quality bases from the sequencing reads. Trim
low quality bases application allows you to get rid of nucleotide bases with a low phred33 quality which corresponds
to an error threshold equal to 1 %.
All preprocessed files are freely accessible in the folders Trim adaptors for Rodriguez et al., 2014 and Trim low quality
bases for Rodriguez et al., 2014.
3.2.4 Mapping reads onto a reference genome
Following the preprocessing, our data is of improved quality and we can move on to the next step — alignment of
trimmed reads onto the reference genome.
We run Bisulfite Sequencing Mapping with default parameters. Click on the app name to move to its page where you
can change the parameters of alignment and learn more about the app clicking on its name.
In the Mapped reads for Rodriguez et al., 2014 folder you can find all the Mapped Reads.
3.2.5 Merging of the mapped reads obtained from technical replicates
It is essential for any experiment to create a good and adequate control. Using replicates can improve quality and
precision of your experiment especially when comparing several experimental conditions. There are two types of
replicates: technical and biological. When the same biological sample is prepared and sequenced separately, this gives
us technical replicates. But if different biological samples are treated the same way but prepared separately, we are
talking about biological replicates. If there are no changes in sample preparation, technical replicates could be consid-
ered as controls for the samples under the same experimental conditions. As we remember, our experiment contains
both biological and technical replicates for 12-month-old wild-type HSCs and Dnmt3a-KO HSCs. As authors do not

mention using different experimental conditions for technical replicates, we can merge them before the calculation of
methylation ratios. We will not merge biological replicates, because significant biological variability may be present
between two sample growing separately.
For each biological replicate create a subset including technical replicates that will be merged. As a result, you will
get 4 merged mapped reads for both analysed murine phenotypes. You can also use prepared Merged Mapped Reads
by opening the Merged mapped reads for Rodriguez et al., 2014 folder.
Before proceeding to the next analysis step we would recommend you to check the quality of mapped reads because
the reads might look alright in raw reads quality control check but some issues, such as low coverage, homopolymer
biases, experimental artifacts, only appears after alignment.
Sequencing coverage is an important quality metric, as biases in sample preparation, sequencing and read mapping can
result in genome regions that lack coverage or in regions with much higher coverage than theoretically expected. You
could explore the coverage for merged mapped reads in our Genome Browser. Go to the folder created, find merged
mapped reads file we have created and click on the name of the merged mapped reads files, go to explore option and
select the Genome Browser. Also go to our tutorial folder, find there Coverage for merged mapped reads for Rodriguez
et al., 2014 file and open it in Genome browser app to explore results for initialized mapped reads.
In Genome Browser you have the option of viewing the region according to its coordinates or alternatively, you can
search for a given gene or transcript name. Let’s check the coverage in the location of the large methylation-depleted
region associated with Pax6 gene.
High-quality reads should have uniform and high enough coverage. Our last three samples are very good for the further
processing. At the same time, the coverage for the first sample— biological replicate 2 of 12 HSCs — is uneven —
with gaps and regions with coverage up to x133. That is why, such results may generate biases during methylation
ratios calculation.

You can explore mapping quality for each individual file: right click merged reads file name, go to explore and
select Mapped reads QC report. On the opened app page you should click generate reports to start calculation
of quality control statistics. To check the quality of all mapped reads or merged mapped reads simultaneously use
the Mapped Reads Quality Control public data flow as we done previously for Raw Reads Quality Control. Mapped
reads QC report describes results of alignment of preprocessed reads onto the reference genome, and for example for
paired-end reads it contains some mapping and insert size statistics, coverage by chromosome plot and insert size
distribution. When the computation is finished, you can open Mapped reads QC report for each individual file or
explore the mapping statistics for all of them using Multiple QC Plotter. In order to do that find all of the merged find
the generated mapped reads QC-reports, select all the four, and with context menu open all of them in Multiple QC
plotter. Below you can see QC-report for merged mapped reads generated by our team:

As we can see, at most 6 % of the reads are improperly mapped out of around 65 % of mapped reads among all the
samples. Remember that you can explore QC reports not only for merged mapped reads but also for mapped reads. All
the prepared QC reports are located in the folders Mapped reads QC for Rodriguez et al., 2014 and Merged Mapped
Reads QC for Rodriguez et al., 2014.
3.2.7 Methylation ratio analysis
When sequencing reads for both murine phenotypes are mapped onto reference genome with high enough quality,
we can move on to the very last step in our pipeline — determining DNA methylation status of every single cytosine
position on the both strands. In order to do that, go back to the Data Flow Runner page. Click on the Methylation
Ratio Analysis to go to the app page where you can see source files and command line options that could be easily
changed.
Then return to the data flow click on action, add files, chose the remaining merged mapped reads, and start initializa-
tion. During this step we apply several options to remove technical biases in WGBS data:
1. Trim N end-repairing fill-in bases set to “3”. This option allows to trim 3 bases from the read end to remove
the DNA overhangs created during read end-repair in library preparation. It is important because this end repair
procedure may introduce artefacts if the repaired bases contain methylated cytosines.
2. Report only unique mappings
3. Discard duplicated reads option to remove duplicated reads which have identical sequences and could be the
result of library preparation. These reads could be mapped to the same position and distort results of downstream
analysis.
The folder Methylation Ratios for Rodriguez et al., 2014 contains all the resulting files of methylation ratios estimation.
3.2.8 Exploring the genome methylation levels in Genome browser
We can explore the distribution of genome methylation levels counted for both murine phenotypes in Genome Browser.
As it was mentioned before, “Canyons” are the large unmethylated DNA regions inside of a highly methylated locus
that often harbour genes, such as Hoxa9 and Meis1, playing a role in hematopoiesis and being deregulated in human
leukemia.
Some Canyons can be exceptionally large, for example one associated with the Pax6 homeobox gene encoding a
homeobox-containing protein regulating transcription is extended over 25 kb:

Let’s compare our methylation ratios distribution in these region with author’s results:
To further examine the impact of Dnmt3a loss on the Canyon size, authors compared low-methylated DNA regions in
HSCs with conditional inactivation of Dnmt3a gene to those in the WT cells:

This investigation revealed that methylation loss in Dnmt3a KO HSCs leads to the formation of new Canyons. Lack
of Dnmt3a does not affect regions inside Canyons but it results in changes of Canyon edges. Boundaries of canyons
became hotspots of differential methylation: Canyon size can be decreased due to hypermethylation or increased due
to hypomethylation.
Moreover, at DNA regions containing cluster of Canyons in WT HSCs, larger Canyons (“Grand Canyons”) can be
formed. We can see it on the example of HoxB regions in which Canyons are interrupted by short stretches of higher
methylation.
All these findings suggest that Dnmt3a can be crucial for maintaining methylation in the Canyon boundaries.
Now, let’s take a look at the original track for the same Canyon cluster to compare the results:

This experiment is a part of the large research of changes in DNA methylation profile including different methodologies
such as, for example, whole genome bisulfite sequencing and CMS-seq to reveal genome-wide distribution of mCs and
hmCs, RNA-Seq to analyse expression of Canyon-associated genes. This incredible work was turned into a research
paper, and the data sets can be found in our Public Experiments.
References
• Challen G.A. et al. Dnmt3a is essential for hematopoietic stem cell differentiation. Nat Genet. 44:23–31 (2012)
• De Carvalho D.D. et al. DNA Methylation Screening Identifies Driver Epigenetic Events of Cancer Cell Sur-
vival. Cancer Cell 21(5):655-667 (2012)
• Ehrlich M. DNA methylation in cancer: too much, but also too little. Oncogene 21:5400-5413 (2002)
• Jeong N. et al. Large conserved domains of low DNA methylation maintained by Dnmt3a. Nat Genet
46(1):17–23 (2014)
• Jeong M. & Goodell M.A. New answers to old questions from genome-wide maps of DNA methylation in
hematopoietic cells. Exp Hematol 42(8):609-617
• Kulis M., Esteller M. DNA methylation and cancer. Adv Genet 70:27-56 (2010)
• Ley T.J. et al. DNMT3A mutations in acute myeloid leukemia. N Engl J Med. 363:2424–2433 (2010)
3.3 Whole-exome sequencing data analysis
As one of the widely used targeted sequencing method, whole-exome sequencing (WES) has become more and more
popular in clinical and basic research. Albeit, the exome (protein-coding regions of the genome) makes up ~1 % of the
genome, it contains about 85 % of known disease-related variants (van Dijk E.L. et al, 2014), making whole-exome
sequencing a fast and cost-effective alternative to whole genome sequencing (WGS).
In this tutorial we’ll provide a comprehensive description of the various steps required for WES analysis, explain how
to build your own data flow, and finally, discuss the results obtained in such analysis.
3.3.1 Setting up an exome sequencing experiment
First off, let’s choose exome sequencing data. You can upload your own data using Import button or search through all
public experiments we have on the platform. Our analysis will be based on data coming from Clark et al. 2011. Let’s
find this experiment in the platform and open it in Metainfo Editor:

The authors compared the performance of three major commercial exome sequencing platforms: Agilent’s SureSelect
Human All Exon 50Mb, Roche/Nimblegen’s SeqCap EZ Exome Library v2.0 and Illumina’s TruSeq Exome Enrich-
ment; all applied to the same human blood sample. They found that, the Nimblegen platform provides increased
enrichment efficiency for detecting variants but covers fewer genomic regions than the other platforms. However, Agi-
lent and Illumina are able to detect a greater total number of variants than Nimblegen platform. Exome sequencing and
whole genome sequencing were also compared, demonstrating that WES allows for the detection of additional variants
missed by WGS. And vice versa, there is a number of WGS-specific variants not identified by exome sequencing. This
study serves to assist the community in selecting the optimal exome-seq platform for their experiments, as well as
proving that whole genome experiments benefit from being supplemented with WES experiments.
3.3.2 Whole-exome sequencing data analysis pipeline
A typical data flow of WES analysis consists of the following steps:

3. Mapping reads onto a reference genome
4. Targeted sequencing quality control
5. Quality control of mapped reads
6. Post-alignment processing
7. Variant calling
8. Effect annotation of the found variants
9. Variant prioritisation in Variant Explorer
Let’s look at each step separately to get a better idea of what it really means.
3.3. Whole-exome sequencing data analysis 63

Low-quality reads, PCR primers, adaptors, duplicates and other contaminants, that can be found in raw sequencing
data, may compromise downstream analysis. Therefore, quality control (QC) is essential step in your analysis to
understand some relevant properties of raw data, such as quality scores, GC content and base distribution, etc. In order
to assess the quality of the data we’ll run the Raw Reads QC data flow:
Genestack FastQC application generates basic statistics and many useful data diagnosis plots. Here is some of them for
sample enriched by Aligned SureSelect 50M:
Basic statistics tells you about basis data metrics such as reads type, number of reads, GC content and total sequence
length.
Sequence length distribution module reports if all sequences have the same length or not.
Per sequence GC content graph shows GC distribution over all sequences. A roughly normal distribution indicates a
normal random library. However, as in our case, if the data is contaminated or there are some systematic bias, you’ll
see an unusually shaped or shifted GC distribution:

Per base sequence quality plots show the quality scores across all bases at each position in the reads. By default you
see low quality zone and mean quality line. If the median is less than 25 or the lower quartile is less than 10, you’ll
get warnings.
Per sequence quality scores report allows you to see frequencies of quality values in a sample. The reads are of good
quality if the peak on the plot is shifted to the right, to the max quality score. In our case, almost all of the reads are of
good quality (>30):

Per base sequence content plots show nucleotide frequencies for each base position in the reads. In a random library,
there could be only a little difference between A, T, C, G nucleotides, and the lines representing them should be
parallel with each other. The black N line indicates the content of unknown N bases which shouldn’t be presented in
the library. In our case, you can notice the absence of unknown nucleotides and a slight difference in A-T and G-C
frequencies:
Sequence duplication levels plots represent the percentage of the library made up of sequences with different dupli-
cation levels. In simple words, 44 % of reads are unique, 26 % of reads are repeated twice, 13 % - three times, 4 %
- more than 10 times, etc. All these duplicates are grouped to give the overall duplication level. You can use Filter
Duplicated Reads application to remove duplicates in raw reads data, however we’ll get rid of them after mapping
step.

Application also detects overrepresented sequences that may be an indication of primer or adaptor contamination.
We have run QC on all the data in the experiment and put the reports in Raw reads QC reports for Clark et al (2011)
folder, so that you can open all of them in Multiple QC Report application to analyse results:
You see that total number of exome sequencing reads is 124,112,466 for Agilent SureSelect, 184,983,780 for Nimble-
gen SeqCap and 112,885,944 for Illumina TruSeq platform. The whole genome library yielded more than one billion
total raw reads.
With the comprehensive raw reads QC reports generated by FastQC app, you’re able to determine whether any pre-
processing steps such as trimming, filtering, or adaptor clipping are necessary prior to alignment. Here is the list of all
preprocess apps that Genestack suggests you to improve the quality of your raw reads:

Our preprocessing procedure will include ‘Trim Adaptors and Contaminants’ step. Once the quality of raw data has
been checked, let’s start planning and building our Whole Exome Sequencing Analysis data flow:
To build any data flow in Genestack, choose one of the samples and start to preprocess or analyse it. Each app suggests
you to add next analytical step or use relevant viewers:

Note that you can create as many files as you want and run the computation process later. Now let’s create a data flow
from the pipeline we built. For the last created file choose Create New Data Flow in Manage section:
This takes us to the Data Flow Editor app page where you can rename, describe your pipeline and change sources.
Let’s ‘clear files’ and click ‘Run dataflow’.


Now we’re on the Data Flow Runner application page. Just choose sources — experiment assays and human reference
genome — and click Run Data Flow. Note that if you choose several raw reads files, the multi-sample variant calling
will be performed. However, in order to compare our results, we need to run this data flow for each sample separately.
After that, the app suggests you to choose the explore app where you can start initialization now for whole your
analysis or delay it till later:
Let’s delay it. After that the app suggests you to choose the app where you can also start computation:

In order to start computation for each data flow step separately, click on file name and choose Start initialization.
All the data are preprocessed and stored in Trimmed raw reads for Clark et al (2011) folder.
3.3.5 Mapping reads onto a reference genome
After raw data QC and preprocessing, the next step is to map exome sequencing data to the reference genome with
high efficiency and accuracy. Genestack supports two Unspliced mappers: one is based on Bowtie2, another uses
BWA alignment package. We’ll use the last one since it is fast and allows gapped alignments which are essential for
accurate SNP and indels (insertion/deletions) identification. The following video illustrates how to start computation
on this step and analyse mapping results in Genome Browser:
When mappings are complete, open all four files in Genome browser to compare their read coverage. Let’s look for
specific gene or region, for example, HBA1 and HBA2 genes encoding alpha-globin chains of hemoglobin. With
WGS technology, you can see coverage in both protein-coding and non-coding sequences:

As for WES technology, you are interested only in exome. That’s why, you see coverage for HBA1 and HBA2 coding
regions and do not see it in non-coding ones. To compare read coverage between different enrichment platforms, you
can build a coverage track:
In most cases you’ll see a significant coverage for sample enriched by Nimblegen. Moreover, each platform targets
particular exomic segments based on combinations of the RefSeq, UCSC, Ensembl and other databases. That’s why,
you may expect difference in coverage for specific gene-coding regions. To further use mapped reads, go to the
Mapped reads for Clark et al (2011) folder.
3.3.6 Targeted sequencing quality control
Besides quality control of the raw sequencing reads, it is also crucial to assess whether the target capture has been
successful, i.e. if most of the reads actually fell on the target, if the targeted bases reached sufficient coverage, etc.
However, by default the application allows you to compute enrichment statistics for reads mapped only on exome. If
you go to the app page, change the value to ‘Both exome and target file’ and select the appropriate target annotation file,
you get both exome and/or target enrichment statistics. To do this step, you can ‘generate reports’ for each mapping
separately or run our Targeted Sequencing Quality Control public data flow (with default values) for several samples
at once and analyse the output reports in Multiple QC Report app:
In this tutorial, we are looking at three exome enrichment platforms from Agilent, Nimblegen and Illumina and as-
sessing their overall targeting efficiency by measuring base coverage over all targeted bases and on-target coverage for
each platform:

A typical target-enrichment WES experiment results in ~90 % of target-bases covered at coverage 1x. This value
tends to decrease as the coverage threshold increases. How fast this percentage decreases with the coverage increment
depends on the specific experimental design.
Not surprisingly, all the technologies give high coverage of their respective target regions, with the Nimblegen plat-
form giving the highest coverage: about 94 % of the targeted bases were covered at least twice, 93 % at 10x and 87
% at 50x. With Agilent, 91 % of bases were covered at 2x, 86 % at 10x and 66 % at 50x. With Illumina TruSeq
enrichment, 91 % of bases were covered at 2x, 86 % at 10x and only 50 % at 50x. These results are very similar to
the paper results (Clark M.J. et al, 2011):
Regarding the overall percentage of reads mapped on the target, in a typical experiment one may expect ~70 %.
Looking at the plot, you see the highest 77 % and 74 % values for samples enriched by Nimblegen and Agilent
platforms, respectively. For Illumina TruSeq, on the other hand, only 48 % reads are mapped on the target region.
Also, it’s not surprising that we notice the biggest mean coverage on target with 2x coverage for Nimblegen samples,
since this platform contains overlapping oligonucleotide probes that cover the bases it targets multiple times, making
it the highest density platform of the three. Agilent baits reside immediately adjacent to one another across the target
exon intervals. Illumina relies on paired-end reads to extend outside the bait sequences and fill in the gaps (Clark M.J.
et al, 2011):

Target annotations used in this tutorial can be found in Public Data, Genome annotations folder or in Target Annota-
tions for Clark et al (2011) tutorial folder, where we put them for your convenience.
Besides the target enrichment statistics, you can assess the percentage of exome bases with coverage started from
2x and the overall proportion of reads mapped on exome:
All targeted sequencing QC reports are collected in Mapped reads enrichment reports for Clark et al (2011) folder.
The reads may look OK on the Raw Reads quality control step but some biases only show up during mapping process:
low coverage, experimental artifacts, etc. The detection of such aberrations is an important step because it allows
you to drive an appropriate downstream analysis.
You can ‘generate reports’ for each mapping separately or just run Mapped Reads Quality Control data flow for
multiple samples and analyse the output reports in Multiple QC Report app:
Output report includes mapping statistics such as:
1. Mapped reads: total reads which mapped to the reference genome;
2. Unmapped reads: total reads which failed to map to the reference genome;

3. Mapped reads with mapped mate: total paired reads where both mates were mapped;
4. Mapped reads with partially mapped mate: total paired reads where only one mate was mapped;
5. Mapped reads with “properly” mapped mate: total paired reads where both mates were mapped with the
expected orientation;
6. Mapped reads with “improperly” mapped mate: total paired reads where one of the mates was mapped with
unexpected orientation.
The Coverage by chromosome plot shows a read coverage at each base on each chromosome and patch (if it is pre-
sented) defined by lines in different colours:
If your reads are paired, the application additionally calculates insert size statistics, such as median and mean insert
sizes, median absolute deviation and standard deviation of insert size. The Insert size distribution plot shows the
insert size length frequencies:

All complete QC reports for mapped reads are stored in Mapped reads QC reports for Clark et al (2011) folder. You can
open all of them at once in Multiple QC Report app to interactively analyse and compare mapping statistics between
samples:

Speaking of mapping results, for each sample, almost all of the reads is mapped properly and there is a small percentage
of partially or improperly mapped reads.
3.3.8 Post-alignment processing
After mapping reads to the reference genome, it’s recommended to remove duplicates before variant calling, with
the purpose of eliminating PCR-introduced bias due to uneven amplification of DNA fragments. That’s why we run
Remove Duplicated Mapped Reads app.
Preprocessed mapped reads are stored in Filtered mapped reads for Clark et al (2011) folder.
3.3.9 Variant calling
After duplicate removal, the next step is to identify different genomic variants including SNVs, indels, MNVs, etc.
For this, we’ll use Variant Calling application based on samtools mpileup:
The app automatically scans every position along the genome, computes all the possible genotypes from the
aligned reads, and calculates the probability that each of these genotypes is truly present in your sample. Then genotype
likelihoods are used to call the SNVs and indels.
We run Variant Calling with default parameters, identifying multi-allelic SNPs and indels, excluding non-variant sites
and not considering anomalous read pairs. Maximum read depth per position was set as 250 and minimum number of
gapped reads for an indel candidate is 1. Base alignment quality (BAQ) recalculation is turned on by default. It helps
to rule out false positive SNP calls due to alignment artefacts near small indels. For more information about the app
and its options, click on the app name and then on About application.
When files will be complete, you can analyse variants in Genome Browser:

Genome Browser application allows you investigate the variants interactively: how many mutations are in this par-
ticular gene or region, review some information about detected variants such as average mapping quality and raw
read depth and compare variants enrichment between samples. Analysing variants in Genome Browser, you can no-
tice a large amount of both exome WES–specific and WGS-specific SNVs. We identified variants for each sample
separately and put them in Variants for Clark et al (2011) folder.
3.3.10 Effect-annotation
After variants are detected, use Effect Annotation application based on SnpEff tool. The app annotates variants and
predicts the effects they produce on genes such as amino acid changes, impact, functional class, etc. To review this
information, open Variants with predicted effects in View report application:
Let’s analyse annotated variants for sample enriched by Nimblegen. Output report contains summary about tool
version, number of variants, number of effects, change rate, and other information.
Change rate details table shows length, changes and change rate for each chromosome and patch (if they are pre-
sented). Here is the change rate details for the first 10 chromosomes:

The app calculates total number of variants as well as number of homo/hetero single nucleotide polymorphisms
(SNPs), multiple nucleotide polymorphisms (MNPs), insertions (INS), deletions (DEL), combination of SNPs and
indels at a single position (MIXED) and records it in Number of changes by type table:
SNVs represent the most numerous sequence variations in the human exome. TruSeq detected the highest number of
SNVs followed by Agilent and Nimblegen. Most of them are SNPs. For example, in the Nimblegen sample, there are
~555,000 of SNPs and ~40,000 of both insertions and deletions. No significant difference in the ratio of heterozygous
to homozygous variants between platforms was observed. However, regarding WGS sample, much more variants were
detected (3,8 million of SNPs and about 600,000 indels).
Number of effects by impact table shows count and percentage of variants that have high, low, moderate impact or
tagged as modifiers:
As a rule, the mutation has high impact if it causes significant changes such as frame shift, stop codon formation,
deletion of a large part (over 1 %) of chromosome or even the whole exon, etc. Variants with low impact do not change
function of the protein they encoded. Usually this is synonymous mutations. Moderate variants do not affect protein
structure significantly but change effectiveness of the protein function. They mostly include missense mutations, codon
deletions or insertions, etc. So called modifiers are mutations in introns, intergenic, intragenic and other non-coding
regions.
For sample enriched by Nimblegen, just about 0.04 % of all annotated variants has high impact. However more than
97 % mutation are modifiers. We see the same percentage of modifiers in WES and WGS samples.
Also, the output report contains information about the count and percentage of missense, nonsense and silent muta-
tions. Find out this in Number of effects by functional class table:

For Nimblegen sample, the app detected ~50 % point mutations in which a single nucleotide change results in a codon
that codes for a different amino acid (missense mutations). There are more then 50 % of silent mutations which do
not significantly alter the protein. And only ~0.3 % are nonsense mutations. They change codon in a way resulting in
a stop (or nonsense) codon formation, and subsequently a truncated, incomplete, and usually nonfunctional protein is
produced. Almost the same percentage of missense, nonsense and silent mutations we notice for other WES and WGS
samples.
Next Number of effects by type and region table outputs how many variants for each type (codon deletion, codon
insertion, etc) and for each region (e.g. exon, intron) exist:
Variations histogram additionally illustrates what regions of genome are mostly affected:

Most of variants are detected in the introns. That can be explained by the fact that platform baits sometimes extend
farther outside the exon targets.
Quality histogram, like this one below, shows you the distribution of quality scores for detected variants:
This one is asymmetrical, there are more then 160,000 variants with quality of 10 and a lot of small peaks of lower
and greater qualities.
Also, the application reports a histogram of Coverage for detected variants:
All variants have coverage 2 and more.

Next Insertions and deletions length histogram shows size distribution of detected indels:

For Nimblegen sample, we identified more than 40,000 indels, of which ~24,000 were deletions of up to 12 bases and
the rest were insertions of up to 12 bases. There are more indels were identified after Illumina TruSeq enrichment
(~80,000) followed by Agilent (~57,000) and Nimblegen platforms. These findings agree with paper results:
Moreover, most insertions and deletions were 1 base in size. Notably, there is a slight enrichment at indel sizes of 4
and 8 bases in the total captured DNA data, and that is also consistent with paper results (Clark M.J. et al, 2011; Mills
R.E. et al, 2006).
In Base change (SNPs) table, the app records how many and what single nucleotide polymorphisms were detected:

There is a slight increase in G→A/C→T transitions and slight decrease in G→C/C→G transversions in both whole
exome and whole genome samples.
Transition vs transversions (Ts/Tv) section is about the number of transitions, number of transversions and their
ratio in SNPs and all variants.
Transitions are mutations within the same type of nucleotide — pyrimidine-pyrimidine mutations (CT) and purine-
purine mutations (AG). Transversions are mutations from a pyrimidine to a purine or vice versa. The table represents
these values taking into account only SNP variants.
But below the table, you can find the information for all variants. For WES data, the Ts/Tv ratio of total variants
ranged from 1.6 to 1.8 and was lower than the estimated ~2.6. It can be explained by the fact that the platforms target
sequences outside coding exons (only 60 % of variants were found in introns, for Nimblegen sample). However, for
WGS data, the ratio is equal to 2 as it’s expected (Ebersberger I. et al, 2002).
Looking at Frequency of alleles histogram, you can evaluate how many times an allele appears once (singleton), twice
(doubleton), etc:

In all samples, most of the variants are represented as singletons. Some variants (less than 400,000 for WES, and
about 1,5 million for WGS) have two alternate alleles.
Codon changes table outputs what and how many reference codons have been replaced. Here is just a fragment of this
table:

Reference codons are shown in rows, changed codons — in columns. The most of changes happened are indicated
in red color. For example, 811 ‘ACG’ reference codons have been replaced by ‘ACA’ triplet. If we compare this
information between our samples, you’ll find the same type and almost the same number of codon changes across
WES samples.
In Amino acid changes table, you can see type and number of amino acid changes. Row indicates a reference amino
acid, column - changed amino acid.
For example, 957 Alanines (A, Ala) have been replaced by Tryptophan (T, Trp) in Nimblegen sample. Number and
type of amino acid changes look pretty similar across WGS and different WES samples.
Changes by chromosome plots show the number of variants per 10000Kb throughout the whole chromosome length.
Such histogram is generated for each chromosome and patch presented in the reference genome. Here is the example
plot for chromosome 1:
Besides above mentioned plots and tables, you can see Details by gene as well.
We annotated the variants calculating the effects they produced on known genes and put them in Variants with predicted
effects for Clark et al (2011) folder.

3.3.11 Variant prioritisation in Variant explorer
The variants can be also interactively analysed in Genestack Variant Explorer application:
Let’s select Illumina sample and open it in Variant Explorer to look at the detected variants:
There are 1,350,608 mutations were identified. Imagine that we are interested only in high-quality nonsense variants:
click ‘QUALITY’ header to apply sorting and set ‘NONSENSE’ in ‘FUNCTIONAL CLASS’. You see that the number
of mutations is decreased significantly. We have only 104 nonsense variants:
You can use other filters and sorting criteria and look through the ‘Filters history’ to check how many variants were

detected after applying specific filter in comparison to the number of mutations we had on the previous filtering step:
When the variants are sorted and filtered, you can share them with your colleagues, export them as tsv file clicking on
‘Download table’ and attach it to your papers and other reports.
So, what can we conclude from our findings? Are the results for WES samples really comparable to a WGS one?
If there are any key differences in performance between the three enrichment platforms? And what target capture
technology is better to select when planning the exome experiment?
Answering these questions we found that neither of whole exome and whole genome technologies managed to cover all
sequencing variants. First, WGS can not and will not replace exome sequencing as due to genome characteristics there
will always be regions that are not covered sufficiently for variant calling. Regarding WES, it shows high coverage but
only towards the target regions. Second, WGS has its value in identifying variants in regions that are not covered by
exome enrichment technologies. These can be regions where enrichment fails, non-coding regions as well as regions
that are not present on the current exome designs. That’s why, for covering really all variants, it might be worth to
think about doing both WGS and WES experiments in parallel. Both technologies complement each other.
In general, all technologies performed well. Our results demonstrated that they give a very high level of targeting
efficiency, with the Nimblegen technology demonstrating the highest one, and able to adequately cover the largest
proportion of its target bases. Therefore, the Nimblegen is superior to the Agilent and Illumina TruSeq platforms for
research restricted to the regions that it covers. The technologies target different exomic features but all of them cover
a large portion of the overall exome with Illumina able to achieve the best exome coverage (~60 %). Moreover, the re-
sults showed that Agilent and Illumina platforms appeared to detect a higher total number of variants in comparison
to Nimblegen one. That’s why the question of which enrichment platform is best must be answered with respect to all
these specific parameters.
References
• Clark M.J., et al. Performance comparison of exome DNA sequencing technologies. Nature biotechnology
2011; 29(10):908-914
• Ebersberger I., et al. Genomewide comparison of DNA sequences between humans and chimpanzees. The
American Journal of Human Genetics 2002; 70:1490–1497
• Mills R.E., et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome
Research 2006; 16:1182–1190

• van Dijk E.L., et al. Ten years of next-generation sequencing technology. Trends in Genetics 2014; 30:418-426
3.4 Whole-genome sequencing data analysis
Understanding genetic variations, such as single nucleotide polymorphisms (SNPs), small insertion-deletions (InDels),
multi-nucleotide polymorphism (MNPs), and copy number variants (CNVs) helps to reveal the relationships between
genotype and phenotype. Currently, high-throughput whole-genome sequencing (WGS) and whole-exome sequencing
(WES) are widely used approaches to investigate the impact of DNA sequence variations on human diversity, identify
genetic variants associated with human complex or Mendelian diseases and reveal the variations across diverse human
populations. There are significant advantages and limitations of both of these techniques, but balancing cost- and
time-effectiveness against the desired results helps choosing the optimal sequencing approach. WES may cost less
than WGS because it covers only protein-coding regions and generates less raw data, but WGS provides more com-
prehensive picture of the genome considering both non-coding and coding genomic regions. It also allows to identify
SV and CNV that may be missed by WES. Moreover, WGS allows to obtain more uniform and reliable coverage. All
in all, WGS is a more universal method than WES.
This tutorial will guide you through the genetic variants discovery workflow on Genestack. We will analyse a dataset
by Dogan et al. including high coverage (x35) WGS data from a Turkish individual. The experiment can be found in
Public Experiments — regularly updated collection of freely accessible experiments imported from SRA, ENA, GEO,
ArrayExpress. Genestack enables you to work on public and private data seamlessly. Import your own sequencing
data, mapped reads or genetic variants data with our data importer .
The genetic variants analysis pipeline includes the following steps:
2. Preprocessing of the raw reads
3. Unspliced mapping of the preprocessed reads onto a reference genome
4. Post-alignment processing
5. Quality control of the mapped reads
6. Variant calling
7. Variant annotation
3.4. Whole-genome sequencing data analysis 89

8. Variant filtering and prioritisation

Raw sequencing assays from the Dogan et al. experiment, all processed data and reports are located in the WGS Data
Analysis on Genestack Platform.
3.4.1 Quality control of raw sequencing reads
Poorly identified bases, low-quality sequences and contaminants (such as adaptors) in the raw sequencing data can
affect downstream analysis leading to erroneous results and conclusions. Before starting the WGS analysis, we will
check the initial data quality and decide how to improve the downstream analysis by a variety of preprocessing options.
FastQC Report app is based on FastQC tool and produces several statistics characterising the raw data quality: Phred
score distribution per base and per sequence, GC content distribution, per base sequence content, read length distri-
bution, and sequence duplication level. We will compute quality control statistics with FastQC Report app for both
assays from the dataset. In order to do so, open the dataset in the Metainfo Editor, click on the Analyse button and
select from the list of suggested options the “FastQC Report” app.
Besides, feel free to run the “Raw Reads Quality Control” public data flow which you can find on the Dashboard. The
app page presents the quality control part of the pipeline in a graphical form. To generate QC-reports click on the Run
Data Flow button and, then, on Start initialization now.

If you don’t want to generate QC-reports now, click Delay initialization till later button.
In this case you can start initialization, for example, from Multiple QC Report apps allowing to explore obtained
results for both samples at the same time.

The calculations can be started directly from the Multiple QC Report app page by clicking “(re)-start computation if
possible”.
Follow the progress of your tasks in Task Manager.
When the computations are finished, QC reports for both sequencing runs will appear on the Multiple QC app page.

Explore reports for each individual assay in FastQC Report app by clicking on the app or file name in the Task
Manager. Alternatively, go to Created files folder and look for a folder containing the files created for “Raw Reads
Quality Control” data flow. To describe raw reads quality control statistics we will use the reports from our tutorial
folder previously prepared by our team. To start with, we will open both of them in Multiple QC Report app that
interactively represents QC statistics for several raw assays at once.
You can select samples of interest, for example ones that are suitable for further analysis, and put them in the separate
folder by click on the New folder with selection button.
For paired reads the quality control report contains statistics such as total nucleotide count, GC content, number of
reads, and number of distinct reads. Using the Multiple QC Report app you can sort assays using QC-keys mentioned
above and metainfo-keys, such as “method” or “organism”. Now when we have the general impression of quality of
raw reads we can go deeper and get a more detailed statistics using FastQC report for each individual sequencing run.
FastQC report contains several quality control metrics outlined below:
• Basic statistics of raw data, for example the total number of reads processed, and GC content;
• Sequence length distribution describing * * the distribution of fragment sizes in the analysed sequencing assay;
• Per sequence GC content plot displaying the GC content across the whole length of each individual read;
• Per base sequence quality plots depicting the range of quality scores for each base at each position in the
analysed sequencing assay;

• Per sequence quality scores plot allowing the detection of poor quality sequences in the total sequences;
• Per base sequence content plots representing the relative number of A, C, T, and G for each position in the tested
sample;
• Sequence duplication level plots representing the proportion of non-unique sequences which could be present in
the library;
• Overrepresented sequences providing the information on sequences that make up more than 0.1 % of the total,
and may either have a high biological significance or indicate contamination of the library.
Table located on the left side of the page informs us which reports raise concerns or report failures. In this case it is
the Per base sequence content, Sequence duplication levels * and *Overrepresented sequences metrics. Raw data
for both sequencing runs failed the per base sequence content metric. Ideally, in a random library we would see four
parallel lines representing the relative base composition. Fluctuations at the beginning of reads in the tested sample
may be caused by adapter sequences or other contaminations of the library.
The warning reported for the sequence duplication * metric for the first sequencing run indicates that the number
of non-unique sequences in the assay has reached more than 20 % of the total. The average duplication levels for read
mates are 1.50x and 1.48x. *Sequence duplication plot represents the relative number of sequences having different
duplication levels, and for WGS experiments, generally characterised by even coverage, this graph should quickly
drop to zero. Duplicates could correspond to PCR amplification bias generated during library preparation or reading
the same sequence several times.
Lastly, according to the reports, the first sequencing run compared to the second one contains some over-represented

sequences — sequences that are highly duplicated in a sample. In total, the app identified 1,052,139 sequences con-
sisting of ‘N’-bases.
The mentioned issues could be fixed by performing appropriate preprocessing of the raw data. In this case, we will trim
low quality bases at the read ends and remove adaptors and contaminants. Moreover, we will filter reads by quality
score, so that in further analysis we will only consider reads with high quality (average Q20) score. Despite differences
in the raw data quality, we will apply the same preprocessing steps to both samples. It should be stressed that after any
applied preprocessing step you can check its influence on the quality of raw reads using the FastQC app.
Now that we have checked the quality of sequencing assays and decided on the appropriate preprocessing steps, it is
time to create the pipeline for genetic variants analysis of WGS data from the raw data preprocessing to the genetic
variants annotation and filtering.
3.4.2 Building the genetic variants analysis pipeline
To start the pipeline, open the “Homo sapiens Genome sequencing and variation” dataset in the Metainfo Editor, click
Analyse, and select the first preprocessing app — Trim Adaptors and Contaminants.
tutorials/WGS_data_analysis/images/trim-adaptors-and-contaminants.png
On the Trim Adaptors and Contaminants app page you can explore the list of created and source files, edit parameters
and continue building the pipeline.
You can also choose the app to explore results, for example discover how this step affects the initial quality of raw
reads with FastQC Report app. If you want to learn more about the application, click on its name and go to About
application.

You can use the created trimmed files as inputs for other applications. Let’s click on Add step button and select
the next preprocessing app — Trim Low Quality Bases. This will take you to the Trim Low Quality Bases app
page. Proceed in the same way and add all the desired steps to the pipeline until you reach the final one — Effect
Prediction.

Don’t forget to set the parameters for each app in the pipeline and select appropriate reference genome, in this case
H. sapiens reference genome (GRCh37.75), that will be used by Unspliced Mapping with BWA, Variant Calling and
Effect Prediction apps. You can return to any added app by clicking on the name of app we are interested in. The very
final output file containing genetic variants and their possible effects on annotated genes can be opened with Variant
Explorer and Genome Browser apps.
To be able to re-use manually built pipeline you could create a data flow. Click on the resulting file name on the final
Effect Prediction app page and go to Manage and Create new Data Flow.
tutorials/WGS_data_analysis/images/create-new-data-flow.png
The created data flow will be opened in the Data Flow Editor, where the pipeline for genetic variants investigation
using WGS is graphically represented. Click on the Run data flow button to create all files for the pipeline.

This will take you to the Data Flow Runner page. To run the pipeline click on the Run Data Flow button at the bottom
of the data flow.


After that you will be suggested to either start the computation now or delay it till later:
We will postpone the analysis and focus on each step of the WGS data analysis pipeline. Later we can start initialization
directly from one of the suggested apps, such as Variant Explorer, Genome Browser or Effect Prediction.
You can verify processing parameters on each individual app pages before running the pipeline. To do this, click on
Edit file list and open the file using the app that created this file:

Data Flow Runner allows you to start initialization up to any step of the pipeline. We recommend you check the
mapping quality after removing the duplicates from mapped reads to assure that they could be further used in variant
calling and effect prediction. In order to do this, click on 2 files in Remove Duplicated Mapped Reads section and
start initialization with right-click context menu. Follow the process in the Task Manager. Regardless of the status of
the analysis all the created data flow files will be located in the corresponding folder in the Created files folder.

Note that there is a data flow file including all the mentioned preprocess and analysis steps previously prepared by Gen-
estack team. This data flow is called WGS data analysis for Dogan et al. (2014) and you can find in our tutorial folder.

Now let’s talk about each of the analysis steps we included in the pipeline in greater detail.
3.4.3 Preprocessing of the raw reads
Often overlooked, preprocessing of raw data is essential due to the fact that it improves the original data quality, and
consequently, the results of the downstream analysis. To prepare raw reads for variant calling and annotation we will
run several preprocessing apps: Trim Adaptors and Contaminants, Trim Low Quality Bases and Filter by Quality
Score. Firstly let’s explore the parameters of the Trim Adaptors and Contaminants app.
Trim Adaptors and Contaminants app finds and clips adapters and contaminating sequences from raw reads. The
authors of the experiment trimmed and filtered reads with Trimmomatic so that reads with a high quality score and a
minimum length of 36 bp after trimming were kept. We will apply default parameters ejecting reads below a length of
15 bp. You can change the minimum length of trimmed sequence on the app page
The next read preprocessing procedure we plan to do is removing bases of low quality with Trim Low Quality Bases
app based on seqtk 1.0 tool. It removes nucleotides of a low quality from the raw data according to phred33 score
that encodes the probability that the base is called incorrectly. Currently, this app does not support any changeable
command line options.

We will finalize the data preprocessing by filtering of trimmed reads by quality score. The app filters out reads
from input file according to the set value of Phred33 quality score. As usual, you can change the default parameters
on the app page. We will eliminate all reads with quality score below 20, considering only the bases called with 99
% accuracy. By default “Minimum quality score” is already equal to “20”, so we only need to set the command line
option “Percentage of bases to be above the minimum quality score” to “100”.
Check the quality of the preprocessed reads with FastQC Report app to assure that it is satisfactory or make deci-
sions about additional preprocessing steps prior further analysis. After we have completed all the preprocessing steps,
our raw reads are of better quality, and now it is time to begin the mapping of analysis-ready reads onto the human
reference genome.

3.4.4 Unspliced mapping reads onto a reference genome
To map preprocessed reads to the reference genome we will use the Unspliced Mapping with BWA app which with high
efficiency and accuracy alines sequencing reads against a whole reference genome without allowing large gaps, such
as splice junctions. The BWA-based aligner currently has a hardcoded command line.
Prior to the variant discovery we would recommend you to check the quality of mapped reads because some issues,
such as low coverage, homopolymer biases or experimental artifacts, only appear after the alignment. One of such
quality metric is sequencing coverage depth that could determine the confidence of variant calling. The deeper the
coverage, the more reads are mapped on each base and the higher the reliability and the accuracy of base calling. The
assays from the Turkish individual were obtained with high coverage (35x) WGS. Let’s explore read coverage for
generated mapped reads for both runs interactively in Genome Browser, which allows navigation between regions of
the genome. To learn more about navigating in Genome Browser look at our blog post.

3.4.5 Remove duplicated mapped reads
Sometimes due to errors in the sample or library preparation, reads may come from the exact same input DNA template
and accumulate at the same start position on the reference genome. Any sequencing error will be multiplied and could
lead to artefacts in the downstream variant calling. Although read duplicates could represent true DNA materials, it is
impossible to distinguish them from PCR artifacts, which are results of uneven amplification of DNA fragments. To
reduce this harmful effect of duplicates prior to variant discovery we will run Remove Duplicated Mapped Reads app
based on Picard MarkDuplicates tool. To determine duplicates Picard MarkDuplicates uses the start coordinates and
orientations of both reads of a read pair. Based on the identical 5’mapping coordinates it discards all duplicates with
the exception of the “best” copy.
As you remember, we run just a part of the pipeline including preprocessing, alignment and removing of duplicates to
check if the mapping quality is good enough and we can move on to variant calling and annotation.
Post-mapping quality control is not necessary, but is a very important step. The mapped Reads QC Report app
produces various QC-metrics such as base qualities, insert sizes, mapping qualities, coverage, GC bias and more. It
helps to identify and fix various mapping issues and make downstream processing easier and more accurate. Find the
created filtered mapped reads (the outputs of Remove Duplicated Mapped Reads app) in the Created files folder.
As in the case of raw reads quality control, you may explore results not only in Mapped Reads QC Report app itself,
but also compare the mapping quality of both tested assays with Multiple QC Report app. Report appears on the page
as the computation is finished.

Let’s look at the example report for the two sequencing runs from our experiment. Go to the tutorial folder and open
QC reports for both mapped reads files in Multiple QC Report app. Use the drop-down menu “Select QC keys to
display” and “Select metainfo to display” to specify which QC-metrics and sample associated information you wish
to see on the plot.
According to the QC check, both technical replicates from our experiment are concordant with all reads being mapped
and 95 % of the reads are mapped properly. To obtain more detailed statistics explore individual QC report in Mapped
Reads QC Report app. Let’s explore the mapping quality for the first sequencing run of Turkish individual sample. On
the app page you will find mapping statistics such as, for example, numbers of mapped, partially mapped, unmapped
mate pairs. Besides general mapping statistics individual QC report contains coverage by chromosome plot, and, for
paired-end reads, some statistics on insert size and and insert size distribution plot. As we can see, the median insert
size is 364 with standard deviation equal to 66.99.
Insert size distribution plot displays the range lengths and frequencies of inserts (x- and y-axis, respectively) in the
analysed assay.

After ensuring that our mapped reads are of high enough quality, we can move on to the final stages of our analysis
pipeline — variant identification and effect prediction. Now then, let’s finalize the computations of the pipeline. Make
sure to check the parameters of Variant Calling and Effect Prediction apps and start initialization of the rest of the files.
3.4.7 Variant calling
Experimental biases could lead to errors in variant calling mimicking true genetic variants. Variant calling on multiple
samples helps increase the accuracy of the analysis by taking the reads from several samples into consideration and
reducing the probability of calling sequencing errors. We run Variant Calling app on analysis-ready mapped reads for
both technical replicates with default parameters that always could be changed on the Variant Calling app page. In the
picture below you can see source files (reference genome and both filtered mapped reads files) and default command
line options.

Track the progress of your tasks in Task Manager and as soon as the computation is finished, explore the results of
variant identification using the interactive applications such as Variant Explorer or Genome Browser. Now, let’s take
a look at the results of variant calling in the Genome Browser. Let’s click on the genetic variants file name in Task
Manager and open it in Genome Browser using the context menu. In case some files are not initialized yet, Genome
Browser page is empty. You can initialize the files by clicking on Go! to start the process. Tracks representing found
mutations will appear on the page as the task is finished. The reference track displaying annotated genes with their
coordinates and variation track representing genetic variants, their genomic position, average mapping quality and
raw read depth.

Zoom in to explore genetic variants in single-base resolution. For example, looking at the region 1:725878-725972
(95 bp) we can see several SNPs (red) and one deletion 5bp long (blue).
3.4.8 Effect prediction
After variants have been identified, we can annotate them and identify the effects they produce on known genes with
Effect Prediction app. The app is based on SnpEff tool that annotates variants based on their genomic location: intronic,
untranslated regions (5UTR or 3UTR), upstream, downstream, splice site, or intergenic regions. It also determines
effects genetic variants have on genes, such as amino acid replacement or frame shifts. Remember, if you have some
files uninitialized, you can run the analysis on the Effect Prediction page or Data Flow Runner page.

Let’s now analyse the annotated variants in the genome of the Turkish individual. You can do this with Report Viewer
application: right click the “Variants with predicted effects for Dogan et al. (2014)” file name and go to View Report.
Note that you can also explore annotated variants in Genome Browser and Variant Explorer apps.

First of all, the report summary contains some basic information about the analysed file.
In general 4,389,254 mutations were found in our assay with one change every 7,014 bases. The most common
variants are SNPs that make up 3,835,537 from the total. The second most abundant genetic variation type after
SNPs are Indels. Insertions and deletions were found in 252,548 and 301,169 change cases, respectively. According
to the paper, the authors identified 3,642,449 and 4,301,769 SNPs using Casava and GATK workflows, respectively.
However in the downstream analysis they used 3,537,794 variants identified by both methods.

Insertion deletion length histogram graphically demonstrates the distribution of length of all insertions and dele-
tions. The discovered Indels ranged from -43 to 28 bp in length with the standard deviation of 5.256. Authors detected
713,640 InDels (341,382 insertions and 372,258 deletions) ranging from 52 bp to 34 bp in length.
Additionally, we performed filtering by effect to find out InDel distribution throughout different genomic loca-
tions. From identified InDels 258680 and 263835 were in an intergenic and intronic region, respectively. We also found
69426 InDels in the upstream and 74162 InDels in the downstream gene regions. Only 69 and 78 mutations were
detected in the splice site donor and in splice site acceptor, respectively. Finally, we detected 6241 insertions and
deletions in exons. Besides the statistics on the change type of the found mutations, report also contains quality and
coverage information. Quality histogram shows quality distribution with minimum value of 3 and maximum value
of 486 for the analysed data:

The following histogram shows coverage. For our data the mean coverage is 28.882 while the maximum coverage is
8,026.
For all the identified genetic variants the app also calculates associated effects and prioritises them by putative biolog-
ical impact.
For example, if a found mutation leads to a protein truncation, then it could have a high and disruptive effect on the
gene function. However, variants that influence only the protein effectiveness will most likely have only a moderate
effect, and synonymous variants that will unlikely change the protein behaviour will probably have low effect. Variants
affecting non-coding genes are considered as modifiers. It is important to remember that grouping doesn’t guarantee
that it is the high-impact effect variants that are responsible for the analysed phenotype. Genetic variants could
have various effects on the genome for instance they could result in codon changes, insertions or deletions, frame
shift mutations etc. Genetic variants can affect different genomic regions such as exons, intergenic regions, introns,
untranslated regions, splice sites, upstream and downstream regions. As we can see from the report most changes in
the Turkish individual genome are located in intronic regions (63,351 % of the total).

As we can see the vast majority of identified variations are associated with introns (climbed above 60 %) and there is
no mutations in splice sites. The changes in intergenic regions represent ~17 % of the total, while changes in exons
occur in approximately 2 % of events.
The most frequent base changes is G to A with 651,754, followed by C to T (650,016), T to C (621,506) and A to G
(620,959) base changes.

The quality of SNP data could be characterised with transition/transvertion (Ts/Tv) ratio that for whole human
genome is typically about 2. Note that this ratio is not universal and could vary with regions, for example it is
higher for exons.
Our results are in line with the original paper by Dogan et. al where they have identified 2,383,204 transitions,
1,154,590 transversions resulting in Ts/Tv ratio of 2.06. Next entry of the report is the codon replacements table (we
have posted a fragment of it below). Rows represent reference codons and columns represent changed codons. The
most common codon change for our data is from GAC to GAT (876 events) resulting in a synonymous change.
The report also contains the amino acid changes table where reference amino acids are represented by rows and
changed amino acids are represented by columns. For example, row ‘A’ and column ‘E’ show how many Ala have
been replaced by Glu. The most common amino acid changes are Ala to Thr, 722 times, followed by 693 changes
from Ile to Val events, and 780 Val to Ile events.

Apart from the mentioned statistics and plots, report also contains allele frequency plots and information on the change
rate per chromosome.
3.4.9 Genetic variants filtering
Resulting genetic variants files, annotated or not, can be opened in the Variant Explorer app. In the Variant Explorer
you can interactively explore the information about found mutations, as well as sort and filter them by specific factors
such as: locus, type of variants (SNP, INS, DEL, MNP), reference or alternative allele, Phred-scaled probability that the
alternative allele is called incorrectly, and for annotated variants by their predicted effect, impact and functional class.
Besides that, the app computes genotype frequencies for homozygous samples with reference and alternative alleles
(GF HOM REF and GF HOM ALT columns, respectively), reads depth for homozygous samples with alternative
allele (DP HOM ALT) and reads depth for heterozygous samples (DP HET). To prioritise found mutations open an
annotated genetic variants file in the Variant Explorer: right-click on the resulting file name in the Data Flow Runner,
Task Manager or File Browser and select Variant Explorer in the context menu. In total 4,361,389 variants were
found.

Let’s now use the filters to see how many of these are high impact variants. Set the filter “Impact” to “high”. As we
can see out of all the identified variants 1007 have a high impact.
Let’s now see how many of these are nonsense mutations by applying “Functional class” filter. And now out of all the
high impact variants, 154 are nonsense mutations.
Let’s see how many of those are found on chromosome 10 by specifying the chromosome in the “Locus”. Turns out
on chromosome 10 there only one variant change that is high impact nonsense mutation. This base change is located
in CTBP2 gene, and result in a premature stop codon.
These are all of the steps of WGS data analysis pipeline. You can use files from our tutorial folder to reproduce
the results. Feel free to perform further prioritisation, play with filters in Variant Explorer to get more information.
For example, you may want to find out, how many InDels results in frame-shift, codon deletion or explore variant
distribution on any region of interest etc. In summary, our analysis allowed to identify 3,835,537 SNPs. We also
identified 252,548 insertions and 301,169 deletions ranging from -43 to 28 bp. Although our results are in concordance
with original paper, there are also some differences in number of identified mutations or InDel length distribution we
mentioned above. Such variation could be explained by the use of different tools. For example, authors identified
variants with the vendor-supplied Eland-Casava pipeline and The Genome Analysis Toolkit (GATK v2.2), while we
used Variant Calling application based on SAMtools and BCFtools.
3.5 Microbiome data analysis
Previously, to study microbes — bacteria, archaea, protists, algae, fungi and even micro-animals — it was necessary to
grow them in a laboratory. However, many of the microorganisms living in complex environments (e.g. gut or saliva),
have proven difficult or even impossible to grow in culture. Recently developed culture-independent methods based on

high-throughput sequencing of 16S/18S ribosomal RNA gene variable regions and internal transcribed spacers (ITS)
enable researchers to identify all the microbes in their complex habitats, or in other words, to analyse a microbiome.
The purpose of this tutorial is to describe the steps required to perform microbiome data analysis, explain how to build
your own data flow, and finally, discuss the results obtained in such analysis.
3.5.1 Setting up a microbiome experiment
First, we need a good example of microbiome data. You can upload your own data using Import button or search
through all the available experiments using Data Browser application. Our analysis will be based on data coming from
Alfano et al. 2015. Let’s look up this experiment in the platform and open it in the Metainfo Editor:
The researchers examined the microbiome composition of different body regions of two captive koalas: the eye, the
mouth and the gut (through both rectal swabs and feces). Samples were characterized by 16S rRNA high-throughput Il-
lumina sequencing. First, since koalas frequently suffer from ocular diseases caused by Chlamydia infection, scientists
examined the eye microbiome and found out that it is very diverse, similar to other mammalian ocular microbiomes
but has a high representation of bacteria from the Phyllobacteriaceae family. Second, authors determined that despite
a highly specialized diet consisting almost exclusively of Eucalyptus leaves, koala oral and gut microbial communities
are similar in composition to the microbiomes from the same body regions of other mammals. Also, it was found that
the rectal samples contain all of the microbial diversity present in the faecal ones. And finally, the researchers showed
that the faecal communities of the captive koalas are similar to the ones for wild koalas, suggesting that captivity may
not influence koala microbial health.
3.5.2 Microbiome data analysis pipeline
A typical data flow for microbiome analysis consists of the following steps:
3. Quality control of preprocessed reads
4. Microbiome analysis
Let’s go through each step to get a better idea of what it really means.
3.5. Microbiome data analysis 119

Garbage in — garbage out. It means that your analysis is only as good as your data. Therefore, the first and very
important step in any kind of analysis is quality control (QC). It allows to look at some relevant properties of the
raw reads, such as quality scores, GC content, base distribution, etc., and check whether any low-quality reads, PCR
primers, adaptors, duplicates and other contaminants are present in the samples. In order to assess the quality of the
data, we’ll run the Raw Reads QC data flow.
Genestack FastQC application generates basic statistics and many useful data diagnosis plots. Here is some of them for
Mouth of Pci_SN265.
Basic statistics reports some sample composition metrics such as reads type, number of reads, GC content and total
sequence length. Our sample contains 470,459 paired-end reads, which all together give us a sequence of 236,170,418
bp in length. The GC content is equal to 49%.

Sequence length distribution module gives us information about read length in a sample. In our example, all the
reads have the same length equal to 251 bp.
Per sequence GC content graph shows GC distribution over all reads. A roughly normal distribution indicates a
normal random library.

However, as in our case, there are sharp peaks which may usually indicate the presence of adapter, primer or rRNA
contamination. To remove possible contaminants, we’ll run Trim Adaptors and Contaminants application.
Per base sequence quality plots show the quality scores across all bases at each position in the reads. By default,
only low-quality zone and mean quality lines are displayed. If the median (or the middle quartile Q2) is less than 20
or the lower quartile (Q1) is less than 5, you’ll get failures (you see the red cross near the metric).
In our sample, the second mates in paired-end reads have bases of bad quality at the end of the sequences. To get rid of
these bases and improve the reads quality we’ll run Trim Adaptors and Contaminants and Filter by Quality Scores
applications.
Per sequence quality scores report allows you to see frequencies of quality values in a sample. The reads are of good
quality if the peak on the plot is shifted to the right, to the maximum quality score.

In our example, first and second mate reads differ by quality score, but still, almost all of them are of good quality
(>30). We will improve the reads quality by running Filter by Quality Scores application.
Per base sequence content plots show nucleotide frequencies for each base position in the reads. In a random library,
there could be only a little difference between A, T, C, G nucleotides, and the lines representing them should be almost
parallel with each other. The black N line indicates the content of unknown N bases which shouldn’t be presented in
the library.
Since we analyse 16S microbiome data, all the reads should begin with the same forward primer, and that’s why we
may observe the peaks of 100 % base content in the first positions. The nucleotide frequency pattern in the rest of the
sequence can be explained by the 16S sequence variation of the analysed bacterial community.
Sequence duplication levels plots help us evaluate library enrichment. In other words, the percentage of the library

made up of sequences with different duplication levels.
The application also picks up the overrepresented sequences which may represent primer or adaptor contamination
as well as indicate highly expressed genes.
The last two QC metrics — Sequence duplication levels and Overrepresented sequences — should not be used to
evaluate 16S microbiome samples. Since we are looking at sequencing data for only a single gene, we are expecting
to see an excess of highly similar sequences, and in turn, to get failures for these modules.
We have run QC on all the data in the experiment and put the reports in Raw reads QC reports for Alfano et al. (2015)
folder, so that you can open all of them in Multiple QC Report application to analyse results.

You see that a total number of sequencing reads for each sample is quite small and vary in the range of 197,000 to
471,000 paired-end reads. Overall, more than 2,5 million paired-end sequencing reads were generated.
FastQC reports help you understand whether it is necessary to improve the quality of your data by applying trimming,
filtering, adaptor clipping and other preprocessing steps. Here is the list of Genestack preprocess applications available
for raw reads:
• Trim adaptors and contaminants;
• Trim low quality bases;
• Filter by quality scores;
• Trim to fixed length;
• Subsample reads;
• Filter duplicated reads.
Once we have checked the quality of the raw reads, let’s start building the Microbiome data analysis pipeline. Our
preprocessing procedure will include two steps — adaptor trimming and filtering out low-quality reads.
1. Trim adaptors and contaminants
For this, click Analyse and select Trim Adaptors and Contaminants:

This brings you to the application page. On this step, the application will scan the reads for adaptors, primers, N bases
or any poor quality nucleotides at the ends of reads, and, based on a log-scaled threshold, will perform clipping.
By default, the application uses an internal list of widely used PCR primers and adaptors that can be considered as
possible contaminants. The occurrence threshold before adapter clipping is set to 0.0001. It refers to the minimum
number of times an adapter needs to be found before clipping is considered necessary.
After trimming, the reads are getting shorter. In order to discard trimmed reads of length below a specific value,
indicate this value in the box for “Minimum length of the trimmed sequence (bp)” option. We will use the default

length of 15 nucleotides.
To get more information about the application and its parameters, you can click the application name at the top of the
page and choose About application.
Trimmed reads for our 8 samples are stored in Trimmed raw reads for Alfano et al (2015) folder.
2. Filter by quality scores
Let’s continue our preprocessing procedure and add our next step - “Filter by Quality Scores”:

This application will filter out the reads based on their quality scores. For this purpose, we need to specify “Minimum
quality score” and “Percentage of bases to be above the minimum quality score”.
According to the paper, only reads with at least 75% of read length with the quality score above 30 were kept for the
future analysis. Let’s use the same settings: 30 for the minimum quality score and 75% for percentage threshold.
Learn more about the application work in About application section.
All 8 samples after this filtering are collected in Filtered trimmed raw reads for Alfano et al (2015) folder.
3.5.5 Quality control of preprocessed reads
After preprocessing steps, we will use the FastQC Report application again to compare the quality of raw and pre-
processed reads and to make sure that we improved reads quality.
Here is the FastQC report for “Mouth of Pci_SN265” sample before preprocessing:
The Per base sequence quality plots depict low-quality bases at the ends of the second mate reads. After trimming
and filtering, the overall quality of reads has been improved (see the Per base sequence quality and Per sequence

content modules). We also expect warnings for Sequence length distribution since the length of the reads has been
changed during preprocessing.
You can find all FastQC reports for preprocessed reads in the QC reports of preprocessed raw reads for Alfano et al
(2015) folder.
All in all, we can notice that the quality of the reads has been noticeably improved after preprocessing, and our samples
are ready for the downstream microbiome analysis.
3.5.6 Microbiome analysis
The last step in the microbiome analysis pipeline is identification of microbial species and their abundances in the
microbiome samples of the two captive koalas examined. To analyse the taxonomic composition of the microbial
communities the Microbiome analysis app requires a reference database containing previously classified sequences,
such as Greengenes (16S rRNA gene) and UNITE (fungal rDNA ITS region sequences) databases. The application
compares identified OTUs (operational taxonomic units) to the OTUs in a reference database.
Microbiome analysis results in two reports: Research report and Clinical report. In this tutorial we will focus on the
research microbiome analysis report generated for all the tested samples (Microbiome report for 8 files). To explore
the clinical microbiome report you should open a research report for an individual sample and click the View Clinical
Report button.
The research report provides abundance plots representing microbiota composition and microbiological diversity met-
rics.
Basic statistics describing the tested samples, such as sample count, number of reads per sample and number of
clustered reads per sample are calculated during the analysis process.

You can group samples by relevant metadata keys with Group samples by option. Besides, you can apply some filters
to display absolute OTU counts, hide unidentified OTUs or hide partially identified OTUs.
The plot displays the relative abundance of OTUs at a highest taxonomic resolution: genus (L6 tab) and species (L7
tab). You can change the resolution to the L2 level to see what phyla are the most abundant across the samples. For
example, our results show that, at low taxonomic resolution (L2 tab), the composition of microbial communities is
similar between samples. Bacteroidetes (8,30–86,73%), Firmicutes (1,46–50,49%) and Proteobacteria (1,38–64,96%)
are the most abundant phyla across most of the samples, followed by Actinobacteria (0,02-15,14%) and Fusobacteria
(0-10,33%).
You can see these results in the table as well. Click on the Total header in the table to get the most abundant phyla

across the samples:
Our findings are consistent with the paper results:
At a higher taxonomic resolution (L7 tab), more than 820 microorganisms were identified. The koala eye and the
koala gastrointestinal tract are characterized by distinct microbial communities. The eye microbial community was
very diverse; despite the fact that there is a small number of very abundant phylae, eye microbiome is characterised
with a high representation of bacteria from the family Phyllobacteriaceae (34.93 %).

To measure the similarity between the bacterial communities we used principal component analysis (PCA) based on
Pearson correlation coefficients:

You may change the PCA type in the upper-left corner of the plot and try other statistics to quantify the compositional
dissimilarity between samples: bray_curtis, abund_jaccard, euclidean, binary_pearson, binary_jaccard.
However, in comparison to the paper, authors used principal coordinate analysis (PCoA) to show the similarity between
the koala microbial communities:
Both PCA and PCoA are used to visualize the data, but different mathematical approaches are applied to the data.
Note: What is the difference between the PCA and PCoA?

The purpose of PCA is to represent as much of the variation as possible in the first few axes. For this, first, the variables
are centred to have a mean of zero and then the axes are rotated (and re-scaled). In the end, we have two axes: the first
one contains as much variation as possible, the second one contains as much of the remaining variation as possible,
etc.
The PCoA takes uses a different approach, the one based on the distance between data points. First, the PCoA projects

the distances into Euclidean space in a larger number of dimensions (we need n-1 dimensions for n data points). PCoA
puts the first point at the origin, the second one along the first axis, then adds the third one so that the distance to the
first two is correct (it means adding the second axis) and so on, until all the points are added. To get back to two
dimensions, we apply a PCA on the newly constructed points and capture the largest amount of variation from the n-1
dimensional space.
Congratulations! You’ve just gone through the entire tutorial!
3.6 Expression microarray data analysis with Microarray Explorer
This tutorial will show you how to use our Microarray Explorer applied to expression microarray data analysis.
Microarray Explorer performs the comprehensive microarray data analysis from quality control check, to dose re-
sponse analysis and differential expression analysis.
The microarray analysis pipeline includes the following steps:
• Normalisation;
• Quality control;
• Differential gene expression analysis;
• Dose response analysis (optional);
• Gene set enrichment analysis.
3.6.1 Setting up a microarray dataset
As an input for the app, you can use either your own data or one of the pre-imported public experiments. You can
explore already existing datasets with Data Browser app.
The experiment we will use further is coming from the Library of Integrated Network-Based Cellular Signatures
(LINCS) Project : “L1000 Connectivity Map perturbational profiles from Broad Institute LINCS - CD34 - vorinostat
- LJP008 - 24h”.
Click the name of the experiment and open it in Metainfo Editor to learn more:

3.6.2 Data analysis
You can open the input microarray dataset in the Microarray Explorer not only by right-clicking the experiment name
and selecting Microarray Explorer under “Explore” but also via just one left-click on the experiment name.
There are two tabs on the opening application page:
1. Description tab, where you can find metadata associated with the data set of your choice.
2. Differential expression tab that allows you to run the pipeline and change default parameters if it is necessary.
You can consider setting the following parameters:

• Group samples by option — to group the samples by the experimental factor specified in the associated meta-
data; for example, to group samples by different dosages of Vorinostat, you should select “Dose” in the list of
suggested metainfo fields.
• Control group option — to specify the control group if needed, e.g. select 0 to compare each group of samples
with Vorinostat against the group of samples without Vorinostat. If you do not set a control group (“No control
group” option), each group will be compared against the average of the other groups.
• Optional report option — if you group the samples by compound dosages, the app suggests you to produce a
dose response analysis report.
• Microarray annotation — to select microarray annotation that will be used in the analysis. It will only show
annotations relevant to the platform (Affymetrix, Agilent, or L1000), for this example, you will see the only
3.6. Expression microarray data analysis with Microarray Explorer 135

“LINCS L1000 Annotation”.

• Gene set database — to choose Gene set database for the gene set enrichment analysis, In this tutorial, we will
choose the Gene Ontology database.
To start the analysis just click on the Analyze button.
In the earlier stage in the microarray data analysis Microarray Explorer performs normalisation to eliminate some
sources of technical variation which can affect the measured gene expression levels. Then, the quality of normalized
microarrays is assessed to detect and remove potential outliers and normalized microarrays that are of good quality,
can then be processed for downstream processing.
As the reports are ready, the generated reports will appear on the app page. You can explore the results and apply
filters to show particular genes and pathways. Change parameters and re-run the analysis if necessary.
You can save the results with set filters as an immutable view by clicking the Save current view and share the view
with your collaborator by clicking the Share link. As a result, the shared view will appear in your colleague’s Saved
view list.
3.6.3 Explore results
Let’s look at the interactive reports that can be generated by the Microarray Explorer app in more details:

1. Differential Expression report. The report shows the list of top differentially expressed genes which can be
filtered and sorted by Log FC, Log Expr, and FDR parameters.
2. Benchmark dose report. The output of Dose Response Analysis that represents the compound dosages at which
genes start to show significant expression changes, i.e. the Benchmark Doses (BMDs).

3. Gene set enrichment report. The report provides results of Fisher’s hypergeometric test between significant dif-
ferentially expressed genes and gene sets corresponding to pathways or/and biological functions, which allows
to determine whether differentially expressed genes were affected.
4. Quality control report. This report represents results of a quality assessment of microarrays and allows to
detect apparent outlier arrays. To remove outliers select the probes that should be excluded (some will already
be automatically tagged and selected as possible outliers) and click the Remove outliers button to subset the
dataset and re-run the analysis.

5. Differential expression similarity search link opens Differential expression similarity search application that
helps you to find experiments characterised by similar differential expression signatures.

3.6.4 Explore existing reports
If you open an experiment in the Microarray Explorer app and go to the Differential Expression tab, the latest reports
you produced or viewed will be shown on the app page. However, you can manually switch to other reports (if any
have been already created) by clicking the Saved view link. Immediately you will see the list of all the available
Microarray Explorer reports for a given experiment, either generated by you or shared with you by a colleague.

CHAPTER 4
Genestack Platform overview
Clicking on your username (your email address) in the top right corner of the page will give you access to your profile
and allows to sign out of the platform.
In this section you can change your name, password, the name of your organisation and your vendor ID.
141
Organizations are a way of enforcing group permissions. There are two types of user in an organization – adminis-
trators and non-administrators. If you are in the same organization as another user, you can add them to groups you
control and share files with them freely. If you are in different organizations, administrators from both organizations
first need to approve adding them to the group. You can learn more about data sharing, permissions and groups in the
Sharing data and collaboration section.
Vendor IDs are used for application development. Applications you have created will be marked with your vendor ID.
Tasks links to the Task Manager application, where you can monitor running and previous computations.
Data Browser allows you to browse through public, private and shared data with Data Browser application which
allows you to search through the wealth of your data using complex queries.
Wherever you are on the platform, you can also access a shortcuts menu by clicking on the button in the top left corner
of any platform page. It is an easy way to reach most commonly used applications and folders. Data Browser, Import
Data, Manage Applications, Manage Groups, Expression Data Miner, Differential Expression Similarity Search, and
Import Template Editor, as well as the folders for created and imported files can all be found here. You can also click
the User Guide to access user documentation.
142 Chapter 4. Genestack Platform overview

Let’s look deeper into each of these items.
4.1 Manage applications
Here you can view the list of all applications available on the platform – both ones you have written as well as public
4.1. Manage applications 143

ones.
The Developer button will give you the option to choose which version of an application you want to use.
The ‘minified’ options optimizes loading of CSS and JS used in the application. You can find more details on minifying
in blog post by Dino Esposito.
The Session and User dropdown menus allow you to chose the version of the application you want to use for your
current log-in session and for your current user account respectively. Inherit is the default option and the order of
version choice inheritance is Global → User → Session. If you change the version of an application, you also need to
reload it to run the version of your choice.
4.2 Manage groups
In order to share data, we use groups. In the Manage Groups section you can change the settings of your current col-
laboration groups or create new ones and invite other users to join. You can also view and accept all the invitations you

have received from other users. Read more about collaboration on Genestack in the Sharing Data and Collaboration
section.
4.3 Manage users
In this section, you can change password of your users or create new users. If you click on Manage Users you
will go to the user management screen. Every user in Genestack Platform belongs to an organisation. When you
signed up to use Genestack via the sign up dialog, we created a new organisation for you, and you have automatically
become its first user and its administrator. As an organisation administrator you can create as many new users for your
organisation as you want. For instance, you can create accounts for your colleagues. Being in one organisation means
you can share data without any restrictions. The user management screen allows you to get an overview of all users in
your organisation. You can change a user’s password, make any user an administrator or lock a user out of the system.
You can also create new users. Let’s create a Second User by clicking the Create user button.
4.3. Manage users 145

You will need to set the user name, email and password. Users added this way are immediately confirmed, and can
log in right away.
You can find more about managing users on Genestack from this video.

CHAPTER 5
Importing data
Here is a list of file types that can be imported into Genestack. Note that gzippped (.gz) and zipped (.zip) files are also
supported.
147
Genestack file type Description Supported file formats
Continuous Genomic Data Contains information on continuous WIGGLE

genome statistics, e.g. GC% WIG
content
Discrete Genomic Data Information on discrete regions of BED

the genome with an exact start and
end position
Gene Expression The file includes the list of genes XLS

Signature and expression pattern (Log FC) TXT
specific to an organism phenotype TSV
with possibly additional annotatio
Gene List Stores a list of genes with possibly XLS

additional annotation TXT
TSV
Gene Signature Database A list of annotated gene sets, that GMT

can be used in enrichment analysis
Infinium Methylation Methylation data matrices TSV

Beta Values contained TXT
Beta-values methylation ratios for
Illumina Infinium Microarrays
Infinium Microarray Data Raw intensity data files for IDAT

Illumina Infinium Microarrays
Mapped Reads Reads aligned to a specific BAM

reference genome CRAM
Methylation Array Methylation chip annotation TSV

Annotation containing information about
association of microarray probe
to known genes
Microarray Annotation
148 Chapter 5. Importing data
Annotation file containing TXT
information about association of CSV
Note: Import of Gene Expression Signature and Gene List files

If the file contains both gene names and log fold changes, it is imported as Gene Expression Signature. If the file
only contains gene names, it is imported as Gene List. The importer will look at the headers of the .tsv file to try to
detect which columns may correspond to gene names or log fold changes (common variations are supported such as
‘gene’/‘symbol’ for gene names, and ‘logFC’/’log fold change’ for log fold changes). If it fails to detect them, the
user will be asked to manually choose the file type and specify the file headers corresponding to gene names or log
fold changes. Gene symbols and Ensembl/Entrez gene IDs are currently supported for gene names.
When you import files that are detected as raw sequencing or microarray data, Genestack automatically creates a
dataset, a special type of folder, and adds the assays to it. Additional documents in any format (e.g. PDF, Word, text,
etc.) can be imported as attachments to a dataset. We will discuss the use of attachments below. Some types of files,
namely Reference Genome, Gene List, Gene Expression Signature, Gene Signature Database, Genetic Variations,
Ontology Files, Dictionary, Microarray Annotation, Methylation Array Annotation, Infinium Beta Values, are not
wrapped in datasets on import because they are rarely uploaded and processed as batches.
When you perform any analysis on Genestack, other data types, which cannot be imported, can be created such as:
• Affymetrix/Agilent/GenePix Microarrays Normalisation — file with normalized
Affymetrix/Agilent/GenePix microarrays data;
• Differential Expression Statistics — expression statistics for change in expression of individual genes or other
genomic features between groups of samples, such as fold-changes, p-values, FDR, etc.;
• Genome Annotations — a technical file used for matching GO terms and gene symbols to gene coordinates;
• Mapped Read Counts — file is produced from Mapped Reads and contains the number of reads mapped to
each feature of a reference sequence.
There are several ways you can access the Import application:
• using the Import data link on the Dashboard;
• clicking the Import button in the File Manager;
149
• using an import template. We will describe what import template is and how to use it later in the guide.
Import data consists of three steps: firstly, temporary Upload files with your data are created in the platform; then, the
biological data type is assigned to your imported data; finally, you can fill in all required metadata or import it from a
text file.

5.1 Step 1: Getting data into the platform
There are two ways to have your data imported into the platform:
1. Upload data from your computer — select or drag-and-drop files.
2. Import from URLs (FTP or HTTP/HTTPS) — specify URLs for separate files or directories.
Furthermore, you can reuse your previous Upload files instead of uploading the same data again: just select existing
files with the Use previous uploads option and, then, add more data if necessary. This feature can be useful, for
example, when you import a dataset with several samples, one of the files is chosen incorrectly or corrupted, so you
would like to replace it. In this case, you need to upload again just one sample and reuse all other previously uploaded
files.
Note: What is an Upload file?

The Upload file is a temporary file that is automatically created during the data importing process. The only purpose
of the Upload files is to temporarily store the data until the corresponding Genestack files are created and initialized
5.1. Step 1: Getting data into the platform 151

correctly. It is Genestack files that will be further used in bioinformatic data analysis; that is why the platform
periodically can remove the Upload files, but no data is lost.
Data uploading from your computer is carried out in multiple streams to increase upload speed. Import from URLs is
performed in the background, which means that even while these files are being uploaded, you can edit their metadata
and use them in pipelines.
If during uploading you lose your Internet connection, you will be able to resume unfinished uploads later.

Click the Import files button to proceed.
5.2 Step 2: Format recognition
After your data is uploaded, Genestack automatically recognizes file formats and transforms them into biological
data types: raw reads, mapped reads, reference genomes, etc. All format conversions will be handled internally by
Genestack. You will not have to worry about formats at all.
If files are unrecognized or recognized incorrectly, you can manually allocate them to a specific data type: drag the
Upload file and move it to the green “Choose type” box at the top of the page.
Choose the data type you find suitable:
5.2. Step 2: Format recognition 153

Click the Create files button to proceed.
5.3 Step 3: Editing metainfo
During this step, the import has already completed, and you can describe uploaded data using an Excel-like spread-
sheet.
By default, you see all metainfo fields available for files, you can fill them or create new custom columns. Click the
Add column button, name new metainfo field and choose its type (Text, Integer, etc.):

You can also choose to apply a naming scheme. This allows you to generate file names automatically based on other
metainfo attributes.
Metainfo fields can be associated with specific dictionaries and ontologies. We pre-uploaded some public dictionaries
such as the NCBI Taxonomy database for the “Organism” field, the Cellosaurus (a resource on cell lines), the ChEBI
for chemical compounds, and the Cell Ontology (cell types in animals).
We also created our own controlled vocabularies to cover Sex, Method and Platform fields. You can find out more
about ontologies in the Managing metadata section.
5.3.1 Import with templates
You can create your own custom dictionary by importing it into the platform as OWL, OBO or CSV file and attach it
to the import template.
Note: What is an import template?

Import templates allow you to select what metainfo attributes of your imported files will be tightly controlled (so you
don’t lose any information in the process). Import templates allow you to set default fields for file metadata based on
5.3. Step 3: Editing metainfo 155

file type (e.g. Datasets, Discrete Genomic Data, Genetic Variations, etc.). Of course, if you’re only importing mapped
reads, you don’t need to specify metainfo attributes for other data types.
You can select which import template to use in two ways: from the Dashboard, or during the 3rd step of the import
process by right-clicking on the import template name (“Default template” is for the public one). You can create a
copy of existing import templates with Make a copy option in the context menu.
Genestack will attempt to fill metainfo fields automatically, but you can always edit the contents manually during the
import process. By using metainfo templates you can make sure that all of your files will be adequately and consistently
described so you will not lose any valuable information. For example, here is the list of metainfo attributes used by
default to describe Reference Genome data:

Import template editor application allows to modify existing import templates and create new ones with proper
metainfo fields, requirements and controlled vocabularies. To access the application right-click on a template’s name
and select the Import template editor from the “Manage” submenu. To create new template on the basis of the default
one you can also click Add import template one the Dashboard.
Now let’s say you wish to create an import template to control the metainfo attributes of raw reads (e.g. you always
need to know the tissue and sex of your samples). In order to do this, click on Add import template, then look for
the table related to Raw Reads and for the fields “tissue” and “sex”, change the required fields to Yes. As you can see,
the system controls what type of information can you put into your metainfo fields. In this case, for tissue the system
will map your entries to the Uberon ontology (an integrative multi-species anatomy ontology) and the metainfo type
must be text.

If you want to add other metainfo fields that are not included in the table already, you can do this at the bottom of the
table where there are blank spaces. For each entry, you must specify whether or not this field is required and what is
its metainfo type (e.g. text, yes/no, integer).
If you are using a file kind that is not yet listed, you can add a new one by clicking on the Add file kind button. Keep
in mind that file kinds are defined in Genestack — you will not be able to create a template entry for a file kind that is
not used on the platform.
When you are done, click on the blue Import using this template button. This will take you to the Import Data app,
where you can go through the three import steps described above. You can find all the imported files in the “Imported”
folder which can be accessed from the Dashboard and from the File Manager.

5.3.2 Metadata import
Apart from editing metainformation manually, you can also import and validate the metainfo attached to the assays
and to the dataset on the platform.
Click Import data from spreadsheet button and select a local CSV or Excel file containing metadata you would like
to associate with the imported files.
Note that names in the first column in the file with metadata should exactly match names of the data samples on the
platform, based on the first “Name” column. For example, in our case metainfo for the second sample does not match
to any assays and is highlighted in red.

Use the Select file option to manually allocate the imported metadata to an appropriate file.
Columns that are mapped to a metainfo field from the dataset’s template (by default data are imported with “Default”
template) are highlighted in green.
On this step for each column you can specify whether it should be imported or not, and if it should be mapped to some
metainfo key from the import template, by clicking on the column header.
Click Import when you finish editing the table. As a result, the table on the Metainfo Editor page is filled in with
metadata from the Excel-file.

5.3.3 Attachments
While importing a dataset into Genestack, you can also attach various files to it such as, for example, a PDF file with
the experiment plan or an R script, etc. When you open your newly-imported datasets, all of the attachments will
accompany it. They will be safely stored on Genestack, so later you can download them from the platform, in case
they get lost on your computer.
How to upload an attachment?
Attachments should be uploaded together with the dataset. In the Data Import application, choose the attachments
from your computer along with your dataset. The platform will recognize the raw data, and all additional files that
were unrecognised will be added to the dataset as attachments.
Besides, you can upload more attachments, or remove attachments in the Metainfo Editor.

5.3.4 Browsing data
Efficient data search and browsing are at the core of Genestack. The platform provides rapid access to private, shared,
and public data and analysis results.
5.4 Data browser
Our platform provides you with a rich collection of freely accessible datasets that we imported from various well-
known repositories, such as GEO NCBI, ENA, SRA and Array Express. Data is synchronized regularly from these
databases, keeping things up-to-date. There are currently more than 3 million sequencing and microarray assays from
over 100,000 public datasets indexed in Genestack. All the public datasets and assays are accompanied by original
metainformation describing biological data. Generally, this information is not standardized that makes operations with
biological data, like browsing data and combining assays from several datasets or reproducing some analysis, difficult
or even impossible without human participation. To harmonize raw metadata we apply automated curation where we
map raw entries to controlled terms that we store and maintain in special files called Dictionaries. To prepare these
Dictionaries we adopted terms from external ontologies or created them manually. You can also use our standardized
and unified terminology to describe your own data or analysis results.
The Data Browser allows to browse these public datasets, as well as your private data and the data shared with you on
Genestack. You can access the Data Browser either from the Dashboard or the Shortcuts menu on the left-hand side.
You can search relevant data with a free-text query, and you can further filter down datasets by metadata attributes
using the checkboxes on the left. These attributes are generated based on the metadata associated with datasets. For
instance, you can set the filters “Access”, “Method” and “Organism” to “Public”, “Whole Exome Sequencing”, “Mus
musculus”, respectively, to filter out publicly accessible data on mice obtained from mouse WES data.

Data Browser allows you to find bioinformatics analyses results associated with raw data. If there are analysis per-
formed on a given dataset, and you have access to these results (i.e. they are yours, or they are shared with you), then
you will find both intermediate results and reports in the column Downstream.
Then, you can merge data from several datasets into a single combined dataset or share several datasets with your
collaborator together. To do so you should select several datasets and choose on a “Briefcase bar” that appears at the
bottom of the screen Merge. . . button or Share. . . button, respectively.
5.4. Data browser 163

If not all the samples meet your searching criteria, feel free to create a subset of a dataset with matching samples and
process them separately. To do so, click a link showing the number of matching files in the Data Browser column
Matched, then, click Make a subset with matching files button to save files matching to the set filters. You can also
make a subset on the Metainfo Editor page.
Clicking on the name of any of the datasets will take you to the Metainfo Editor, where you can view (and possibly
edit) the metadata of this dataset and its assays.

Besides,directly from the Metainfo Editor page you can start building pipeline step by step via the button Analyse.
If you want to analyse some part of your dataset, select samples and click the Make a subset button (by default all
subsets are created in the folder My datasets).

Click a subset name to open it with Metainfo Editor application and edit its metainformation if it is needed.
If you are an owner of a given dataset, you can add more samples to your dataset by clicking on Upload more files
button.
Besides, you can remove files from a dataset: select files you want to exclude and click Remove files from dataset
button.

And if you are sure, confirm removing of the data by click Remove button. Remember that if the files you are going to
exclude from a dataset are not used anywhere, they will be deleted from the platform without any possibility to restore.
If your dataset is made from subsets of other datasets, use metainfo filters in File Provenance. Open the dataset
in File Provenance to see based on which metadata samples were selected, and, therefore, you can be sure that no
significant data was omitted.

5.5 File manager
Like on any operating system, the File Manager is where you can easily access all of your files, organise them into
folders and open them with various applications.

The panel (tree view) on the left-hand side is the file system navigator. Here you can see many different folders. Some
special folders are worth mentioning:
Created files is the folder where any new file created by an application on Genestack goes.
Imported files is where imported data goes, organized by date: all files imported at the same time (during one import
action) will be located in the same folder.
Uploads contains all the files you have uploaded into Genestack — FASTQ and BAM files, pdf documents, excel
tables etc.
Note: What is the difference between uploads and imported files?

When you have just started importing your files (in various formats like FASTQ, BAM etc.), they all go to the specific
storage area (the “Uploads” folder). During import, Genestack will recognize these uploaded files and allocate them
to appropriate biological types (you can also do it manually), e.g. sequencing assays, mapped reads etc. These
meaningful biological objects are what you work with on Genestack, and these are located in the “Imported files”
folder.
The Exports folder contains data ready for export. See the Data export section for more information.
Shared with me give access to all files that other users have shared with you or that you shared with other users. See
the Sharing data and collaboration section for more details.
Public Data folder contains all of the goodies we have preloaded on Genestack to make life a bit simpler for our users.
This folder contains:
5.5. File manager 169

1. Codon tables: currently 18 different tables such as yeast mitochondrial, vertebrate mitochondrial, blepharisma
macronuclear etc.;
2. Default template: is an import template thai is used by default in data importing process. It provides the list
of optional and required metadata fields for each file kind. An ontology or a dictionary can be associated with
metadata keys to validate metainfo;
3. Dictionaries: dictionaries include terms from external ontologies and are used to curate and harmonize
metainfo, e.g. sex, platform, NCBI taxonomy.
4. Example results: so you can play around with our platform and see what types of visualizations are available;
5. External databases: sets of sequences with associated annotation, e.g. Greengenes for 16S rRNA;
6. Genome annotations: for a range of different organisms and platforms (for WES analysis);
7. Microarray annotations: annotation lists to be used as the translation table to link probes and common public
domain sequences;
8. Public analyses: all files created during re-analysis of previously published datasets;
9. Reference genomes: various reference genomes for the most commonly analysed organisms;
10. Public data flows: all data flows available to our users, including tutorial data flows and the ones found on the
Dashboard;
11. Public experiments: this is a feature we are particularly proud of have pre-loaded the platform with thousands
and thousands of publicly available datasets, from public repositories such as GEO, ArrayExpress, SRA, and
ENA. Currently, we have more than 110,000 datasets in our database.
12. Tutorials: the folder contains files we use as examples during various tutorials.
To access the context menu for a given file, you can either do a right or left click on the respective entry in the file
browser. The topmost entry is the application that was used to generate this file or the application that should be used
to view it. The next four entries are submenus for each of the four different types of applications that can be used on
the file. Further down are options for viewing and re-using the pipeline used to generate the file. The final section
allows you to manage file locations and names. For folders, left-clicking opens the folder, while right-clicking opens
the menu. The Add to and Move to action allow you to link or move a file to a chosen directory.

Note: This does not perform a copy

We use the word “linking” and not “copying” in this context because in Genestack, adding a file to a folder does not
physically create a duplicate of that file (unlike copy-pasting in your traditional operating system). It just adds a link
to that file from the folder (similar to symbolic links on UNIX).
Show all parent containers shows you a list of all the folders in which the current file is linked. The file accession is
a unique identifier attached to each file. Unlike other metainfo attributes, it will never change for any file.
Above the File Manager pane, you can find the Import button. Clicking it takes you to the Import application page,
where you can upload your files, import them into the platform and edit their metainfo.
Next to the Import button, you can see a New Folder button. Using it you will be able to create a new folder wherever
you want. Another option — New folder with selection — appears when you have selected files and want to put all
of them in a separate folder.

The Preprocess, Analyse, Explore and Manage menus at the top of the page correspond to the four groups of
applications that can be used to process and view data. These menus will become available when you select a file.
When you choose a file, the system will suggest applications which can work with the specific file type (e.g. sequencing
assay). However, you still need to think about the nature of the data. For instance, if you want to align a raw WGBS
sequencing assay, Genestack will suggest several mappers, but only the Bisulfite Sequencing Mapping application will
be suitable in this case. To figure out what applications are recommended to process WGBS, WES, RNA-Seq or other
sequencing data, go to the Applications review section of this guide.
File search in the top-right corner allows you to search for files by metadata (names, organism, method). To limit the
search by file type or whether or not the file is shared with you, click on the arrow inside the search box.
Below the search box is a button to access your briefcase. Your briefcase is a place where you can temporarily store
files from various folders. To delete an item from your briefcase hover over it and click on the “x” button. To clear

all items from the briefcase, select the “Clear all” option.
To add files to your briefcase, hover over each individual file and use the special “briefcase” button or select several
files, right-click on them and choose “Add to briefcase. . . ”.
If you select a file, three additional buttons will show up, allowing you to share, delete the file or view metainfo (an
“eye”-icon) for the file.
Use the Share button to share your data with colleagues (the share button will not be available if you are using a guest
account). Read more about sharing on Genestack in the section Sharing data and collaboration.

The Delete button allows you to remove your files from the system.
The View metainfo button gives you more information about the file: technical (file type, its owner, when the file was
created and modified, etc.), biological (e.g. cell line, cell type, organism, etc.), and file permissions.

5.5.1 Managing metadata
Metainfo Editor application enable you to explore metadata for datasets or standalone files. Besides, if you have
enough permissions, you can edit metadata or import it from spreadsheet in .xls, .xlsx, .csv formats. You can access
Metainfo Editor from anywhere in the platform via the context menu. Moreover, metadata editing is the last step in
the data importing process (see Import section for more information). Metadata of the files are shown in Excel-like
tables where columns represent metainfo fields, such as ‘Organism’, ‘Cell line’ or ‘Platform’.
5.6 Edit metadata manually
By default a metainfo data table is based on Default Import Template that, however, you can easily replace with a
custom one (learn more about templates in the section Importing data). To do so click on the template’s name, select
Change template, and select the template you want in the pop-up window.
When you start typing in the corresponding cell, you will be suggested with terms from our controlled dictionaries if
possible. You are free to enter any values, however we encourage you to use our standartized terminology, that helps
you to avoid typos and harmonise metadata.
5.6. Edit metadata manually 175

Furthermore, you can add several terms to one metadata field for each file. To do so enter the first term as usual, click
the button Add another and either add one of the existing fields or create your own one (i.e. custom key).
If you create new metadata field, you also need to specify its type: for example, for free-text values you should select
“Text”, and for numeric value you should use “Integer” or “Decimal” one.

Click column name to sort metadata or delete the selected column if needed.
5.6. Edit metadata manually 177

5.7 Import metainfo data from your computer
To begin, click the Import data from spreadsheet button. Then, choose a CSV, XLS or XLSX file with metadata that
you would like to attach.
Make sure that names of samples in the imported file are the same as the ones shown in the column “Name” in
Metainfo Editor application. Otherwise, all not matching information in the imported file will not be imported. It will
be marked in red, so you could easily fix it by clicking on “Select file” link.
During metadata import process you can also decide whether a column should be imported and associate it with
another metadata field by click on the name of the column.

5.8 Compose file names using metainfo keys
When you complete describing your samples, you can use the metadata to name them. Click Apply naming scheme
button and select metainfo fields that you want to use to create names.
5.8. Compose file names using metainfo keys 179

5.9 Make a subset
If you just want to analyse some samples from a dataset, you can make a subset. There are two ways of making
subsets: select samples you want to analyse using checkboxes, and click Make a subset; or you can open metainfo
summary and specify metainfo values that will be used as a rule to create a subset and filter out all non-matching files.

Once you are happy with the metadata for your files, you can proceed to analyse them by clicking the button Use
dataset. You can use the suggested visualize applications to explore your files, like “FastQC Report” to check the
quality of raw reads, use on of the existing public data flows or build your own pipeline by adding applications step-
by-step. Moreover, you could share the files with your collaborators and add them to a folder of your choice.
5.9.1 Sharing data and collaboration
5.10 Access control model
There are three concepts around access control in Genestack: users, groups and organisations. Each user belongs
to a single organisation (typically corresponding to the user’s company or institution, or a specific team within the
institution). Organisations have two types of users: regular users and administrators, who have the right to add new
5.10. Access control model 181

users, and deactivate existing ones.

To check which organisation you belong to, you can go to the Profile page, accessible via the menu which opens when
you click on your email address at the top-right corner of any page.
5.11 Managing users
If you are an administrator of your organisation, the shortcut menu appeared when click on the Genestack logo
will also have an additional item, Manage Users, which takes you to the organisation’s user management page.
From there, administrators can add or disable users, and reset passwords.

Sharing in Genestack is done through groups: every user can create any number of groups, and add other users to them.
Each file in the system can be shared with any number of groups, who are granted different permissions (read-only,
read and write, etc.).
5.12 Managing groups
To manage your groups, click on the Genestack logo at the top-left corner of any screen and select Manage groups.
From there, you can create groups using the Create group button, add or remove people from groups, and change
users’ privileges within groups. By default, you will be a group administrator of any group that is created by your user.
5.12. Managing groups 183

If you are an administrator of a group, you can click the Add member button to add people to a group. From there
you will be prompted for the e-mail of the user you want to add. If they are in your organisation, you will be provided
with autocomplete.
Note: Can I add users from other organisations?

You can also add users from other organisations to a group (“cross-organisation group”). However, in that case, every
user invitation will need to be approved by an organisation administrator of both your organisation and the other user’s
organisation.
Once you have added a user from your organisation to the newly created group, you will also be able to set up their
permissions within the group. Within a group, a user can be:
• Non-sharing user (can only view data shared with the group);
• Sharing user (can view data shared with the group, and share data);
• Group administrator (all of the above, and can add/remove users to the group and change users’ privileges).
By default, newly added users will be granted the lowest permission level (Non-sharing user). You can change that
using the drop-down next to their name.

5.13 Sharing files with a group
If you are a sharing user or an administrator of a group, you can share files with that group. Any file created on
Genestack can be shared.
To share a file, you can click the file name and select the Share option in the context menu. Besides, some apps, such
as Data Browser, Metainfo Editor or File Manager, have a special Share button.
From there, you will be taken to the file sharing dialog, which asks you to select a group to share the file with. By
default, files are shared with read-only permissions (both for data and metadata). But you have the option of giving
members the ability to edit the files in addition to just viewing them.
5.13. Sharing files with a group 185

Once you click the blue Share button, you will be asked whether you would like to link the file into the group’s shared
folder.
If you link the file into that folder, it will be visible to the group’s users when they open that folder (which can make it
easier for them to find it). If you click “No”, the file will not be linked into the group folder but the group’s users will
still be able to find the file through the File Search box (for instance, if you tell them the accession of the file), in File
Provenance and through the Data Browser.
Each group has an associated group folder which you can access from the File Manager under “Shared with me” in
the left-hand side panel.

All files you share with other people, along with all files shared with you, will be located in that folder.
It is also possible to share files directly from the application pages; for example to share FastQC Report with your
collaborators you should click the QC-report name and select Share option in the drop-down list.
Besides, you can share your datasets from the Metainfo Editor page with the Share button.

5.13.1 Building pipelines
Bioinformatic data analysis includes several steps, these steps vary depending on the type of the data and your goals.
For instance, WGS data analysis includes the following steps: check the initial quality of raw reads, preprocessing
of the data to improve the quality, if it is needed, and alignment the reads onto a reference genome followed by
identification and annotation of genetic variants.
With Genestack you can either use one of the dataflows or build a pipeline manually selecting customizable applica-
tions supported by the system.
Use the Data Browser to find a dataset you would like to analyse, click on it. Then, on the Metainfo Editor page
click on the button marked Analyse to start creating a pipeline. If you want to analyse not the entire dataset but some
part of it, select the assays you wish to analyse and Make a subset.
So, select the first application you wish to see in your pipeline. For each individual file the system suggests only
applications that can be used to analyse your data, considering its type and metadata.
Applications on the platform are divided in several categories:
• Preprocess to prepare the data for actual analysis;
• Analyse perform various kinds of analysis;
• Explore to visualise QC check or analysis results;
• Manage to operate with your files.

This will take you to the application page where you can:
• learn more about the application;
• view and edit application parameters;
• explore your results;
• add further steps to the file data flow (the pipeline).

To proceed click on Add step button that will show you the list of all the matching applications.
Continue adding steps until you have finished building your pipeline. When you add each of the steps, you create new
files which end up in the Created files and My datasets folders. However, these files are not yet ready to use — they
need to be initialized first.

5.13.2 Reproducing your work
For any datasets in the system, you can learn where the data came from and replay the same exact analysis on other
data.
• File Provenance
File Provenance application allows you to explore the history of data, to learn how a given dataset was gener-
ated. Click the New folder with files button makes a folder where all the files used in the pipeline are located.
View the text description of the pipeline including all the steps. Click View as text button to see which applications,
parameters and tools were used at each step of the analysis.

• Data Flow Editor

If you want to reuse the same pipeline on different data, you can create the data flow identical to the pipeline
used to create the original file, by selecting the file of interest and choosing Create new Data Flow from the
available “Manage” applications. This will open Data Flow Editor application that gives a visual representation
of the pipeline and allows you to choose your input files, such as raw reads and a reference genome. We would
like to highlight here also that a range of public reference genomes have already imported from Ensembl and
readily available on the platform. To add new inputs to the created data flow click choose sources. At this stage,
no files have been created nor initialized.

Click Run dataflow button to continue, it will take you to the Data Flow Runner application.
• Data Flow Runner
Data Flow Runner application allows you to run the pipeline. Click Run dataflow button to create all the
relevant files in an uninitialized state. A separate file is created for each individual input file at every step of
analysis. You can find them in a separate folder in the “Created files” folder.

When the files are created, you will be suggested to either start initialization right away or delay it till later. You can
check and change parameters if needed only before computations started. To do so, click application name in the
corresponding node of the data flow. However just as initialization process started, any changes of files are forbidden.

Finally, whether you decide to start the computation or not, you will be suggested with a list of matching applications
to explore results or continue analysis.

5.13.3 Public data flows
On our platform, you can find a range of public data flows we have prepared for our users. We cover most of the
common analysis types:
• Whole Genome Methylation Analysis
• Whole Exome Sequencing Analysis
• Single-cell Transcriptomic Analysis
• Prediction of Genetic Variants Effects
• Isoform Expression Statistics
• Genetic Variation Analysis
• Gene Expression Statistics
• Targeted Sequencing Quality Control
• Raw Reads Quality Control
• Mapped Reads Quality Control
• Agilent Microarray Quality Control
• Affymetrix Microarray Quality Control
• Unspliced Mapping
• Spliced Mapping
Clicking on the data flow will take you to the Data Flow Runner where you can add source files and a reference
genome. When you have chosen your files, click on the button marked as Run data flow. If you do not want to change
any settings, you can click Start initialization now. To tweak the parameters and settings of the applications, select
Delay initialization till later. To change the settings, click on the name of the application in the data flow. This will
take you to the application page, where you can select Edit parameters and introduce your changes. When you are
happy with parameters, go back to the data flow and start initialization.
5.13.4 Initialising files
You can initialize files in different ways:

1. Using the Start initialization option in the context menu.
For instance, click on the name of the created dataset at the top of the application page and select Start initialization.

2. Clicking Start initialization now in the Data Flow Runner application.

If you want to save the pipeline and specific parameters you used in your pipeline to re-use again on other files, you
can create a new data flow with the Data Flow Editor app.

To proceed, lick on the Run dataflow button and create all the relevant files for each app in the pipeline. This will
take you to the Data Flow Runner page where you can check or change parameters of the applications by click app
name and, then, initialize the computations with the Run Data Flow button in the last cell.

Choose the Start initialization now option if you would like to run the computations immediately or Delay initial-
ization till later.

This data flow, along with all your results (after computations are finished) will be stored in the Created files folder.
3. Using File initializer application.
Select the data you are interested in, right click on them, choose the File Initializer in the Manage section.
The File Initializer reports the status of the files and allows you to initialize those that need to be by clicking on their

respective the Go! buttons, or Initialize all to do them all at once. Files do not need to be produced by the same
applications to be initialized together.
4. Using the Start initialization button in the File Provenance.

Alternatively, you can click on the name of the last created file, go to Manage and choose the File Provenance appli-
cation. The application displays the pipeline and also allows you to run the computation using the Start initialization
button. Doing this will begin initialization of all the files (including intermediate files) you have created whilst building
this pipeline.

Regardless the way you start initialization, you can track the progress of tasks in the Task Manager.
5.13.5 Task manager
In the top-right corner of any page on Genestack, you can see a link called Tasks. It will take you to the Task
Manager, an application which allows you to track the progress of your computations.

Besides, tasks can be sorted and filtered by application name, file name, accession, status, a user who started tasks,
last update and elapsed time.
Statuses in the Task Manager help you keep track of your tasks. Let’s look what each status means:
• Starting — the computation process has started to run;
• Done — the task has finished successfully;
• Failed — the computation has failed. To find out why click on View logs;
• Queued — the task is pending for execution, for example when it is waiting for dependencies to complete
initialization;
• Running — your task is in progress;
• Blocked by dependency failure — the computation cannot be completed because a task on which this one de-
pends has failed;
• Killed — the task has been canceled by the user.
You can also view output and error logs produced for each task. Error logs tell you why a task has failed. Output logs
contain information about the exact details of what Genestack does with your files during the computation process,
what specific tools and parameters are used, and so on. If the computations finished successfully, error logs will be
empty, but the logs can provide you with some basic information about the output data.

If you change your mind about a computation after it has started, remember that you can kill tasks whenever you
want by clicking the Cancel button, next to the task status. To rerun an analysis click file name and select Restart
initialization.
5.13.6 Data export
Genestack provides secure data storage and Export Data application allows to safely download both assays and
analysis results together with attached metadata to a local machine.
Select those files you are going to export, right-click on them and choose Export Data application. On the application
page you will see the status of your files, and if some of them are not initialized, you will be suggested to initialize
them prior to export.
If you change your mind, you can stop exporting process by click on Cancel button.

The application creates a temporary Export file that contains a special link to download the selected files. All the
Export files are stored in the “Exports” folder.

Sharing the link enables your collaborators to download data even if they do not have a Genestack account. The
created export file can be removed after some time by the platform. It means that the corresponding download link
will not be accessible any longer, however the data itself will not be affected.

CHAPTER 6
Introduction to NGS data analysis
NGS technologies, such as WGS, RNA-Seq, WES, WGBS, ChIP-Seq, generate significant amounts of output data.
Before we start talking about various applications available on Genestack and how to choose appropriate ones for
your analysis, let’s take a moment to go through the basics of sequencing analysis. To help you better understand
the processes involved, we will use the example of genetic variant analysis for WES (Whole Exome Sequencing)
data. A typical WES data analysis pipeline includes raw reads quality control, preprocessing, mapping, post-alignment
processing, variant calling, followed by variant annotation and prioritization (Bao et al., 2010).
The first thing you need to do with sequencing data is to assess the quality of raw sequencing data. For example,
you will get a general view on number and length of reads, if there are any contaminating sequences in your sample
or low-quality sequences.
After that, you can do some preprocessing procedures to improve the initial quality of your data. For example, if your
sequencing data is contaminated due to the sequencing process, you may choose to trim adaptors and contaminants
from your data. Quality control and preprocessing are essential steps because if you do not make sure your data is of
good quality to begin with, you cannot fully rely on analysis results.
After you have checked the quality of your data and if necessary, preprocessed it, the next step is mapping, also
called aligning, of your reads to a reference genome or reference transcriptome. It allows determining the nucleotide
sequence of data being studied with no need of de novo assembly because obtained reads are compared with a reference
already existed in a database. For example, in our case, aligning WES reads allows you to discover nucleotides that
vary between a reference sequence and the one being tested. The accuracy of the further variant identification depends
on the mapping accuracy (The 1000 Genomes Project Consortium, 2010).
After you have mapped your reads, it is a good idea to check the mapping quality, as some of the biases in the data
only show up after the mapping step.
Similarly to what you have done before with raw sequencing reads, if you are unsatisfied with the mapping quality, you
can process the mapped reads and, for instance, remove duplicated mapped reads (which could be PCR artifacts).
Post-alignment processing is very important, as it can greatly improve the accuracy and quality of further variant
analysis.
Once the sequence is aligned to a reference genome, the data needs to be analyzed in an experiment-specific fashion.
Here we will use the WES reads mapped against the reference genome to perform variant analysis, including variant
calling and predicting the effects found variants produce on known genes (e.g. amino acid changes or frame shifts). In
this step you compare your sequence with the reference sequence, look at all the differences and try to establish how
207
big of an influence do these changes have on the gene. For instance, if it is a synonymous variant, it will probably
have low influence on the gene as such a change causes a codon that produces the same amino acid. However, if it is
a large deletion, you can assume that it will have a large effect on the gene function.
When it comes to visualising your data: the standard tool for visualisation of mapped reads and identified variants is
the Genome Browser. Since visualization is one of the concepts at the core of our platform, on Genestack you will
find a range of other useful tools that will help you better understand your data considering their nature. For example,
for WES or WGS data, we suggest using Variant Explorer which can be used to sieve through thousands of variants
and allow users to focus on their most important findings.
208 Chapter 6. Introduction to NGS data analysis

CHAPTER 7
Applications review
Applications available on Genestack are grouped into four categories:

• Preprocess applications cover tasks such as data prefiltering, subsampling or normalisation, which typically
should be performed before getting into the “heavy-lifting” part of data analysis.
• Analyse applications include key analysis steps like sequence alignment, variant calling, expression quantifica-
tion, etc.
• Explore contains all interactive graphical interface applications allowing users to view the results of their com-
putations, such as applications for visualizing QC reports, Genome Browser, Variant Explorer etc.
• Manage contains applications used to manage your data: applications dealing with data flows, file provenance,
export, metadata editing and so on.
An extended version of each application’s description can be found in the “About application” text for that application.
To view this text for a specific application, click on the application’s name at the top-left corner of the page, and in the
drop-down menu select “About application”.
209
7.1 Sequencing data
7.1.1 Raw reads quality control and preprocessing
Once you have got raw sequencing data in Genestack the next steps are to check the quality of the reads and improve
it if it is necessary. Let’s go through the applications developed to assess the quality of the data and do preprocessing.
FastQC report
Action: to perform quality control (QC) of raw sequencing reads. According to the “garbage in, garbage out” rule, if
we begin our analysis with poor quality reads, we should not expect great results at the end. This is why QC is the
essential first step of any analysis.
The FastQC Report application is based on the FastQC tool developed by Simon Andrews at the Babraham Institute.
The application generates a separate FastQC Report for each sample from a tested dataset; you can find them in the
folder “Created files”. You can also explore all of them simultaneously with Multiple QC Report app. To do so, go
210 Chapter 7. Applications review

to “My datasets” folder, select the created “FastQC Report” dataset and open it with Multiple QC Report using the
context menu.
Fast QC Report contains various graphs that visualize the quality of your data. We will go through all of them one by
one and tell you:
1. How they should look for good-quality data;
2. How they may look if there is something wrong with your data;
3. What you can do if the quality is unsatisfactory.
The metrics table gives you quick indicators as to the status of each of the quality metrics calculated.
1. Basic statistics
Information on type and number of reads, GC content, and total sequence length.
2. Sequence length distribution
Reports lengths of all sequences.

Warning: this report will get warnings if the lengths are not identical, but this can usually be ignored, as it is expected
for some sequencing platforms.
7.1. Sequencing data 211

3. Per sequence GC content
For data of good quality, the graph will show a normal, bell-shaped distribution.
Warning: when the sum of the deviations from the normal distribution represents more than 15% of the reads, the
report will raise a warning.
Warnings are usually caused by a presence of contaminants. Sharp peaks may represent a presence of a very specific
contaminant (e.g. an adaptor). Broader peaks may indicate contamination with a range of contaminants.
Improving data quality: run Trim Adaptors and Contaminants preprocessing application.
4. Per base sequence quality
For data of good quality, the median quality score per base (Phred) should not drop below 20.
Failure: the report will get failures if the lower quartile for quality at any base position is less than 5 or if the median
for any base is less than 20.
Improving data quality: if the quality of the library falls to a low level over the course of a read, the blueprint solution
is to perform quality trimming of low quality bases or omitting low quality reads. This can be performed using Trim
Low Quality Bases or Filter by Quality Scores applications respectively.

5. Per sequence quality scores
Ideally, we expect to see a sharp peak at the very end of the graph (meaning most frequently observed mean quality
scores are above 27).
Warning: the report will get warnings when the peak is shifted to the left, which means the most frequently observed
mean quality is below 27. This equals to a 0.2% error rate.
Improving data quality: perform quality-based trimming or selection using Trim Low Quality Bases or Filter by
Quality Scores applications respectively.
6. Per base sequence content
Ideally, in a random library we would see four parallel lines representing the relative base composition. Fluctuations at
the beginning of reads in the tested sample may be caused by adapter sequences or other contaminations of the library.
A bias at the beginning of the reads is common for RNA-seq data. This occurs during RNA-seq library preparation
when “random” primers are annealed to the start of sequences. These primers are not truly random, and it leads to a
variation at the beginning of the reads.
Warning: a warning will be raised if the difference between A and T, or G and C is greater than 10% at any position.
Improving data quality: if there is instability at the start of the read the consensus is that no QC is necessary. If

variation appears over the course of a read Trim to Fixed Length application may be used. If there is persistent
variation throughout the read it may be best to discard it. Some datasets may trigger a warning due to the nature of the
sequence. For example, bisulfite sequencing data will have almost no Cytosines. Some species may be unusually GC
rich or poor and therefore also trigger a warning.
7. Sequence duplication levels
Reports total number of reads, number of distinct reads and mean duplication rates.
Warning: this module will issue a warning if non-unique sequences make up more than 20% of the total.
There are two potential types of duplicates in a library: technical duplicates arising from PCR artifacts or biological
duplicates which are natural collisions where different copies of exactly the same sequence are randomly selected.
From a sequence level, there is no way to distinguish between these two types and both will be reported as duplicates
here.
Improving data quality: if the observed duplications are due to primer/adaptor contamination, they can be removed
using the Trim Adaptors and Contaminants application. Filter Duplicated Reads application can also be used for
DNA sequencing data but will distort expression data.

8. Overrepresented sequences
Shows the highly overrepresented sequences (more than 0.1% of total sequence) in the sample.
Warning: if any sequence is found to represent more than 0.1% of the total, a warning will be raised.
There are several possible sources of overrepresented sequences:
• technical biases (one region was sequenced several times; PCR amplification biases);
• a feature of library preparation (e.g. for targeted sequencing);
• natural reasons (RNA-seq libraries can naturally present high duplication rates).
Overrepresented sequences should only worry you if you think they are present due to technical biases.
Improving data quality: procedures and caveats for improving data quality are the same as for sequence duplication
level.
You can explore all the generated FastQC reports at the same time and on one page with Multiple QC report appli-
cation. All the FastQC reports are kept together in a “FastQC Report” dataset in the “My Datasets” folder.
Multiple QC report
Action: to display metrics from multiple reports at once. It accepts as input a dataset of QC reports.
Select from a range of QC keys to display on the plot, e.g. Total nucleotide count (mate 1 and 2), Number of reads
(mate 1 and 2): count (mate 1 and 2), Number of reads (mate 1 and 2):

You can select which metainfo to display in the plot labels:

Also, samples in the Multiple QC Report can be sorted by metainfo key or specified QC metric.
Finally, you can highlight the interesting reports and put them in a separate folder (New folder with selection button).

When the quality of the raw reads is unsatisfactory, several preprocessing applications are available on the platform
that can increase the quality of your raw reads. Here we will walk you through each one and give you a checklist to
use when deciding which to select. After each of the preprocessing steps, you can use the FastQC Report application
again to compare the quality pre- and post-processing (remember that in order to do this, you need to run a different
computation, this time inputting processed data source files into the data flow).
Subsample reads
Action: to create a random subset of raw reads.

Command line options:
1. The Random seed value will let you create different subsets with the same number of reads. (default: 100)
2. The Number of reads in subset option tells the application how many reads you expect the output subsample
will contain. (default: 50,000)
Using the same seed and the same number of reads will result in identical subsets.
This application is based on the Seqtk.
Filter duplicated reads
Action: to discard duplicated sequenced fragments from raw reads data. If a sequence of two paired reads or a single
read occurs multiple times in a library, the output will include only one copy of that sequence.

The Phred quality scores are created by keeping the highest score across all identical reads for each position.
If you suspect contamination with primers or some other repetitive sequence. This should be evident from the “Se-
quence duplication levels” and the “Overrepresented Sequences” modules of the FastQC report. Keep in mind this
application should not be used with RNA-seq data as it will remove observed differences in expression level.
This tool is based on Tally.
Filter by quality scores
Action: to discard reads from a Raw reads file based on Phred33 quality scores. The application classifies the sequence
as pass or fail calculating quality score distribution for each read.
1. Minimum quality score (Phred+33 range, 0. . . 41) is a quality cutoff value. A score of 20 means that there is
a 1% chance that the corresponding base was called incorrectly by the sequencer. A score of 30 means a 0.1%
chance of an incorrect base call. (default: 20)
2. Percentage of bases to be above the minimum quality score is number of nucleotides in the reads having
quality equal to or higher than the chosen minimum quality score. 100% requires all bases in the reads to be
equal to or higher than the quality cut-off value. 50% means requires the median of the bases to be at least the
quality cut-off value. (default: 80)
Let’s take an example to understand how the application works. So, here is our read:
The second line represents the nucleotide sequence (10 bases in this case). The fourth line contains quality scores for
each nucleotide in the read.
• If the “Minimum quality score” is equal to 30 and the “Percentage of bases” is equal to 50, this read will not be
discarded, because the median quality of the read is higher than 30.
• If the “Minimum quality score” is equal to 20 and the “Percentage of bases” is equal to 100, the read will be
discarded, because not all bases have quality equal to or higher than 20.
This tool is based on fastq_quality_filter, which is part of the FASTX-Toolkit.
This application is best used if you have some low quality reads, but others are of high quality. You should be able to
tell if this is the case from the shape of the “Per sequence quality scores” plot from the FastQC application. It may
also be worth trying this application if the per base sequence quality is low.
Trim adaptors and contaminants
Action: to find and trim adaptors and known contaminating sequences from raw reads data.
The application uses an internal list of sequences that can be considered as contaminants. This list is based on the
possible primers and adaptors which the most popular sequencing technologies and platforms use. For instance, it
contains widely used PCR primers and adaptors for Illumina, ABI etc. (see the list of primers and adaptors we
remove).

The occurrence threshold before adapter clipping is set to 0.0001. It refers to the minimum number of times an adapter
needs to be found before clipping is considered necessary.
Minimum length of the trimmed sequence (bp). The application will discard trimmed reads of length below this
number. (default: 15)
This application is based on the fastq-mcf, one of the EA-Utils utilities.
The application is best used when you have irregularities in GC content, in base content at the start of reads, duplicated
reads. Since this QC application relies on sequence matching it should be run first if used in conjunction with other
QC applications.
Trim low quality bases
Action: to isolate high-quality regions from raw reads.

Trim Low Quality Bases application is based on the Phred algorithm. It finds the longest subsequence in read where
the estimated error rate is below the error threshold (which is equal to 0.01 by default).
To understand how the application works let’s take an example. So, imagine we have a sequence:
The application will find the fragment of the read where the sum of all probability errors will not be more than 0.01 (in
our case). In this case, the best sequence will be “TAGA” (.001*2 + .0001*2 = .0022) and it will be the output read.
Other fragments will have the sum of error probabilities more than the cutoff 0.01, so they will be ignored.
This tool is based on the Seqtk tool and uses the Phred algorithm to pick out the regions of the highest quality.
Trim reads to fixed length
Action: to trim a specific amount of bases from the extremities of all reads in a sample.
1. The Keep bases from position option asks you to specify the first base that should be kept. (default: 1)
2. Keep bases to position (set to zero for entire read). Indicate the position of the last nucleotide that should be
kept in the read. (default: 0)
For example, if you set 5 as the first base to keep and 30 as the last base to keep, it means that the application trims all
nucleotides before the 5th position, and all nucleotides after the 30th base.
This tool is based on the fastx_trimmer, which is part of the FASTX-Toolkit.
Trim Reads to Fixed Length application is helpful when you want to obtain reads of a specific length (regardless of
the quality).
7.1.2 Mapped reads quality control and preprocessing
If you analysing mapped reads, we recommend you check if there are any biases taken place during the mapping
process (e.g. low coverage, experimental artifacts, etc.) and do preprocessing of mapped reads.

Mapped reads QC report
Action: to perform quality control (QC) of mapped reads.

We follow a similar procedure to the one used to generate FastQC reports. After selecting the mapped reads we wish
to check the quality of, we can run the Mapped Reads QC public data flow.
An individual Mapped Reads QC report contains some technical information about source data, tools used and data
flow.
Also, it includes a range of Mapping statistics. For single reads, you will calculate these QC metrics:
1. Total number of reads: how many reads used to map to the reference genome;
2. Unmapped reads: total number of reads which failed to map to the reference genome;
3. Mapped reads: total number of reads aligned to the reference genome;
4. Uniquely mapped reads: total number of reads aligned exactly 1 time to the reference genome;
5. Multi-hit mapped reads: total number of reads aligned >1 times to the reference genome.
In case you analyse paired-end reads data, you will see the following statistics:
1. Total number of mate pairs: how many paired-end reads used to map to the reference genome;
2. Mapped mate pairs: total number of paired reads where both mates were mapped;
3. Partially mapped mate pairs: total number of paired reads where only one mate in the pair was mapped;
4. Unmapped mate pairs: total number of paired reads which failed to map to the reference genome;
5. Improperly mapped mate pairs: total number of paired reads where one of the mates was mapped with an
unexpected orientation;
6. Properly mapped mate pairs: total number of paired reads where both mates were mapped with the expected
orientation.
Coverage by chromosome plot is reported for both read types.

This plot shows the percentage of reads covered by at least x reads. To clear it up, let’s just imagine that we have a
plot which shows coverage only for one chromosome and therefore it shows 1 line. If on the x-axis we have e.g. 100
reads, on y-axis — 10% (percentage of chromosome bases covered by 100 reads). So, it looks like we have 100-reads
coverage for 10% of chromosome.
The amount of coverage you are expecting varies with the experimental techniques you are using. Normally you want
similar coverage patterns across all chromosomes, but this may not be the case if e.g. you are dealing with advanced
stage cancer.
Insert Size statistics will be calculated for paired-end reads only.
Note: What is the difference between fragment size, insert size and mate inner distance?
Mate inner distance is the length between the two sequence reads. Insert size is normally the distance between paired-
end adaptors (paired-end reads + mate inner distance). Fragment size is the insert plus both adaptors.
Insert size statistics are useful to validate library construction and include:
1. Median insert size - a middle of a sorted list of insert sizes;
2. Median absolute deviation is calculated by taking the median of the absolute deviations from the median insert
size;
3. Mean insert size (trimmed) - an average of the insert sizes;
4. Standard deviation of insert size measures the variation in insert sizes from the mean insert size.

Insert size distribution graph is displayed for paired-end reads:
This graph shows the distribution of insert sizes.

Of course, the expected proportions of these metrics vary depending on the type of library preparation used, resulting
from technical differences between pair-end libraries and mate-pair libraries.
Mapped Reads QC Report application is based on the BEDtools and the Picard tools.
You can analyse the output for several Mapped Reads QC reports at once using our Multiple QC Report application.

This is helpful, because it allows you to see in comparison, how many reads in your dataset are unmapped, partially or
improperly mapped.
Targeted sequencing QC report
This application is good to use when analysing Targeted Sequencing data, e.g. Whole Exome Sequencing assays.
Action: to assess whether the target capture has been successful, i.e. if most of the reads actually fell on the target, if
the targeted bases reached sufficient coverage, etc.
Options:
1. Compute enrichment statistics based on option. The application allows you to compute enrichment statistics
for reads mapped only on exome, only on target file, or both exome and target file. (default: Exome)
The following enrichment statistics are computed:
• Number and proportion of mapped reads on target;
• Mean coverage on target with at least 2X coverage;
• Target bases with at least 2X, 10X, 20X, 30X, 40X, and 50X coverage.
You can generate these reports directly by choosing Mapped Reads datasets, right clicking on them and selecting the
appropriate application (in “Explore” section) or using the Targeted Sequencing Quality Control public data flow.
You can analyse the output for multiple reports at once using the Multiple QC Report application.

This application is based on the BEDtools, Picard and SAMtools.

Apart from quality control applications, Genestack suggests you a bunch of applications to preprocess mapped reads.
Mark duplicated mapped reads
Duplicated reads are reads of identical sequence composition and length, mapped to the same genomic position.
Marking duplicated reads can help speed up processing for specific applications, e.g. variant calling step, where
processing additional identical reads would lead to early PCR amplification effects (jackpotting) contributing noise to
the signal. You can read more about duplicated mapped reads in this excellent SeqAnswers thread.
Action: to go through all reads in a mapped reads file, marking as “duplicates” for paired or single reads where the
orientation and the 5’ mapping coordinate are the same.
3’ coordinates are not considered due to two reasons:
1. The quality of bases generated by sequencers tends to drop down toward the 3’ end of a read. Thus its alignment
is less reliable compared to the 5’ bases.
2. If reads are trimmed at 3’ low-quality bases before alignment, they will have different read lengths resulting in
different 3’ mapping coordinates.
In such cases, when the distance between two mapped mates differs from the internally estimated fragment length,
including mates mapping to different chromosomes, the application will not identify or use them but will not fail due
to inability to find the mate pair for the reads.
Marking duplicated reads can help speed up processing for specific applications, e.g. Variant Calling application.
This tool is based on the MarkDuplicates, part of the Picard tool.
Remove duplicated mapped reads
The point of removing duplicated mapped reads is to try to limit the influence of early PCR selection (jackpotting).
Whether or not you should remove duplicate mapped reads depends on the type of data you have. If you are dealing
with whole-genome sequencing data where expected coverage is low and sequences are expected to be present in
similar amounts, removing duplicated reads will reduce processing time and have little deleterious effect on analysis.

If however you are processing RNA-seq data, where the fold-variation in expression can be up to 10^7, reads are
relatively short, and your main point of interest is the variation in expression levels, this probably is not the tool for
you. You can read more about duplicated mapped reads in this excellent SeqAnswers thread.
Action: to go through all reads in a Mapped Reads file, marking as “duplicates” paired or single reads where the
orientation and the 5’ mapping coordinate are the same and discarding all except the “best” copy.
3’ coordinates are not considered due to two reasons:
1. The quality of bases generated by sequencers tends to drop down toward the 3’ end of a read. Thus its alignment
is less reliable compared to the 5’ bases.
2. If reads are trimmed at 3’ low-quality bases before alignment, they will have different read lengths resulting in
different 3’ mapping coordinates.
The application also takes into account interchromosomal read pairs. In such cases, when the distance between two
mapped mates differs from the internally estimated fragment length, including mates mapping to different chromo-
somes, the application cannot identify them but will not fail due to inability to find the mate pair for the reads.
This application is based on the MarkDuplicates, part of the Picard tools.
Subsample reads
You can use this application, for example, if you want to take a look at what your final experimental results will look
like, but do not want to spend time processing all your data right away.
Action: to create a random subset of mapped reads.
Command line options
1. The Subsampling ratio (percentage) option is used to set a fraction of mapped reads you would like to extract
(default: 50).
2. The Random seed option will let you produce different subsets with the same number of mapped reads. (default:
0)
Using the same random seed and the same subsampling ratio will result in identical subsets.
This application is based on the SAMtools.
Merge mapped reads
The application is useful when you have multiple replicates of the same experiment and want to combine them before
producing your final result.
Action: to merge multiple Mapped Reads files, producing one single output Mapped Reads file.
The application is based on the SAMtools.
Convert to unaligned reads
The application will be very useful when you are interested in fraction of reads that exactly will map to genome or
when you would like to remap the reads with other aligner.
Action: to convert a Mapped reads file to raw reads.
This application is based on the Picard toolkit.

7.1.3 Variants preprocessing
While analysing variants, you also can preprocess them. Just select Genetic Variations file and click on “Preprocess”
section to see what applications are available for you.
Merge variants
Merging variants can be useful, when you have, for example, one Genetic Variations file for SNPs and another one
for Indels. After their merging, the result Genetic Variations file will separately contain information about SNPs and
about Indels.
Action: to merge two or more Genetic Variations files into a single file.
Make sure that the same reference genome is specified in the metainfo of the selected Genetic Variants files.
This application is based on the BCFtools.
Concatenate variants
Concatenation would be appropriate if you, for example, have separate Genetic Variations files for each chromosome,
and simply wanted to join them “end-to-end” into a single Genetic Variations file.
Action: to join two or more Genetic Variations files by concatenating them into a larger, single file.
Make sure that the same reference genome is specified in the metainfo of Genetic Variants files you wish to concate-
nate.
The application always allows overlaps so that the first position at the start of the second input will be allowed to come
before the last position of the first input.
1. The Remove duplicated variants option checks for the duplicated variants and makes sure that there are no
redundant results. (default: unchecked)
The application is based on the BCFtools.
7.1.4 RNA-seq data analysis
Mapping (also called alignment) refers to the process of aligning sequencing reads to a reference sequence, whether
the reference is a complete genome, transcriptome, or de novo assembly.
Note: What is the difference between genome, exome and transcriptome?

Genome includes both coding (genes) and noncoding DNA in a given cell type.
Exome is a part of genome formed by exons, i.e it includes all DNA that is transcribed into mRNA.
Transcriptome is a collection of all mRNAs present in a given cell type. In comparison to the genome, the tran-
scriptome is dynamic in time (within the same cell type) in response to both internal and external stimuli. Thus, the
transcriptome derived from any one cell type will not represent the entire exome, i.e. all cells my have essentially the
same genome/exome, but not all genes are expressed in a specific cell type.
There are at least two types of mapping strategies — spliced mapping and unspliced mapping. In case of RNA-seq
data, reads are derived from mature mRNA, so there is typically no introns in the sequence. For example, if the read
spans two exons, the reference genome might have one exon followed by an intron.

Note: What is the difference between exons and introns?

Exons and introns are both parts of genes. However, exons code for proteins, whereas introns do not. In RNA
splicing, introns are removed and exons are jointed to produce mature messenger RNA (mRNA) which is further used
to synthesize proteins.
In this case, if you will use Unspliced mapper, the reference genome would find a matching sequence in only one of
the exons, while the rest of the read would not match the intron in the reference, so the read cannot be properly aligned.
When analysing RNA-seq data using unspliced aligner, the reads may be mapped to potentially novel exons, however
reads spanning splice junctions are likely to remain unmapped.
In contrast, Spliced mappers would know not to try to align RNA-seq reads to introns, and would somehow identify
possible downstream exons and try to align to those instead ignoring introns altogether. Taking this into account, we
recommend you use Spliced mapping applications to analyse RNA-seq data.
On Genestack, you will find two spliced aligners - “Spliced mapping with Tophat2” and “Spliced mapping to tran-
scriptome with STAR”.
Spliced mapping with Tophat2
Action: to map raw reads with transcriptomic data like RNA-seq to a reference genome, taking or not taking into
account splice junctions.
Note: What is a splice junction?

Splice junctions are exon-intron boundaries, at which RNA splicing takes place. For example, to cut an intron (between
two exons) you need to splice in two places so that two exons might be jointed.
Details on various settings:

1. Strand-specificity protocol. If you are using strand-specific RNA-seq data, this option will let you choose
between the “dUTP” and “ligation” method. If you are not sure whether your RNA-seq data is strand-specific or
not, you can try using Subsample Reads application to make a small subsample, map it with Spliced Mapping
with Tophat2 and check the coverage in Genome Browser for genes on both strands. (default: None)
2. Rule for mapping over known annotations. This option allows you to use annotated transcripts from the ref-
erence genome to distinguish between novel and known junctions (“Yes, and discover novel splice junctions”).
Also, you can restrict mappings only across known junctions (“Yes, without novel splice junctions discovery”)
or infer splice junctions without any reference annotation (“Do not use known annotations”). (default: “Yes,
and discover novel splice junctions”)
3. Rule for filtering multiple mappings. If you set “Unique mappings only”, the application will report only
unique hits for one mappable read. If you are interested in reads mapped to multiple positions in the genome,
choose “Multiple mappings only”. Select “None”, if you would like to get both unique and multiple mappings.
(default: None)
4. The Number of best mappings to report option lets you increase the number of reported mappings. This can
be used together with “Rule for filtering mappings” to choose whether to keep reads mapping to uniquely or to
multiple positions, e.g. report up to 5 possible mappings, and only for multi-hit reads. (default: 1)
5. The Number of allowed mismatches option lets you set the maximum number of allowed mismatches per read.
(default: 2)
6. The Disallow unique mappings of one mate option allows you to discard pairs of reads where one mate maps
uniquely and the other to multiple positions. (default: unchecked)

7. The Disallow discordant mappings will discard all mappings where the two mates map uniquely but with
unexpected orientation, or where the distance between two mapped mates differs from and internally estimated
fragment length, including mates mapping to different chromosomes. (default: unchecked)
The application is based on the Tophat2 aligner and used in the Testing Differential Gene Expression tutorial.
Spliced mapping to transcriptome with STAR
Action: to perform gapped read alignment of transcriptomic data to a Reference Genome taking into account splice
junctions.
In comparison to Tophat2, STAR works fast, at the same time being very accurate and precise. Moreover, in contrast
to all our other mappers, it maps reads onto the reference transcriptome, not the genome. Another advantage of the
application is that it can be used to analyse both: short and long reads, making it compatible with various sequencing
platforms. What’s more, this Spliced Mapper supports two-pass alignment strategy when it runs the second alignment
pass to align reads across the found splice junctions, which improves quantification of the novel splice junctions.
Taking all these features into account, the “Spliced Mapping to Transcriptome with STAR” application can be a very
good alternative to other RNA-seq aligners.
Now, let’s look through the application parameters:
1. The Enable two pass mapping mode option is recommended for sensitive novel junction discovery. The idea is
to collect the junctions founded in the first pass, and use them as “annotated” junctions for the 2nd pass mapping.
(default: unchecked)
2. Maximum number of multiple alignments allowed for a read: if exceeded, the read is considered un-
mapped. This option allows you to set how many mappings you expect for one mappable read if it is mapped
to multiple positions of the genome. (default: 10)
3. The Minimum overhang for unannotated junctions option prohibits alignments with very small splice over-
hangs for unannotated junctions (overhang is a piece of the read which is spliced apart). (default: 5)
4. The Minimum overhang for annotated junctions option does the same job as “Minimum overhang for unan-
notated junctions” but for annotated junctions. (default: 3)
5. The Maximum number of mismatches per pair parameter sets how many mismatches you allow per pair.
(default: 10)
6. Minimum intron length is a minimum intron size for the spliced alignments. Read the paper in case you are
not sure about the value. (default: 21)
7. Maximum intron length is a maximum intron size you consider for the spliced alignments. For example, set
1,000 and the application will take into account the introns of maximum 1,000 bp in size. Note, that the default
0 here means the max intron size equal about 590,000 bp. If you are not sure about intron size value, the paper
may help you to make a decision. (default: 0)
8. Maximum genomic distance between mates is a maximum gap between reads from a pair when mapped to
the genome. If reads map to the genome farther apart the fragment is considered to be chimeric. (default: 0)
The application is based on the STAR aligner.
Gene quantification with RSEM
Action: to use STAR mapper to align reads against reference transcripts and apply the Expectation-Maximization
algorithm to estimate gene and isoform expression levels from RNA-seq data.
1. The RNA-seq protocol used to generate the reads is strand specific. If the reads are strand-specific, check
this option. (default: unchecked)

2. Estimated average fragment length (for single-end reads only) option. It is important to know the fragment
length distribution to accurately estimate expression levels for single-end data. Typical Illumina libraries pro-
duce fragment lengths ranging between 180–200 bp. For paired-end reads, the average fragment length can be
directly estimated from the reads. (default: 190)
3. Estimated standard deviation of fragment length (for single-end reads only) option. If you do not know
standard deviation of the fragment library, you can probably assume that the standard deviation is 10% of the
average fragment length. For paired-end reads this value will be estimated from the input data. (default: 20)
When the task is complete, click View report in Explore section to get gene and isoform level expression estimates.
The output report represents a table with the following main columns:
• transcript_id - name of the transcript;
• gene_id — name of the gene which the transcript belongs to. If no gene information is provided, gene_id and
transcript_id are the same;
• length — transcript’s sequence length (poly(A) tail is not counted);
• effective_length — counts only the positions that can generate a valid fragment. If no poly(A) tail is added,
effective length is equal to transcript length — mean fragment length + 1. If one transcript’s effective length is
less than 1, this transcript’s both effective length and abundance estimates are set to 0;
• expected_count — the sum of the posterior probability of each read comes from this transcript over all reads;
• TPM — transcripts per million normalized by total transcript count in addition to average transcript length;
• FPKM — fragments per kilobase of exon per million fragments mapped;
• IsoPct — the percentage of the transcript’s abundance over its parent gene’s abundance. If the parent gene has
only one isoform or the gene information is not provided, this field will be set to 100.
The application is based on the RSEM program and the STAR mapper.
Quantify raw coverage in genes
Action: to compute gene counts from mapped reads. The application takes as input a mapped reads file, and uses a
reference genome to produce a mapped reads counts file, indicating how many reads overlap each gene specified in
the genome’s annotation.

Let’s go through the application parameters:

1. Feature type option. Depending on your tasks, you should specify the feature type for which overlaps choosing
from “exon”, “CDS” (coding DNA sequence), “3’UTR” (the 3’ untranslated region) or “5’UTR” (the 5’untrans-
lated region). For example, you may consider each exon as a feature in order to check for alternative splicing.
By default, the “gene-id” will be used as a feature identifier. (default: exon)
2. The Rule for overlaps option dictates how mapped reads that overlap genomic features will be treated. There
are three overlap resolution modes: union, strict-intersection, and non-empty intersection. (default: union)
The first one - “union” - is the most recommended. It combines all cases when the read (or read pair) at least
partly overlaps the feature. The “strict-intersection” mode is about strict intersection between the feature and the
read overlapping this feature. But if you are interested in counting reads that are fully or partly intersected with
the feature, you should use the last mode. It is important that the read will be counted for feature if it overlaps
precisely only one feature. If the read overlaps with more than one feature, it will not be counted.
3. Strand-specific reads. The application takes into account the direction of the read and the reference, so that
a read from the wrong direction, even if it is mapped to the right place, will not be counted. This option can
be useful if your data is strand-specific and you are interested in counting of reads overlapping with feature
regarding to whether these reads are mapped to the same or the opposite strand as the feature. Choose “Yes”,

if the reads were mapped to the same strand as the feature and “Reverse” — if the reads were mapped on the
opposite strand as the feature. Specify “No”, if you do not consider strand-specificity. (default: Yes)
This application is based on the HTSeq tool and used in Differential Gene Expression Analysis pipeline. After
calculating read abundance on the gene level, you will be able to run Test Differential Gene Expression application.
Isoform quantification with Salmon
Specific genes can produce a range of different transcripts encoding various isoforms, i.e. proteins of varying lengths
containing different segments of the basic gene sequence. Such isoforms can be generated, for example, in the process
of alternative splicing.
Action: to quantify abundances of transcripts from RNA-seq data. The application requires a set of reference tran-
scripts and uses the concept of quasi-mapping to provide accurate expression estimates very quickly.
The application is based on the Salmon tool.
Isoform quantification with Kallisto
Action: to quantify abundances of genes and isoforms from RNA-seq data without the need for alignment. It uses
the Expectation-Maximization algorithm on “pseudoalignments” to find a set of potential transcripts a read could have
originated from. Note, that the application accepts reference transcriptome (cDNA) not a genome (DNA).
Let’s inspect the application options:
1. The Strand-specificity protocol parameter is used to specify how to process the pseudoalignments. If “None”,
the application does not take into account strand specificity. To run the application in strand specific mode,
change this value to “Forward” if you are interested only in fragments where the first read in the pair is pseu-
domapped to the forward strand of a transcript. If a fragment is pseudomapped to multiple transcripts, only the
transcripts that are consistent with the first read are kept. The “Reverse” is the same as “Forward” but the first
read will be pseudomapped to the reverse strand of the transcript. (default: None)
2. The Enable sequence based bias correction option will correct the transcript abundances according to the
model of sequences specific bias. (default: checked)
3. The Estimated average fragment length (for single-end reads only) option must be specified in case of single-
end reads. Typical Illumina libraries produce fragment lengths ranging from 180–200 bp. For paired-end reads,
the average fragment length can be directly estimated from the reads. (default: 190)
4. Estimated standard deviation of fragment length (for single-end reads only) option. If you do not know the
standard deviation of the fragment library, you can probably assume that the standard deviation is 10% of the
average fragment length. For paired-end reads, this value will be estimated from the input data. (default: 20)
Use the View report application in the Explore section to review the Kallisto output report.

It contains a table with the following main columns:

• target_id — feature name, e.g. for transcript, gene;
• length — feature length;
• eff_length — effective feature length, i.e. a scaling of feature length by the fragment length distribution;
• est_counts — estimated feature counts;
• tpm — transcripts per million normalized by total transcript count in addition to average transcript length.
The application is based on the Kallisto tool.
Quantify FPKM coverage in isoforms
Action: to quantify reads abundance at the isoform level. It accepts mapped reads (corresponding to isoform align-
ment) and reference genome as inputs. The output is a file containing isoform counts. Several such files corresponding
to samples with different biological conditions and isoforms can be further used in Test Differential Isoforms Ex-
pression application.
Before running the application, you can choose the following parameters:
1. The Strand-specificity protocol is used for generating your reads. If “None”, the application will consider your
data as none-strand-specific, but this value can be changed to “dUTP” or “RNA-ligation”. (default: None)
2. The No correction by effective length option is used if you would like to not apply effective length normaliza-
tion to transcript FPKM (fragments per kilobases of exons for per million mapped reads). (default: unchecked)
The application always makes an initial estimation procedure to more accurately weight reads mapping to multiple
places in the genome.
This application is based on the cuffquant (a part of the Cufflinks tool) and used in Differential Isoform Expression
Analysis public data flow.
Test differential gene expression
Action: to perform differential gene expression analysis between groups of samples. The application accepts Mapped
Read Counts (from the “Quantify Raw Coverage in Genes” application) and generates Differential Expression Statis-
tics file which you can view with the Expression Navigator application.
Options:

1. The “Group samples by” option allows you to apply autogrouping, i.e. when the application helps you to group
your samples according to experimental factor indicated in metainfo for the samples (e.g. disease, tissue, sex,
cell type, cell line, treatment, etc.). (default: None)
2. Method for differential expression. The application supports two methods — “DESeq2” and “edgeR” statis-
tical R packages — to perform normalization across libraries, fit negative binomial distribution and likelihood
ratio test (LRT) using generalized linear model (GLM). (default: DESeq2)
With edgeR, one of the following types of dispersion estimate is used, in order of priority and depending on the
availability of biological replicates: Tagwise, Trended, or Common. Also, edgeR is much faster than DESeq2 for
fitting GLM model, but it takes slightly longer to estimate the dispersion. It is important that edgeR gives moderated
fold changes for the extremely lowly Differentially Expressed (DE) genes which DESeq2 discards, showing that the
likelihood of a gene being significantly differentially expressed is related to how strongly it is expressed. So, choose
one of the packages according to your desires and run the analysis.
For each group, a GLM LRT is carried out to find DE genes in this group compared to the average of the other groups.
In the case of 2 groups, this reduces to the standard analysis of finding genes that are differentially expressed between
2 groups. Thus, for N groups, the application produces N tables of Top DE genes. Each table shows the corresponding
Log2(Fold Change), Log2(Counts per Million), p-value, and False Discovery Rate for each gene. Look at all result
tables and plots in Expression Navigator application.
• log-fold change: the fold-change in expression of a gene between two groups A and B is the average expression
of the gene in group A divided by the average expression of the gene in group B. The log-fold change is obtained
by taking the logarithm of the fold change in base 2.
• log-counts per million: dividing each read count by the total read counts in the sample, and multiplying by
10^6 gives counts per million (CPM). log-counts per million are obtained by taking the logarithm of this value
in base 2.
• p-value. The application also computes a p-value for each gene. A low p-value (typically, < 0.005) is viewed as
evidence that the null hypothesis can be rejected (i.e. the gene is differentially expressed). However, due to the
fact that we perform multiple testing, the value that should be looked at to safely assess significance is the false
discovery rate.
• False discovery rate. The FDR is a corrected version of the p-value, which accounts for multiple testing
correction. Typically, an FDR < 0.05 is good evidence that the gene is differentially expressed. You can read
more about it here.
This application is based on DESeq2 and edgeR R packages.
Test differential isoform expression
Action: to perform differential isoform expression analysis between groups of samples. The application accepts
FPKM Read Counts (from Quantify FPKM Coverage in Isoforms application) and generates Differential Expression
Statistics file which you can view in Expression Navigator application.
The application has the following options:
1. The “Group samples by” option allows you to apply autogrouping, i.e. when the application helps you to group
your samples according to experimental factor indicated in metainfo for the samples (e.g. disease, tissue, sex,
cell type, cell line, treatment, etc.). (default: None)
2. Apply fragment bias correction option - if checked, the application will run the bias detection and correction
algorithm which can significantly improve accuracy of transcript abundance estimates. (default: checked)
3. The Apply multiple reads correction option is useful if you would like to apply the multiple reads correction.
(default: checked)

The application finds isoforms that are differentially expressed (DE) between several groups of samples and produces
tables of Top DE transcripts. Each table shows the corresponding Log2(Fold Change), Log2(Counts per Million),
p-value, and False Discovery Rate for each isoform. Use the Expression Navigator to visualize the results.
by taking the logarithm of the fold-change in base 2.
10^6 gives counts per million (CPM). log-counts per million are obtained by taking the logarithm of this value
in base 2.
• p-value. The application also computes a p-value for each isoform. A low p-value (typically, < 0.005) is viewed
as evidence that the null hypothesis can be rejected (i.e. the isoform is differentially expressed). However, due
to the fact that we perform multiple testing, the value that should be looked at to safely assess significance is the
false discovery rate.
correction. Typically, an FDR < 0.05 is good evidence that the isoform is differentially expressed. You can read
more about it here.
This application is based on the cuffdiff which is a part of the Cufflinks tool.
Expression navigator
Action: to view and filter the results of differential gene and isoform expression analyses.
The Expression Navigator page contains four sections:

1. Groups Information section. It is a summary of the groups available for comparison. Size refers to the number
of samples used to generate each group.

2. The Top Differentially Expressed Genes section allows you to choose which groups to compare and how to
filter and sort identified differentially expressed (DE) genes.

You can filter DE genes by maximum acceptable false discovery rate (FDR), up or down regulation, minimum log fold
change (LogFC), and minimum log counts per million (LogCPM).

Let’s look through these statistics:

by taking the logarithm of the fold change in base 2. Log transformed values contains the same information as
fold change but makes it more clear for interpretation because of symmetric values. Genes with positive log FC
are considered to be up-regulated in the selected group, ones with negative log FC are down-regulated.
10^6 gives counts per million (CPM). Log-counts per million are obtained by taking the logarithm of this value
in base 2.
• p-value. The application also computes a p-value for each gene. A low p-value (typically, < 0.005) is viewed as
discovery rate.
correction. Typically, an FDR < 0.05 is good evidence that the gene is differentially expressed. You can read
more about it here.
Moreover, you can sort the DE genes by these statistics by clicking the arrows next to the name of the metrics in the
table headers.
The buttons at the bottom of the section allow you to update the list based on your filtering criteria or clear your
selection.
3. The top-right section contains a boxplot of expression levels. Each colour corresponds to a gene. Each boxplot
corresponds to the distribution of a gene’s expression levels in a group, and coloured circles represent the
expression value of a specific gene in a specific sample.

4. The bottom-right section contains a search box that allows you to look for specific genes of interest. You can
look up genes by gene symbol, with autocomplete. You can search for any gene (not only those that are visible
with the current filters).

You can read more about this application in the corresponding tutorials.
Single-cell RNA-seq analysis
Action: to identify heterogeneously-expressed (HE) genes across cells, while accounting for technical noise. The
application analyses single-cell RNA-seq data and accepts several Mapped Read Counts as inputs. The output report
can be opened in Single-cell RNA-seq Visualiser.
The application supports two algorithms for heterogeneity analysis. The first uses spike-in data (artificially introduced
RNAs of known abundance) to calibrate a noise model. The second method is a non-parametric algorithm based on
smoothing splines and does not require the presence of spike-in data.
To identify highly variable genes you can try different options:
1. The Use spike-ins to calibrate noise option determines whether or not spike-in data should be taken into
account. If you select only one folder before running the application, you will use spike-free algorithm and this
option will be switched off by default. But if you select two folders, one for biological and the other for spike-in
data, you can use the Brennecke algorithm which requires this option.
2. The Exclude samples with low coverage option allows you to exclude or include for analysis samples with low
read counts. (default: checked)
3. Significance level for the p-value (-10log10 (p)). If you set it equal to 1, the application will select the genes for
which the p-value is smaller than 0.1. (default: 1)

The next three options will be available if spike-ins are included in the experiment and “Use spike-ins to calibrate
noise” option is switched:
4. The Expected biological CV is the minimum threshold chosen for quantifying the level of biological variability
(CV — coefficient of variation) expected in the null hypothesis of the model. (default: 0.5)
5. The Noise fit - proportion of genes with high CV2 to remove option allows you to exclude spike-in genes with
high CV2 to fit the noise model. (default: 0)
6. The Noise fit - proportion of genes with low mean expression to remove option enables you to exclude a
fraction of spike-in genes with low mean expression to fit the noise model, because extreme outliers tend to
skew the fit. (default: 0.85)
To look at the HE analysis results, open the created Single-cell RNA-seq Analysis page in Single-cell RNA-seq
Visualiser.
This application is based on such R packages as DESeq, statmod, ape, flashClust and RSJONIO.
Read more about single-cell RNA-seq analysis on Genestack.
Single-cell RNA-seq visualiser
Action: to explore cell-to-cell variability in gene expression in even seemingly homogeneous cell populations based
on scRNA-seq datasets.
The application shows basic statistics such as the number of identified highly variable genes across the analysed
samples.
It also provides several quality control (QC) plots allowing to check the quality of raw sequencing data, estimate and
fit technical noise for the Brennecke algorithm, and detect genes with significantly high variability in expression.

QC plots are adopted from the original paper by Brennecke et al. In all the plots described below, gene expression
levels are normalized using the DESeq normalization procedure.
The first plot describing the quality of raw data is the Scatter Plot of Normalised Read Counts, which shows the cell-
to-cell correlation of normalized gene expression levels. Each dot represents a gene, its x-coordinate is the normalized
gene count in the first cell, and its y-coordinate is the normalized gene count in the second cell. If spike-ins were used
during the analysis, separate plots will be rendered for spike-in genes and for sample genes.

The Technical Noise Fit and Highly Variable Genes plots provide a visual summary of the gene expression noise
profile in your dataset across all cells.
They graph the squared coefficient of variation (CV2 ) against the average normalized read counts across samples. The
Gene Expression Variability QC plot allows you to visualize the genes whose expression significantly varies across
cells. A gene is considered as highly variable if its coefficient of biological variation is significantly higher than 50%
(CV2 > 0.25) and the biological part of its coefficient of variation is significantly higher than a user-defined threshold
(its default value is 50%, and can be modified in the Single-cell Analyser). The coefficient of variation is defined as
the standard deviation divided by the mean. It is thus a standardized measure of variance.
If spike-ins were used to calibrate technical noise, then the separate Technical Noise Fit plot is displayed. On this
plot, each dot corresponds to a “technical gene” (spike-in gene).It plots the mean normalized count across all samples
on the x-coordinate and the squared coefficient of variation (CV2 ) of the normalized counts across all samples on
the y-coordinate. The coefficient of variation is defined as the standard deviation divided by the mean. It is thus
a standardized measure of variance. The plot also represents the fitted noise model as a solid red line (with 95%
confidence intervals as dotted red lines). It allows you to check whether the noise model fits the data reasonably well.
If it is not the case, you should change the noise fitting parameters in the Single-cell Analysis application.

Expression of the highly variable genes across all cell samples is represented by an interactive clustered heatmap.
The interactive heatmap depicts the log normalised read count of each significant highly variable gene (rows) in each
cell sample (columns). Hierarchical clustering of molecular profiles from cell samples is based on the similarity
in gene expression of highly expressed genes and allows identification of molecularly distinct cell populations. The
heatmap is clustered both by columns and by rows, to identify clusters of samples with similar gene expression profiles,
and clusters of potentially co-expressed genes. The bi-clustered heatmap is provided by an open source interactive

Javascript library InCHlib (Interactive Cluster Heatmap library).

Finally, several plots in the Samples Visualisation section can be used to detect cell subpopulations and identify novel
cell populations based on gene expression heterogeneity in the single-cell transcriptomes.
The Samples Visualisation section provides interactive plots used to cluster cell samples based on expression of highly
variable genes. Currently, two alternative methods are supported for visualisation and clustering of samples: the first
one is based on the t-distributed Stochastic Neighbour Embedding (t-SNE) algorithm and the second one uses Principal
Component Analysis (PCA).
For automatic cluster identification, the k-means clustering algorithm can be used in combination with either t-SNE
or PCA. K-means clustering requires you to supply a number of clusters to look for (“k”). You can either enter it
manually using the dropdown menu or use the suggested value estimated using the “elbow” method (choosing a value
of k such that increasing the number of clusters does not significantly reduce the average “spread” within each cluster).
The Interactive Principal Component Analysis (PCA) scatter plot is rendered using the NVD3 Javascript library. The
PCA features and k-means algorithm results are computed using R’s built-in functions prcomp and knn. The t-SNE
transformation is computed using the Rtsne package.
Read our blog post about the application and single-cell RNA-seq analysis.

7.1.5 Genome/exome sequencing data analysis
Mapping (also called alignment) refers to the process of aligning sequencing reads to a reference sequence, whether
the reference is a complete genome, transcriptome, or de novo assembly.
There are at least two types of mapping strategies — Spliced Mapping and Unspliced Mapping. In contrast to spliced
aligners, unspliced read aligners map reads to a reference without allowing large gaps such as those arising from reads
spanning exon boundaries, or splice junctions. When analysing whole genome sequencing (WGS) or whole exome
sequencing (WES) data, there is no need to look for spliced these sites precisely. That is why we recommend use
unspliced mapping applications in such cases.
On Genestack, you will find two unspliced aligners — Unspliced Mapping with BWA and Unspliced Mapping with
Bowtie2.
Unspliced mapping with BWA
Action: to map WES or WGS data to a reference genome without allowing splice junctions. The application generates
Mapped Reads which can be used further with our Variant Calling application which is based on samtools mpileup.
BWA’s MEM algorithm will be used to map paired or single-ends reads from 70 bp up to 1Mbp (“mem” option
in command line). For reads up to 70 bp the algorithm called BWA-backtrack will be applied. This algorithm is
implemented with the “aln” command, which produces the suffix array (SA) coordinates of the input reads. Then the
application converts these SA coordinates to chromosome coordinates using the “samse” command (if your reads are
single-end) or “sampe” (for paired-end reads).
1. Perform targeted mapping option. If this parameter is selected, a BED file (“Target region” source file) is
used to restrict mapping of the reads to specific locations in the genome, that the reads should be mapped to.
The reference genome is altered to only contain those locations, using the bedtools “getfasta” command and the
reads are then mapped to the altered genome. The resulting sam file contains local genome co-ordinates, which
are converted back to the global coordinates of the reference genome. (default: unchecked)
The application is based on the BWA aligner and BEDtools. The application is used in Whole Exome Sequencing
Data Analysis and Whole Genome Sequencing Data Analysis tutorials.
Unspliced mapping with Bowtie2
Action: to map WES or WGS data to a reference genome without allowing splice junctions. The application generates
Mapped Reads which can be used further with our Variant Calling application which is based on samtools mpileup.
Let’s look at the parameters we can use to do mapping:
1. Report the best mapping option. The application will consider only the best mapping for one mappable read.
(default: checked)
2. Limit the number of mappings to search option. If you are interested in reads mapping to multiple positions,
switch off “Report the best mapping” option and set N mappable positions for one read in the text box for “Limit
the number of mappings to search”. (default: 1)
3. Rule for filtering mappings. You can apply a rule for filtering mappings to choose whether to keep reads
mapping uniquely or to multiple positions. (default: None)
4. Number of allowed mismatches option. If you want to be stricter, you can change the maximum number of
allowed mismatches, e.g. if you set it to 1, any mapping with 2 or more mismatches will not be reported (default:
0)
For paired-end reads two more option appears:

5. The Disallow unique mappings of one mate option allows you to discard pairs of reads where one mate maps
uniquely and the other to multiple positions. (default: unchecked)
6. The Disallow discordant mappings parameter will discard all mappings where the two mates map uniquely
but with unexpected orientation or where the distance between two mapped mates differs from and internally
estimated fragment length, including mates mapping to different chromosomes. (default: unchecked)
The application is based on the Bowtie2 aligner.
Variant calling with SAMtools and BCFtools
Action: to identify genomic variants. The application accepts Mapped Reads files to call variants. You will be able
to perform variant calling for each single Mapped Reads file separately or run Variant Calling application on multiple
mapped reads samples. The last option maybe helpful because you increase the accuracy of the analysis by taking
the reads from several samples into consideration and reducing the probability of calling sequencing errors. After the
variants are detected you can annotate them running Effect Prediction application or/and use Genome Browser and
Variant Explorer for exploring the results.
The application uses samtools mpileup which automatically scans every position supported by an aligned read, com-
putes all the possible genotypes supported by these reads, and then calculates the probability that each of these geno-
types is truly present in your sample.
As an example, let’s consider the first 1000 bases in a Reference Genome file. Suppose the position 35 (in reference
G) will have 27 reads with a G base and two reads with a T nucleotide. Total read depth will be 29. In this case,
the application concludes with high probability that the sample has a genotype of G, and the T reads are likely due to
sequencing errors. In contrast, if the position 400 in reference genome is T, but it is covered by 2 reads with a C base
and 66 reads with a G (total read depth equal to 68), it means that the sample more likely will have G genotype.
Then the application executes bcftools call which uses the genotype likelihoods generated from the previous step to
call and filter genetic variants and outputs the all identified variants in the Genetic Variations file.
Let’s now look at the command line options more closely:
1. Variants to report option. The application can call both “SNPs and INDELs” variants, “SNPs only” or “IN-
DELs only”. (default: “SNPs and INDELs”)
2. Call only multi-allelic variants option. The multiallelic calling is recommended for most tasks. (default:
checked)
Note: What is a multiallelic variant?

A multiallelic variant is a genomic variant with two or more observed alleles in the variant locus. In contrast to
multiallelic variant, consensus (or biallelic) variant is determined as a single non-reference allele (there are only two
possible alleles at the variant site - the reference allele and the consensus one).
3. Only report variant sites option. In some cases, it’ll be interested to report only potential variant sites and
exclude monomorphic ones (sites without alternate alleles). For this purpose, switch the option “Only report
variant sites”. (default: checked)
4. The Discard anomalous read pairs option is used to skip anomalous read pairs in variant calling. (default:
checked)
5. The Maximum per-sample read depth to consider per position option sets the maximum number of reads at
the position to consider. (default: 250)
6. Minimum number of gapped reads for an INDEL candidate option. Typically, gapped alignments (like the
ones from Unspliced with Bowtie2) can be used to identify indels (about 1-10 bases in length). The greatest
indel sensitivity can be achieved by generating indel candidate from mapped reads. (default: 1)

7. Minimum per-sample depth to call non-variant block option. A non-variant block is all variants, describing
a segment of nonvariant calls. Specify, what minimum read depth value you expect to observe among all sites
encompassed by the non-variant block. (default: 1)
8. Minimum variant quality option. The application will ignore the variant with quality score below this value.
(default: 20)
9. The Minimum average mapping quality for a variant parameter is used to discard all variants with average
mapping quality value less than specified. (default: 20)
10. The Minimum all-samples read depth for a variant is a minimum number of reads covering position. (default:
1)
11. The Chromosome to analyse option allows you to choose specific chromosomes to analyse. (default: All)
12. Key to merge samples is a metainfo key you need to specify in order you would like to merge the samples.
This option can be useful for merging technical replicates.
Moreover, base alignment quality (BAQ) recalculation is turned on by default. It helps to rule out false positive SNP
calls due to alignment artefacts near small indels.
Also, the application will always write DP (number of reads covering position), DV (number of high-quality variant
reads), DP4 (number of forward reference, reverse reference, forward non-reference and reverse non-reference alleles
used in variant calling) and SP (phred-scaled strand bias p-value) tags in the output file.
The result Genetic Variations can be explored in Genome Browser as a separate variation track, further annotated
using Effect Prediction application, or viewed immediately using Variant Explorer application.
This application is based on the SAMtools and BCFtools utilities and best used when performing Whole Exome
Sequencing Analysis or Whole Genome Sequencing Analysis.
Effect prediction with SnpEff
Action: to annotate variants based on their genomic locations and calculate the effects they produce on known genes.
The application accepts Genetic Variations and adds annotations for them.
The annotated variants can be further explored in Genome Browser, Variant Explorer or View Report applications.
In Genome Browser, the Variation track shows the genetic variants (SNPs, insertions etc.), their exact position on
genome, average mapping quality and raw read depth.

If you would like to see the whole list of effects and annotations for variants as well as to get some general statistics
(for example, to know number of variants by chromosome, find out how many variants are corresponding to SNP or
insertions, to know number of effects by type and region and some other information), just open the annotated Genetic
Variations file in View Report application. Read about the variant annotations and report statistics in Whole Exome
Sequencing tutorial, in Effect annotation section.
Use Variant Explorer application to know what effect is generated by each separate variant as well as to sort and
filter the variants by various fields, such as mutation type, quality, locus, etc.
This application is based on the open-source SnpEff tool and best used in Whole Exome Sequencing and Whole
Genome Sequencing analyses.
Variant explorer
Action: to interactively explore genetic variations such as SNPs, MNPs, and indels at specific genomic positions. The
application not only displays the information about variants but also allows you to sort and filter by various fields, such
as mutation type, quality, locus, etc.

Variant Explorer takes as input a Genetic Variations file which can be imported or generated with the Variant Calling
application. If you open it in the application, you will see default DP (Raw read depth) and MQ (Average mapping
quality) columns (“Other” tab in “Columns” section).
Variants can be annotated with the Effect Prediction application that analyses genomic position of the variants and
reveals the effects they produce on known genes (such as amino acid changes, synonymous and nonsynonymous
mutations, etc.). For such variants the following information will be shown (find it in “Effect prediction” tab).

• Effect — effect predicted by SnpEff tool;

• Impact — impact predicted by SnpEff tool;
• Functional class — functional class of a region, annotated by SnpEff tool.
Moreover, the application calculates “Additional metrics” such as genotype frequencies for homozygous samples
with reference and alteration alleles (GF HOM REF and GF HOM ALT columns correspondingly), reads depth for
homozygous samples with alteration allele (DP HOM ALT) and reads depth for heterozygous samples (DP HET).
Note: How many raw reads match to the reference and alternative alleles?

DP and DP4 fields may help.

DP is about raw read depth. DP4 refers to the reads covering the reference forward, reference reverse, alternate
forward, alternate reverse bases. For example, DP4=0,0,1,2 means 1 read is the alternate base forward strand, 2
alternate base reverse strand, and no covering reads have a reference at that position. The sum of DP4 will not always
equal to the DP value due to some reads being of too low quality.
Note: How can I find out an allele frequiency for a variant?

Have a look at allele frequency (RAF column) which is a fraction of reads supporting alternate allele (that information
is provided in DP4 field). Our Variant Calling application is forced to fit the model of categorical allele frequencies,
e.g. 0 (homozygous reference), ~0.5 (heterozygote, carrying 1 copy of each of the reference and alternate alleles) or 1
(homozygous alternate).
To change the default columns or add more columns, choose them in the corresponding tabs in “Columns” section and
“Save” your changes. After that all selected columns will be displayed in Table viewer.
You can “download filtered data as .tsv” or create a new file with filtered variants.
Read more about this application in our tutorials on Whole Exome Sequencing and Whole Genome Sequencing anal-
yses.
Intersect genomic features
Action: to perform an intersection between several feature files such as Mapped Reads files or Genetic Variations files.
Depending on the input files, the applications generates different outputs, either Mapped Reads or Genetic Variations
files.
Let’s look at the command line options:
1. Rule for filtering option. The application can “Report overlapping features”. For example, you could isolate
single nucleotide polymorphisms (SNPs) that overlap with SNPs from another file. For this, intersect two
Genetic Variations files. But there are cases when you would like to know which features do not overlap with
other ones (use “Report non-overlapping features” filter). (default: Report overlapping features)
2. The Minimum overlapping fraction option allows you check whether a feature of interest has a specified
fraction of its length overlapping another feature. (default: 10)
3. The Rule for overlap strandedness option allows you to ignore overlaps on the same strand (“Discard overlaps
on the same strand”), on the other strand (“Discard overlaps on the other strand”) or expect overlapping without
respect to the strandedness (“None”). (default: None)
This application is based on the BEDtools.
7.1.6 Bisulfite sequencing data analysis
Bisulfite sequencing mapping with BSMAP
Action: to map high-throughput bisulfite sequencing reads at the level of the whole genome.
Let’s talk a bit about various settings:
1. The Number of mismatches option lets you set the maximum number of allowed mismatches per read. Chang-
ing this number you can affect application runtime and percentage of mapped reads. There is an increase in the
percentage of mapped reads and in the application runtime when increasing this value. For example, by default
the read could be mapped to the genome with no more than 5 mismatches. (default: 5)

2. Rule for multiple mappings option. The application can “only reports unique hits” for one mappable read or if
your reads are mapped to multiple positions in the genome, “report 1 random “best” mapping”. In the last case,
it stops duplicated genome regions from being omitted altogether. (default: Report 1 random “best” mapping)
3. The BS data generation protocol option enables you to specify what library preparation method was used to
construct the bisulfite converted library. (default: Lister)
If the “Lister” protocol was used, your reads will be mapped to two forward strands. You can read more about this
protocol in Lister et al. If you choose the “Cokus” protocol the application will align your reads to all four strands.
You can find more details about this protocol in the original study by Cokus et al.
The application is based on the BSMAP aligner and it is used in the Whole-Genome Bisulfite Sequencing Analysis
tutorial.
Reduced representation bisulfite sequencing mapping with BSMAP
Action: to map reduced representation bisulfite sequencing (RRBS) reads to the specific digestion sites on the genome.
Let’s talk a bit about various settings:
1. Enzyme sequence option is important. It specify what sequence is recognized by by the restriction enzyme and
used to digest genomic DNA in the process of library preparation. By default, the application uses the C-CGG
sequence which is recognised in MspI restriction. (default: “C-CGG”)
2. The Number of mismatches option lets you set the maximum number of allowed mismatches per read. De-
creasing this number you can reduce application runtime and percentage of mapped reads. (default: 5)
3. Rule for multiple mappings option. The application can “only reports unique hits” for one mappable read or if
your reads are mapped to multiple positions in the genome, “report 1 random “best” mapping”. In the last case,
it stops duplicated genome regions from being omitted altogether. (default: Report 1 random “best” mapping)
4. The BS data generation protocol option enables you to specify what library preparation method was used to
construct the bisulfite converted library. (default: Lister)
If the “Lister” protocol was used, your reads will be mapped to two forward strands. You can read more about this
protocol in Lister et al. If you choose the “Cokus” protocol the application will align your reads to all four strands.
You can find more details about this protocol in the original study by Cokus et al.
The application is based on the BSMAP aligner.
Methylation ratio analysis
Action: to determine the percent methylation at each ‘C’ base in mapped reads. Next, you can view methylation ratios
in Genome Browser.
Command line options are the following:
1. The Minimum coverage option allows you to get results filtered by depth of coverage. But raising it to a higher
value (e.g. 5) requires that at least five reads will cover the position. (default: not set)
2. Trim N end-repairing fill-in bases option. For paired-end mappings, you can trim from 1 to 240 fill-in nu-
cleotides in the DNA fragment end-repairing. For RRBS mappings, the number of fill-in bases could be deter-
mined by the distance between cuttings sites on forward and reverse strands. If you analyse WGBS mappings,
it is recommended to set this number between 0~3. (default: not set)
3. The Report loci with zero methylation ratios option is used to report positions with zero methylation. (default:
unchecked)
4. The Combine ratios on both strands option allows you to combine CpG methylation ratio from both strands.

5. Only unique mappings parameter is checked in case you would like to process only unique mappings. (default:
checked)
6. The Discard duplicated reads option is used to remove duplicates from mapped reads. (default: checked)
7. C/T SNPs filtering option. To ignore positions where there is a possible C/T SNPs detected, choose “skip”
value. If you want to correct the methylation ratio according to the C/T SNP information estimated by the G/A
counts on reverse strand, set “correct” value. Or, let the application do not consider C/T SNPs (“no-action”
value). (default: no-action)
If you analyse paired reads one more option appears:
8. The Discard discordant mappings parameter is used to discard all mappings where the two mates map uniquely
but with unexpected orientation, or where the distance between two mapped mates differs from and internally
estimated fragment length, including mates mapping to different chromosomes.
The outputs from Methylation Analysis application can be represented in the Genome Browser as Methylation ratios
track.
Note: What does the 0-1000 side bar represent?

These bars represent the final methylation frequency. To understand this, take a simple example. Let’s imagine, we
investigate position 30 in the Chr X. This position has 10 reads contributing to the methylation frequency. 7 of these
10 reads reported Cs in this position (i.e. methylated Cs, no bisulfite conversion and Cs do not transform into Ts) and
3 reads showed Ts (unmethylated Cs, bisulfite conversion takes place). Then the final methylation frequency will be
calculated as 7/10 = 0.7. This value is multiplied by 1000 to get 700 (this is the bar sides you see in Genome Browser).
So, it means, that side bars with 0 value represent unmetylated position, and vice versa side bars with 1000 — show
max methylation (all reads have methylated Cs in this case).
The Methylation Analysis application is based on the methratio.py script and it is used in the Whole-Genome Bisulfite
Sequencing Analysis tutorial.
7.1.7 Microbiome data analysis

Microbiome analysis with QIIME
Action: to identify microbial species and their abundances in microbiome samples. The application accepts microbial
sequencing reads and outputs Clinical or Research reports with abundance plots and microbiological diversity metrics.
The microbiome analysis can use either Greengenes (for bacteria) or UNITE (for fungi) reference databases to estimate
the taxonomic composition of the microbial communities.
Let’s review the application options:
1. OTU picking option. To pick OTUs (Operational Taxomonic Units), the application provides two protocols
(default: open-reference):
• closed-reference: reads are clustered against a reference sequence collection and any reads which do not hit a
sequence in the reference sequence collection are excluded from downstream analyses.
• open-reference: reads are clustered against a reference sequence collection and any reads which do not hit the
reference sequence collection are subsequently clustered de novo (i.e. against one another without any external
reference).
2. Algorithm used for clustering. In case open-reference protocol, the application suggests you use uclust (by
default) or sortmera_sumclust algorithms. If you prefer closed-reference protocol, choose between blast (by
default), uclust_ref and sortmera algorithms.
3. The Quality filter for pre-clustering step option will remove any low quality or ambiguous reads before clus-
tering. (default: 0)
4. The Join paired-end reads (for paired reads only) option will join paired-end reads before the clustering.
The next two options are available only for open-reference protocol:
5. Taxonomy assignment will be performed using the blust, rdp, rtax, mothur, uclust or sortmerna algorithm. In
case of closed-reference method, taxonomy assignment will always be performed by uclust algorithm. (default:
blust)
6. The Percent of reads to cluster de novo option is applied for reads that will not hit the reference database and
will be cluster de novo. (default: 0,001)
Output reports include the following metrics:
– counts for every taxonomic unit (how many reads match to a given group) in form of interactive plot:

And table:
– alpha diversity (within each sample, how rich the sample is e.g. number of taxa identified):
– beta diversity (difference between a pair of samples)(heterogeneity of samples):

The application is based on the open-source tool QIIME.
7.1.8 Additional visualisation applications
This section includes the applications that can be used in various pipelines to view the content of the data (e.g. Se-
quencing Assay Viewer) or to display multiple data types on different steps of analyses (e.g. Genome Browser).
Sequencing assay viewer
Action: to show the content of Raw Reads file and look for specific nucleotide sequences which can be exact, reverse,
complement or reverse complement to the sequence of interest.

To access this application, select the assay you are interested in, right click on it and from the “Explore” section select
the application.
Genome browser
Action: to visualize different types of genomic data: mapped reads, genetic variants, methylation ratios and others.
There are several tracks that can be visualized in Genome Browser:

• Reference track displays reference genome, its genes (green boxes), transcripts, and their coordinates;

• Coverage track represents the sequencing reads coverage for mapped reads
• Variation track shows genetic variants (SNPs, insertions etc.), their exact position on the genome, average
mapping quality and raw read depth;
• Methylation ratios track reflects the proportion of methylated and unmethylated cytosine residues.

Also you can manage tracks: add new ones, hide or delete them. When manipulating with multiple tracks you can
use the tracks mentioned above to create Combined track or Formula track. On the combined track several tracks are
imposed and shown together, thereby comparing coverage for different samples.
Or you can apply some basic mathematical operations and create formulas based on your genomic data, for example,
quantify average value between values corresponding to different samples. The results of the computations will be
shown on the formula track.
Moreover, each track can be customised by changing its properties (track color, normalized values, show only SNPs,
maximum and minimum values to be shown on a track, etc.). Use “Edit” button to change properties for multiple
tracks at once.
Genome Browser allows you to browse either a specific genomic position (search by coordinates) or a specific feature

(search by feature name). You can navigate through the data to find a feature of interest or explore regions surrounding
the feature, and zoom in to nucleotide resolution. The found feature can be marked with sticky notes (Shift + click on
the position on the track). When you share the Genome Browser page with your collaborators, sticky notes will help
to focus their attention on your findings.
You can see the Genome browser in action in this blog post.
7.1.9 Reference genomes
One way or another, many bioinformatics analysis pipelines rely on the use of a reference genome. For instance, we
use reference genomes in DNA methylation analysis, in differential gene expression analysis, and in the analysis of
transcriptomic heterogeneity within populations of cells. The choice of a reference genome can increase the quality
and accuracy of the downstream analysis or it can have a harmful effect on it. For example, it has been shown that the
choice of a gene annotation has a big impact on RNA-seq data analysis, but also on variant effect prediction.
On Genestack, you can find several reference genomes for some of the most common model organisms. We are adding
more and more reference genomes of model organisms to this list regularly.
For some organisms we provide several genomes, e.g. there are a couple of reference genomes for Homo sapiens.
What are the differences between these reference genomes? And how do you chose the correct one? The answer is
not so straightforward and depends on several factors – let’s discuss each of them:
1. Reference genome assembly and release version
For instance: “Homo sapiens / GRCh37 release 75” vs “Homo sapiens / GRCh38 release 86”.

The numbers correspond to versions (or “builds”) of the reference genome – the higher the number, the more recent
the version. We generally recommend you use the latest version possible. One thing to remember is that for the newest
genome builds, it is likely that resources such as genome annotations and functional information will be limited, as it
takes time for Ensembl/ UCSC to integrate additional genomic data with the new build. You can read more about it a
blog post from Genome Spot blog and in this article from Bio-IT.
2. One organism – many strains
K12 and O103 are two different strains of E.coli. K12 is an innocuous strain commonly used in various labs around the
world. O103 is a pathogenic strain, commonly isolated from human cases in Europe. Depending on your experiment,
you should choose a matching reference genome.
3. Toplevel sequence or primary assembly
• Toplevel reference genomes contain all chromosomes, sequence regions not assembled into chromosomes and
padded haplotype/patch regions.
• Primary assembly genomes contain all toplevel sequence region excluding haplotypes and patches.
We are strongly recommend to use primary assembly reference genomes, since they are best for performing sequence
similarity searches while patches and haplotypes would confuse analysis.
4. DNA or cDNA
• DNA - reference genome contains sequence of genomic DNA;
• cDNA reference genome consists of all transcripts sequences for actual and possible genes, including pseudo-
genes.
5. Masked, soft-masked and unmasked genomes
There are three types of Ensembl reference genomes: unmasked, soft-masked and masked.
Masking is used to detect and conceal interspersed repeats and low complexity DNA regions so that they could be
processed properly by alignment tools. Masking can be performed by special tools, like RepeatMasker. The tool goes
through DNA sequence looking for repeats and low-complexity regions.
There are two types of masked reference genomes: masked and soft-masked.
• Masked reference genomes are also known as hard-masked DNA sequences. Repetitive and low complexity
DNA regions are detected and replaced with ‘N’s. The use of masked genome may adversely affect the analysis
results, leading to wrong read mapping and incorrect variant calls.
Note: When should you use a masked genome?

We generally do not recommend using masked genome, as it relates to the loss of information (after mapping, some
“unique” sequences may not be truly unique) and does not guarantee 100% accuracy and sensitivity (e.g. masking
cannot be absolutely perfect). Moreover, it can lead to the increase in number of falsely mapped reads.
• In soft-masked reference genomes, repeats and low complexity regions are also detected but in this case they
are masked by converting to a lowercase variants of the base (e.g. acgt).
Note: When should you use a soft-masked genome?

The soft-masked sequence does contain repeats indicated by lowercase letters, so the use of soft-masked reference
could improve the quality of the mapping without detriment to sensitivity. But it should be noted that most of the
alignment tools do not take into account soft-masked regions, for example BWA, tophat, bowtie2 tools always use all
bases in alignment weather they are in lowercase nucleotides or not. That is why, there is no actual benefit from the
use of soft masked genome in comparison with unmasked one.

• We recommend you use unmasked genomes when you do not want to lose any information. If you want to
perform some sort of filtering, it is better to do so after the mapping step.
Usually, reference genome name includes information about all these factors: organism, genome assembly, release,
primary assembly/toplevel, masking procedure and molecule.
Example:
To perform Whole Exome Sequencing analysis, we recommend you use an unmasked reference genome of the latest
releases and assemblies (e.g. Homo sapiens / GRCh38 release 85 (primary assembly, unmasked, DNA) for human
samples).
The bioinformatics community is divided on the topic of the use of reference genomes. It is our personal opinion that
it is best to always use unmasked genome and perform filtering after the mapping step. However, if you would like to
read more on the topic, we suggest taking a look at the following papers:
1. McCarthy DJ, Humburg P, Kanapin A, Rivas MA, Gaulton K, Cazier JB, Donnelly P. Choice of transcripts and
software has a large effect on variant annotation. Genome Med. 2014;6(3):26. DOI: 10.1186/gm543;
2. Frankish A, Uszczynska B, Ritchie GR, Gonzalez JM, Pervouchine D, Petryszak R, et al. Comparison of
GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. BMC
Genomics. 2015;16 (Suppl 8):S2. DOI: 10.1186/1471-2164-16-S8-S2.
7.2 Microarray data
Scientists are using DNA microarrays to quantify gene expression levels on a large scale or to genotype multiple
regions of a genome.
Note: What is a DNA microarray?

It is a collection of microscopic DNA spots attached to a solid surface. Each spot contains multiple identical DNA
sequences (known as probes or oligos) and represents a gene or other DNA element that are used to hybridize a cDNA
or cRNA sample (called target) under high-stringency conditions. Probe-target hybridization is measured by detection
of targets labeled with a molecular marker of either radioactive or fluorescent molecules.
7.2.1 Expression arrays
Microarrays are useful in a wide variety of studies with a wide variety of objectives. In this section we will look at
expression microarrays.
Microarray Explorer
Action: performs the process of collaboratively producing and exploring complete Microarray reports: from quality
control check, dose response analysis, to differential expression analysis.
The application is based on the following Genestack applications that you can also freely use separately:
Microarrays normalisation
When investigating differential gene expression using microarrays, it is often the case that the expression levels of
genes that should not change given different conditions (e.g. housekeeping genes) report an expression ratio other
than 1. This can be caused by a variety of reasons, for instance: variation caused by differential labelling efficiency of
the two fluorescent dyes used or different amounts of starting mRNA. You can read more about this in this document.
7.2. Microarray data 263

Normalisation is a process that eliminates such variations in order to allow users to observe the actual biological
differences in gene expression levels. On Genestack, we have four different Microarray Normalisation applications -
one for each of the four commonly used chips: Affymetrix, Agilent, L1000 and GenePix.
Affymetrix microarrays normalisation
Action: to perform normalisation of Affymetrix microarray data.

To normalize Affymetrix microarrays the application uses Robust Multi-array Average (RMA) method. First, the raw
intensity values are background corrected, log2 transformed and then quantile normalized. Next a linear model is fit
to the normalized data to obtain an expression measure for each probe set on each array. For more on RMA, see this
paper.
The normalised data can be assessed with the Microarray quality control application enabling you to detect potential
outliers and probably remove them from the downstream analysis. Good quality data can be further processed with
Dose response analysis or Test differential expression for microarrays applications.
The application is based on the affy R package.
Agilent microarrays normalisation
Action: to perform normalisation of Agilent microarray assays.

For 1-channel Agilent microarrays, various procedures for background correction (e.g. “subtract”, “half”, “minimum”,
“normexp”), and between-array normalisation (e.g. “quantile”, “scale”), can be applied.
For 2-channel Agilent microarrays, procedures for within-array normalisation (e.g. “loess”, “median”) can also be
applied.
Note: What is the difference between 1-channel and 2-channel microarrays?

Two-channel (or two-color) microarrays are typically hybridized with cDNA prepared from two samples (or two ex-
perimental conditions) that the scientists want to compare, e.g. disseased tissue vs. healthy tissue. These arrays
samples are labeled with two different fluorophores, say Cy5 and Cy3 dyes, and will emit signal with different inten-
suty. Relative intensities of each fluorophore may then be used in ratio-based analysis to identify up-regulated and
down-regulated genes
In single-channel arrays, also called one-color microarrays, each experimental condition must be applied to a separate
chip. They give estimation of the absolute levels of gene expression and only a sigle dye is used.
The application is based on the limma R package.
GenePix microarrays normalisation
Action: to perform normalisation of GenePix microarray assays.

For GenePix microarrays, quantile between-array normalisation is performed and various procedures for background
correction (e.g. “subtract”, “half”, “minimum”, “normexp”) can be applied.

L1000 microarrays normalisation
Action: to perform normalisation of L1000 microarray assays.

To normalize L1000 microarrays, the application uses the “quantile” method for between-array normalisation.
Microarray quality control
As in any statistical analysis, the quality of the data must be checked. The goal of this step is to determine if the whole
process has worked well enough so that the data can be considered reliable.
Action: to perform quality assessment of normalised microarrays and detect potential outliers.
The application generates report containing quality metrics based on between-array comparisons, array intensity,
variance-mean dependence and individual array quality. Some metrics have their own labels. It helps to understand
according to which metric(s) the particular microarray is considered to be outlier.
QC metrics are computed for both the unnormalised and normalised microarrays and include:
1. Between array comparison metrics.
• Principal Component Analysis (PCA) is a dimension reduction and visualisation technique that is used to project
the multivariate data vector of each array into a two-dimensional plot, such that the spatial arrangement of the
points in the plot reflects the overall data (dis)similarity between the arrays.
For example, in the picture below, PCA identifies variance in datasets, which can come from real differences
between samples, or, as in our case, from the failed “CD4 T lymphocytes, blood draw (1)” array.

• Distances between arrays. The application computes the distances between arrays. The distance between two
arrays is computed as the mean absolute difference (L1-distance) between the data of the arrays (using the data
from all probes without filtering).
The array will be detected as an outlier if for this array the sum of the distances to all other arrays is extremely
large.

2. Array intensity statistics.

• Boxplots of signal intensities represents signal intensity distributions of the microarrays. Typically, we expect
to see the boxes similar in position and width. If they are different, it may indicate an experimental problem.
• Density plots of signal intensities show density distributions for microarrays. In a typical experiment, we expect
these distributions to have similar shapes and ranges. The differences in density distributions can indicate the
quality related problems.

3. Variance mean dependence metric.

• “Standard deviation versus mean rank” plot is a density plot of the standard deviation of the intensities across
arrays on the y-axis versus the rank of their mean on the x-axis. The red dots, connected by lines, show the
running median of the standard deviation.
After normalisation procedure we typically expect the red line to be almost horizontal. A hump on the right-hand
side of the line may indicate a saturation of the intensities.

4. Individual array quality.

• MA Plots allow pairewise comparison of log-intensity of each array to a “pseudo”-array (which consists of
the median across arrays) and identification of intensity-dependent biases. The Y axis of the plot contains the
log-ratio intensity of one array to the median array, which is called “M” while the X axis contains the average
log-intensity of both arrays — called “A”. Typically, probe levels are not likely to differ a lot so we expect a
MA plot centered on the Y=0 axis from low to high intensities.

Additional Affymetrix-specific metrics are also computed for Affymetrix microarrays.

Overall, if you click on “Outlier detection overview” the application will detect apparent outlier arrays, suggest you
remove them and re-normalise your data or continue differential expression or dose response analyses.
The application is based on the ArrayQualityMetrics R package.

Differential gene expression for microarrays
Expression microarrays can simultaneously measure the expression level of thousands of genes between sample
groups. For example, to understand the effect of a drug we may ask which genes are up-regulated or down-regulated
between treatment and control groups, i.e. to perform differential expression analysis.
Once your microarray samples have been normalised, you can use them as inputs for differential expression analysis.
Test differential expression for microarrays
Action: to perform differential expression analysis between groups of microarray samples.

The application requires normalized microarrays to calculate differential expression statistics (such as log-expr, log-
fold change, p-value and FDR) and microarray annotation to map probe identifiers to the gene symbols.
Let’s look at the options:
1. Group samples by an experimental factor or condition that was specified in the metainfo of the samples. For
example, if you have six samples — three of them are treated by compound X, and the rest three are untreated
— the grouping factor will be the treatment procedure. If no grouping factor is available here, you should open
your microarray assays in Metainfo editor and specify a grouping factor in a new column.
2. Control group option. If you specify a control group, each group will be compared separately to that control
group. If you do not specify a control group, each group will be compared against the average of all the other
groups.
If you specify one or more confounding factors, you can identify differentially expressed genes between the tested
groups of samples, while taking into account potential confounders, such as sex, age, laboratory, etc. In this case the
detected gene expression changes are only caused by the factor of interest, for example treatment, while any possible
effects of confounding factors are excluded. Specify confounding factors by choosing the corresponding metainfo
keys.
Currently, only single-factor comparisons are supported. More complex experimental designs (batch effects, multi-
factor analysis, etc.) will be supported in later versions of the application.
When the analysis is done, you can explore the results in Expression navigator.
Expression navigator
Action: to view and filter the results of differential gene expression analysis.

The Expression Navigator page contains four sections:

1. Groups Information section. It is a summary of the groups available for comparison. Size refers to the number
of samples used to generate each group.
2. The Top Differentially Expressed Genes section allows you to choose which groups to compare and how to
filter and sort identified differentially expressed genes.

You can filter differentially expressed genes by maximum acceptable false discovery rate (FDR), up or down regula-
tion, minimum log fold change (LogFC), and minimum log counts per million (LogCPM).
Let’s look through these statistics:

by taking the logarithm of the fold-change in base 2.
• log-expression: log-transformed and normalised measure of gene expression.

• p-value. The application also computes a p-value for each gene. A low p-value (typically, 0.05) is viewed as
discovery rate.
correction. Typically, an FDR < 0.05 is good evidence that the gene is differentially expressed.
Moreover, you can sort the differentially expressed genes by these statistics, clicking the small arrows near the name
of the metric in the table.
The buttons at the bottom of the section allow you to refresh the list based on your filtering criteria or clear your
selection.
3. The top right section contains a boxplot of expression levels. Each colour corresponds to a gene. Each boxplot
corresponds to the distribution of a gene’s expression levels in a group, and coloured circles represent the
expression value of a specific gene in a specific sample.
4. The bottom-right section contains a search box that allows you to look for specific genes of interest. You can

look up genes by gene symbol, with autocomplete. You can search for any gene (not only those that are visible
with the current filters).
Differential expression similarity search
Action: to find existing differential expression results with similar differential expression patterns. This is useful, for
example, to infer the effects of new experiments (e.g. investigations of compound toxicity or drug effectiveness) using
existing knowledge of other experiments.
The application accepts two kinds of gene signatures as input:
• A gene list This can be entered manually or by supplying an imported Gene List file in .grp, .tsv, .txt, .xls
formats.
• A gene expression signature. This can be supplied by selecting genes from a Differential Expression File
(differential expression analysis results) in the platform or by supplying an imported Gene Expression Signature
file in .tsv, .txt or .xls formats. Unlike gene lists, gene expression signatures store quantitative data about
expression changes specific to a phenotype such as cell type or disease and represented by LogFC values. Gene
expression signatures can be used for more accurate similarity comparisons.
If you import source file, note that genes in the imported files can be represented by any standard identifier such as
Ensembl or Entrez, or by gene names. Make sure that organism is specified in the metadata so that the right dictionary
for species is applied because the same gene names can be used for different organism species. Annotations can be
also imported together with Gene list or Gene Expression Signature. Besides, Gene list or Gene Expression Signature
may include other annotation columns (in the .tsv/.txt/.xls files) on top of the columns containing the gene names and
LogFC values.
Selected input gene signature will be compared against other imported gene signatures and sets of differentially ex-
pressed genes derived in the platform (a gene is categorised as differentially expressed if its FDR from the differential
expression analysis is lower that the user-specified Max FDR input).

Depending on the input different similarity metrics are calculated:

• If the input is a gene list, Fisher’s hypergeometric test will be performed between the input list against each
imported gene signature and against each set of differentially expressed genes. The p-values derived from these
tests are then adjusted using the Benjamini-Hochberg correction.
• If the input is a gene expression signature, more quantitative similarity metrics are computed by comparing
the Log FCs: equivalence test (also known as Two One-Sided T-tests – TOST) and Pearson’s correlation.
• Furthermore, if compound metainfo is available for both the input and target files, similarity based on Tanimoto
coefficient is computed. It calculates how many structural features two chemical structures have in common
based on their chemical fingerprints. A Tanimoto score of 1.0 indicates that the two structures are very similar.
However, as the fingerprints are calculated on a chemical structure path depth of eight it means that many
structures will have similar fingerprints and very high similarity scores even though they might not be very
structurally similar.
Furthermore, the application performs compound search by similarity of chemical structures. If a Chebi structure of a
compound is available in metainfo for both the input and target files, the Tanimoto coefficient between the structures is
computed. This coefficient shows how many common structural features two chemical structures have based on their
chemical fingerprints. The fingerprint is a 2D chemical structure converted so that presence of a particular structural
feature is indicated. A Tanimoto score ranges from 0 to 1, and the value equal to 1 indicates that the two structures are
very similar.
Use Get genes from file button to change an input file (Gene list, Gene Expression Signature or Differential Expression
Statistics) or enter gene names or IDs manually in the search box. Then, for a differential expression files, choose the
genes that you want to search; for Gene List or Gene Expression Signature, the entire list will be used.
The results are represented by an interactive table including the following information:
• File links to either the imported gene signature or the dataset that the gene signature is derived from. In the latter
case, it also includes the derived differential expression report and the dose response report (if available);
• Compound shows the compound name and its ChEBI chemical structure (if available);
• Assays column shows how many case and control samples participated in the differential expression analysis;
• Group provides the name of the differential expression analysis contrast (e.g. dose 30µM vs dose 0µM). Hover
over to learn more about each group;
• Genes shows the number of genes of the input signature that overlap with the found gene signature. Click the
number to explore sortable table with statistics for each found gene, such as LogFC, Log Expr and FDR;
• FDR(TOST or Fisher) columns includes adjusted p-values representing the significance of similarity between
contrasts being tested;
• Corr columns reflects correlation of the signatures (if log-fold change values are available);
• ChemSim shows chemical similarity score comparing the chemical structure of the target and query compounds,
if applicable (0 means no similarity, and 1 means identical chemical structure);
• A bar chart that represents log-fold change values for the found signatures, where blues colour is used for
upregulation, while red represents downregulation.
The results can then be sorted and filtered by Max FDR (maximum FDR), Min Abs Corr (minimum absolute corre-
lation), Min ChemSim (minimum chemical similarity), and Experiment name or compound.
You can export the results using the Download as .tsv link at the bottom of the page.
Compound dose response analysis

Dose response analyser
Action: to find compound dosages at which genes start to show significant expression changes, i.e. the Benchmark
Doses (BMDs). Genes that are differentially expressed between different doses are then analysed for pathway enrich-
ment analysis; the application reports affected pathways and the average BMD of the genes involved in it.
This application takes as input normalised microarray data and performs dose response analysis. It requires a microar-
ray annotation file to map probe identifiers to gene symbols (you can upload your own or use a publicly available one).
It also requires a pathway annotation file to perform pathway enrichment analysis. Pathway files from Wikipathways
are pre-loaded in the system.
The first step of the analysis is to identify genes that are significantly differentially expressed across doses. Once these
are detected, multiple dose response models are fitted to each significant genes and statistics are recorded about the
fits.
The following options can be configured in the application:
1. The FDR filter for differentially expressed genes specifies the False Discovery Rate threshold above which
genes should be discarded from the differential expression analysis (default: FDR < 0.1)
2. Metainfo key for dose value. This option specifies the metainfo key storing the dose levels corresponding to
each sample, as a numeric value. If no such attribute is present in your data, you need to open your microarray
assays in the Metainfo editor and add it there.
For each gene, the application fits different regression models (linear, quadratic and Emax) to describe the expression
profiles of the significant genes as a function of the dose. Then, an optimal model is suggested based on the Akaike
Information Criterion (AIC), that reflects the compromise between the complexity of a model and its “goodness-
of-fit”. A model with the highest AIC is considered “the best”, however a difference in AIC that is less than 2 is not
significant enough models with these AIC should also be considered as good.
Mathematically AIC can be defined as:
AIC = 2(k - ln(L)),
where k is the number of parameters of the model, and ln(L) is the log-likelihood of the model.
Besides, for each gene, based on the optimal model, the benchmark dose (BMD) is computed.
Following the method described in the Benchmark Dose Software (BMDS) user manual the benchmark dose can be
estimated as follows:
|m(d)-m(0)| = 1.349𝜎 0 ,
where m(d) is the expected gene expression at dose d, 𝜎 0 is the standard deviation of the response at dose 0, which we
approximate by the sample standard deviation of the model residuals.
Moreover, for each gene, a differential expression p-value is computed using family-wise F-test across all pairs of
contrasts of the form “dose[n+1] - dose[n]” for n from 0 to <number of doses - 1>. Multiple testing correction is then
performed and genes that pass the user-defined FDR threshold are categorised as differentially expressed.
Differentially expressed genes are then processed by gene set enrichment analysis. For each enriched pathway, the
average and standard deviation of the BMDs of its genes are computed.
The application is based on the limma R package.
Dose response analysis viewer
Action: to display dose response curves and benchmark doses for differentially expressed genes and enriched path-
ways. Note that if no gene passed the FDR threshold specified in the dose response analysis application, the application
will report the 1,000 genes with the smallest unadjusted p-values.

Various regression models (linear, quadratic and Emax) are fitted for each identified differentially expressed gene to
describe its expression profile as a function of the dose. These results are presented in an interactive table.
The table includes information about:

• PROBE ID – chip-specific identifier of the microarray probe;
• GENE – the gene symbol corresponding to that probe (according to the microarray annotation file). Clicking on
the gene name will show you a list of associated gene ontology (GO) terms;

• BMD – the benchmark dose, corresponding to the dose above which the corresponding gene shows a significant
change in expression, according to the best-fitting of the 3 models used. It is calculated using the following
formula:
Let m(d) be the expected gene expression at dose d. The BMD then satisfies the following equation: | m(BMD)-
m(0) | = 1.349*𝜎. In this formula, 𝜎 is the standard deviation of the response at dose 0, which we approximate
by the sample standard deviation of the model residuals.
• BEST MODEL – the model with the optimal Akaike Information Criterion (AIC) among the 3 models that were
fitted for the gene; the AIC rewards models with small residuals and penalizes models with many coefficients,
to avoid overfitting;
• MEAN EXPR – average expression of the gene across all doses;
• T – the moderated t-statistic computed by limma to test for differential expression of the gene;
• P – unadjusted p-value testing for differential expression of the gene across doses;
• FDR – false discovery rate (p-value, adjusted for multiple testing);
• B – B statistic computed by limma to test for differential expression of the gene. Mathematically, this can be
interpreted as the log-odds that the gene is differentially expressed.
Here are examples of dose response curves as they are displayed in the application:

In the “Pathways” tab, you can see a list of significantly enriched pathways, based on the detected differentially
expressed genes and the pathway annotation file supplied to the analysis application.

The table includes:

• PATHWAY – pathway name, e.g. “Iron metabolism in placenta”;
• SIZE – pathway size, i.e. how many genes are involved in the given pathway;
• DE GENES – how many pathway genes are found to be differentially expressed in our data. Clicking on the
specific pathway takes you to the “Genes” tab where you can get expression profiles and regression curves for
the differentially genes.
• P – p-value;
• FDR – false discovey rate value;
• BMD – the pathway BMD is computed as the average of the BMDs of the significant genes involved in this
pathway, computed with the model yielding the best AIC;
• BMD SD – BMD standard deviation.
7.2.2 Methylation arrays
DNA methylation arrays are a widely-used tool to assess genome-wide DNA methylation.
Microarrays normalisation
Action: to perform normalisation of methylation microarray assays.

For methylation microarrays, normalisation can be performed with either “subsetQuantileWithinArray” or “quantile”
method, and in addition, “genomeStudio” background correction may be applied.
Further, the quality of normalised microarrays can be checked using the Microarray QC Report application to detect
and remove potential outliers. Normalised microarrays that are of good quality may then be used in Differential
methylation analysis and Differential regions methylation analysis.
The application is based on the minfi Bioconductor package.
Methylation array QC
Quality assessment of microarray data is a crucial step in microarray analysis pipeline, as it allows us to detect and
exclude low-quality assays from the further analysis.
A single array with both red and green channels is used to estimate methylation for each individual sample. Then, for
each CpG locus, both methylated and unmethylated signal intensities are measured.

Currently, there are two main methods used to estimate DNA methylation level: Beta-value and M-value. The Beta-
value is the ratio of the methylated probe intensity against the overall intensity (sum of the methylated and the unmethy-
lated probe intensities) plus a constant (Du P. et al., 2010). Therefore, Beta-value can reflect roughly the percentage
of methylated sites.
The M-value is the log2 ratio of the intensities of the methylated probe versus the unmethylated probe (Du P. et al.,
2010):
Action: to assess quality of methylation microarray assays.

The Methylation array QC application allows the user to export files containing methylation and unmethylation values,
as well as the Beta-values, the M-values and Log median intensity values.
Additionally, you can download and explore “Copy number values” file with the sum of the methylated and unmethy-
lated signals.
Methylation array QC application provides various types of quality control plots. Let’s explore QC-report for the
Infinium 450K microarrays:
1) Log median intensity plot
The scatterplot represents the log median of the signal intensities in both methylated and unmethylated channels for
each array. In general, samples of good quality cluster together, while “bad” samples tend to separate, typically with
lower median intensities.

2) Beta-values of the assays are represented by two plots:

• Beta density plot represents the Beta-value densities of samples
• Beta density bean plot also shows methylation the Beta-values.

3) Control probes plots:

The Infinium 450K arrays have several internal control probes helping to track the quality on different stages of assay
preparation (based on Illumina’s Infinium® HD Assay Methylation Protocol Guide):
Sample-independent controls
Several sample-independent controls allow the monitoring different steps of the of microarray assay preparation and
include:
• Staining control strip, which estimate the efficiency of the staining step for both the red and green channels.
They are independent of the hybridization and extension steps.

• Extension control strip, which tests efficiency of single-base extension of the probes that incorporates labeled
nucleotides. Both red (A and T, labeled with dinitrophenyl) and green (C and G labeled with biotin) channels
are considered.
• Hybridization control strip, which estimates entire performance of the microarray assay.

This kind of controls uses synthetic targets that are complementary to the array probe sequence. Extension of the
target provides signal. The higher concentration of the target is used, the higher signal intensity will be registered.
• Target removal control strip, which tests whether all targets are removed after extension step. During extension
reaction the sequence on the array is used as template to extend the control oligos. The probe sequences,
however, are not extendable. The signal is expected to be low in comparison to the signal of hybridization
control.

Sample-dependent controls
A number of sample-dependent controls are provided to assess quality across samples.
• Bisulfite-conversion controls
To estimate methylation of DNA, the 450k assay probe preparation involves bisulfite conversion of DNA when all
unmethylated cytosines are converted to uracils, while methylated cytosines are remains as they are.
Bisulphite conversion I control strip
This control uses Infinium I assay chemistry. There are two types of probes in this control: bisulphite-converted and
bisulphite-unconverted ones. If the bisulphite conversion was successful, the converted probes matches the converted
DNA, and are extended. If the sample has some unconverted DNA, the unconverted probes get extended.

Bisulphite conversion II control strip

This control uses the Infinium I chemistry technology. If the bisulphite conversion went well, the adenin base is added,
generating signal in the red channel. If there is some unconverted DNA, the guanine base is incorporated, resulting to
signal in the green channel.
• Specificity controls, which monitor potential non-specific primer extension.

Specificity I control strip is used to assess allele-specific extention for the Infinium I chemistry assays.
Specificity II control strip allows to estimate specificity of extension for Infinium II assay and test whether there is any
nonspecific methylation signal detected over unmethylated background.
All the QC-plots shown on the application page may be downloaded in PDF format (see Minfi PDF Report).

Finally, based on the QC-results you can exclude particular samples as outliers, remove them, and re-normalize the
rest of the assays together. To do so, click Sample list and select those samples that pass QC-check, then click Remove
outliers and re-normalise button.
Then, if you are happy with quality of re-normalized arrays, you can proceed to the following step - Differential
Methylation Analysis.
The “Methylation array QC” application is based on the minfi and the shinyMethyl Bioconductor packages.
Test differential methylation
Action: to identify differential methylation in single CpG sites (‘a differentially methylated positions (DMP)’) across
groups of normalized microarray assays using linear models. Currently, 450k and EPIC Illumina’s Methylation arrays
are supported.
The input data for this application is Infinium Methylation Normalization file obtained with the “Infinium Methylation
Normalization” application.
The analysis includes annotating data when the application determines genomic position of the methylated loci and
its location relatively to various genomic features. Differential methylation analysis application supports custom
Methylation Array Annotation that you can upload with Import Data application.
The application computes differential methylation statistics for each CpG site for the selected group compared to the
average of the other groups. Besides, you can assess differential methylation for each group compared to a control
one.
1. “Group samples by” option allows to group assays for comparison automatically: the application helps you to
group your samples according to experimental factor indicated in metainfo for the microarray assays such as disease,
tissue or treatment, etc. (default: None)
2. Control group option allows to consider one of the created groups as a control one. In this case the application
performs differential methylation analysis for each CpG site in the group against the control one. (default: No control
group)

If you specify one or more confounding factors, you can identify differentially methylated sites between tested groups
of samples, while taking into account potential confounders, such as sex, age, laboratory, etc. In this case the detected
methylation changes are only caused by the factor of interest, for example treatment, while any possible effects of
confounding factors are excluded. As confounding factors must be chosen according to metainfo keys common to all
samples, remember to specify the relevant information for all the samples.
Explore the output with interactive Methylation Navigator.
The application is based on the minfi, limma Bioconductor packages.
Test differential regions methylation
Action: to determine and analyse contiguous regions which are differentially methylated across groups of normalized
microarray assays. Currently, 450k and EPIC Illumina’s Methylation arrays are supported.
As an input the application takes “Infinium Methylation Normalization” file with normalised microarray assays and
returns Differential Methylation Statistics file that you can further explore with the Methylation Navigator. Differential
methylation analysis application supports custom methylation chip annotations that you can upload with Import Data
application.
1. “Group samples by” option allows to automatically group assays according to an experimental factor indicated in
metainfo for the selected microarray assays such as disease, tissue or treatment, etc. (default: None)
2. Control group option allows to consider one of the created groups as a control one. In this case the application
performs differential methylation analysis for each region in the group against the control one. (default: No control
group)
If you specify one or more confounding factors, you can identify differentially methylated regions between tested
groups of samples, while taking into account potential confounders, such as sex, age, laboratory, etc. In this case
the detected methylation changes are only caused by the factor of interest, for example treatment, while any possible
effects of confounding factors are excluded. As confounding factors must be chosen according to metainfo keys
common to all samples, remember to specify the relevant information for all the samples.
The Test Differential Regions Methylation application is based on the minfi and DMRcate packages.
Explore the output with interactive Methylation Navigator.

Methylation navigator for sites
Action: to view, sort and filter the results of analysis of differential methylation positions (DMPs).
The Methylation Navigator page contains four sections:

1. The Groups Information section summarise the information on the created groups of samples to be tested.
2. The Top Differentially Methylated Sites table lists all the detected sites that are differentially methylated in the
selected group compared to either the average of the other groups or a control group (if it is set).

For each DMP (differentially methylated position) or DMR (differentially methylated region), its Delta Beta, Average
Beta, P-value, and FDR are shown.
Click probe ID to get more information about the probe:

You can filter by maximum acceptable false discovery rate (FDR), up or down regulation, minimum log fold change
(LogFC), and minimum log counts per million (LogCPM).
You can reduce the list of DMPs by filtering the data in the table based on the following criteria:
• Max FDR (maximum acceptable false discovery rate) — only shows sites with FDR below the set threshold.
• Methylation All/ Down/ Up — to show all sites or just those that are hypo- or hypermethylated.
• Min Delta Beta — delta Beta represents the difference between the Beta values in the groups being compared;
this filter can be used to get only sites with absolute Delta Beta value of at least this threshold.
• Min Average Beta — only shows sites with average Beta value of at least this threshold.
Sort the list of probes by clicking the arrows next to the name of the statistical metrics in the table headers.

3. A boxplot of methylation levels

Each color corresponds to an individual probe you selected; each circle represents an assay belonging to the tested
group. Each boxplot represents the distribution of a methylation in a given group. The y-axis shows Beta values, while
the x-axis shows probe IDs.

4. The bottom-right section contains a search box that allows you to explore the results for a particular probe. Start
typing a probe ID and select the probe of interest in the appeared drop-down list of possible variants.

You can further export either the complete table of differential methylation analysis for all the groups or the list of
values for the specific comparison in TSV format. See Export Data (for all comparisons, as .tsv) and Download
filtered data for current comparison as .tsv options, respectively.
Methylation navigator for regions
Action: to view, sort and filter the results of analysis of differential methylation regions (DMRs).

The Methylation Navigator page contains the following sections:

1. The Groups Information section summarise the information on the created groups of samples to be tested.

2. The Top Differentially Methylated Regions table shows all the detected regions that are differentially methylated
in the selected group compared to either the average of the other groups or a control group (if it is set).
You can further reduce the list of identified DMRs and exclude those regions that do not meet set filtering criteria. The
following filters can be applied:
• Max FDR (maximum acceptable Stouffer-transformed false discovery rate) — the FDR is statistical certainty

that the given region is differentially methylated. This filter only shows regions with the FDR values below the
set threshold. Learn more about Stouffer-test from the paper by Kim S.C. (2013).
• Methylation (Down/All/Up) — shows all regions or only hypo- or hypermethylated ones.
• Min BetaFC (minimum mean beta fold change within the region) — for every DNA region, each probe has its
Beta value, which is defined as relative methylation of the region (B1, B2 etc.). BetaFC, in this case, can be
defined as mean Beta fold change; apply the filter to show only regions having BetaFC below the threshold.
• Min significant CPG sites count — minimum number of CpG sites inside the genomic region.
You can also sort the list of identified DMRs by clicking the arrows next to the name of the statistical metrics in the
table.
Finally, you can export both the complete table of top differential methylated regions for all the groups (Export Data
(for all comparisons, as .tsv)) and the list of regions with associated statistics for the one comparison in TSV format
(Download filtered data for current comparison as .tsv).



CHAPTER 8
FAQ
8.1 Where do I find data that was shared with me?
If the files were linked into the group folder, you will find them there. You can access the group folder from the file
browser, under the “Shared with me” section. Otherwise, the files can be found via search, if you know their name or
accession.
8.2 How do I reuse a data flow?
Open a data flow you would like to run in the “Run Dataflow” application. On the application page you can set input
files and additional files (e.g. reference genome) that are required for analysis.
8.3 How can I initialize files?
Generally speaking, there are many ways to initialize files in Genestack. Firstly, you can use File Initializer app that
can accept multiple files. To do so, select files of your interest, right-click on them and go to Manage section. Then,
you can start initialization when run data flow. On the “Data Flow Runner” page when all inputs are set, all files are
created and you are happy with parameters of each app, you will be suggested to start computations right away or
delay it. Finally, you can right-click file name wherever you are in the platform (on Data Flow Runner, File Manger,
Task Manager, any bioinformatic app etc.) and select “Start initialization” option in the context menu.
8.4 How do I create a data flow?
To create a data flow, select the data you wish to analyse and choose the first application you wish to use in your
analysis. On the application page, using the “add step” button, add the rest of the desired steps. Once you are done,
click on the name of the file (or files) at the top of the page, go to Manage, and click on Create New Data Flow. Your
new data flow can be found in the Created Files folder
303
If you do not want to create a data flow from scratch, but rather re-use the same analysis pipeline used to create a file,
click on the name of that file, go to Manage, and select Create New Data Flow.
Selecting File Provenance instead of Create New Data Flow will show you the pipeline (in the form of a data flow)
that was used to create this file. Read more about data flows in this tutorial.
8.5 What is the difference between BWA and Bowtie2?
The biggest differences between the two aligners are:

• the way of accepting or rejecting an alignment BWA counts the number of mismatches between the read and
the corresponding genomic position; Bowtie2 aligner uses a quality threshold bases on the probability of
the occurrence of the read sequence given an alignment location.
• accepting colorspace data BWA tool does not support data in colorspace data, while Bowtie2 is able to align
such files.
8.6 How does Genestack process paired-end reads?
There are three types of raw sequencing reads that our platform supports:
• single-end (1 file locally, 1 file in Genestack);
• paired-end (2 files locally, 1 file in Genestack);
• paired-with-unpaired (3 or 4 files locally, 2 files in Genestack).
During import, Genestack recognises them and imports them in their respective format-free form. If the platform
cannot recognise the files automatically, you can allocate the files manually.
8.7 What is the difference between an Datasets and a folder?
Datasets are a special kind of folder, which can only contain assays, e.g. “raw” experimental data.
8.8 What is the difference between masked and unmasked reference

genomes?
In general, when a genomes is “masked” it means that all repeats and low complexity regions of your reference genome
(detected by RepeatMasker tool) are hidden away and replaced with “N”s, so that they will not be aligned to.
We do not recommend using a masked genome, as it always results in a loss of information. Masking can never
be 100% accurate, and can lead to an increase in the number of falsely mapped reads. If you would like to perform
filtering, it is better to do it after the mapping step.
In soft-masked genomes, repeated and low complexity regions are still present, but they have been replaced with
lowercased versions of their nucleic base.
Unmasked genomes contain all repeats and low complexity regions without any changes.
304 Chapter 8. FAQ

8.9 What is the difference between Data Flow Runner and Data Flow
Editor?
Data Flow Editor is used to create data flow templates, e.g. selecting source files.
When you want to use the data flow to run your analysis, on the Data Flow Editor page you can click on “Run Data
Flow” button, which will take you to Data Flow Runner. Here you can not only edit source files and parameters, but
also start initialization of your files.
8.9. What is the difference between Data Flow Runner and Data Flow Editor? 305
306 Chapter 8. FAQ

CHAPTER 9
References
• Anders S., et al. HTSeq — A Python framework to work with high-throughput sequencing data. bioRxiv
preprint. 2014
• Aronesty E. ea-utils: Command-line tools for processing biological sequencing data. Expression Analysis,
Durham, NC. 2011
• Bao R., et al. Review of Current Methods, Applications, and Data Management for the Bioinformatics Analysis
of Whole Exome Sequencing. Cancer Informatics. 2014; 13:67–82
• Bray N.L., et al. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology. 2016; 34:525–527
• Brennecke P., et al. Accounting for technical noise in single-cell RNA-seq experiments. Nature Methods. 2013;
10(11): 1093–1095
• Caporaso J.G. QIIME allows analysis of high-throughput community sequencing data. Nature Methods. 2010;
7(5): 335-336
• Cell Ontology
• Cellosaurus
• The NCBI Taxonomy database
• ChEBI
• Cingolani P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms,
SnpEff: SNPs in the genome of Drosophila melanogaster strin w1118; iso-2; iso-3. Fly. 2012; 6(2): 80-92
• Cokus S.J. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning.
Nature. 2008; 452(7184):215–219
• CRAN
• Davis M.P., et al. Kraken: a set of tools for quality control and analysis of high-throughput sequence data.
Methods. 2013; 63(1):41-9
• Dobin A., et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2012; 29(1): 15-21
• Do C.B. and Batzoglou S. What is the expectation maximization algorithm? Nature Biotechnology. 2008;
26(8):897-899
307
• Esposito D. Cutting Edge - Programming CSS: Bundling and Minification. MSDN Magazine Blog. 2013
• Federhen S. The NCBI Taxonomy database. Nucleic Acids Res. 2012
• Frankish A., et al. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset
on variant effect prediction. BMC Genomics. 2015; 16(8), S2
• Gautier L. et al. affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 2004;
20(3):307–315
• Greengenes
• Hastings J., et al. The ChEBI reference database and ontology for biologically relevant chemistry: enhancements
for 2013.. Nucleic Acids Research. 2013
• QIIME
• Quinlan A.R. and Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinfor-
matics. 2010; 26(6):841-2
• Kim D., et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene
fusions. Genome Biology. 2013; 14(4), R36
• Langmead B. and Salzberg S.L. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012; 9(4):357-9
• Li B. and Dewey C.N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference
genome. BMC Bioinformatics. 2011; 12(1), 323
• Li H. and Durbin R., Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics.
2009; 25:1754-60
• Li H., et al. The Sequence Aignment/Map format and SAMtools. Bioinformatics. 2009; 25(16): 2078-9
• Lister R., et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature.
2009; 462(7271):315-22
• Love M.I., et al. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome
Biology. 2014; 15(12), 550
• Luscombe N.M., et al. What is bioinformatics? An introduction and overview. Yearbook of Medical Informat-
ics. 2001
• Madan Babu M.M. An Introduction to Microarray Data Analysis Computational Genomics (Ed: R. Grant),
Horizon press, UK 2004
• McCarthy D.J., et al. Choice of transcripts and software has a large effect on variant annotation. Genome
Medicine. 2014; 6(3), 26
• Pearson W.R., et al. Comparison of DNA sequences with protein sequences. Genomics. 1997; 46(1):24-36
• Removing duplicates is it really necessary? (2010, September 14) SEQanswers.
• Ritchie M.E., et al. limma powers differential expression analyses for RNA-sequencing and microarray studies.
Nucleic Acids Research. 2015; 43(7)
• Robinson M.D., et al. edgeR: a Bioconductor package for differential expression analysis of digital gene ex-
pression data. Bioinformatics. 2010; 26(1):139-140
• Trapnell C., et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and
isoform switching during cell differentiation. Nature Biotechnology. 2010; 28:511-515
• UNITE
• Xi Y. and Li W. BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics. 2009;
10(1), 232
308 Chapter 9. References

Genestack User Tutorials Readthedocs Io en Latest

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Genestack User Tutorials Readthedocs Io en Latest

Uploaded by

Copyright:

Available Formats

Genestack User Tutorials

Aug 16, 2019

1 Genestack key concepts 3

2 Get started with Genestack 5

4 Genestack Platform overview 141

5 Importing data 147

6 Introduction to NGS data analysis 207

7 Applications review 209

Welcome to the Genestack user guide!

Genestack key concepts

1.1 Rich metadata system

1.2 Public data

1.3 Interactive data analysis and visualisation

1.4 Format-free files

Note: Formatting Hell

4 Chapter 1. Genestack key concepts

Get started with Genestack

2.1 How to find relevant data?

6 Chapter 2. Get started with Genestack

2.2 How to import data?

2.2. How to import data? 7

data from URL or use previous uploads.

8 Chapter 2. Get started with Genestack

2.2. How to import data? 9

2.3 How to build and run a pipeline?

10 Chapter 2. Get started with Genestack

2.3. How to build and run a pipeline? 11

12 Chapter 2. Get started with Genestack

2.3. How to build and run a pipeline? 13

14 Chapter 2. Get started with Genestack

2.4 How to reproduce your work?

2.4. How to reproduce your work? 15

16 Chapter 2. Get started with Genestack

2.4. How to reproduce your work? 17

18 Chapter 2. Get started with Genestack

2.5 How to share and manage your data?

2.5. How to share and manage your data? 19

Right away we have a new group:

20 Chapter 2. Get started with Genestack

And we can add a new member to this newly created group:

Now your group looks like this:

2.5. How to share and manage your data? 21

22 Chapter 2. Get started with Genestack

2.5. How to share and manage your data? 23

24 Chapter 2. Get started with Genestack

3.1 Testing differential gene expression

3.1.1 Setting up an RNA-seq experiment

3.1.2 Quality control of raw reads

26 Chapter 3. Case studies

3.1. Testing differential gene expression 27

28 Chapter 3. Case studies

3.1.3 Preprocessing of raw reads

3.1. Testing differential gene expression 29

30 Chapter 3. Case studies

3.1. Testing differential gene expression 31

32 Chapter 3. Case studies

3.1. Testing differential gene expression 33

3.1.4 Mapping RNA-seq reads onto a reference genome

34 Chapter 3. Case studies

Here is a combined track for all trisomic and control samples:

3.1. Testing differential gene expression 35

3.1.5 Quality control of mapped reads

36 Chapter 3. Case studies

3.1. Testing differential gene expression 37