You are on page 1of 8

Differential expression analysis with DESeq2

Dr. Kathi Zarnack 1


1
Buchmann Institute for Molecular Life Sciences (BMLS), Frankfurt

28 June 2019

Contents

1 Short introduction to R and Bioconductor . . . . . . . . . . . . . 2

2 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Data import . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 DESeq2 analysis . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4 Saving the results . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Using Ensembl Biomart . . . . . . . . . . . . . . . . . . . . . . . 7


Differential expression analysis with DESeq2

The aim of this practical is to introduce the basic steps of gene expression analysis with
DESeq2. The analysis starts from a pre-calculated count tables which contains the number of
RNA-seq reads that fell into each genes in the mouse genome.

1 Short introduction to R and Bioconductor


R is a freely available programming language for statistical data analyses and their visualisation
(see http://www.r-project.org). Based on R, Bioconductor is an open source software project
which offers a broad range of packages to analyse genomic data from molecular biology
experiments (see https://www.bioconductor.org/). Each Bioconductor package has its own
package which specifies commands for its installation as well as a so-called vignette introducing
the most important concepts and commands. You can find all information on the package
DESeq2 here: https://www.bioconductor.org/packages/release/bioc/html/DESeq2.html.

The following code snippets recapitulate a central R commands:


###### Package installation
install.packages("pheatmap")

# for packages from Bioconductor, check their respective webpages


###### Loading a packages
library(DESeq2)
###### Help
?plotMA # gives information about the function plotMA()

2 Data analysis
2.1 Data import
The number of RNA-seq reads from the six different samples that fall into the different genes
(Gene Counts) are provided as tab-separated tables by HTseq-count. The first column reports
the gene ID and the second column indicates read counts associated to the specific gene.
DESeq2 provides the function DESeqDataSetFromHTSeqCount() to import htseq-count output
tables into a single count table.
We save the htseq-count outputs into a single directory (htseq_dir). We specify which files
to read in using list.files().
mydir <- "~/Documents/DAcourse/htseq_counts/" # path to the directory with htseq-count output
myfiles <- list.files(path = mydir)
myfiles

# set the working directory to the path with files


setwd(mydir)

We create a metadata table with additional information about the files. Be careful with the
order of files.

2
Differential expression analysis with DESeq2

sampleFiles <- myfiles


sampleCondition <- c(rep("TKO",3),rep("WT",3))
sampleReplicate <- rep(paste("Rep",1:3, sep = "_"),2)
sampleTable <- data.frame(sampleName = sampleFiles,
fileName = sampleFiles,
condition = sampleCondition,
replicate = sampleReplicate)

Then we build the DESeqDataSet using the following function:


library("DESeq2")
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = mydir,
design= ~condition)

dds

2.2 DESeq2 analysis


In the next step, we will generate a DESeq DataSet from the imported data.
design = ~ condition specifies that the effect to be tested in the analysis is given
in the column condition of the colData table. The function DESeq() is a convenient wrapper
function that runs all core steps of the DESeq2 analysis. These includes the estimation of
the sizeFactors to correct for differences in library size, the dispersion estimation, the model
adjustment and the final test for differential expression. The function sizeFactors() outputs
the calculated sizeFactors. In order to normalise the read counts in the columns of the count
table, the function DESeq() internally divides each column by the associated sizeFactor of this
sample. The function counts() allows to output the read counts before and after sizeFactor
normalisation by setting the parameter normalized=FALSE/TRUE.
# Run DESeq2 analysis
dds <- DESeq(dds)

Exercise: Which sample had the largest / smallest library size?

2.3 Test results


The function results() retrieves the results of the statistical test. The parameters alpha and
lfcThreshold allow to set cut-offs on the significance level (i.e. the adjusted P value) and the fold
change (on log2 scale). The defaults are alpha=0.1 and lfcThreshold=0. In addition, the parameter
contrast defines which value in condition should serve as basic stage. In this case, we use the wild
type as starting condition, i.e. change in expression will be calculated as a ratio of knock out / wild
type.
# Generate a results table with default settings
res <- results(dds, contrast=c("condition", "TKO", "WT"))
head(res)

# Summarise results
summary(res)

3
Differential expression analysis with DESeq2

The function order() allows to sort the regulated genes by different criteria. In addition, the gene ID
in the row names allows to directly retrieve individual genes of interest from the results table.
# The 10 most upregulated genes
res <- res[order(res$log2FoldChange, decreasing = TRUE), ]
TopUP<-res[1:10,]
TopUP

# Retrieving individual genes


res["ENSMUSG00000058440",]

Exercise:
- How many genes are differentially regulated with a significance level of 0.05 and an at least 2-fold
change?
- Which are the 10 genes with the most significant regulation?
- How are the DNA (cytosine-5)-methyltransferase genes efficiently deactivated in the knock out?

2.4 Saving the results


The function subset() allows to extract defined subsets from the results table, which can then e.g. be
saved as table with write.table(). Within the function subset(), the columns of the table can be
directly referenced by column name. As an alternative to write.table(), you can also generate a list
of just the gene names with writeLines(). This list can be used to retrieve additional informations
on the target genes from Ensembl Biomart (see below). They can also serve as input for a Gene
Ontology analysis.
# Extract list of upregulated genes with padj < 0.01 and log2 foldchange > 1
SigGenesUP <- subset(res, padj < 0.01 & log2FoldChange > 1)

# Export upregulated genes


write.table(SigGenesUP, file = "DESeq2_TKO_up.tab", quote=FALSE, sep="\t")

# List of gene names of the 200 most downregulated genes


res <- res[order(res$log2FoldChange, decreasing=FALSE), ]
TopDown200 <- res[1:200,]
head(TopDown200)

writeLines(rownames(TopDown200),"TopDown200.txt")

Exercise: Generate and save a list of gene names of the 200 most up-regulated genes in activated
HSCs.

2.5 Visualisation
DESeq2 offers a number of functions for visualisation of the data and analysis results. The plots on the
one hand serve for quality control, and on the other hand offer valuable insights in the transcriptional
response present in the analysed samples.
A commonly used option to visualise the transcriptional response is the so-called MA plot. In this plot,
each individual gene is represented by one dot showing the log2 fold change in relation to the average
read count level across all samples. Genes that are significant according to the chosen significance
level alpha (default alpha=0.1) are coloured in red. All other genes are in black.

4
Differential expression analysis with DESeq2

# MA Plot (with default log2 fold change cutoff)


plotMA(res, main="MA plot")

An alternative and complementary way to visualise the transcriptional response is the so-called Volcano
plot. In this plot, each individual gene is represented by one dot showing the P value (or adjusted
pvalue) in relation to the alog2 fold change. Genes that are significant according to the chosen
significance level alpha (default alpha=0.1) are coloured differently. This graphic can be created with
ggplot.
library(ggplot2)
res_table <- as.data.frame(res)
# add a column labeling significant genes. To note that there are NA (not available) values
res_table$sig_genes <- res_table$padj<0.1
res_table$sig_genes[is.na(res_table$padj)] <- FALSE
gg1 <- ggplot(data = res_table,
mapping = aes(x = log2FoldChange,
y = -log10(padj),
color = sig_genes)) +
geom_point()
gg1

gg1 + scale_color_manual(values = c(`TRUE` = "red", `FALSE` = "black"))

The function plotCounts() allows to display the sizeFactor-normalised counts of an individual gene
across the different samples.
# Normalised read counts for an individual gene
plotCounts(dds, gene="ENSMUSG00000004099.16", intgroup="condition")

A useful approach for general quality control is the pairwise comparison of similarity between the
different samples. For this, the gene counts of the DESeq DataSet are first log-transformed, and then
input into the function dist() to calculate pairwise Euclidean distances. The pairwise similarities can
be displayed as heatmap, which allows to assess general quality and successful treatment for each
sample.
# Heatmap of sample similarities
library("pheatmap")
library("RColorBrewer")

rld <- rlog(dds, blind=FALSE)


sampleDists <- dist(t(assay(rld)))

sampleDistMatrix <- as.matrix(sampleDists)


rownames(sampleDistMatrix) <- paste(rld$cell, rld$condition, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix, clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists, col=colors)

The principle component analysis (PCA) follows a similar approach. By default, the function does
not analyse all genes, but just takes into account the 500 genes which show the highest variability
across all samples. The respective plot can be generated with the function plotPCA() from the DESeq2
package.
# PCA (500 most variable genes)
plotPCA(rld, intgroup=c("condition"), ntop = 500)

5
Differential expression analysis with DESeq2

It is also possible to customize the PCA plot using the ggplot function.

pcaData <- plotPCA(rld, intgroup=c("condition","replicate"), returnData=TRUE)


percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=replicate)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) + # change label x axis
ylab(paste0("PC2: ",percentVar[2],"% variance")) + # change label y axis
coord_fixed() +
scale_colour_viridis_d(option = "inferno") # change color scale

Another diagnostic plot is distribution of P values in a histogram.


# Histogram of P values
hist(res$pvalue)

Exercise:
- Generate an MA plot applying an increased significance level of adjusted P value < 0.01.
- Visualise the read counts of the DNA (cytosine-5)-methyltransferase 1 gene.

6
Differential expression analysis with DESeq2

3 Using Ensembl Biomart


The DESeq2 analysis results in a list of gene IDs of the differentially regulated genes. In order to learn
more about these genes, we will use the Ensembl database (http://www.ensembl.org/index.html).
This database collects, annotates and compares the genomes of various vertebrate species. Data from
this database can be retrieved with the web-based tool Biomart.
Steps of an example Biomart request:

1. Since Ensembl contains a multitude of organisms and a broad range of information, we first
need to choose the database and dataset to query. Please chose "Ensembl Genes 96" as
database and "Mouse genes (GRCm38.p6)" as dataset.
2. In the menu point Filters, it is possible to specify the gene for which information is requested.
In order to make a query for the list of differentially regulated genes, go to Filters, then "Gene".
Select "Input external references ID list" and choose "Ensembl Gene ID(s)" as type of ID.
3. In the menu point Attributes, you can then choose what information you would like to receive
for the gene list. For instance, to get for each gene besides the gene ID also the gene name and
a short description, go to "Gen" in the section "Features" and select the information "Ensembl
Gene ID", "Associated Gene Name" and "Description". Make sure to have the gene ID that
you provided as input also present in your output table. This is important to connect the
results of the Ensembl query back to your original gene list from DESeq2.

7
Differential expression analysis with DESeq2

References:
• "Differential analysis of count data - the DESeq2 package", Bioconductor.
https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
• Anders S, Huber W. (2010) Differential expression analysis for sequence count data.
Genome Biol 10:R106.
• Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, Robinson MD. (2013) Count-
based differential expression analysis of RNA sequencing data using R and Bioconductor.
Nat Protoc 9:1765-86
• Love MI, Huber W, Anders S. (2014) Moderated estimation of fold change and dispersion for RNA-seq
data with DESeq2.
Genome Biol 12:550

You might also like