Professional Documents
Culture Documents
28 June 2019
Contents
2 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Data import . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 DESeq2 analysis . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4 Saving the results . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
The aim of this practical is to introduce the basic steps of gene expression analysis with
DESeq2. The analysis starts from a pre-calculated count tables which contains the number of
RNA-seq reads that fell into each genes in the mouse genome.
2 Data analysis
2.1 Data import
The number of RNA-seq reads from the six different samples that fall into the different genes
(Gene Counts) are provided as tab-separated tables by HTseq-count. The first column reports
the gene ID and the second column indicates read counts associated to the specific gene.
DESeq2 provides the function DESeqDataSetFromHTSeqCount() to import htseq-count output
tables into a single count table.
We save the htseq-count outputs into a single directory (htseq_dir). We specify which files
to read in using list.files().
mydir <- "~/Documents/DAcourse/htseq_counts/" # path to the directory with htseq-count output
myfiles <- list.files(path = mydir)
myfiles
We create a metadata table with additional information about the files. Be careful with the
order of files.
2
Differential expression analysis with DESeq2
dds
# Summarise results
summary(res)
3
Differential expression analysis with DESeq2
The function order() allows to sort the regulated genes by different criteria. In addition, the gene ID
in the row names allows to directly retrieve individual genes of interest from the results table.
# The 10 most upregulated genes
res <- res[order(res$log2FoldChange, decreasing = TRUE), ]
TopUP<-res[1:10,]
TopUP
Exercise:
- How many genes are differentially regulated with a significance level of 0.05 and an at least 2-fold
change?
- Which are the 10 genes with the most significant regulation?
- How are the DNA (cytosine-5)-methyltransferase genes efficiently deactivated in the knock out?
writeLines(rownames(TopDown200),"TopDown200.txt")
Exercise: Generate and save a list of gene names of the 200 most up-regulated genes in activated
HSCs.
2.5 Visualisation
DESeq2 offers a number of functions for visualisation of the data and analysis results. The plots on the
one hand serve for quality control, and on the other hand offer valuable insights in the transcriptional
response present in the analysed samples.
A commonly used option to visualise the transcriptional response is the so-called MA plot. In this plot,
each individual gene is represented by one dot showing the log2 fold change in relation to the average
read count level across all samples. Genes that are significant according to the chosen significance
level alpha (default alpha=0.1) are coloured in red. All other genes are in black.
4
Differential expression analysis with DESeq2
An alternative and complementary way to visualise the transcriptional response is the so-called Volcano
plot. In this plot, each individual gene is represented by one dot showing the P value (or adjusted
pvalue) in relation to the alog2 fold change. Genes that are significant according to the chosen
significance level alpha (default alpha=0.1) are coloured differently. This graphic can be created with
ggplot.
library(ggplot2)
res_table <- as.data.frame(res)
# add a column labeling significant genes. To note that there are NA (not available) values
res_table$sig_genes <- res_table$padj<0.1
res_table$sig_genes[is.na(res_table$padj)] <- FALSE
gg1 <- ggplot(data = res_table,
mapping = aes(x = log2FoldChange,
y = -log10(padj),
color = sig_genes)) +
geom_point()
gg1
The function plotCounts() allows to display the sizeFactor-normalised counts of an individual gene
across the different samples.
# Normalised read counts for an individual gene
plotCounts(dds, gene="ENSMUSG00000004099.16", intgroup="condition")
A useful approach for general quality control is the pairwise comparison of similarity between the
different samples. For this, the gene counts of the DESeq DataSet are first log-transformed, and then
input into the function dist() to calculate pairwise Euclidean distances. The pairwise similarities can
be displayed as heatmap, which allows to assess general quality and successful treatment for each
sample.
# Heatmap of sample similarities
library("pheatmap")
library("RColorBrewer")
The principle component analysis (PCA) follows a similar approach. By default, the function does
not analyse all genes, but just takes into account the 500 genes which show the highest variability
across all samples. The respective plot can be generated with the function plotPCA() from the DESeq2
package.
# PCA (500 most variable genes)
plotPCA(rld, intgroup=c("condition"), ntop = 500)
5
Differential expression analysis with DESeq2
It is also possible to customize the PCA plot using the ggplot function.
Exercise:
- Generate an MA plot applying an increased significance level of adjusted P value < 0.01.
- Visualise the read counts of the DNA (cytosine-5)-methyltransferase 1 gene.
6
Differential expression analysis with DESeq2
1. Since Ensembl contains a multitude of organisms and a broad range of information, we first
need to choose the database and dataset to query. Please chose "Ensembl Genes 96" as
database and "Mouse genes (GRCm38.p6)" as dataset.
2. In the menu point Filters, it is possible to specify the gene for which information is requested.
In order to make a query for the list of differentially regulated genes, go to Filters, then "Gene".
Select "Input external references ID list" and choose "Ensembl Gene ID(s)" as type of ID.
3. In the menu point Attributes, you can then choose what information you would like to receive
for the gene list. For instance, to get for each gene besides the gene ID also the gene name and
a short description, go to "Gen" in the section "Features" and select the information "Ensembl
Gene ID", "Associated Gene Name" and "Description". Make sure to have the gene ID that
you provided as input also present in your output table. This is important to connect the
results of the Ensembl query back to your original gene list from DESeq2.
7
Differential expression analysis with DESeq2
References:
• "Differential analysis of count data - the DESeq2 package", Bioconductor.
https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
• Anders S, Huber W. (2010) Differential expression analysis for sequence count data.
Genome Biol 10:R106.
• Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, Robinson MD. (2013) Count-
based differential expression analysis of RNA sequencing data using R and Bioconductor.
Nat Protoc 9:1765-86
• Love MI, Huber W, Anders S. (2014) Moderated estimation of fold change and dispersion for RNA-seq
data with DESeq2.
Genome Biol 12:550