DAVID Tutorial

Pathway and Network Analysis Hands-On Session
Rossella Melchiotti
22/01/2015
Contents
1. Mount the local drive on the Windows machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. Take a look at the datasets that will be used for the analysis . . . . . . . . . . . . . . . . . . . . 3
3. Select genes of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4. Perform gene enrichment using DAVID (on the Windows machine) . . . . . . . . . . . . . . . . 6
5. Perform KEGG enrichment using GSEA (on the Windows machine) . . . . . . . . . . . . . . . . 10
6. Visualize a pathway using PathVisio (on the Windows machine) . . . . . . . . . . . . . . . . . . 16
7. Perform gene enrichment using topGO (on the cluster, will be presented only if time permits) . 22
8. Help Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
In this practical you will learn how to perform pathway analysis on a real microarray expression dataset
using overrepresentation analysis, rank-based methods and conditional enrichment. You will also learn how
to overlay expression datasets on pathways for further understanding the effects of a perturbation.
1. Mount the local drive on the Windows machine
This technical step is required to access the datasets and tools used in this practical. It is specific to this
tutorial and not required for pathway analysis in other contexts.
1. Go to the Desktop
2. Double click on Computer
1
3. Select the tab Computer
4. Click on Map Network Drive
5. Select F: as Drive
6. Type \\kclad\groups\BioinformaticsWorkshop\ in Folder
2
7. Click on Finish
2. Take a look at the datasets that will be used for the analysis
As previously mentioned by Dr. Filipe Gracio in his presentation on RNA-Seq, the analysis of expression
datasets often produces files containing gene names, their p-value for differential expression across two or
more phenotypic groups and, in some cases, associated fold changes. For this practical we will analyse a
similar file. The file to use for the analysis can be found at:
\\kclad\groups\BioinformaticsWorkshop\practicals\Thursday\Melchiotti\Data\
deg_GSE2706_8h.txt (on the Windows shared drive)
3
This dataset was generated by comparing cells isolated from the same individuals before (3 samples) and
after (3 samples) stimulation with a receptor activator. Differential expression was estimated using the tool
Limma as implemented in GEO2R. Expression was measured using the Affymetrix Human Genome U133
Plus 2.0 Array.
Open this file and inspect it. It should contain the following fields:
• Affymetrix Probe ID (ID)

• Benjamini Hochberg corrected p-value (adj.P.Val)
• Uncorrected p-value (P.Value)
• Log Fold Change of stimulated versus unstimulated cells (logFC)
• HUGO Gene Symbol (Gene.symbol)
• Gene full name (Gene.title)
The same directory contains the corresponding expression dataset (GSE2706_series_matrix.txt). Each
rows represent a probe, each column represent a sample. Samples identified by the prefix PBS are baseline
samples. Samples identified by the prefix LPS are samples measured after a perturbation. This dataset will
be described in more detail later on in the practical. Familiarize yourself with the structure of both files.
IMPORTANT!!!
Please do not save any changes made to the files or all the other students will be affected as
well
3. Select genes of interest
The first step in a pathway enrichment analysis is the selection of genes of interest to use for overrepresentation
analysis. One option is to look only at genes which are significantly different across two or more phenotypic
groups after multiple testing correction (adj.P.Val<0.05). This can be easily done in Excel. Because of time
contraints this step has already been performed and the corresponding list of differentially expressed genes
filtered by p-value can be found at:
4
deg_GSE2706_8h_filtered_by_pval.txt (on the Windows shared drive)
Q: How many genes are significantly differentially expressed?

When the list of significant genes is too long additional filters can be introduced (for example fold changes)
or the significance threshold can be reduced.
Here is an heatmap representing the expression of differentially expressed genes across samples. Each row
represents a probe while each column represents a sample. Samples colored in green are before treatment
samples while samples colored in brown are after treatment samples.
How do we make biological sense out of this? Due to the large number of genes significantly perturbed we
5
cannot analyse them one by one.
4. Perform gene enrichment using DAVID (on the Windows machine)
We will start by running a simple overrepresentation analysis as implemented by the tool DAVID (Database
for Annotation, Visualization and Integrated Discovery). Go to the following website:
http://david.abcc.ncifcrf.gov/
1. Click on Functional Annotation
2. Upload the list deg_GSE2706_8h_filtered_by_pval.txt choosing Affymetrix_3PRIME_IVT_ID as

identifier and gene list as List Type
3. Click on Submit List
6
4. Move to the tab Background and under Affymetrix 3’ IVT Backgrounds choose Human Genome U133
Plus 2 Array as a background since the experiment was run using this chip
7
Another option would be to use only expressed genes as a background.
Ontologies and Pathway Databases of interest can be selected in the right hand side of the webpage.
8
As mentioned in the theoretical session, DAVID provides three distinct tools to perform pathway enrichment:
• Functional Annotation Clustering (similar enriched categories are clustered together)

• Functional Annotation Chart (enriched categories are independently reported)
• Functional Annotation Table (each probe is independently annotated)
Choose the right analysis and the right annotation to answer the following questions:
Q: What are the most enriched GO terms for Biological Process (GOTERM_BP_FAT)?
Q: Are the results redundant? (suggestion, use the Functional Annotation Clustering option)
Q: What are the most enriched KEGG pathways?
Q: Take a look at the description of this dataset (http:// www.ncbi.nlm.nih.gov/ geo/ query/ acc.cgi?acc=
GSE2706 ). Do the results make biological sense? (only untreated samples and samples treated with LPS at 8h
were used for the analysis)
9
Most of the enriched functions and pathways seem to revolve around inflammation. For this experiment
unstimulated human dendritic cells (DCs) were compared with DCs stimulated with lipopolysaccharides (LPS)
to induce TLR4-pathway activation. Expression after 8h stimulation was compared to baseline. It therefore
makes biological sense that most enriched pathways and functions are linked to the immune system. LPS, the
molecule used to activate dendritic cells, is in fact normally found in the outer membrane of Gram-negative
bacteria and should therefore be recognized as a threat by the immune system.
OPTIONAL
Q: What are the GO terms for Molecular Function (GOTERM_MF_FAT) most enriched in upregulated genes?
What about downregulated genes? (You can use the files deg_GSE2706_8h_filtered_by_pval_upregulated.txt
and deg_GSE2706_8h_filtered_by_pval_downregulated.txt which contain only genes upregulated and down-
regulated by the perturbation respectively)
5. Close the browser.
5. Perform KEGG enrichment using GSEA (on the Windows machine)
To avoid the arbitrary choice of a threshold, for selecting genes of interest to test for enrichment, we can
use a ranked-based approach. In this practical we will focus on GSEA (Gene Set Enrichment Analysis).
The version of GSEA we will use today is the standalone Java application that can be found at http:
//www.broadinstitute.org/gsea/index.jsp The software can be launched by double clicking on gsea2-2.1.0.jar
located in the directory:
\\kclad\groups\BioinformaticsWorkshop\practicals\Thursday\Melchiotti\Software\GSEA\
(on the Windows shared drive)
In order to run a GSEA analysis we need to download the expression dataset (instead of simply using a list
of genes). This file can be accessed at:
10
GSE2706_series_matrix.txt (on the Windows shared drive)
The file, which can be opened using Excel (tab delimited), contains an identifier (Affymetrix probe ID)
followed by the expression of six samples: 3 controls (PBS.n) and 3 stimulated samples (LPS.8h.n).
To run the GSEA software:
1. Click on Load data

2. Browse to the directory containing the expression file (GSE2706_series_matrix.txt)
3. Click on Run GSEA
11
4. Choose the loaded file as expression dataset
5. Choose c2.cp.kegg.v4.0.symbols.gmt as Gene sets database
6. Choose Create an on-the-fly phenotype as Phenotype labels
7. Write PBS.1, PBS.2, PBS.3 under Samples for class A (one per line, these names correspond to the
sample names in the header of the expression file)
8. Write LPS.8h.1, LPS.8h.2, LPS.8h.3 under Samples for class B (one per line, these names correspond
to the sample names in the header of the expression file)
9. Click on Apply to dataset
12
10. Choose gene_set as Permutation type (phenotype is usually preferred but since our dataset contains
only 6 samples, the number of possible permutations is not enough to estimate a reliable FDR q-val)
11. Choose HG_U133_Plus_2.chip as Chip platform(s)
12. Expand Basic Fields and choose a meaningful name as Analysis name
13. Use default values for the other parameters
14. Click on Run
13
Pre-computed results can be found at:
\\kclad\groups\BioinformaticsWorkshop\practicals\Thursday\Melchiotti\Results\GSEA\
effect_of_stimulation_on_DCs.Gsea.1418825504824\ (on the Windows shared drive)
Load the index.html file containing the summary of the results using a browser.
Results computed by GSEA can also be accessed directly by the software window clicking on Success 5.
14
Enriched pathways for upregulated and downregulated genes can be found by clicking on Detailed enrich-
ment results in html format. Clicking on a pathway name leads to a description of a pathway. Clicking
on Details . . . leads to the list of genes in the pathway and the corresponding heatmap. Plots explaining
how well a pathway was enriched at the top of the list can be found by clicking on Snapshot of enrichment
results.
15
Q: What are the most enriched KEGG pathways for upregulated genes in class B?
Q: What are the most enriched KEGG pathways for downregulated genes in class B?
15. Close GSEA.
6. Visualize a pathway using PathVisio (on the Windows machine)

One of the most enriched pathways in our dataset according to GSEA is, as expected, the
KEGG_TOLL_LIKE_RECEPTOR_SIGNALING_PATHWAY. It would be interesting to overlay
16
gene expression fold changes on this enriched pathway to better understand how the pathway is perturbed
after activation.
PathVisio is a free open-source tool for the visualization of biological pathways. Open PathVisio by double
clicking on the executable file which can be found at:
\\kclad\groups\BioinformaticsWorkshop\practicals\Thursday\Melchiotti\Software\
pathvisio-3.1.3\ (on the Windows shared drive)
Here are the steps required to plot the toll like receptor signaling pathway and to colour it according to the
fold changes in our dataset for the genes in the pathway.
1. Click on File > Open and select the Hs_Toll-like_receptor_signaling_pathway_WP75_72133.gpml

pathway stored at:
\\kclad\groups\BioinformaticsWorkshop\practicals\Thursday\Melchiotti\Pathways\
17
2. Click on Data > Select Gene Database and select the file Hs_Derby_20130701.bridge stored at:
\\kclad\groups\BioinformaticsWorkshop\practicals\Thursday\Melchiotti\Pathways\
This file is an annotation file to map gene IDs to pathway components.
3. Click on Data > Import expression data and select the expression matrix deg_GSE2706_8h_filtered_by_pval.txt
as Input file. This file is stored at:
\\kclad\groups\BioinformaticsWorkshop\practicals\Thursday\Melchiotti\Data\Pathvisio\
4. To choose the Output file click on Browse and select your home directory (the one with your username
as a name, ex. a1102248). You can give the file the name you prefer. Click on Choose filename for
database. Click on Next
18
IMPORTANT!!!
Do not save the output file in the default directory given by the software
19
5. Choose tab as a data delimiter. Click on Next
6. Choose Gene.symbol as primary identifier column. Select Use the same system code for all rows and
choose HGNC (Hugo Gene Symbols). Click on Next
7. Click on Finish
8. Choose Data > Visualization options
20
9. Tick Expression as Color, Tick Basic and select only logFC as the metric to use to color nodes
10. Click on Modify and change the scale of the color set so that the gradient goes from -10 to 10
21
Q: Are genes mostly upregulated or downregulated?
Q: Are perturbed genes concentrated upstream or downstream in the pathway?
11. Close PathVisio.
IMPORTANT!!!
Please do not save the changes made to the pathway or all the other students will be affected
as well
7. Perform gene enrichment using topGO (on the cluster, will be presented only
if time permits)
As mentioned in the theoretical lecture GO has a hierachical nature which can sometimes lead to redundant
enriched functions (see DAVID results in Section 4). It is therefore interesting to compare results obtained
22
with traditional overrepresentation and rank-based analyses with results obtained by conditional enrichment.
This can be performed programmatically for the GO ontology using the R package topGO. A description of
the package and of all functions contained in the package can be found at http://www.bioconductor.org/
packages/release/bioc/html/topGO.html.
Please open MobaXterm, login into the cluster, add the modules required and open RStudio as explained in
the previous practical (see handout).
You can follow this section by copying and pasting the code from this PDF or by running, in RStudio, the
script topGOAnalysis.R stored at:
~/practicals/Thursday/Melchiotti/Scripts/topGOAnalysis.R (on the cluster)
rm(list = ls())
# Load packages
library(topGO)
library(org.Hs.eg.db)
library(biomaRt)
library(reshape2)
# Set analysis parameters

input_file_dir <- "~/practicals/Thursday/Melchiotti/Data/"
working_dir <- "~/practicals/Thursday/Melchiotti/Results/topGO/"
IMPORTANT!!!
working_dir should be changed to the directory in which to store results
prefix<-"GSE2706_8h"
deg_filename <- paste(input_file_dir,"deg_GSE2706_8h.txt",sep="")
significance_threshold_pvalue <- 0.05
# Load the list of genes with their corresponding p-value for differential expression
# (all genes regardless of p-value)
genes <- read.delim(deg_filename,sep="\t",header=TRUE,na.strings="")
print(head(genes))
## ID adj.P.Val P.Value logFC Gene.symbol

## 1 204698_at 0.000489 8.94e-09 12.70 ISG20
## 2 1405_i_at 0.002017 9.94e-08 9.37 CCL5
## 3 33304_at 0.002017 1.20e-07 10.90 ISG20
## 4 204655_at 0.002017 1.48e-07 8.85 CCL5
## 5 210163_at 0.002811 2.57e-07 12.40 CXCL11
## 6 207901_at 0.002991 3.35e-07 9.26 IL12B
## Gene.title
## 1 interferon stimulated exonuclease gene 20kDa
## 2 chemokine (C-C motif) ligand 5
## 3 interferon stimulated exonuclease gene 20kDa
## 4 chemokine (C-C motif) ligand 5
## 5 chemokine (C-X-C motif) ligand 11
## 6 interleukin 12B
23
# Collapse probes with the same Gene Symbol
genes_collapsed<-dcast(genes[,c("Gene.symbol","adj.P.Val")],Gene.symbol~.,
median,value.var="adj.P.Val")
colnames(genes_collapsed)<-c("Gene.symbol","adj.P.Val")
# Create a vector containing the scores that will be used to rank the list, each vector
# element should be named with its gene symbol
all_genes <- genes_collapsed$adj.P.Val
names(all_genes) <- genes_collapsed$Gene.symbol
# Define function to select significant genes for Fisher's test

top_diff_genes_function <- function (scores)
{return(scores < significance_threshold_pvalue)}
# Enrichment for biological processes (the package org.Hs.eg.db contains the mapping
# between gene symbols and GO terms)
GO_data_BP<-new("topGOdata",
ontology = "BP",
allGenes = all_genes,
geneSel = top_diff_genes_function,
nodeSize = 10,
annot = annFUN.org,
mapping = "org.Hs.eg.db",
ID = "symbol"
)
##
## Building most specific GOs ..... ( 9342 GO terms found. )
##
## Build GO DAG topology .......... ( 12580 GO terms and 28930 relations. )
##
## Annotating nodes ............... ( 14149 genes annotated to the GO terms. )
# Run enrichment using both Fisher and Kolmogorov-Smirnov tests and both the classic and
# the elim methods provided by topGO
results_Fisher_BP <- runTest(GO_data_BP, algorithm = "classic", statistic = "fisher")
##
## -- Classic Algorithm --
##
## the algorithm is scoring 3620 nontrivial nodes
## parameters:
## test statistic: fisher
results_Fisher_elim_BP <- runTest(GO_data_BP, algorithm = "elim", statistic = "fisher")
##
## -- Elim Algorithm --
##
## parameters:
## test statistic: fisher
24
## cutOff: 0.01
##
## Level 18: 2 nodes to be scored (0 eliminated genes)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
results_KS_BP <- runTest(GO_data_BP, algorithm = "classic", statistic = "ks")
##
## -- Classic Algorithm --
##
## parameters:
## test statistic: ks
## score order: increasing
results_KS_elim_BP <- runTest(GO_data_BP, algorithm = "elim", statistic = "ks")
##
## -- Elim Algorithm --
##
25
## parameters:
## test statistic: ks
## cutOff: 0.01
## score order: increasing
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
# Substitute zero p-values with a very small number so that it is possible to compute the
# log10
results_Fisher_BP@score[which(results_Fisher_BP@score==0)]=1e-300
results_Fisher_elim_BP@score[which(results_Fisher_elim_BP@score==0)]=1e-300
results_KS_BP@score[which(results_KS_BP@score==0)]=1e-300
results_KS_elim_BP@score[which(results_KS_elim_BP@score==0)]=1e-300
# Convert results into a table

all_res_BP_just_pval <- GenTable(GO_data_BP,
classicFisher = results_Fisher_BP,
26
elimFisher = results_Fisher_elim_BP,
classicKS = results_KS_BP,
elimKS = results_KS_elim_BP,
ranksOf = "classicFisher",
topNodes = length(score(results_Fisher_BP)))
# Write results to a table

write.csv(all_res_BP_just_pval,paste(working_dir,prefix,"_topGO_results.csv",sep=""),quote=FALSE)
Open the file you just created in your working directory (GSE2706_8h_topGO_results.csv). In case the
analysis is taking too long pre-computed results can be found at:
~/practicals/Thursday/Melchiotti/Results/topGO/GSE2706_8h_topGO_results.csv
(on the cluster computer)
OR
\\kclad\groups\BioinformaticsWorkshop\practicals\Thursday\Melchiotti\Results\topGO\
GSE2706_8h_topGO_results.csv
(on the Windows shared driver)
Q: How many functions are significantly enriched according to classic Fisher? How many according to elim
Fisher?
Q: How many functions are significantly enriched according to classic KS? How many according to elim KS?
This can be done directly in R by examining the data frame all_res_BP_just_pval.
print("Number of significant functions according to classic Fisher")
## [1] "Number of significant functions according to classic Fisher"
classicFisher <- which(all_res_BP_just_pval$classicFisher<0.05)

print(length(classicFisher))
## [1] 783
print("Number of significant functions according to elim Fisher")
## [1] "Number of significant functions according to elim Fisher"
elimFisher <- which(all_res_BP_just_pval$elimFisher<0.05)

print(length(elimFisher))
## [1] 540
print("Overlap between classic Fisher and elim Fisher")
## [1] "Overlap between classic Fisher and elim Fisher"
27
print(length(intersect(classicFisher,elimFisher)))
## [1] 437
print("Number of significant functions according to classic KS")
## [1] "Number of significant functions according to classic KS"
classicKS <- which(all_res_BP_just_pval$classicKS<0.05)

print(length(classicKS))
## [1] 837
print("Number of significant functions according to elim KS")
## [1] "Number of significant functions according to elim KS"
elimKS <- which(all_res_BP_just_pval$elimKS<0.05)

print(length(elimKS))
## [1] 536
print("Overlap between classic KS and elim KS")
## [1] "Overlap between classic KS and elim KS"
print(length(intersect(classicKS,elimKS)))
## [1] 494
Q: Can you find an example of a function that is enriched when running KS but not when running Fisher?
results<-all_res_BP_just_pval[which(as.numeric(all_res_BP_just_pval$classicKS)<0.05
& as.numeric(all_res_BP_just_pval$classicFisher)
>0.05),]
## Warning in which(as.numeric(all_res_BP_just_pval$classicKS) < 0.05 &

## as.numeric(all_res_BP_just_pval$classicFisher) > : NAs introduced by
## coercion
print(tail(results))
## GO.ID Term Annotated

## 4960 GO:0072182 regulation of nephron tubule epithelial ... 11
## 5024 GO:0097031 mitochondrial respiratory chain complex ... 15
## 5080 GO:1902186 regulation of viral release from host ce... 15
## 5084 GO:1902253 regulation of intrinsic apoptotic signal... 16
28
## 5116 GO:2000696 regulation of epithelial cell differenti... 13
## 5129 GO:2001251 negative regulation of chromosome organi... 41
## Significant Expected Rank in classicFisher classicFisher elimFisher
## 4960 0 0.28 4960 1.00000 1.00000
## 5024 0 0.38 5024 1.00000 1.00000
## 5080 0 0.38 5080 1.00000 1.00000
## 5084 0 0.41 5084 1.00000 1.00000
## 5116 0 0.33 5116 1.00000 1.00000
## 5129 0 1.05 5129 1.00000 1.00000
## classicKS elimKS
## 4960 0.01102 0.01102
## 5024 0.00068 1.00000
## 5080 0.03267 0.03267
## 5084 0.01886 0.01886
## 5116 0.04885 0.04885
## 5129 0.04448 0.04448
Plot the results on the Gene Ontology tree
# Plot and save results

showSigOfNodes(GO_data_BP, score(results_Fisher_BP), firstSigNodes = 10, useInfo = "all")
GO:0008150
biological_process
1
363 / 14149
GO:0002376 GO:0050896
immune system proces... response to stimulus
< 1e−20 1.05e−19
153 / 2037 258 / 6776
GO:0006955 GO:0006950 GO:0009605 GO:0009607

immune response response to stress response to external... response to biotic s...
< 1e−20 < 1e−20 1.19e−20 < 1e−20
125 / 1274 173 / 3100 112 / 1778 84 / 629
GO:0006952 GO:0043207 GO:0051704

defense response response to external... multi−organism proce...
< 1e−20 < 1e−20 1.90e−17
125 / 1333 84 / 602 110 / 1900
GO:0002252 GO:0051707
immune effector proc... response to other or...
< 1e−20 < 1e−20
69 / 544 84 / 602
GO:0098542 GO:0009615
defense response to ... response to virus
< 1e−20 < 1e−20
55 / 325 52 / 273
GO:0051607
defense response to ...
< 1e−20
43 / 195
## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 15
## Number of Edges = 20
##
## $complete.dag
## [1] "A graph with 15 nodes."
29
pdf(paste(working_dir,prefix,"_Fisher_BP_just_pval.pdf",sep=""))
showSigOfNodes(GO_data_BP, score(results_Fisher_BP), firstSigNodes = 10, useInfo = "all")
## $dag
##
## $complete.dag
dev.off()
## pdf
## 2
showSigOfNodes(GO_data_BP, score(results_Fisher_elim_BP), firstSigNodes = 5,

useInfo = "all")
GO:0008150
biological_process
1.000000
363 / 14149
GO:0002376 GO:0050896 GO:0023052 GO:0044699 GO:0009987 GO:0051704 GO:0065007

immune system proces... response to stimulus signaling single−organism proc... cellular process multi−organism proce... biological regulatio...
0.123479 0.611162 0.073995 0.396224 0.860723 0.598114 0.951017
153 / 2037 258 / 6776 206 / 4941 326 / 11446 337 / 12644 110 / 1900 271 / 8802
GO:0006955 GO:0006950 GO:0009605 GO:0009607 GO:0042221 GO:0051716 GO:0044700 GO:0044763 GO:0044764 GO:0044419 GO:0050789
immune response response to stress response to external... response to biotic s... response to chemical cellular response to... single organism sign... single−organism cell... multi−organism cellu... interspecies interac... regulation of biolog...
4.30e−09 0.025170 0.755932 0.141046 0.894425 0.023986 0.073995 0.337651 0.600011 0.246480 0.903062
125 / 1274 173 / 3100 112 / 1778 84 / 629 141 / 3044 222 / 5376 206 / 4941 307 / 10311 38 / 678 40 / 731 262 / 8309
GO:0006952 GO:0009611 GO:0043207 GO:0010033 GO:0070887 GO:0007154 GO:0050794 GO:0044403 GO:0043900 GO:0048519
defense response response to wounding response to external... response to organic ... cellular response to... cell communication regulation of cellul... symbiosis, encompass... regulation of multi−... negative regulation ...
0.014586 0.202730 0.081788 0.846826 0.515286 0.114443 0.936779 0.246480 0.372157 0.764962
125 / 1333 80 / 1129 84 / 602 116 / 2173 109 / 2008 205 / 5009 247 / 7865 40 / 731 34 / 248 147 / 3346
innate immune respon... inflammatory respons... response to other or... response to cytokine cellular response to... signal transduction viral process negative regulation ... regulation of symbio... negative regulation ...
0.021511 8.09e−14 0.081788 0.178865 0.366757 0.010106 0.233544 0.554931 0.734954 0.182613
73 / 786 58 / 520 84 / 602 67 / 555 93 / 1605 196 / 4434 36 / 668 128 / 3054 21 / 157 22 / 93
GO:0002252 GO:0098542 GO:0009615 GO:0034340 GO:0071345 GO:0007166 GO:0019058 GO:0050792

immune effector proc... defense response to ... response to virus response to type I i... cellular response to... cell surface recepto... viral life cycle regulation of viral ...
0.698497 0.522037 7.12e−05 1.000000 0.113446 0.170434 0.733635 0.698518
69 / 544 55 / 325 52 / 273 24 / 76 57 / 461 124 / 2544 21 / 285 18 / 135
GO:0051607 GO:0071357 GO:0019221 GO:0019079 GO:0048525

defense response to ... cellular response to... cytokine−mediated si... viral genome replica... negative regulation ...
5.84e−19 1.000000 0.000486 1.000000 0.476850
43 / 195 24 / 75 52 / 358 16 / 67 17 / 63
GO:0060337 GO:0045069
type I interferon si... regulation of viral ...
< 1e−20 1.000000
24 / 75 16 / 53
GO:0045071
negative regulation ...
2.00e−16
16 / 37
## $dag
##
## $complete.dag
30
pdf(paste(working_dir,prefix,"_Fisher_BP_elim_just_pval.pdf",sep=""))
showSigOfNodes(GO_data_BP, score(results_Fisher_elim_BP), firstSigNodes = 5,
useInfo = "all")
## $dag
##
## $complete.dag
dev.off()
## pdf
## 2
showSigOfNodes(GO_data_BP, score(results_KS_BP), firstSigNodes = 10, useInfo = "all")
GO:0008150
biological_process
1.00000
363 / 14149
GO:0002376 GO:0050896 GO:0051704

immune system proces... response to stimulus multi−organism proce...
1.28e−14 1.47e−07 0.00131
153 / 2037 258 / 6776 110 / 1900
GO:0006955 GO:0042221 GO:0006950 GO:0009605 GO:0009607

immune response response to chemical response to stress response to external... response to biotic s...
1.65e−18 1.54e−05 6.23e−12 2.88e−07 1.21e−16
125 / 1274 141 / 3044 173 / 3100 112 / 1778 84 / 629
GO:0010033 GO:0006952 GO:0043207

response to organic ... defense response response to external...
5.52e−05 3.78e−17 9.87e−18
116 / 2173 125 / 1333 84 / 602
GO:0002252 GO:0034097 GO:0051707

immune effector proc... response to cytokine response to other or...
1.75e−13 1.60e−13 9.87e−18
69 / 544 67 / 555 84 / 602
GO:0098542 GO:0009615
defense response to ... response to virus
1.49e−13 8.98e−15
55 / 325 52 / 273
GO:0051607
defense response to ...
1.57e−14
43 / 195
## $dag
##
## $complete.dag
31
pdf(paste(working_dir,prefix,"_KS_BP_just_pval.pdf",sep=""))
showSigOfNodes(GO_data_BP, score(results_KS_BP), firstSigNodes = 10, useInfo = "all")
## $dag
##
## $complete.dag
dev.off()
## pdf
## 2
showSigOfNodes(GO_data_BP, score(results_KS_elim_BP), firstSigNodes = 5, useInfo = "all")
GO:0008150
biological_process
1.00000
363 / 14149
GO:0002376 GO:0050896 GO:0023052 GO:0044699 GO:0009987 GO:0051704 GO:0065007

immune system proces... response to stimulus signaling single−organism proc... cellular process multi−organism proce... biological regulatio...
0.05871 0.39634 0.34824 0.77564 0.81331 0.79073 0.88152
153 / 2037 258 / 6776 206 / 4941 326 / 11446 337 / 12644 110 / 1900 271 / 8802
GO:0009607 GO:0009605 GO:0006950 GO:0006955 GO:0042221 GO:0051716 GO:0044700 GO:0044763 GO:0044764 GO:0044419 GO:0050789
response to biotic s... response to external... response to stress immune response response to chemical cellular response to... single organism sign... single−organism cell... multi−organism cellu... interspecies interac... regulation of biolog...
0.62871 0.59938 0.15545 5.13e−05 0.50310 0.20136 0.34824 0.42789 0.08659 0.23707 0.53550
84 / 629 112 / 1778 173 / 3100 125 / 1274 141 / 3044 222 / 5376 206 / 4941 307 / 10311 38 / 678 40 / 731 262 / 8309
response to external... response to wounding defense response response to organic ... cellular response to... cell communication symbiosis, encompass... regulation of cellul... regulation of multi−... negative regulation ...
1.00000 0.39213 0.28342 0.41982 0.45791 0.41160 0.23707 0.53284 0.03474 0.04207
84 / 602 80 / 1129 125 / 1333 116 / 2173 109 / 2008 205 / 5009 40 / 731 247 / 7865 34 / 248 147 / 3346
response to other or... inflammatory respons... innate immune respon... response to cytokine cellular response to... signal transduction viral process regulation of symbio... negative regulation ... negative regulation ...
0.00687 2.39e−07 0.00443 0.12689 0.32227 0.36178 0.05393 0.72331 0.02574 0.16752
84 / 602 58 / 520 73 / 786 67 / 555 93 / 1605 196 / 4434 36 / 668 21 / 157 128 / 3054 22 / 93
GO:0002252 GO:0009615 GO:0098542 GO:0034340 GO:0035456 GO:0071345 GO:0007166 GO:0019058 GO:0050792

immune effector proc... response to virus defense response to ... response to type I i... response to interfer... cellular response to... cell surface recepto... viral life cycle regulation of viral ...
0.80260 0.02304 0.57644 0.17964 5.16e−07 0.55731 0.50854 0.21865 0.63408
69 / 544 52 / 273 55 / 325 24 / 76 7 / 14 57 / 461 124 / 2544 21 / 285 18 / 135
GO:0051607 GO:0071357 GO:0019221 GO:0019079 GO:0048525

defense response to ... cellular response to... cytokine−mediated si... viral genome replica... negative regulation ...
1.10e−11 1.00000 2.05e−05 0.12314 0.11511
43 / 195 24 / 75 52 / 358 16 / 67 17 / 63
GO:0060337 GO:0045069
type I interferon si... regulation of viral ...
2.19e−10 0.52987
24 / 75 16 / 53
GO:0045071
negative regulation ...
2.60e−09
16 / 37
## $dag
##
## $complete.dag
32
pdf(paste(working_dir,prefix,"_KS_BP_elim_just_pval.pdf",sep=""))
showSigOfNodes(GO_data_BP, score(results_KS_elim_BP), firstSigNodes = 5, useInfo = "all")
## $dag
##
## $complete.dag
dev.off()
## pdf
## 2
The following code contains two useful functions to retrieve the full name of a GO term and all genes belonging
to a particular GO category since topGO tends to truncate GO term full names in the tables and graphs.
# Retrieve full name of a GO term and all genes belonging to that category
go_id<-"GO:0045087"
print(paste("Full name of term GO:0045087:",Term("GO:0045087"),sep=""))
## [1] "Full name of term GO:0045087:innate immune response"
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")

gene.data <- getBM(attributes=c('hgnc_symbol', 'go_id'),
filters = 'go_id', values = go_id, mart = ensembl)
print("Number of genes contained in GO term GO:0045087:",length(gene.data))
## [1] "Number of genes contained in GO term GO:0045087:"
print(head(gene.data))
## hgnc_symbol go_id
## 1 IPO7 GO:0045087
## 2 TRIM27 GO:0045087
## 3 IFNW1 GO:0045087
## 4 PIK3C3 GO:0045087
## 5 TOLLIP GO:0045087
## 6 DUSP7 GO:0045087
Q: Compare the results of the two Fisher’s tests (classic and elim) using the PDF files just generated. Which
nodes lose significance using the elim version?
Q: Compare the results of the Fisher’s classic test with the results of the KS’s classis test using the PDF files
just generated. Are there any differences?
Close RStudio and MobaXterm.
33
8. Help Links
1. DAVID: http://david.abcc.ncifcrf.gov/helps/functional_annotation.html
2. GSEA: http://www.broadinstitute.org/gsea/doc/desktop_tutorial.jsp
3. PathVisio: http://www.pathvisio.org/documentation/
4. topGO: http://www.bioconductor.org/packages/release/bioc/vignettes/topGO/inst/doc/topGO.pdf
34

DAVID Tutorial

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DAVID Tutorial

Uploaded by

Copyright:

Available Formats

Pathway and Network Analysis Hands-On Session

1. Mount the local drive on the Windows machine

• Affymetrix Probe ID (ID)

3. Select genes of interest

Q: How many genes are significantly differentially expressed?

4. Perform gene enrichment using DAVID (on the Windows machine)

1. Click on Functional Annotation

2. Upload the list deg_GSE2706_8h_filtered_by_pval.txt choosing Affymetrix_3PRIME_IVT_ID as

• Functional Annotation Clustering (similar enriched categories are clustered together)

5. Close the browser.

5. Perform KEGG enrichment using GSEA (on the Windows machine)

1. Click on Load data

3. Click on Run GSEA

15. Close GSEA.

6. Visualize a pathway using PathVisio (on the Windows machine)

1. Click on File > Open and select the Hs_Toll-like_receptor_signaling_pathway_WP75_72133.gpml

This file is an annotation file to map gene IDs to pathway components.

11. Close PathVisio.

~/practicals/Thursday/Melchiotti/Scripts/topGOAnalysis.R (on the cluster)

# Set analysis parameters

## ID adj.P.Val P.Value logFC Gene.symbol

# Define function to select significant genes for Fisher's test

results_Fisher_elim_BP <- runTest(GO_data_BP, algorithm = "elim", statistic = "fisher")

results_KS_BP <- runTest(GO_data_BP, algorithm = "classic", statistic = "ks")

results_KS_elim_BP <- runTest(GO_data_BP, algorithm = "elim", statistic = "ks")

# Convert results into a table

# Write results to a table

print("Number of significant functions according to classic Fisher")

## [1] "Number of significant functions according to classic Fisher"

classicFisher <- which(all_res_BP_just_pval$classicFisher<0.05)

print("Number of significant functions according to elim Fisher")

## [1] "Number of significant functions according to elim Fisher"

elimFisher <- which(all_res_BP_just_pval$elimFisher<0.05)

print("Overlap between classic Fisher and elim Fisher")

## [1] "Overlap between classic Fisher and elim Fisher"

print("Number of significant functions according to classic KS")

## [1] "Number of significant functions according to classic KS"

classicKS <- which(all_res_BP_just_pval$classicKS<0.05)

print("Number of significant functions according to elim KS")

## [1] "Number of significant functions according to elim KS"

elimKS <- which(all_res_BP_just_pval$elimKS<0.05)

print("Overlap between classic KS and elim KS")

## [1] "Overlap between classic KS and elim KS"

## Warning in which(as.numeric(all_res_BP_just_pval$classicKS) < 0.05 &

## GO.ID Term Annotated

Plot the results on the Gene Ontology tree

# Plot and save results

GO:0006955 GO:0006950 GO:0009605 GO:0009607

GO:0006952 GO:0043207 GO:0051704

showSigOfNodes(GO_data_BP, score(results_Fisher_elim_BP), firstSigNodes = 5,

GO:0002376 GO:0050896 GO:0023052 GO:0044699 GO:0009987 GO:0051704 GO:0065007

GO:0002252 GO:0098542 GO:0009615 GO:0034340 GO:0071345 GO:0007166 GO:0019058 GO:0050792

GO:0051607 GO:0071357 GO:0019221 GO:0019079 GO:0048525

showSigOfNodes(GO_data_BP, score(results_KS_BP), firstSigNodes = 10, useInfo = "all")

GO:0002376 GO:0050896 GO:0051704

GO:0006955 GO:0042221 GO:0006950 GO:0009605 GO:0009607

GO:0010033 GO:0006952 GO:0043207

GO:0002252 GO:0034097 GO:0051707

showSigOfNodes(GO_data_BP, score(results_KS_elim_BP), firstSigNodes = 5, useInfo = "all")

GO:0002376 GO:0050896 GO:0023052 GO:0044699 GO:0009987 GO:0051704 GO:0065007

GO:0002252 GO:0009615 GO:0098542 GO:0034340 GO:0035456 GO:0071345 GO:0007166 GO:0019058 GO:0050792

GO:0051607 GO:0071357 GO:0019221 GO:0019079 GO:0048525

## [1] "Full name of term GO:0045087:innate immune response"

ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")