You are on page 1of 46

Pathway analysis from transcriptomics

data
Main methods for pathway analysis
• Over-representation analysis
• Gene set enrichment analysis (GSEA)
• Topology-based pathway analysis
• Perturbation signature-based pathway analysis
Over-representation analysis
• Involves comparing a list of differentially expressed genes with a reference database of
pathways to determine if certain pathways are over-represented in the dataset.

• It aims to identify biological pathways that are statistically enriched with a higher number
of genes or proteins of interest than would be expected by chance.

• It provides binary results (significant or not) for each gene set without a continuous
measure of enrichment.
Over-representation analysis

KEGG/Reactome/GO
Over-representation analysis
UC patients who have failed anti-TNF therapy vs UC patients who are anti-TNF naïve

Number of DEGS = 11

Background genes (i.e. genes


detected in microarray) = 19917

Gene sets from GO database


Over-representation analysis
Antimicrobial humoral response (GO:0019730)
Differentially Not differentially Total
expressed genes expressed
Present in 2 104 106
antimicrobial humoral
response gene set k K
Absent in antimicrobial 6 14064 14070 p = 0.015
humoral response
gene set n-k N-K
8 14068 14176

n N-n N

Gene ratio Background ratio

What is the probability of seeing at least 2 out of the 8 DEGs annotated to this particular
GO term, given the proportion of background genes that are annotated to that term?
What is the probability of getting at least 2 red balls when drawing a sample containing 8 balls (without
replacing the balls back into the bag), when there are 106 red balls in the total population of 14176 balls?
Use hypergeometric probability distribution.
Over-representation analysis using clusterprofiler
> genelist_sig_DE_antiTNF
[1] "REG4" "MUC17" "TFPI2" "PYY" "DEFB4A" "TM4SF20" "REG1B" "C10orf99"
[9] "MS4A12" "LOC100288985" "C4orf7"

GO_ORA_antiTNF<-enrichGO(gene = genelist_sig_DE_antiTNF, universe = rownames(UNIFI_counts), OrgDb =


"org.Hs.eg.db", keyType = "SYMBOL", ont = "BP", pAdjustMethod = "fdr")

ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count


antimicrobial humoral imm
une response mediated by
GO:0061844 GO:0061844 antimicrobial peptide 2/8 68/14176 0.000623 0.01994 0.011806 DEFB4A/REG1B 2

antimicrobial humoral resp


GO:0019730 GO:0019730 onse 2/8 106/14176 0.001506 0.024096 0.014267 DEFB4A/REG1B 2

There are ~30,000 biological processes in the GO database.


So, the previous statistical test was performed 30,000 times. 1/20 of these tests can be significant due to
chance alone if we use a p value threshold of 0.05 i.e. ~1500 pathways can be falsely positive.

So, need to correct for multiple testing!


Over-representation analysis using clusterprofiler
ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
antimicrobial humoral imm
une response mediated by
GO:0061844 GO:0061844 antimicrobial peptide 2/8 68/14176 0.000623 0.01994 0.011806 DEFB4A/REG1B 2

antimicrobial humoral resp


GO:0019730 GO:0019730 onse 2/8 106/14176 0.001506 0.024096 0.014267 DEFB4A/REG1B 2

Bonferroni correction => use a p value of < (0.05 / n number of tests) i.e. 0.0000016 --> problem it is VERY
conservative

False Discovery Rate (FDR) => proportion of false positives amongst all significant results. Typically, one
accepts 5% of the significant results being false positives.

Benjamini-Hochberg procedure adjusts the p value to achieve a particular FDR


• First sorts all p values and ranks them with the smallest p value being given rank 1
• Calculate the p(k) x m/k where k = rank of each p value, m = total number of p values
• We use the adjusted p value to select the positive results set against the desired FDR
Over-representation analysis using clusterprofiler
Over-representation analysis
• Advantages:
 Simple to compute
 Requires only few input data

• Disadvantages:
 Need to carefully select background
 Disregards the vast majority of data
 Assumes all gene act independently i.e. ignores interactions between genes
 Only considers the number of DEGs represented in each pathway rather than its position within the
pathway
 Can lead to many false positives
 Does not provide any information on pathway activity
Functional class scoring methods e.g. Gene Set
Enrichment Analysis (GSEA)
• GSEA assesses whether predefined reference sets of genes associated with each
pathway from a reference database is collectively upregulated or downregulated in an
experimentally-derived ranked gene list.
• It uses an enrichment score and permutation-based statistics to determine if gene sets
are significantly enriched, providing a continuous measure of enrichment.
• GSEA is suitable for identifying subtle, coordinated changes in gene expression and
doesn't rely on arbitrary cutoffs, which is require for ORA.
• It is particularly useful when studying complex biological conditions with multifaceted
gene regulation.
GSEA
GSEA
GSEA
GSEA
GSEA
GSEA
GSEA T cell differentiation Cellular division Heart contraction

Which of these pathways show non-random distribution across this sorted list?
Use statistical test called Kolmogorov-Smirnov test to give a p value
GSEA: Enrichment score plot

Maximum deviation from 0 = Enrichment Score


GSEA: Enrichment score plot
T cell differentiation Cellular division

Positive ES = Enrichment at the top of the ranked list Negative ES = Enrichment at the bottom of the ranked list

Leading edge = Subset of members within the gene set that contribute most to the enrichment score.
It tells you which genes of a particular gene set is most important in your data.
GSEA

Accounts for differences in gene set size,


allowing ES comparisons to be made
between pathways
GSEA
• Advantages:
 More accurate than ORA
 Uses entire list of measured genes from transcriptomics data (i.e. no need to filter DEG list first)

• Disadvantages:
 Assumes all gene act independently i.e. ignores interactions between genes
 Analyses each pathway independently
 Can also result in false positives
 No information on pathway activity
Topology-based pathway analysis

• These methods go beyond simply considering the presence or absence of genes in pathways and instead take
into account the network and interaction characteristics of the genes and proteins within pathways.
• Network Structure: Topology-based methods consider the topology, or the network structure, of pathways.
They examine how genes or proteins interact within pathways, including the type and strength of interactions.
• Node Centrality: These methods often incorporate measures of node centrality, such as degree centrality (the
number of connections a node has), betweenness centrality (the importance of a node in connecting other
nodes), or other network centrality metrics to assess the significance of individual genes or proteins within
pathways.
• Pathway Cross-Talk: Topology-based analysis can identify cross-talk between pathways, showing how genes
or proteins in one pathway may also be involved in other related pathways, revealing the interconnected nature
of biological processes.
• Functional Impact: These methods aim to determine the functional impact of specific genes or proteins within
pathways, considering their position and interactions within the network.
• Pathway Rewiring: Topology-based analysis can identify instances where pathways are rewired in response
to genetic mutations or experimental conditions, helping to understand the dynamics of pathway regulation.
Topology-based pathway analysis: Signalling
Pathway Impact Analysis (SPIA)
• The most well-known topology-based pathway analysis method is SPIA.
• Two evidences of differential expression of a pathway are combined:
o ORA
o Pathway topology reflected in the perturbation factor
 The authors assume that a differentially expressed gene at the beginning of a pathway topology (e.g. a receptor in a
signaling pathway) has a stronger effect on the functionality of a pathway than a differentially expressed gene at the
end of a pathway (e.g. a transcription factor in a signaling pathway).
 The perturbation factors of all genes are calculated from a system of linear equations and then combined within a
pathway.

• The two evidences in the form of p-values are combined into a global p-
value, which is used to rank the pathways.
Topology-based pathway analysis: ROntoTools1

• We hypothesize that better results could be achieved if one distinguishes


between genes that are true sources of perturbation, e.g., due to mutations,
copy number variations, epigenetic changes, etc., and genes that merely
respond to perturbation signals coming from upstream.
• Intuitively, a pathway should be more significantly impacted if it hosts more
genes that are such true sources of perturbation. The method proposed here is
an attempt at capturing these differences by calculating a “primary
dysregulation” for every gene and using them to compute a total pathway
perturbation and subsequent significance
• Another issue related to the traditional topological data analysis approaches
involves the need for a selection of differentially expressed (DE) genes. They
have both an approach based on DE genes (cut-off-based approach) and
another approach to include all genes (threshold-free approach).

1
Ansari et al. 2016: A Novel Pathway Analysis Approach Based on the Unexplained Disregulation of Genes
Topology-based pathway analysis: ROntoTools

• The measured expression change of a gene in a given phenotype can be seen as the result of influences from
upstream genes superimposed on the dis-regulation incurred by that particular gene itself. We will refer to this
later quantity as the primary dis-regulation (pDis).
• The diffusion of signals between genes in regulatory networks, called “network propagation,” can be used to
find the active genes and subnetworks as well as the function of the genes in different conditions.
• Here, we are using a similar approach that uses propagation between genes to calculate pDis in order to find
the most impacted pathways.
• We propose a pathway analysis method that focuses on this primary dysregulation
Topology-based pathway analysis: ROntoTools

KEGG pathway
pathNames terms pPert pPert.fdr
PPAR signaling pathway 0.004975 0.014627
MAPK signaling pathway 0.004975 0.014627
Calcium signaling pathway 0.004975 0.014627
Cytokine-cytokine receptor interaction 0.004975 0.014627
Chemokine signaling pathway 0.004975 0.014627
NF-kappa B signaling pathway 0.004975 0.014627
HIF-1 signaling pathway 0.004975 0.014627
Neuroactive ligand-receptor interaction 0.004975 0.014627
Cell cycle 0.004975 0.014627
p53 signaling pathway 0.004975 0.014627
Endocytosis 0.004975 0.014627
Phagosome 0.004975 0.014627
PI3K-Akt signaling pathway 0.004975 0.014627
Wnt signaling pathway 0.004975 0.014627
Axon guidance 0.004975 0.014627
Osteoclast differentiation 0.004975 0.014627
Focal adhesion 0.004975 0.014627
ECM-receptor interaction 0.004975 0.014627
Tight junction 0.004975 0.014627
Complement and coagulation cascades 0.004975 0.014627
Toll-like receptor signaling pathway 0.004975 0.014627
Topology-based pathway analysis: ROntoTools
Perturbation propagation of the TGFb signaling pathway
Topology-based pathway analysis

• Over 30 tools and methods fall in this category including Pathway-Express, SPIA,
NetGSA, TopoGSA, TopologyGSA, PWEA, PathOlogist, GGEA, cepaORA, cepaGSA,
PathNet, ROntoTools, BLMA etc
Perturbation signature-based pathway analysis (e.g.
PROGENy)

• Most pathway approaches make use of either the set (e.g. ORA/GSEA) or infer or
incorporate structure (topology-based methods) of signaling molecules to make
statements about possible activation of a pathway, while signature-based approaches
such as PROGENy consider the genes affected by actually perturbing the pathway
• Aims to infer the activity of signalling pathways based on gene expression data in the
context of genetic or molecular perturbations
• PROGENy leverages a large compendium of pathway-responsive gene signatures
derived from a wide range of different conditions in order to identify genes that are
consistently deregulated when perturbing a particular pathway i.e. identify a common
core of Pathway RespOnsive GENes to a specified set of stimuli
• While this approach has been taken before, previous studies either focused less on
integrating responses from many different cell lines or derived their scores from a much
smaller collection of perturbation experiments.
Perturbation signature-based pathway analysis (e.g.
PROGENy)
• They curated a total of 208 different submissions to ArrayExpress/GEO, spanning perturbations of the
11 pathways EGFR, MAPK, PI3K, VEGF, JAK-STAT, TGFb, TNFa, NFkB, Hypoxia, p53-mediated DNA
damage response, and Trail (apoptosis). This consisted of 568 experiments and 2652 microarrays
• They calculated z-scores of gene expression changes for each experiment
• For each pathway, they identified 100 responsive genes that are most consistently deregulated across
experiments. Interestingly, these responsive genes are specific to the perturbed pathway and have little
overlap with genes encoding for its signaling proteins  highlighting, that pathway expression and
activation are distinct processes
• They used the z-scores of those 100 pathway-responsive genes in a simple, yet effective, linear model
to infer pathway activity from gene expression called PROGENy i.e. basically assesses how closely the
gene expression in a sample matches the expected response patterns for genes within each pathway.
• Then used a scoring algorithm to compute pathway activity scores for each sample. This score reflects
the likelihood that a particular pathway is active in a given sample based on the gene expression data.
A high score indicates a high likelihood of pathway activity, while a low score suggests pathway
inactivity.
PROGENy advantages

• Advantages:
 Infer pathway activity: Able to more accurately infer activity of signalling pathways by providing
continuous pathway activity scores, allowing for a more granular assessment of pathway
involvement. In contrast, GSEA and ORA only tells you whether a pathway is enriched or not
enriched.
 Capturing Complex Changes: Perturbation-based analysis can capture subtle and complex
changes in gene expression patterns, which may not be easily detected by binary methods like
ORA. It provides a more nuanced understanding of how pathways are altered.

• Disadvantages:
 Computationally more intense
 Coverage is low (n = 14 pathways currently in PROGENy)
Perturbation-based pathway analysis (PROGENy)
To access it we can use decoupleR - to run decoupleR methods, we need an input matrix (mat), an input prior knowledge
network/resource (net), and the name of the columns of net that we want to use.
• net <- get_progeny(organism = 'human', top = 500)
• counts<-na.omit(UNIFI_counts)
• counts<-as.matrix(counts)
• deg<-DE_TNF[,3, drop=FALSE]
• deg<-as.matrix(deg)

• #To generate heatmap of pathway activity per sample


• sample_acts <- run_wmean(mat=counts, net=net, .source='source', .target='target’,
.mor='weight', times = 100, minsize = 5)

• # To contrast pathway activities between conditions


• contrast_acts <- run_wmean(mat=deg, net=net, .source='source', .target='target’,
.mor='weight', times = 100, minsize = 5)
Perturbation-based pathway analysis (PROGENy)

Patient 6

Patient 5

Patient 4

Patient 3

Patient 2

Patient 1
PROGENy pathway analysis in anti-TNF treated vs anti-TNF untreated UC patients

Colour = Activity Score


Size of dot 1/ p-value
Pathway responsive genes and their strength of differential expression in anti-TNF treated vs untreated

• The t value is a measure of the strength and direction of the differential expression of a gene between groups
• PROGENy assigns weight values to each gene within a pathway. These weight values are determined based on various factors, including the biological
significance of the gene within the pathway and its expression pattern in the experimental data. The weight represents the contribution of the gene to the
overall activity of the pathway.
• A negative weight for a gene within a pathway suggests that the expression of that gene is negatively correlated with the activity of the pathway. In other
words, when this gene's expression increases, it tends to suppress or inhibit the activity of the pathway  thus, they negatively regulate the pathway
Pathway responsive genes and their strength of differential expression in anti-TNF treated vs untreated
Which pathway analysis method is the best?
Which pathway analysis method is the best?
• Here, for the first time, we present a comparison of the performances of 13 representative
pathways analysis methods on 86 real data sets from two species: human and mouse.
• Aimed to answer following questions:
i. is there any difference in performance between non-TB and TB methods?
ii. is there a method that is consistently better than the others in terms of its ability to
identify target pathways, accuracy, sensitivity, specificity, and the area under the
receiver operating characteristic curve (AUC)?
iii. are there any specific pathways that are biased (in the sense of being more likely
or less likely to be significant across all methods)?
iv. do specific methods have a bias toward specific pathways (e.g., is pathway X likely
to be always reported as significant by method Y)?
Which pathway analysis method is the best?

Used 75 human data sets related to 15


different diseases with each disease being
represented by five different data sets to
evaluate the ability of methods to identify
(rank) the target pathways.

In this example, a data set of Alzheimer’s disease is examined,


and thus, the target pathway is “Alzheimer’s disease.” Each
method produces lists of ranks and p values of the target
pathways, which are then used to assess its performance
Which pathway analysis method is the best?
The Ranks and p values of target pathways derived by 13 methods

(lower is better for both ranks and p values)


Which pathway analysis method is the best?

• But this approach focuses solely on one true positive, the target pathway. We
do not know what other pathways are also truly impacted and therefore cannot
evaluate other criteria such as the accuracy, specificity, sensitivity, and the
AUC of a method. Here, we use knockout data sets that involve using
knockout experiments (KO), where the source of the perturbation is known,
i.e., the KO gene.
• We consider pathways containing the KO gene as true positives and the
others as true negatives.
• Subsequently, we calculate the accuracy, sensitivity, specificity, and AUC of
methods studied using 11 KO data sets.
Which pathway analysis method is the best?
• ROntoTools and PADOG have the
highest median value of accuracy
(0.91).
• ROntoTools also has the highest
median value of specificity (0.94).
• All methods show rather low
sensitivity. Among them, KS is the
best one with the median value of
sensitivity of 0.2.
• AUC is the most comprehensive and
important one because it combines
both the sensitivity and specificity
across all possible thresholds
• In conclusion, TB methods outperform
non-TB methods in all aspects,
namely ranks and p values of target
pathways, and the AUC. Moreover,
the results suggest that there is still
room for improvement since the ranks
of target pathways are still far from
optimal in both groups.
Are some pathways
particularly biased during
pathway analysis?
• They created a true null hypothesis by using simulated data
sets that are constructed by randomly selected healthy
samples from the 75 aforementioned data sets.
• Then applied each method more than 2000 times, each
time on different simulated data sets. Repeated
~2000
• Each pathway for each method then has an empirical null times
distribution of p values resulting from those 2000 runs
• When the null hypothesis is true, p values obtained from
any sound statistical test should be uniformly distributed
between 0 and 1.
• A null distribution of p values of a pathway generated by a
method skewed to the right (biased toward 0) shows that
this method has a tendency to yield low p values and
therefore report the pathway as significantly impacted even
when it is not (false positive).

frequency
• A null distribution of p values of a pathway skewed to the
left (biased toward 1) indicates that the given method tends
to produce consistently higher p values thus possibly report
this pathway as insignificant when it is indeed impacted
(false negative). False positive
Which pathway analysis
method is the best?
• The number of biased pathways is at least 66 for
all the methods compared in this work, except
GSEA which has no biased pathway.
• The figure shows that performing pathway
analysis using the FE test produces the highest
number (137 out of 150 pathways) of false
positives (biased toward 0); this is followed by The numbers of pathways The numbers of pathways
the WRS test (114 out of 150 pathways) and biased toward 0 (false biased toward 1 (false
CePaGSA (112 out of 186 pathways). On the positives) negatives)
other hand, GSEA and PathNet produce no false
positive pathways.
• Similarly, produced by different methods are
shown in Fig. 6c. PathNet produces the highest
number (129 out of 130 pathways) of false
negative pathways. No false negative pathways
are identified while performing pathway analysis
using GSEA, CePaGSA, WRS test, and FE test
Which pathway analysis method is the best?
• The resulting graph indicates that there is
no such “ideal" unbiased pathway. Each
pathway is biased by at least 2 out of 13
investigated methods.
• Some pathways are biased by as many as
12 methods (out of 13 methods). The
common characteristic of these most
biased pathways is that they are small in
size (less than 50 genes), except for “PPAR
signaling pathway” (259 genes) and
“Complement and coagulation cascades”
(102 genes). In contrast, all pathways in the
top 10 least biased have more than 200
genes and up to 2806 genes.
• In essence, small pathways are generally
more likely to be biased than larger ones.

You might also like