Evaluating Sources of Variability in Pathway Profiling

Vol. 00 no.
00 2011
Doc-StartBIOINFORMATICS
EVALUATING SOURCES OF VARIABILITY IN PATHWAY Pages 1–9
PROFILING
∗ ∗
Annalisa Barla(1), Samantha Riccadonna(2) , Salvatore Masecchia(1) ,
Margherita Squillario(1) , Michele Filosi(2,3) , Giuseppe Jurman(2) and Cesare
Furlanello(2)†
arXiv:1201.3216v1 [q-bio.MN] 16 Jan 2012
1
DISI, University of Genoa, via Dodecaneso 35, I-16146 Genova, Italy.
2
FBK, via Sommarive 18, I-38123 Povo (Trento), Italy.
3
CIBIO, University of Trento, via Delle Regole 101, I-38123 Mattarello (Trento), Italy
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
ABSTRACT feature ranking/selection algorithm, the enrichment procedure, the

Motivation: A bioinformatics platform is introduced aimed at inference method and the networks comparison function. Each of
identifying models of disease-specific pathways, as well as a set these components is a potential source of variability, as shown in
of network measures that can quantify changes in terms of global the case of biomarkers from microarrays (The MAQC Consortium,
structure or single link disruptions.The approach integrates a network 2010). Considerable efforts have been directed to tackle the
comparison framework with machine learning molecular profiling. problem of poor reproducibility of biomarker signatures derived
The platform includes different tools combined in one Open Source from high-throughput -omics data (The MAQC Consortium, 2010),
pipeline, supporting reproducibility of the analysis. We describe addressing the issues of selection bias (Ambroise and McLachlan,
here the computational pipeline and explore the main sources of 2002; Furlanello et al., 2003) and more recently of pervasive batch
variability that can affect the results, namely the classifier, the feature effects (Leek et al., 2010). We argue that it is now urgent to
ranking/selection algorithm, the enrichment procedure, the inference adopt a similar approach for network medicine studies. Stability
method and the networks comparison function. (and thus reproducibility) in this class of studies is still an
Results: The proposed pipeline is tested on a microarray dataset open problem (Baralla et al., 2009). Underdeterminacy is a major
of late stage Parkinsons’ Disease patients together with healty issue (De Smet and Marchal, 2010), as the ratio between network
controls. Choosing different machine learning approaches we get dimension (number of nodes) and the number of available
low pathway profiling overlapping in terms of common enriched measurements to infer interactions plays a key role for the stability
elements. Nevertheless, they identify different but equally meaningful of the reconstructed structure. Furthermore, the most interesting
biological aspects of the same process, suggesting the integration of applications are based on inferring networks topology and wiring
information across different methods as the best overall strategy. from high-throughput noisy measurements (He et al., 2009).
Availability: All the elements of the proposed pipeline are available Despite its common use even in biological contexts (Sharan and Ideker,
as Open Source Software: availability details are provided in the main 2006), the problem of quantitatively comparing networks (e.g.,
text. using a metric instead of evaluating network properties) is a still
Contact: annalisa.barla@unige.it an open issue in many scientific disciplines. The central problem
is of course which network metrics should be used to evaluate
stability, whether focusing on local changes or global structural
1 INTRODUCTION changes. As discussed in Jurman et al. (2011), the classic distances
in the edit family focus only on the portions of the network
We present a computational framework for the study of interested by the differences in the presence/absence of matching
reproducibility in network medicine studies (Barabasi et al., 2011). links. Spectral distances - based on the list of eigenvalues of the
Networks, molecular pathways in particular, are increasingly Laplacian matrix of the underlying graph - are instead particularly
looked at as a better organized and more rich version of gene effective for studying global structures. In particular, the Ipsen-
signatures. However, high variability can be injected by the different Mikhailov (Ipsen and Mikhailov, 2002) distance was found robust
methods that are typically used in system biology to define a in a wide range of situations (Jurman et al., 2011). However, global
cellular wiring diagram at diverse levels of organization, from distances can be tricked by isomorph or close to isomorph graphs.
transcriptomics to signalling, of the functional design. For example, In Jurman et al. (2012), both approaches are improved by proposing
to identify the link between changes in graph structures and a glocal measure which combines a spectral distance with a typical
disease, we choose and combine in a workflow a classifier, the Hamming local editing component. In this paper we use this new
tool to quantify how stability of network reconstruction is modified
∗ These authors equally contributed to the work in practice by the different inference and enrichment methods.
† to whom correspondence should be addressed
c Oxford University Press 2011.

1
Barla et al
Pathway enrichment methods are widely used in bioinformatics proper Data Analysis Protocol, which will ensure accurate and reproducible
analysis, for example to assess the relevance of biomarker lists or as results (The MAQC Consortium, 2010). The prediction model M is built
a first step in network analysis. The enrichment step is performed by using two different algorithms for classification and feature ranking.
using the functional information stored in the Kyoto Encyclopedia The more recent one is the ℓ1 ℓ2 regularization with double optimization,
capable of selecting subsets of discriminative genes. The algorithm can be
of Genes and Genomes (KEGG) database (Kanehisa and Goto,
tuned to give a minimal set of discriminative genes or larger sets including
2000) and in the Gene Ontology (GO) database (Ashburner et al.,
correlated genes and it is based on the optimization principle presented
2000). The reconstruction of molecular pathways from high- in Zou and Hastie (2005). The implementation used consists of two stages
throughput data is then based on the theory of complex networks (De Mol et al., 2009) and it is cast in nested loops of 10-fold cross-validation
(e.g. Strogatz (2001); Newman (2003); Boccaletti et al. (2006); (Barla et al., 2008). The first stage identifies the minimal set of relevant
Newman (2010); Buchanan et al. (2010)) and, in particular, in variables (in terms of prediction error), while, starting from the minimal
the reconstruction algorithms for inferring networks topology and list, the second one selects the family of (almost completely) nested lists of
wiring from data (He et al., 2009). relevant variables for increasing values of linear correlation. As alternative
The stability analysis is applied to networks identifying common choice we consider Liblinear, a linear Support Vector Machine (SVM)
and specific network substructures that could be basically associated classifier specifically designed for large datasets (millions of instances and
to disease. The overall goal of the pipeline is to identify and rank features) (Fan et al., 2008). In particular, the classical dual optimization
problem with L2-SVM loss function is solved with a coordinate descent
the pathways reflecting major changes between two conditions,
method. For our experiment we adopt the ℓ2 -regularized penalty term and
or during a disease evolution. We start from a profiling phase the module of the weights for ranking purposes within a 100 × 3-fold
based on classifiers and feature selection modules organized cross validation schema. We build a model for increasing feature sublists
in terms of experimental procedure by Data Analysis Protocol where the feature ranking is defined according to the importance for the
(The MAQC Consortium, 2010), obtaining a ranked list of genes classifier. We choose the model, and thus the top ranked features, providing
with the highest discriminative power. Classification tasks as well a balance between the accuracy of the classifier and the stability of the
as quantitative phenotype targets can be considered by using signature (The MAQC Consortium, 2010). Thus, the output of this first step
different machine learning methods in this first phase. Alternative is a gene signature g1 , ..., gk (one for each model M) containing the k most
methods are made available as components of one Open Source discriminant features, ranked according their frequency score.
pipeline system. The problem of underdeterminacy in the inference Pathway Enrichment
The successive enrichment phase derives a list of relevant pathways from
procedure is avoided by focusing only on subnetworks, and the
the discriminant features, moving the focus of the analysis from single
relevance of the studied pathways for the disease is judged in genes to functionally related pathways. As outlined in the review by
terms of discriminative relevance for the underlying classification Huang et al. (2009), in the last 10 years the gene-annotation enrichment
problem. analysis field has been growing rapidly and several bioinformatics tools
As a testbed for the glocal stability analysis, we compare network have been designed for this task. Huang et al. (2009) provide a unique
structures on a collection of microarray data for patients affected categorization of these enrichment tools in three major categories based
by Parkinson’s disease (PD), together with a cohort of controls on the underlying algorithm: singular enrichment analysis (SEA), gene
(Zhang et al., 2005b). PD is a neurodegenerative disorder that set enrichment analysis (GSEA), and modular enrichment analysis (MEA).
impairs the motor skills at the onset and the cognitive and the speech We choose one representative E for each class for our comparison
functions successively. The goal of this task is to identify the most referring as sources of information D both to the KEGG, to explore
known information on molecular interaction networks, and GO, to explore
relevant dirsupted pathways and genes in late stage PD. On this
functional characterization and biological annotation. In the first category we
dataset, we show that choosing different profiling approaches we choose WebGestalt (WG), an online gene set analysis toolkit (Zhang et al.,
get low overlapping in terms of common enriched pathways found. 2005a) taking as input a list of relevant genes/probesets. The enrichment
Despite this variability, if we consider the most disrupted pathways, analysis is performed in KEGG and GO identifying the most relevant
their glocal distances (between case and control networks) share a pathways and ontologies in the signatures. WG adopts the hypergeometric
common distribution assessing equal meaningfulness to pathways test to evaluate functional category enrichment and performs a multiple test
found starting from different approaches. adjustment (the default method is the one from Benjamini and Hochberg
(1995)). The user may choose different significance levels and the minimum
number of genes belonging to the selected functional groups. GSEA
(Subramanian et al., 2005) is our representative of the second class. It first
performs a correlation analysis between the features and the phenotype
2 MATERIALS AND METHODS
obtaining a ranked list of features. Secondly it determines whether the
The machine learning pipeline adopted in this paper has been originally members of given gene sets are randomly distributed in the ranked list of
introduced in (Barla et al., 2011). As shown in Figure 1, it handles features obtained above, or primarily found at the top or bottom. We use
case/control transcription data through four main steps, from a profiling task the preranked analysis tool, feeding the ranked lists of genes produced by
to the identification of discriminant pathways. The pipeline is independent the profiling phase directly to the enrichment step of GSEA. To avoid a
from the algorithms used: here we describe each step and the implementation miscalculation of the enrichment score ES, we provide as input the complete
adopted for the following experiments evaluating the impact of different list of variables (not just the selected ones), assigning to the not-selected
enrichment methods in pathway profiling. a zero score. Note that GSEA calcultates enrichment scores that reflect
Formally, we are given a collection of n subjects, each described by a the degree to which a pathway is overrepresented at the top or the bottom
p-dimensional vector x of measurements. Each sample is also associated of the ranked list. In our analysis we considered only pathways enriched
with a phenotypical label, e.g. y = {1, −1}, assigning it to a class (in a with the top of the list. Finally, the tool in the MEA class is the Pathways
classification task). The dataset is therefore represented by a n×p expression and Literature Strainer (PaLS) (Alibés et al., 2008), which takes a list or
data matrix X, where p ≫ n, and a corresponding labels vector Y . a set of lists of genes (or protein identifiers) and shows which ones share
Feature Selection Step the same GO terms or KEGG pathways, following a criterion based on a
The matrix X is used to feed the profiling part of the pipeline within a
2
Evaluating sources of variability in pathway profiling
Fig. 1. General scheme of the analysis pipeline, with the indication of the algorithms and tools used in the PD dataset application (lower boxes).
threshold t set by the user. The tool provides as output those functional information scores for all possible pairs within the corresponding network
groups that are shared at least by the t% of the selected genes. PaLS is aimed context (i.e. all the pairs including either the regulator or the target).
at easing the biological interpretation of results from studies of differential RegnANN is a newly defined method for inferring gene regulatory networks
expression and gene selection, without assigning any statistical significance based on an ensemble of feed-forward multilayer perceptrons. Correlation
to the final output. Applying the above mentioned pathway enrichment is used to define gene interactions. For each gene a one-to-many regressor
techniques, we retrieve for each gene gi the corresponding whole pathway is trained using the transcription data to learn the relationship between the
pi = {h1 , ..., ht }, where the genes hj 6= gi not necessarily belong to the gene and all the other genes of the network. The interaction among genes are
original signature g1 , ..., gk . Extending the analysis to all the hj genes of the estimated independently and the overall network is obtained by joining all
pathway allows us to explore functional interactions that would otherwise get the neighborhoods. Summarizing, we obtain a real-valued adjacency matrix
lost. as output of the subnetwork inference step for each dataset X, for each
Subnetwork Inference class y, for each model M, for each enrichment tool E, for each source
For each pathway, networks are inferred separately on data from the different of information D, for each pathway pi , and for each subnetwork inference
classes. The subnetwork inference phase requires to reconstruct a network algorithm N . We thus need to quantitatively evaluate network differences,
Npi ,y on the pathway pi by using the steady state expression data of the i.e. using a metric instead of evaluating network properties.
samples of each class y. The network inference procedure is limited to the Subnetwork Distance and Stability
sole genes belonging to the pathway pi in order to avoid the problem of Among the possible choices already available in literature, we focus on two
intrinsic underdeterminacy of the task. As an additional caution against this of the most common distance families: the set of edit-like distances and the
problem, in the following experiments we limit the analysis to pathways spectral distances. The functions in the former family quantitatively evaluate
having more than 4 nodes and less than 1000 nodes. The pipeline allows the differences between two networks (with the same number of nodes)
to run analysis in parallel with different methods and thus to evaluate the in terms of minimum number of edit operations (with possibly different
variability along the whole pipeline. We adopt four different subnetwork costs) transforming one network into the other, i.e. deletion and insertion
reconstruction algorithms N : the Weighted Gene Co-Expression Networks of links, while spectral measures relies on functions of the eigenvalues of
(WGCN) algorithm (Horvath, 2011), the Algorithm for the Reconstruction one of the connectivity matrices of the underlying graph. As discussed in
of Accurate Cellular Networks (ARACNE) (Margolin et al., 2006), the Jurman et al. (2011), the drawback of many classical distances (such as those
Context Likelihood of Relatedness (CLR) approach (Faith et al., 2007), and of the edit family) is locality, that is focusing only on the portions of the
the Reverse Engineering Gene Networks using Artificial Neural Networks network interested by the differences in the presence/absence of matching
(RegnANN) (Grimaldi et al., 2011). In this work, we applied WGCNA, CLR links. Spectral distances can overcome this problem considering the global
and ARACNE to analyze the pathway identified in the Pathway Enrichment structure of the compared topologies. Within them, we consider the Ipsen-
step, while RegnANN was used, as an alternative algorithm, to reconstruct Mikhailov ǫ distance: originally introduced in Ipsen and Mikhailov (2002)
interesting disrupted pathways and to compare its results with results from as a tool for network reconstruction from its Laplacian spectrum, it has been
methods mentioned above. WGCNA is based on the idea of using (a function proven to be the most robust in a wide range of situations by Jurman et al.
of) the absolute correlation between the expression of a couple of genes (2011). We are also aware that spectral measures are not flawless: they
across the samples to define a link between them. ARACNE is a method cannot distinguish isomorphic or isospectral graphs, which can correspond
for inferring networks from the transcription level (Margolin et al., 2006) to quite different conditions within the biologica context. We thus introduce
to the metabolic level (Nemenman et al., 2007). Beside it was originally the glocal distance φ as a possible solution against both issues: φ is defined
designed for handling the complexity of regulatory networks in mammalian as the product metric of the Hamming distance H (as representative of the
cells, it is able to address a wider range of network deconvolution problems. edit-familiy) and the ǫ distance. Full mathematical details are available in
This information-theoretic algorithm removes the vast majority of indirect Jurman et al. (2012).
candidate interactions inferred by co-expression methods by using the data Relying on the distances ǫ and φ, we evaluate networks corresponding to
processing inequality property (Cover and Thomas, 1991). CLR belongs the same pathway for different classes, i.e. all the pairs (Npi ,+1 , Npi ,−1 )
to the relevance networks class of algorithms and is employed for the and rank the pathways themselves from the most to the least changing across
identification of transcriptional regulatory interactions (Faith et al., 2007). classes.
In particular, interactions between transcription factors and gene targets are Moreover, we attached to each network a quantitative measure of stability
scored by using the mutual information between the corresponding gene with respect to data subsampling, in order to evaluate the reliability of
expression levels coupled with an adaptive background correction step. inferred topologies. In particular, for each Npi ,y , we extracted a random
Indeed the most probable regulator-target interactions are chosen comparing subsampling (of a fraction r of X labelled as y) on which the corresponding
the mutual information score versus the ”background” distribution of mutual Npi ,y will be reconstructed. Repeating m times the subsampling/inferring
3
Barla et al
Table 1. Summary of pathways retrieved in the Table 2. Summary of common most disrupted
pathway enrichment step. The numbers in brackets refer pathways (φ ≥ 0.05).
to the pathways considered for the network inference
step.
M D |∩(AllE )| |∩(EWG, PaLS )|
E GO 0 17
M D ℓ1 ℓ2
WG GSEA PaLS KEGG 1 22
GO 0 5
Liblinear
GO 114 (92) 7 (7) 381 (331) KEGG 0 21
ℓ1 ℓ2
KEGG 43 (43) 2 (2) 71 (71)
GO 83 (45) 0 (0) 404 (356)
Liblinear
KEGG 56 (55) 1 (1) 77 (77)
boostrap Confidence Interval: (0.78;0.83)) coupled with a stability

of 0.70. The lists have 119 common genes.
The number of enriched pathways greatly varied depending on
procedure, a set of m nets will be generated for each Npi ,y . Then all
m(m−1) the selection and enrichment tools. With ℓ1 ℓ2 , we found globally for
mutual m

= distances are computed, and for each set of
2 2 GO and KEGG, 157, 452 and 9 pathways as significantly enriched,
m graphs we build the corresponding distance histogram. In particular,
for our experiments we set m = 20 and r = 23 . Mean and variance
for WG, PaLS and GSEA respectively. Similarly, for Liblinear,
of the constructed histograms will quantitatively assess the stability of the the identified pathways were: 139, 481 and 1. Table 1 reports the
subnetwork inferred from the whole dataset: the lower the values, the higher detailed results for model M, enrichment E and database D.
the stability in terms of robustness to data perturbation (subsampling). If we consider the ℓ1 ℓ2 selection method and the enrichment
Data description and preprocessing performed within the GO, we may note that no common GO terms
The presented approach is applied to PD data originally introduced in were selected across enrichment methods. A significant overlap
Zhang et al. (2005b) and publicly available at Gene Expression Omnibus of results was found only between WG and PaLS, with 30 GO
(GEO), with accession number GSE20292. The biological samples consist common terms. Similar considerations may be drawn with the
of whole substantia nigra tissue in 11 PD patients and 18 healthy controls. results from the Liblinear feature selection method. Within the GO
Expressions were measured on Affymetrix HG-U133A platform. We
enrichment we did not identify any common GO term among the
perform the data normalization on the raw data with the rma algorithm
of the R Bioconductor affy package with a custom CDF (downloaded
three enrichment tools. Considering only WG and PaLS, we were
from BrainArray: http://brainarray.mbni.med.umich.edu) adopting Entrez able to select 12 common GO terms.
identifiers. If we consider the ℓ1 ℓ2 selection method and the enrichment
Software Availability performed within KEGG, two common pathways are identified
The Python implementation of ℓ1 ℓ2 regularization with double optimization across enrichment methods. A significant overlap of results was
is available at http://slipguru.disi.unige.it/Software/L1L2Py. Liblinear was found between WG and PaLS, with 43 common pathways. For
originally developed by the Machine Learning Group at the National Liblinear, only one common pathway was selected among the three
Taiwan University and it is now available within the Python mlpy enrichment tools. A significant overlap of results was found between
library (http://mlpy.fbk.eu). We adopt the l2r l2loss svc dual WG and PaLS, with 55 common pathways.
solver, with C = 10e−4 . WG is available as a web application at
Following the pipeline, we also performed a comparison of the
http://bioinfo.vanderbilt.edu/webgestalt. GSEA is available either as a web
application or a Java stand-alone tool at http://www.broadinstitute.org/gsea.
three network reconstruction methods. We considered the most
PaLS is available online at http://pals.bioinfo.cnio.es as a web application. disrupted networks, keeping for the analysis those pathways that had
For three of the network reconstruction algorithms, we adopted the R a glocal distance greater or equal to the chosen threshold τ = 0.05.
Bioconductor implementation: the WGCNA package for WGCN, and MiNET The choice of such threshold was made considering the distribution
(Mutual Information NETworks package) for ARACNE and CLR. In of glocal distances φ for the methods M. For instance, if we
particular, we set the WGCNA soft thresholding exponent to 5, while we consider the Liblinear selection method and the KEGG database,
keep the default value for the ARACNE data processing inequality tolerance we have a cumulative distribution as depicted in Figure 2(a). The
parameter (Meyer et al., 2008). Moreover, the ARACNE implementation threshold τ is set to 0.05 and allows retaining at least 50% of
requires all the features to have non-zero variance on each class and for pathways. The plot in Figure 2(b) represents the glocal distances
consistency purposes we applied this in all experiments. RegnANN is instead
distribution for all enrichment methods E with respect to the two
available from http://sourceforge.net/projects/regnann. It is implemented in
C and relies on GPGPU programming paradigm for improving efficiency.
components of the glocal distance: the Ipsen distance ǫ and the
The glocal distance φ is available upon request to the authors either as R Hamming distance H. The red curved line represents the threshold
script or Python script. The computation of the Ipsen-Mikhailov distance ǫ τ in this space. The plot in Figure 2(c) is detailed for subnetwork
is included as component of the glocal script. inference method N .
After retaining the most distant pathways, we performed a
comparison of common terms for fixed selection method M and
3 RESULTS AND DISCUSSION database D. The results are reported in Table 2. In Tables 3 and 4
The feature selection results varied accordingly to the chosen we report the most disrupted GO terms and KEGG pathways that
method: ℓ1 ℓ2 identified 458 discriminant genes associated to an have a glocal distance φ greater or equal to the chosen threshold τ .
average prediction performance of 80.8%, while with Liblinear we As an example of a selected pathway within KEGG, the networks
selected the top-500 genes associated to an accuracy of 80% (95% (thresholded at edge weight 0.1 for graphic purposes) inferred
4
Table 3. Summary of most disrupted GO terms common between WG and PaLS, for different models M . Each GO term is associated to a glocal distance
φ ≥ 0.05 for all subnetwork reconstruction algorithms N . GO terms are sorted according decreasing average φ. Bold fonts represent the GO terms shared by
model M.
ℓ1 ℓ2 Liblinear
ID Term name ID Term name
GO:0005739 Mitochondrion GO:0042127 Regulation of cell proliferation

GO:0031966 Mitochondrial membrane GO:0005783 Endoplasmic reticulum
GO:0005743 Mitochondrial inner membrane GO:0015629 Actin cytoskeleton
GO:0042802 Identical protein binding GO:0006469 Negative regulation of protein kinase activity
GO:0007018 Microtubule-based movement GO:0005747 Mitochondrial respiratory chain complex I
GO:0046961 Proton-transporting ATPase activity, rotational mechanism
GO:0005753 Mitochondrial proton-transporting ATP synthase complex
GO:0000502 Proteasome complex
GO:0015986 ATP synthesis coupled proton transport
GO:0045202 Synapse
GO:0048487 Beta-tubulin binding
GO:0042734 Presynaptic membrane
GO:0005747 Mitochondrial respiratory chain complex I
GO:0006120 Mitochondrial electron transport, NADH to ubiquinone
GO:0015078 Hydrogen ion transmembrane transporter activity
GO:0015992 Proton transport
GO:0005874 Microtubule
05014 − GSE20292 class +1 − RegnANN

Parkinson 1 − 05014 − WGCNA (th=0.5)
474447414217
4744 4217 48424747 317 2906
05014
4741 54205
48424747 317 2906 2905
54205 2905 5530 2904
5530 2904
5532 2903
5532 2903 5533 2902
5533 2902
5534 2891
5534 2891
5600 2890
5600 2890
5603
5606
1616
1432
Aracne CLR WGCNA 5603
5606
1616
1432
5608 11261
5608 11261
5630 9973 0.30 5630

572
9973
847
572 847
581 842 581 842

5868 836 5868 836
5879 834 5879 834

5961 79139 5961
596
598 7133
7157 0.25 596
598 7133
79139
7157
6300 637 71247132 6300 637 71247132

6506
63928 6647 6392865066647
0.20
05014 − GSE20292 class −1 − RegnANN
Parkinson −1 − 05014 − WGCNA (th=0.5)
4744 4217
4741
4744
4741
4217 48424747 317 2906
48424747 317 2906 54205 2905
54205 2905 5530 2904
5530 2904
0.15 5532 2903

φ
5532 2903 5533 2902

5533 2902
5534 2891
5534 2891
5600 2890
5600 2890
5603 1616
5603 1616
5606 1432
5606 1432
5608 11261 0.10 5608
5630
11261
9973
5630 9973
572 847 572 847
581 842 581 842
5868 836 5868 836
5879
5961
834
79139 0.05 5879
5961
596 7157
834
79139
596 7157 598
598 7133 7133
6300 637 71247132 6300 637 71247132
6506 65066647
63928 6647 63928
0.00
(a) Control PD (b) PD
Control Control PD (c)
Fig. 3. (a) Networks inferred by WGCNA algorithm for the ALS KEGG pathway for PD patients (above) and controls (below), on the same pathway for
different inference algorithm. (b) WGCNA is the method showing the highest stability on the two classes. (c) Same pathway reconstructed with RegnANN.
by WGCNA (together with the corresponding stability) on the of this step by choosing two approaches within the regularization
Amyotrophic Lateral Sclerosis KEGG pathway (ALS - 05014) are methods family. Both classifiers adopt a ℓ2 -regularization penalty
displayed in Figure 3. We also plot the inferred network by the term, combined with different loss functions and, for ℓ1 ℓ2 with
RegnANN algorithm. Similarly, in Figure 4 we plot the Pathogenic another regularization term. We used similar but not equal model
E. coli infection KEGG pathway, reconstructed by WGCNA, its selection protocols. Both guarantee that the results are not affected
stability plot, and the corresponding inferred networks by the by selection-bias. In this work, the main source of variability was the
RegnANN algorithm. choice of the gene enrichment module. Therefore, the experimenter
Discussion must be careful in choosing one method or another and in using
The variability in the results, as expected, strongly depends on it compliantly with the experimental design. For instance, GSEA
the method of choice. For feature selection, the nature of the was designed for estimating the significance levels by considering
method is key. In the proposed pipeline we limited the impact separately the positively and negatively scoring gene sets within a
5
Barla et al
Table 4. Summary of most disrupted KEGG pathways common between WG and PaLS, for different models M. Each pathway is associated to a glocal
distance φ ≥ 0.05 for all subnetwork reconstruction algorithms N . KEGG pathways are sorted according decreasing average φ. Bold fonts represent the
KEGG pathways shared by model M.
ℓ1 ℓ2 Liblinear
ID Pathway name ID Pathway name
01100 Metabolic pathway 04630 Jak-STAT signaling pathway

05130 Pathogenic Escherichia coli infection 01100 Metabolic pathway
04910 Insulin signaling pathway 05130 Pathogenic Escherichia coli infection
00310 Lysine degradation 04623 Cytosolic DNA-sensing pathway
04140 Regulation of autophagy 00330 Arginine and proline metabolism
03050 Proteasome 04910 Insulin signaling pathway
00230 Purine metabolism 05212 Pancreatic cancer
05014 Amyotrophic lateral sclerosis* 03030 DNA replication
00980 Metabolism of xenobiotics by cytochrome P450 05213 Endometrial cancer
00620 Pyruvate metabolism 04660 T cell receptor signaling pathway
05213 Endometrial cancer 04310 Wnt signaling pathway
00270 Cysteine and methionine metabolism 05210 Colorectal cancer
00240 Pyrimidine metabolism 04912 GnRH signaling pathway
05120 Epithelial cell signaling in Helicobacter pylori infection 05332 Graft-versus-host disease
05110 Vibrio cholerae infection 04520 Adherens junction
00020 Citrate cycle (TCA cycle) 04621 NOD-like receptor signaling pathway
00562 Inositol phosphate metabolism 04370 VEGF signaling pathway
00600 Sphingolipid metabolism 04662 B cell receptor signaling pathway
05218 Melanoma 04722 Neurotrophin signaling pathway
00010 Glycolysis / Gluconeogenesis 05214 Glioma
00051 Fructose and mannose metabolism 04330 Notch signaling pathway
04722 Neurotrophin signaling pathway
*Note: This is the only selected pathway shared across all enrichment methods E .
list of genes selected with filter methods based on classical statistical biological features in particular at the mithocondrial level. Moreover
tests. It is worth noting that, if one uses the preranked option, at the phenotypic level the skeletal muscles of the patients are
as we did, negative regulated groups might not be significant at severely affects influencing the movements. In Figure 3 it is evident
all (we indeed discarded them). WG uses the Hypergeometrical that a high number of interactions are established among the genes
test to assess the functional groups but, differently from GSEA, going from the control (below) to the affected (above) pathways.
does not use any significance assessment based on permutation It is also interesting to underline that CYCS (Entrez ID: 54205)
of phenotype labels. PaLS is the simplest methods, being just a one of the hub genes (represented by a red dot in the graph)
measure of occurrences of a given descriptor in the list of selected within the pathway was identified by ℓ1 ℓ2 as discriminant. This
genes. However, enrichment methods from different categories are gene is highly involved in several neurodegenerative diseases (e.g.,
complementary and can identify different but equally meaningful PD, Alzheimer’s, Huntington’s) and in pathways related to cancer.
biological aspects of the same process. Thus, the integration of Furthermore its protein is known to functions as a central component
information across different methods is the best strategy. of the electron transport chain in mitochondria and to be involved
Moreover, the assessment of the reconstruction distance between in initiation of apoptosis, known cause of the neurons loss in
case and control version of the same pathways help in providing PD. Across variable selection algorithms M, five highly disrupted
a quantitative focus on the key pathway involved in the process. pathways were found as shared between two of the three enrichment
The use of a distance mixing the effects of structural changes with methods (see Table 4, bold items). In particular, we represented in
those due to the differences in rewiring moreover warrants a more Figure 4 the corresponding inferred networks. To further highlight
informative view on the difference assessment itself. The limited the different outcomes occurring from the same dataset when
effect of different feature selection methods is confirmed by the diverse inference methods are employed, we reconstructed the ALS
plots in Figure 5. and Pathogenic E. coli infection by the RegnANN algorithm, which
For ℓ1 ℓ2 , the only most disrupted pathway shared by the three tends to spot also second order correlation among the network
enrichment tools E and the three reconstruction methods N is ALS. nodes, see Figures 3 and 4.
This pathway is relevant in this context because, like PD, ALS is Two genes in the E. coli infection pathway were selected both
another neurodegenerative disease therefore they share significant by ℓ1 ℓ2 and Liblinear, namely ABL1 (Entrez ID:71) and TUBB6
6
05130 − GSE20292 class +1 − RegnANN

Parkinson 1 − 05130 − WGCNA (th=0.5)
203068
253423643
203068
253423643 3059 25 20171499
4690
4691
387
3059 25
347733
20171499
10552
10383
10382
10381
05130 51807
4690
4691
347733
387
10552
10383
10382
10381
10376
51807 10376
5578 10109
5578 10109
56604
6093
10095
10093
Aracne CLR WGCNA 56604
6093
10095
10093
60 10092
60 10092
7099 999
7099
7100 998
999
0.30 7100 998
71 71 9475
9475
7277 929 7277 929
7278 9181 7278 9181

7280 9076 7280 9076
7430
7454
75347846
81027
8976
84790
844084617
0.25 7430
7454
75347846
81027
8976
84790
844084617
79861 81873 79861 81873
0.20 05130 − GSE20292 class −1 − RegnANN

Parkinson −1 − 05130 − WGCNA (th=0.5)
203068
253423643
203068
253423643 3059 25 20171499
3059 25 20171499 347733 10552
347733 10552 387 10383
387 10383
10382
0.15 4690
φ
4690 10382 4691 10381
4691 10381
51807 10376
51807 10376
5578 10109
5578 10109
56604 10095
56604 10095
6093 10093
6093 10093
60
7099
10092
999
0.10 60
7099
10092
999
998 7100 998

7100
71 71 9475
9475
929 7277 929
7277
7278 9181
7278
7280
9181
9076 0.05 7280
7430
9076
8976
7430 8976 7454 84790
7454 84790
75347846 844084617
75347846 844084617 81027
81027 79861 81873
79861 81873
0.00
(a) Control PD (b) PD
Control Control PD (c)
Fig. 4. (a) Networks inferred by WGCNA algorithm for the Pathogenic E. coli infection KEGG pathway for PD patients (above) and controls (below), on the
same pathway for different inference algorithm. (b) WGCNA is the method showing the highest stability on the two classes. (c) Same pathway reconstructed
with RegnANN.
(Entrez ID: 84617). ABL1 seems to play a relevant role as hub ACKNOWLEDGEMENT
both in the WGCNA and in the RegnANN networks. ABL1 is a The authors at FBK want to thank Shamar Droghetti for his help
protooncogene that encodes protein tyrosine kinase that has been with the enrichment web interfaces.
implicated in processes of cell differentiation, cell division, cell
adhesion, and stress response. It was also found to be responsible Funding: The authors at DISI acknowledge funding by the
of apoposis in human brain microvascular endothelial cells. In Compagnia di San Paolo funded Project Modelli e metodi
Figure 6 we note that pathways with high number of genes are computazionali nello studio della fisiologia e patologia di reti
similar in term of local distance, instead a wider range of variability molecolari di controllo in ambito oncologico. The authors at
is found looking at the spectral distance. The red line in 6(b) divides FBK acknowledge funding by the European Union FP7 Project
the 2 cluster. Pathway targets beyond and within the red line are HiperDART and by the PAT funded Project ENVIROCHANGE.
represented in the cumulative histogram in 6(a). Pathways beyond
the threshold are equally distributed and they represent a wider
range of targets, instead pathways within the threshold show a REFERENCES
smaller number of targets 6(a) on the right. Alibés, A., Cañada, A., and Dı́az-Uriarte, R. (2008). PaLS: filtering common literature,
biological terms and pathway information. Nucleic Acids Res, 36(Web Server issue),
W364–W367.
Ambroise, C. and McLachlan, G. (2002). Selection bias in gene extraction on the basis
of microarray gene-expression data. PNAS, 99(10), 6562–6566.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis,
A. P., K., D., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver,
4 CONCLUSION L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin,
Moving from gene profiling towards pathway profiling can be an G. M., and Sherlock, G. (2000). Gene ontology: tool for the unification of biology.
the gene ontology consortium. Nature Genetics, 25(1), 25–9.
effective solution to overcome the problem of the poor overlapping
Barabasi, A. L., Gulbahce, N., and Loscalzo, J. (2011). Network medicine: a network-
in -omics signatures. Nonetheless, the path from translating a based approach to human disease. Nature Review Genetics, 12, 56–68.
discriminant gene panel into a coherent set of functionally related Baralla, A., Mentzen, W., and de la Fuente, A. (2009). Inferring Gene Networks: Dream
gene sets includes a number of steps each contributing in injecting or Nightmare? Ann. N.Y. Acad. Sci., 1158, 246–256.
variability in the process. To reduce the overall impact of such Barla, A., Mosci, S., Rosasco, L., and Verri, A. (2008). A method for robust variable
selection with significance assessment. Proceedings of ESANN 2008.
variability, it is thus critical that, whenever possible, the correct Barla, A., Jurman, G., Visintainer, R., Filosi, M., Riccadonna, S., and Furlanello, C.
tool for each single step is adopted, accurately focussing on the (2011). A machine learning pipeline for discriminant pathways identification. In
desired target to be investigated. This mainly holds for the choice CIBB 2011, pages 1–10. ISBN:9788890643705.
of the most suitable enrichment tool and biological knowledge Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical
and powerful approach to multiple testing. Journal of the Royal Statistical Society,
database, and, to a lower extent, to the inference method for
Series B.
the newtork reconstruction: all these ingredients are planned Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., and Hwang, D.-U. (2006). Complex
for different objectives, and their use on other situations may networks: Structure and dynamics. Physics Reports, 424(4–5), 175–308.
result misleading. As a final observation and a possible future Buchanan, M., Caldarelli, G., De Los Rios, P., Rao, F., and Vendruscolo, M., editors
development to explore, the emerging instability can be tackled by (2010). Networks in Cell Biology. Cambridge University Press.
Cover, T. and Thomas, J. (1991). Elements of Information Theory. Wiley.
obtaining the functional groups identification as the result of a prior De Mol, C., Mosci, S., Traskine, M., and Verri, A. (2009). A regularized method
knowledge injection in the learning phase, rather than a procedure a for selecting nested groups of relevant genes from microarray data. Journal of
posteriori (Zycinski et al., 2011, 2012). Computational Biology, page 8.
7
Barla et al
0.00.20.40.60.8 0.00.20.40.60.8
GSEA PALS WebGestalt GSEA PALS WebGestalt
0.8 0.8
WGCNA
WGCNA
WGCNA
WGCNA
WGCNA
WGCNA
0.6 0.6
0.4 0.4
0.2 0.2
0.0 GSEA PALS WebGestalt 0.0 GSEA PALS WebGestalt
0.8 0.8
CLR
CLR
CLR
CLR
CLR
CLR
0.6 0.6
ε
ε
0.4 0.4
0.2 0.2
GSEA PALS WebGestalt 0.0 GSEA PALS WebGestalt 0.0
0.8 0.8
Aracne
Aracne
Aracne
Aracne
Aracne
Aracne
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.00.20.40.60.8 0.00.20.40.60.8 0.00.20.40.60.8 0.00.20.40.60.8
H H
a: ℓ1 ℓ2 and KEGG b: Liblinear and KEGG
0.00.20.40.60.8 0.00.20.40.60.8
GSEA PALS WebGestalt GSEA PALS WebGestalt
0.8 0.8
WGCNA
WGCNA
WGCNA
WGCNA
WGCNA
WGCNA
0.6 0.6
0.4 0.4
0.2 0.2
0.0 GSEA PALS WebGestalt 0.0 GSEA PALS WebGestalt
0.8 0.8
CLR
CLR
CLR
CLR
CLR
CLR
0.6 0.6
ε
0.4 0.4
0.2 0.2
GSEA PALS WebGestalt 0.0 GSEA PALS WebGestalt 0.0
0.8 0.8
Aracne
Aracne
Aracne
Aracne
Aracne
Aracne
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.00.20.40.60.8 0.00.20.40.60.8 0.00.20.40.60.8 0.00.20.40.60.8
H H
c: ℓ1 ℓ2 and GO d: Liblinear and GO
Fig. 5. Plots of Hamming vs. Ipsen distances (H vs. ǫ) for all possible combinations of M, D, E and N . In our analysis we considered the glocal distance φ,
defined as the normalized product of H and ǫ.
De Smet, R. and Marchal, K. (2010). Advantages and limitations of current network perspectives. J. Biotechnol., 144(3), 190–203.
inference methods. Nature Review Microbiology, 8, 717–729. Horvath, S. (2011). Weighted Network Analysis: Applications in Genomics and Systems
Faith, J., Hayete, B., Thaden, J., Mogno, I., Wierzbowski, J., Cottarel, G., Kasif, Biology. Springer.
S., Collins, J., and Gardner, T. (2007). Large-Scale Mapping and Validation of Huang, D., Sherman, B., and Lempicki, R. (2009). Bioinformatics enrichment tools:
Escherichia coli Transcriptional Regulation from a Compendium of Expression paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids
Profiles. PLoS Biol., 5(1), e8. Res, 37(1), 1–13.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). Liblinear: Ipsen, M. and Mikhailov, A. (2002). Evolutionary reconstruction of networks. Phys.
A library for large linear classification. Journal of Machine Learning Research, 9, Rev. E, 66(4), 046109.
1871–1874. Jurman, G., Visintainer, R., and Furlanello, C. (2011). An introduction to spectral
Furlanello, C., Serafini, M., Merler, S., and Jurman, G. (2003). Entropy-based gene distances in networks. In Proc. WIRN 2010, pages 227–234.
ranking without selection bias for the predictive classification of microarray data. Jurman, G., Visintainer, R., Riccadonna, S., Filosi, M., and Furlanello, C. (2012). A
BMC Bioinformatics. glocal distance for network comparison. arXiv:submit/0397475 [math.CO].
Grimaldi, M., Visintainer, R., and Jurman, G. (2011). Regnann: Reverse engineering Kanehisa, M. and Goto, S. (2000). KEGG: kyoto encyclopedia of genes and genomes.
gene networks using artificial neural networks. PLoS ONE, 6(12), e28646. Nucleic Acids Res, 28(1), 27–30.
He, F., Balling, R., and Zeng, A.-P. (2009). Reverse engineering and verification of gene Leek, J., Scharpf, R., Corrada Bravo, H., Simcha, D., Langmead, B., Johnson, W.,
networks: Principles, assumptions, and limitations of present methods and future Geman, D., Baggerly, K., and Irizarry, R. (2010). Tackling the widespread and
8
critical impact of batch effects in high-throughput data. Nat Rev Genet, 11, 733–739.
Margolin, A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Dalla-Favera,
R., and Califano, A. (2006). ARACNE: an algorithm for the reconstruction of gene
regulatory networks in a mammalian cellular context. BMC Bioinform., 7(7), S7.
Meyer, P., Lafitte, F., and Bontempi, G. (2008). minet: A R/Bioconductor Package
for Inferring Large Transcriptional Networks Using Mutual Information. BMC
Bioinform., 9(1), 461. 20
Percentage of Total
Nemenman, I., Escola, G., Hlavacek, W., Unkefer, P., Unkefer, C., and Wall, M. (2007).
15
Reconstruction of Metabolic Networks from High-Throughput Metabolite Profiling
Data. Ann. N.Y. Acad. Sci., 1115, 102–115. 10
Newman, M. (2003). The Structure and Function of Complex Networks. SIAM Review,
45, 167–256. 5
Newman, M. (2010). Networks: An Introduction. Oxford University Press. 0

Sharan, R. and Ideker, T. (2006). Modeling cellular machinery through biological
0.0 0.1 0.2 0.3 0.4
network comparison. Nature Biotechnology, 24(4), 427–433. φ
Strogatz, S. (2001). Exploring complex networks. Nature, 410, 268–276.
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette,
M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S., and Mesirov, J. P.
(2005). Gene set enrichment analysis: A knowledge-based approach for interpreting
genome-wide expression profiles. PNAS, 102(43), 15545–15550.
The MAQC Consortium (2010). The MAQC-II Project: A comprehensive study (a)
of common practices for the development and validation of microarray-based
predictive models. Nature Biotechnology, 28(8), 827–838.
Zhang, B., Kirov, S., and Snoddy, J. (2005a). WebGestalt: an integrated system for
exploring gene sets in various biological contexts. Nucleic Acids Res, 33(Web Server
issue), W741–8.
Zhang, Y., James, M., Middleton, F. A., and Davis, R. L. (2005b). Transcriptional Aracne CLR WGCNA
analysis of multiple brain regions in parkinson’s disease supports the involvement 0.05 0.10 0.15
of specific protein processing, energy metabolism, and signaling pathways, and GSEA PALS WebGestalt
suggests novel disease mechanisms. Am J Med Genet B Neuropsychiatr Genet, 0.6
137B(1), 5–16. 0.5
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. 0.4
ε
0.3
J.R. Statist. Soc. B.
0.2
Zycinski, G., Barla, A., and Verri, A. (2011). Svs: Data and knowledge integration in 0.1
computational biology. Engineering in Medicine and Biology Society,EMBC, 2011
Annual International Conference of the IEEE, pages 6474 – 6478. 0.05 0.10 0.15 0.05 0.10 0.15
Zycinski, G., Squillario, M., Barla, A., Sanavia, T., Verri, A., and Di Camillo, B. H
(2012). Discriminant functional gene groups identification with machine learning
and prior knowledge. Submitted.
(b)
0.05 0.15
GSEA PALS WebGestalt
0.6
0.5
WGCNA
WGCNA
WGCNA
0.4
0.3
0.2
0.1
GSEA PALS WebGestalt 0.6
0.5
0.4
CLR
CLR
CLR
ε
0.3
0.2
0.1
0.6 GSEA PALS WebGestalt
0.5
Aracne
Aracne
Aracne
0.4
0.3
0.2
0.1
0.05 0.15 0.05 0.15
H
(c)
Fig. 2. Detailed distance plot for Liblinear and KEGG (see Figure 5b). The
histogram plot in (a) represents the cumulative histogram for all distances
across enrichment methods E and subnetwork inference algorithms N . The
threshold τ is set to retain at least 50% of pathways. In (b) a plot of
Hamming vs. Ipsen distances (H vs. ǫ) for all possible combinations of E 9
and N , which is detailed in (c).
Barla et al
0 200 400 600 800 1000 1200 1400
beyond within
35
30
25
Percent of Total
20
15
10
0 200 400 600 800 1000 1200 1400
Number of genes
(a)
0.00 0.10 0.20

GSEA PALS WebGestalt
0.4
0.3
ε
0.2
0.1
0.0
0.00 0.10 0.20 0.00 0.10 0.20
H
(b)
Fig. 6. (a) Pathway target cumulative histogram. (b) Hamming versus Ipsen
(H vs. ǫ) distance, and thresholding of high populated pathways.
10
Parkinson −1 − 05130 − Aracne (th=0.7)
203068
253423643
3059 25 20171499
347733 10552
387 10383
4690 10382
4691 10381
51807 10376
5578 10109
56604 10095
6093 10093
60 10092
7099 999
7100 998
71 9475
7277 929
7278 9181
7280 9076
7430 8976
7454 84790
75347846 844084617
81027
79861 81873
Parkinson 1 − 05130 − Aracne (th=0.7)
203068
253423643
3059 25 20171499
347733 10552
387 10383
4690 10382
4691 10381
51807 10376
5578 10109
56604 10095
6093 10093
60 10092
7099 999
7100 998
71 9475
7277 929
7278 9181
7280 9076
7430 8976
7454 84790
75347846 844084617
81027
79861 81873

Evaluating Sources of Variability in Pathway Profiling

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evaluating Sources of Variability in Pathway Profiling

Uploaded by

Copyright:

Available Formats

Vol. 00 no.

Associate Editor: XXXXXXX

ABSTRACT feature ranking/selection algorithm, the enrichment procedure, the

c Oxford University Press 2011.

boostrap Confidence Interval: (0.78;0.83)) coupled with a stability

ID Term name ID Term name

GO:0005739 Mitochondrion GO:0042127 Regulation of cell proliferation

05014 − GSE20292 class +1 − RegnANN

5630 9973 0.30 5630

581 842 581 842

5879 834 5879 834

6300 637 71247132 6300 637 71247132

0.15 5532 2903

5532 2903 5533 2902

5608 11261 0.10 5608

581 842 581 842

5868 836 5868 836

ID Pathway name ID Pathway name

01100 Metabolic pathway 04630 Jak-STAT signaling pathway

05130 − GSE20292 class +1 − RegnANN

7278 9181 7278 9181

0.20 05130 − GSE20292 class −1 − RegnANN

998 7100 998

Newman, M. (2010). Networks: An Introduction. Oxford University Press. 0

0 200 400 600 800 1000 1200 1400

0 200 400 600 800 1000 1200 1400

0.00 0.10 0.20

You might also like