You are on page 1of 11

B RIEFINGS IN FUNC TIONAL GENOMICS . VOL 12. NO 5. 457^ 467 doi:10.

1093/bfgp/elt003

Biostatistical approaches for the


reconstruction of gene co-expression
networks based on transcriptomic data
Liliana Lo¤pez-Kleine, Luis Leal and Camilo Lo¤pez
Advance Access publication date 12 February 2013

Abstract

Downloaded from https://academic.oup.com/bfg/article/12/5/457/206992 by guest on 05 February 2021


Techniques in molecular biology have permitted the gathering of an extremely large amount of information relating
organisms and their genes. The current challenge is assigning a putative function to thousands of genes that have
been detected in different organisms. One of the most informative types of genomic data to achieve a better know-
ledge of protein function is gene expression data. Based on gene expression data and assuming that genes involved
in the same function should have a similar or correlated expression pattern, a function can be attributed to those
genes with unknown functions when they appear to be linked in a gene co-expression network (GCN). Several
tools for the construction of GCNs have been proposed and applied to plant gene expression data. Here, we
review recent methodologies used for plant gene expression data and compare the results, advantages and disadvan-
tages in order to help researchers in their choice of a method for the construction of GCNs.
Keywords: gene co-expression networks; microarray datasets; transcriptomics; plants; gene functional prediction

INTRODUCTION potential functions can be assigned to genes with


In the past 15 years, molecular biology has an unknown function based on their interactions
experienced a new way to approach the understand- with genes of known functions. Based on gene ex-
ing of biological processes, primarily due to the pression data and assuming that genes involved in the
sequencing of complete genomes and the generation same function should have similar or correlated ex-
of high-throughput data derived from transcrip- pression patterns, a function can be attributed to
tomics, proteomics, interactomics and metabolomics. those genes with unknown functions. The quickly
These ‘-omics’ areas of research represent a major accumulating information on gene expression based
challenge in the integration of information and the on microarray and RNA-sequencing (RNA-seq) ex-
generation of a real understanding of living beings. periments is a valuable tool that can be exploited to
The current challenge is assigning a putative function create gene co-expression networks (GCNs).
to thousands of genes that have been detected in Although GCNs have so far been constructed
different organisms. Data mining from a holistic per- based on microarray experiments, the new sequen-
spective can contribute significantly to this objective. cing technologies provide a new opportunity to gain
The construction of biological networks based on information on gene expression.
‘-omics’ data can describe the various functional A graph or network is a mathematical represen-
interactions between genes and consequently, tation of a system of elements. It consists of a set of

Corresponding author. Liliana López-Kleine, Statistical Department, Universidad Nacional de Colombia, Ciudad Universitaria. Cra 30
No 45-03, Colombia. Tel: þ57 1 3165000, ext. 13175; Fax: þ57 1 3165000, ext. 13210; E-mail: llopezk@unal.edu.co
Liliana Lo¤pez is a biologist with a PhD in applied statistics. Her main research interests are systems biology and statistical analysis for
genomic data.
Luis Leal has a BS in Chemical Engineering and is a student in the Master of Science in Statistics program at the Universidad Nacional
de Colombia. Together with Liliana López-Kleine and Camilo López, he is working on a common project regarding the construction
and comparison of GCNs in plants.
Camilo Lo¤pez is a biologist with a PhD in Life Sciences. He is the leader of the Manihot-biotec group in the Biology Department
(Universidad Nacional). His main interest is to elucidate the immune responses activated in cassava in response to cassava bacterial blight
disease.

ß The Author 2013. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com
458 Lo¤pez-Kleine et al.

nodes or vertices of elements and links or edges be- from gene expression data that combine several
tween nodes. The edges can signify different types of experiments [13–15].
relationships, for example, the coordinated action of
elements or their direct interaction. Depending on GCNs allow for the summarizing of systemic
the system, these edges can be directed or form knowledge regarding all the cellular processes of an
loops, or be undirected or acyclic. In the case of organism, summarizing the appropriate molecular in-
plants or any biological system at the cellular level, formation. Moreover, the mere construction of
the definition of the elements and the significance of GCNs can contribute to the discovery of novel func-
the edges depend on the available biological data. tional relationships, leading to biological hypotheses.
A very clear partition of different biological The reliability of GCNs is, of course, dependent on
networks is provided by Christensen et al. [1], who the quality of the available information and on the
separated these networks into five main categories as method of construction. GCNs have been con-

Downloaded from https://academic.oup.com/bfg/article/12/5/457/206992 by guest on 05 February 2021


follows: structed for almost all organisms of interest with
sequenced genomes and are often stored in specia-
(i) Metabolic networks represent metabolite dy- lized databases [16–18]; the more complex the or-
namics to uncover the biochemical properties ganism is, the more complex the network will be.
of a cell and the phenotypic results of gene ex- Nevertheless, several attempts to construct GCNs
pression [2, 3]. In this type of network, metab- have been made in plants [19–21].
olites, reactions or enzymes are the nodes, and In this review, we will focus on the efforts that
their regulation and mass flow are depicted by have been made toward the construction of GCNs
the edges. in plants. As Arabidopsis thaliana has been, until re-
(ii) Signal transduction networks are models that cently, the plant with the most available microarray
show how cells convert signals retrieved from experiments, it is not surprising to find that much of
the external environment to mediate biological the network construction done for plants so far has
responses [4]. Here, the nodes are proteins and been in this species [22]. We will emphasize the
the edges represent the signal propagation. GCN construction techniques that have been used
(iii) Transcriptional regulatory networks describe and the major results obtained. Moreover, we will
how transcription factors (TFs) control the ex- highlight the advantages and disadvantages of these
pression of target genes [5, 6]. The nodes are techniques to obtain a global picture. Finally, we will
TFs or mRNAs, and the edges illustrate the summarize the global and systemic knowledge that
regulation between them. has been gathered through GCNs in plants.
(iv) Protein–protein interaction networks integrate
knowledge regarding protein function derived
from experimental or computational methods
[7]. The nodes are proteins linked by the METHODS TO RECONSTRUCT
edges, which represent functional relationships. GCNS IN PLANTS
(v) Functional gene networks (FGNs) and GCNs Pearson correlation
present information based on relationships be- Currently, the most widely used computational
tween genes (and/or the proteins they code for) method involves calculating the standard Pearson
that indicate a coordinated participation in a correlation coefficient (PCC) between the expres-
common biological process or pathway [8, 9]. sion values of pairs of genes. The PCC is a metric
In these types of networks, genes are the nodes that scores the tendency of two genes to show similar
and their functional associations (which are not expression levels across samples [18, 23]. As this ten-
necessarily physical) are represented by the dency could be in either the opposite or the same
edges. For FGNs, the edges are obtained from direction, the PCC ranges from 1 to 1 [24].
integration of diverse compendiums of data, Consequently, the absolute value of the PCC is
such as gene expression data (microarrays and commonly used as a similarity measure ranging
RNA-seq gene expression experiments), biolo- from 0 to 1 [18]. The expression levels from pairs
gical process annotations, protein interaction of genes with a larger correlation value than a pre-
databases and similarity of phylogenetic profiles, selected threshold are considered to reveal a potential
etc. [10–12]. For GCNs, the edges are obtained interaction, influence, dependence or coordinated
Reconstruction of gene co-expression networks 459

participation in the same function (i.e. a functional were the first to apply the GGM to infer relationships
relationship based on co-expression). between genes based on microarray data from
Numerous studies have constructed GCNs based Saccharomyces cerevisiae. Since then, several articles
on the PCC in plants. For Arabidopsis, Mao et al. [13] have been published on the application of these
used 1094 ATH1 arrays to calculate PCCs. This net- methods for microorganisms, animals and plants.
work was constructed and partitioned into modules The GGM is one of the most popular methods to
that were analyzed for biological processes [gene model genetic networks employing data from micro-
ontology (GO) enrichment terms in modules]. arrays. Under certain circumstances, this technique
Atias et al. [9] followed a similar methodology; they works better than other methods [29]. Although
assessed the PCCs for pairs of genes within each the GGM has several advantages (e.g. it considers
microarray experiment separately. Thus, they the effect of other genes and is simple), the main
proposed a scoring system to merge the PCCs disadvantage is that it is best suited for cases in

Downloaded from https://academic.oup.com/bfg/article/12/5/457/206992 by guest on 05 February 2021


from different experiments. Another work using which the number of samples (N) is relatively large
the PCC as a similarity measure was conducted by compared with the number of variables (p). This
Mutwil etal. [14], who used the PCC as a first step to situation is not common for microarray data, where
test a novel cluster algorithm (Heuristic Cluster the number of different experiments (N) is much
Chiseling Algorithm), which allowed them to smaller than the number of genes (p), resulting in
divide a network of weighted edges. Similarly, in the correlation matrix not having a full rank and
rice, Childs et al. [21] constructed a weighted gene not being able to be inverted [30]. However, several
co-expression network from PCCs. The modules alternatives have been proposed to tackle this prob-
were identified and used to provide functional an- lem. Schäfer and Strimmer [32] proposed as alterna-
notations for genes. Moreover, small networks such tives (i) the use of bootstrap re-sampling and the
as those modeled by Ouyan et al. [25] are also im- application of the Moore–Penrose pseudoinverse
portant applications of the PCC. Here, the modeled and false discovery rate multiple testing or (ii) the
GCNs were useful in inferring a biological role of inference of the GGM from regularization and mod-
the OsWD40 family of genes, which are involved in eration [33]. Other options have been based on the
various important cellular pathways, such as chroma- estimation of the GGM with a limited-order partial
tin modification, reproduction and developmental correlation function, which estimates correlations
processes. In other plants, including maize [26], conditional on one or two, but not all, other
tomato [27] and barley [28], the PCC has been genes. This technique has been developed to infer
applied to gene expression data. gene networks from Arabidopsis and yeast transcript
profiles [34, 35]. More recently, new strategies based
Graphical Gaussian models on the application of regularized high dimensional
The graphical Gaussian model (GGM) is an alterna- regression for covariance selection have been imple-
tive method to the Pearson correlation. GGM-based mented [36].
methods are undirected probabilistic graphical Ma et al. [37] adapted a previously proposed
models that describe the conditional independence method [32] and applied it to Arabidopsis transcrip-
relationship among nodes (here, genes) under the tome data comprising 2045 chips, including a large
assumption of a multivariate Gaussian distribution spectrum of conditions (biotic and abiotic stress, tis-
of the data [29]. In the GGM networks, each node sues, development stages, etc.). The authors classified
represents a gene and an edge connects two genes if the global network into several subnetworks of
they are partially correlated [30]. Different from the functional groups. In a more recent work, as men-
Pearson correlation, which records correlation be- tioned above, Mao et al. [13] reconstructed an
tween gene pairs without taking into account any Arabidopsis network based on 1094 microarrays, em-
information regarding other genes, in the GGM, a ployed a standard PCC to measure the degree of co-
partial correlation between two genes measures the expression between two genes and connected them
degree of correlation remaining after removing the in the network if their PCC was above a cutoff value
effects of other genes. In statistical terms, the GGM of 0.75. This network was compared with the
calculates the empirical covariance matrix from a GGM-based GCN [37] and found that the photo-
dataset that is then inverted, after which the partial synthesis, protein biosynthesis and cell cycle modules
correlations are computed. Toh and Horimoto [31] identified in the GGM network had smaller module
460 Lo¤pez-Kleine et al.

sizes [13]. Another Arabidopsis network based on gene are a sequence of temporal steps where genes only
co-expression employing the GGM was constructed interact from step to step. These attributes make
for more than 7000 genes, which permitted the DBNs especially useful to model regulatory net-
identification of the regulation of the biochemical works from microarray time series data [46, 47].
and stress response pathways, including in particular The use of BNs has allowed the integration of
the trehalose-6-phosphate phosphatase [22]. Another prior biological knowledge, such as Kyoto
GGM-based GCN was also constructed in Encyclopedia of Genes and Genomes (KEGG) path-
Arabidopsis with a focus on the starch metabolism in ways [48,49].
leaves [38]. The grouping of genes according to their Although Bayesian approaches were first used for
oscillating day/night expression was carried out to the construction of gene networks by Friedman et al.
identify particular patterns of co-regulation with [43], these methods have only been applied to a
starch biosynthesis and degradation, indicating a re- small number of a couple of sets of plant gene ex-

Downloaded from https://academic.oup.com/bfg/article/12/5/457/206992 by guest on 05 February 2021


lationship between TFs targeting starch metabolic pression data for gene network reconstruction. The
genes [38]. first of these applications was a novel algorithm used
The impact and potential future use of the GGM to construct a gene regulatory network (GRN) for
approach in plants is high, as has been demonstrated Arabidopsis from microarray experiments [50]. The
by previous results. The expansion to other applica- purpose of this work was to place an initial set of
tions is possible, for example, to develop new stra- genes into a larger network. In this way, a small
tegies for inferring a gene–SNP (single-nucleotide graph was inferred with the initial key genes, and
polymorphism) network in an integrative genomic the network was grown by the iterative addition of
setting [39] or for the reconstruction of metabolomic new genes. The most accurate GRN was identified
networks from metabolomics data [40]. using the Bayesian Information Criterion score. This
strategy also reduced the number of sets to be eval-
Bayesian approaches to construct uated, making the problem tractable.
networks A recent study using DBNs with plant gene
A different approach for GCN construction is based expression data was led by Dondelinger et al. [51],
on Bayesian theory to represent the probabilistic re- who presented a model to infer the structure of a
lationships between all genes. This methodology leads GRN associated with nine circadian genes from the
to static or dynamic Bayesian networks (BNs) [41]. Arabidopsis circadian clock. The authors proposed a
Due to their probabilistic nature, these networks can solution to infer the DBN from microarray datasets
identify relevant indirect interactions between genes. belonging to different experimental conditions. In
These methods are better suited to tackle the noise this approach, an individual network was developed
present in expression data than other deterministic for each dataset, and a novel information sharing
methods, such as Boolean networks [42]. method was applied to merge the networks. A simu-
A BN is structured by a directed acyclic graph lation was implemented to produce a time series of
whose nodes represent random variables. These data points preceding the application to real
random variables are linked to the gene expression Arabidopsis data. Under the assumption that network
values [43]. Consequently, the directed edges indi- structures do not change over time, it is remarkable
cate the conditional dependence relationship among to observe how the information based on this
genes [44]. Additionally, a BN is defined by a family method originated less dense and highly accurate
of conditional probabilistic distributions and their networks. The main conclusion of this work was
parameters. BNs can be viewed as probabilistic that the methodology works well for data from het-
models for the joint distribution of a set of random erogeneous experimental conditions that share some
variables [29]. similarity in gene expression behavior, which is not
Although in static BNs the edges are strictly acyc- suitable for a very long time series.
lic [41], dynamic BNs (DBNs) extend the model to
allow for the inclusion of self-regulated genes and Methods based on similarity (mutual
time-dependent expression data [44]. Given that a information, clustering)
gene affects the expression level of a second gene Coulibaly and Page [52] present some tools that
after a period of time, here, the edges represent the implement approaches to reverse-engineer GRNs
interaction through time [45]. In this way, the DBNs based on the detection of similarity between
Reconstruction of gene co-expression networks 461

expression profiles. The first approach reviewed by information regarding position and metabolic path-
the authors is based on ‘mutual information’, defined way, among others [56].
as a measure of correlation between gene expression
patterns [53]. A higher mutual information between
a pair of genes means that they are nonrandomly STRATEGIES TO INCLUDE
associated [54]. In this method, the regulatory inter- ADDITIONAL GENOMIC DATA
action between two genes is established if the mutual Several studies have attempted to integrate various
information on their expression patterns is signifi- types of genomic data in order to reconstruct gene
cantly larger than a P-threshold value calculated networks in plants in specialized databases. A success-
from the mutual information between random per- ful example is the CoP database, which has been
mutations of the same patterns. Unlike Bayesian constructed using a large dataset (10 022 assays) ob-
theory, which analyzes whole networks and selects tained from public plant microarray databases. It as-

Downloaded from https://academic.oup.com/bfg/article/12/5/457/206992 by guest on 05 February 2021


the one that best explains the observed data, the sociates co-expressed gene modules with biological
mutual information method constructs a network information, such as GO terms and, if available,
by selecting or rejecting regulatory interactions be- metabolic pathway names. Following the analysis
tween pairs of genes. This method does not provide of 21 000 Arabidopsis genes in 43 datasets and
the direction of regulatory interactions, but only in- 2  108 gene pairs, Mutwil et al. [57] identified a
dicates if a possible relationship exists between genes. globally co-expressed gene network enriched with
In addition to methods that search by estimating GO terms. This method allows for the identification
the similarity between genes, bioinformatics app- of gene clusters and the integration of diverse micro-
roaches based on similarity detection have also array experiments from many sources. The analysis
been applied to plants. One of these approaches is reveals that part of the Arabidopsis transcriptome is
the gene coordination approach devoted to the iden- globally co-expressed and can be further divided
tification of specific or multiple microarray data seg- into known as well as novel functional gene mod-
ments (called cues), in which there is a positive or ules. The proposed methodology is sufficiently gen-
negative coordination between pairs of genes [55]. eral to apply to any set of microarray experiments,
This approach combines t-testing of differentially ex- using any scoring function [9]. Going beyond the
pressed genes in different biological perturbations as a integration of genomic data, some authors have pro-
first step toward detecting co-expression for each pair posed the integration of metabolomic data in the
of genes with significantly different expression. construction of gene networks [58, 59]. Moreover,
Significant positive coordination and non-significant several studies [12, 26] have demonstrated
negative coordination in biological perturbations are that including heterogeneous genomic information
compared to construct the final network [55]. is useful in obtaining global and reliable functional
Another alternative approach is clustering. These relationships between genes and proteins.
algorithms identify groups of genes that are similar Nevertheless, in the Arabidopsis plant model, the
and group them together in a cluster. The genes of types of heterogeneous data are limited, making it
each cluster are similar to each other and different difficult to apply a systematic approach integrating
from genes of other clusters. These methods have additional genomic data to construct gene
also been used to find co-regulated genes based on networks [60].
gene co-expression properties. An example of a gen- In addition to GCNs, other strategies have been
eral cluster method is ‘Cluster cutting’, an analytical applied with the aim to gain general information
tool available on the PRIME website (Platform for about physical interactions between the product of
RIKEN Metabolomics, http://prime. psc.riken.jp/). genes: proteins. Experimentally and employing the
This web-based tool gives the results of hierarchical yeast two-hybrid technique, a protein–protein inter-
clustering for all Arabidopsis genes calculated for all of action map for the interactome network of
the available transcriptome data at AtGenExpress. Arabidopsis was recently established [61]. This allowed
The results of applying these methods that meas- also assigning novel hypothetical links between pro-
ure co-expression through linear and nonlinear cor- teins and pathways [61]. Additionally, a detailed
relation or distances have been stored in a database interactome protein network was constructed be-
for A. thaliana (ATTED-II, http://atted.jp). These tween Arabidopsis proteins and effectors proteins
stored networks are enriched with additional from two different pathogens Pseudomonas syringae
462 Lo¤pez-Kleine et al.

and Hyaloperonospora arabidopsidis [62]. This to translate functional gene information between
study found that pathogen proteins have preference these two species [64]. In a second work, the
to interact with highly connected plant proteins, GCNs for Arabidopsis and six crops species (barley,
which in most of the cases are related with poplar, rice, soybean and wheat) were compared.
immune responses [62]. The information based on A novel algorithm was implemented to combine
protein–protein interaction obtained employing gene sequences with the co-expression network
yeast two-hybrid, chromatin immunoprecipitation structure. In this way, similar network vicinities
or RNA, DNA–protein interactions will be valuable within and across species were inferred which
to get a more realistic figure of the cell functioning. allowed to predict functional homologs between
these species [65]. Both works open new possibilities
to attribute a putative function to unknown genes
STRATEGIES COMPARING and to transfer the knowledge from model plant spe-

Downloaded from https://academic.oup.com/bfg/article/12/5/457/206992 by guest on 05 February 2021


NETWORKS IN SEVERAL SPECIES cies to crop species with limited genetic resources.
Even if the co-expression networks provide a way to
infer gene function and connect genes in a related
way, the question remains as to whether these rela- BEYOND THE PREDICTIVE
tionships can be extrapolated to other organisms. FUNCTION OF UNKNOWN GENES
This question is particularly interesting because the GCNs have been useful to provide functional anno-
networks are, in most cases, constructed for model tations for genes whose function was previously un-
organisms. Some authors suggest that to be extrapo- known. An Arabidopsis GCN was portioned into
lated, the gene co-expression observed in one species different modules employing a graph-clustering al-
should be confirmed in other evolutionarily close gorithm. The deep analysis according to the GO
species [63]. Tools have been developed that make terms, pathway information and gene expression
use of the large sample size available in these data- data of one particular module allowed to classify it
bases to identify more reliable concerted changes in as photosynthesis module [13]. Based on its module-
transcript levels as well as to examine the coordinated level annotation, 173 of the 1381 genes belonging to
change of gene expression levels [52]. this module and lacking of annotation were
The cross-species comparison of relevant co- hypothesized to be linked to photosynthesis [13].
expressed gene groups is also useful, as shown in In maize, Ficklin and Feltus [64] evidenced the pos-
the database GeneCAT [57], which provides a com- sibility to infer a function for 193 of the 391 genes of
parative analysis of Arabidopsis and barley co-expres- unknown function based on co-functional gene
sed genes, including information regarding sequence clusters. A recent study in rice identified gene mod-
similarity. The Confeito algorithm is used to calcu- ules in a GCN based on condition-dependent and
late the interconnectivity between genes in condition-independent data. Gene modules identi-
co-expressed gene networks. The database includes fied by condition-dependent experiments were more
the gene modules for Arabidopsis and seven crops: useful to assign functional annotation to rice genes
Glycine max (soybean), Hordeum vulgare (barley), allowing additional expression-based annotation for
Oryza sativa (rice), Populus trichocarpa (poplar), Triticum 13 537 genes, 2980 of which lack a functional anno-
aestivum (wheat), Vitis vinifera (grape) and Zea mays tation description [21].
(maize). Although the assignment of a function for a par-
Recently, the comparison between GCNs in dif- ticular gene based on the reconstruction of networks
ferent plant species has been presented [64, 65], is merely predictive, some studies have demon-
which is an important aspect toward transferring in- strated, through experimental validation, that in
formation and functional annotation for plants with some cases this prediction is correct. Although not
scarce genomic information. Through the use of based on GCN, two rice networks were constructed:
IsoRankN tool [66], a global network alignment one at genome-scale [10] and the other related spe-
was constructed to compare a newly constructed cifically to abiotic and biotic stress responses [67]. In
maize network and a previously reported rice both cases, several genes with unknown function
co-expression network [64]. This strategy allowed were predicted to be involved in rice stress responses.
the identification of conserved subgraphs and pre- The function of some of these proteins was validated
served co-expression edges, which is the first step employing mutants and it was found that some
Reconstruction of gene co-expression networks 463

regulate positively or negatively the resistance to bac- resulting GCNs are very similar. Nevertheless, the
terial disease and tolerance to submergence [10, 67]. most significant difference is the method used to
Similarly, the function of the unknown proteins that construct the network. These methods can be very
were predicted to be involved in immunity based on simple, such as those based on the PCC and mutual
their interactions with immunity-related proteins in information, or they can be more elaborate and in-
the protein–protein Arabidopsis interactome were volve more statistical and mathematical complexity,
validated employing plants mutated in the corres- such as the GGM, BNs and DBNs. For DBNs,
ponding genes. Some of these mutant plants were high-level computational resources and precise algo-
more susceptible to the infection of the pathogens rithms are mandatory, which limits their application.
confirming the prediction [62]. The tradeoff between accurate results and computa-
tional cost/complexity depends on the application
[13, 51].

Downloaded from https://academic.oup.com/bfg/article/12/5/457/206992 by guest on 05 February 2021


CONCLUSIONS Consequently, the choice of method is different
The correlation between genes based on their gene for each problem. In many studies, the construction
expression is a useful strategy to assign a putative of GCNs is not the main goal, but only an inter-
function to unknown genes. In this way, microarray mediate step. In this case, a simple GCN construc-
experiments provide an excellent source of data to tion methodology based on the PCC should be used
identify this type of relationship. In addition, based [14, 64, 65]. Nonetheless, if a GCN will be deeply
on patterns of gene co-expression, networks that studied and the functional prediction of a molecular
allow for greater comprehension of cell behavior mechanism needs to be studied in detail, the direc-
are constructed. Biostatistical analyses should be car- tion of regulations could be desirable and a Bayesian
ried out to correlate with these complex representa- approach should be selected.
tions of relationships. In recent years, different Based on applications to microarray data from
statistical methods have been developed to accom- several species, some advantages and disadvantages
plish this goal. To our knowledge, no in-depth com- can be considered for each method. As Table 1 sum-
parative studies have been performed regarding the marizes, one disadvantage of the PCC is that the
methods employed to construct these types of net- similarity between genes relies only on linear rela-
works for plants. A strict comparison is difficult when tionships, whereas nonlinear relationships remain un-
each method provides different information and is detected. However, it is known that complex
even more so when thousands of interactions nonlinear relationships are less frequent than linear
remain hypothetical and experimental validation is relationships. Therefore, the PCC remains a general
not available for most of the interactions. and useful method [68].
The GCNs obtained using the methods reviewed Another problem with the PCC is its sensitivity to
here differ primarily in the detailed representation of outlying observations; it could originate pairs of
the gene network viewed as a system; however, the genes co-expressed erroneously [24]. More problems

Table 1: Summary of the advantages and disadvantages of the methods reviewed

Method Summary References

Pearson correlation Only linear relationships are detected [13, 30, 64, 65, 69, 70]
No need for large datasets
Makes assumptions based on distribution
Very sensitive to outliers
High number of false-positive relationships
Confusion between indirect and direct relationships
No loops or feedbacks as in BNs
Graphical Gaussian models More complex and computationally costlier than PCC-based methods [22, 23, 30, 71]
Eliminates the effect of other genes when similarity is calculated
Bayesian networks Directionality of relationships and loops is depicted [16, 44, 72]
Computationally costly
Other similarity metrics No assumptions based on distribution [3, 24, 71, 73]
More samples needed than for PCC
No loops or feedbacks as in BNs
464 Lo¤pez-Kleine et al.

appear when the expression levels of genes are very observations. An important difficulty appears when
low; again, the method cannot produce reliable the number of included genes is high, although
results [69]. Finally, with the PCC, many pairs of several algorithms have emerged to overcome this
genes have the tendency to show similar behavior problem [50, 71]. Large-scale networks are still
in expression profiles by chance even though there more difficult to study because the number of pos-
is not a biological relationship. This random similar- sible networks grows exponentially as the number of
ity makes it difficult to calculate the significance of genes increases [72]. Accordingly, BNs are better
the results [30]. suited for small networks because of low computa-
GCNs based on the GGM compared with tional costs [16]. As this situation is not usually the
networks based on the PCC have allowed the de- case in plants, which have very large genomes,
termination that correlation alone by PCC is not BNs are not a good choice for these types of
strong enough and cannot distinguish between organisms when global networks need to be con-

Downloaded from https://academic.oup.com/bfg/article/12/5/457/206992 by guest on 05 February 2021


direct and indirect interactions [23]. The GGM- structed, but they could be very useful for
based approaches are more complex than PCC subnetworks.
approaches, but they still require fewer assumptions In addition to the previously mentioned advan-
and parameters than BNs [71]. It is also evident that tages and disadvantages, it is known that the results of
GGM-based approaches do not undergo loss of in- GCNs are stored in databases gathering relationships
formation by discretization, as occurs in some meth- obtained via only one method. As databases are very
ods based on similarity (i.e. mutual information) [74]. useful for the preliminary investigation of genes of
Perhaps, the most attractive feature of the GGM is its interest, they would be more informative if they re-
ability to eliminate the effects of other genes over the ported a consensus GCN based on several methods.
pairs analyzed. This method can also enhance the This shortcoming remains unexplored and could
detection of genes with poor direct correlation but reveal solid arguments for the choice of an appropri-
with highly partial correlations through the neigh- ate method or combination of methods for specific
boring genes [30]. expression data.
In contrast with PCC methods, the use of GGM Sequences obtained directly from RNA (its copy
requires a larger number of samples compared with cDNA) by Illumina or 454 have allowed the gathering
the number of genes [13]. Otherwise, this method of expression information in a fast, deep and less tech-
could produce an indefinite sample covariance nically variable manner than classical microarray tech-
matrix that cannot be inverted [32]. Some disadvan- niques. Consequently, in upcoming years, an
tages of the PCC method are also present in the explosion of information regarding gene expression
GGM, such as the assumption of linear relationships data in specialized databases will be available for experi-
among the expression levels and the limitation to mentation. RNA-seq data analysis can be treated the
evaluate only pairwise interactions [71]. same way as microarray data analysis once it has been
In terms of modeling GCNs, the BNs and transformed using the reads per kilobase of exon per
especially the DBNs are the more suitable million mapped reads (RPKM) transformation [75]. It
approaches. These methods are the only ones that is expected that data quality and information gathered
allow the investigators to identify the direction of a from these type of data will be higher than for micro-
regulatory relationship. They also better handle bio- array data (depending on sequence depth and technical
logical and experimental noise [72]. However, very variability). Nevertheless, still not enough RNA-seq
high noise levels reduce the performance of BNs data are available in order to make conclusions on
[29]. this subject and detailed comparison of available
In addition to extracting interactions between models applied on microarray and RNA-seq data
more than two genes, DBNs are unique in the mod- needs to be undertaken. As no GCN construction
eling of self-loops and regulatory effects in time, method is ‘the best’, algorithms that allow a simple,
which are always present in co-expression [44]. It fast and accurate construction of GCNs based on all
has also been demonstrated that BNs tend to outper- of these data and that allow a choice of GCN construc-
form GGM- and PCC-based approaches regarding tion method depending on the researcher’s interest
interventional data (i.e. gene knockouts) [29], need to be developed. Moreover, databases for GCN
whereas the high computational cost of BNs is not storage allowing for comparison and consensus
justified when data originate from passive searches in plants are needed.
Reconstruction of gene co-expression networks 465

stage specific interactions in Arabidopsis thaliana. BMC Syst


Key Points Biol 2010;4:180.
 Network theory has been successfully employed to study genetic 12. Lee I, Ambaru B, Thakkar P, et al. Rational association of
co-expression in plants through GCNs, allowing for predictions genes with traits using a genome-scale gene network for
and the discovery of functionally related genes from the underly- Arabidopsis thaliana. Nat Biotechnol 2010;28:149–56.
ing data. 13. Mao L, Van Hemert JL, Dash S, et al. Arabidopsis gene
 Statistical methods for the construction of GCNs in plants are co-expression network and its functional modules. BMC
mainly centered on the PCC approach, but other approaches, Bioinformatics 2009;10:346.
such as GGM, BNs, DBNs and mutual information, although 14. Mutwil M, Usadel B, Schütte M, et al. Assembly of an
harder to implement, have shown better accuracy in extracting interactive correlation network for the Arabidopsis
detailed networks.
genome using a novel heuristic clustering algorithm. Plant
 The integration of heterogeneous genomic data and the transfer
Physiol 2010;152:29–43.
of genetic knowledge between plants are two significant applica-
tions derived from GCNs, leading comparative analyses to un- 15. Mentzen WI, Peng J, Ransom N, et al. Articulation of three
cover functionally conserved genes. core metabolic processes in Arabidopsis: Fatty acid biosyn-
thesis, leucine catabolism and starch metabolism. BMC Plant

Downloaded from https://academic.oup.com/bfg/article/12/5/457/206992 by guest on 05 February 2021


Biol 2008;8:76.
16. Allen JD, Xie Y, Chen M, et al. Comparing statistical meth-
FUNDING ods for constructing large scale gene networks. PLoS One
The present work has been partially financed by the Dirección 2012;7:e29348.
de Investigación Bogotá (DIB) of the Universidad Nacional de 17. Gupta A, Maranas CD, Albert R. Elucidation of direction-
Colombia—Sede Bogotá through a grant for the alliance of two ality for co-expressed genes: predicting intra-operon ter-
research groups: the research groups ‘‘Methods in Biostatistics’’ mination sites. Bioinformatics 2006;22:7.
and ‘‘Stress physiology and biodiversity in plants and 18. Zhang B, Horvath S. A general framework for weighted
microorganisms’’. gene co-expression network analysis. Stat Appl Genet Mol
Biol 2005;4:Article17.
19. Faccioli P, Provero P, Herrmann C, et al. From single genes
to co-expression networks: extracting knowledge from
References barley functional genomics. Plant Mol Biol 2005;58:739–50.
1. Christensen C, Thakar J, Albert R. Systems-level insights 20. Edwards KD, Bombarely A, Story GW, et al. TobEA: an
into cellular regulation: inferring, analysing, and modelling atlas of tobacco gene expression from seed to senescence.
intracellular networks. IET Syst Biol 2007;1:61–77. BMC Genomics 2010;11:142.
2. Morgenthal K, Weckwerth W, Steuer R. Metabolomic 21. Childs KL, Davidson RM, Buell CR. Gene coexpression
networks in plants: Transitions from pattern recognition network analysis as a source of functional annotation for rice
to biological interpretation. Biosystems 2006;83:108–17. genes. PLoS One 2011;6:e22196.
3. Numata J, Ebenhöh O, Knapp EW. Measuring correlations 22. Ma S, Bohnert HJ. Gene networks in Arabidopsis thaliana
in metabolomic networks with mutual information. Genome for metabolic and environmental functions. Mol Biosyst
Inform 2008;20:112–22. 2008;4:199–204.
4. Colcombet J, Hirt H. Arabidopsis MAPKs: a complex sig- 23. Soranzo N, Bianconi G, Altafini C. Comparing association
nalling network involved in multiple biological processes. network algorithms for reverse engineering of large-scale
BiochemJ 2008;413:217–26. gene regulatory networks: synthetic versus real data.
5. Yilmaz A, Mejia-Guerra MK, Kurz K, et al. AGRIS: the Bioinformatics 2007;23:1640–7.
Arabidopsis Gene Regulatory Information Server, an 24. Mutwil M. Integrative Transcriptomic Approaches to Analyzing
update. Nucleic Acids Res 2011;39:D1118–22. Plant Co-expression Networks. http://opus.kobv.de/ubp/voll-
6. Nakashima K, Ito Y, Yamaguchi-Shinozaki K. texte/2011/5075/pdf/mutwil_diss.pdf (28 December 2012,
Transcriptional regulatory networks in response to abiotic date last accessed).
stresses in Arabidopsis and grasses. Plant Physiol 2009;149: 25. Ouyang Y, Huang X, Lu Z, et al. Genomic survey, expres-
88–95. sion profile and co-expression network analysis of OsWD40
7. Raman K. Construction and analysis of protein–protein family in rice. BMC Genomics 2012;13:100.
interaction networks. Autom Exp 2010;2:2. 26. De Bodt S, Carvajal D, Hollunder J, et al. CORNET: A
8. Hwang S, Rhee SY, Marcotte EM, et al. Systematic predic- user-friendly tool for data mining and integration. Plant
tion of gene function in Arabidopsis thaliana using a prob- Physiol 2010;152:1167–79.
abilistic functional gene network. Nat Protoc 2011;6: 27. Fukushima A, Nishizawa T, Hayakumo M, et al. Exploring
1429–42. tomato gene functions based on coexpression modules using
9. Atias O, Chor B, Chamovitz DA. Large-scale analysis of graph clustering and differential coexpression approaches.
Arabidopsis transcription reveals a basal co-regulation net- Plant Physiol 2012;158:1487–502.
work. BMC Syst Biol 2009;3:86. 28. Mochida K, Uehara-Yamaguchi Y, Yoshida T, et al. Global
10. Lee I, Seo YS, Coltrane D, et al. Genetic dissection of the landscape of a co-expressed gene network in barley and its
biotic stress response using a genome-scale gene network for application to gene discovery in Triticeae crops. Plant Cell
rice. Proc Natl Acad Sci USA 2011;108:18548–53. Physiol 2011;52:785–803.
11. Pop A, Huttenhower C, Iyer-Pascuzzi A, et al. Integrated 29. Werhli AV, Grzegorczyk M, Husmeier D. Comparative
functional networks of process, tissue, and developmental evaluation of reverse engineering gene regulatory networks
466 Lo¤pez-Kleine et al.

with relevance networks, graphical gaussian models and 47. Geier F, Timmer J, Fleck C. Reconstructing gene-
Bayesian networks. Bioinformatics 2006;22:2523–31. regulatory networks from time series, knock-out data, and
30. Markowetz F, Spang R. Inferring cellular networks – a prior knowledge. BMC Syst Biol 2007;1:11.
review. BMC Bioinformatics 2007;8:S5. 48. Imoto S, Higuchi T, Goto T, et al. Combining microarrays
31. Toh H, Horimoto K. Inference of a genetic network by a and biological knowledge for estimating gene networks via
combined approach of cluster analysis and graphical Bayesian networks. Proc IEEE Comput Soc Bioinform Conf
Gaussian modeling. Bioinformatics 2002;18:287–97. 2003;2:104–13.
32. Schäfer J, Strimmer K. A shrinkage approach to large-scale 49. Werhli AV, Husmeier D. Gene regulatory network recon-
covariance matrix estimation and implications for functional struction by Bayesian integration of prior knowledge and/or
genomics. Stat Appl Genet Mol Biol 2005;4:Article32. different experimental conditions. J Bioinform Comput Biol
2008;6:543–72.
33. Schäfer J, Strimmer K. An empirical Bayes approach to
inferring large-scale gene association networks. 50. Needham CJ, Manfield IW, Bulpitt AJ, et al. From gene
Bioinformatics 2005;21:754–64. expression to gene regulatory networks in Arabidopsis thali-
ana. BMC Syst Biol 2009;3:85.
34. Magwene PM, Kim J. Estimating genomic coexpression
51. Dondelinger F, Husmeier D, Lèbre S. Dynamic Bayesian

Downloaded from https://academic.oup.com/bfg/article/12/5/457/206992 by guest on 05 February 2021


networks using first-order conditional independence.
Genome Biol 2004;5:R100. networks in molecular plant science: inferring gene regula-
tory networks from multiple gene expression time series.
35. Wille A, Zimmermann P, Vranová E, et al. Sparse graphical
Euphytica 2011;183:361–77.
Gaussian modeling of the isoprenoid gene network in
Arabidopsis thaliana. Genome Biol 2004;5:R92. 52. Coulibaly I, Page GP. Bioinformatic tools for inferring
functional information from plant microarray data II:
36. Krämer N, Schäfer J, Boulesteix AL. Regularized
Analysis beyond single gene. Int J Plant Genomics 2008;
estimation of large-scale gene association networks
2008:893941.
using graphical Gaussian models. BMC Bioinformatics 2009;
10:384. 53. Steuer R, Kurths J, Daub CO, et al. The mutual informa-
tion: detecting and evaluating dependencies between vari-
37. Ma S, Gong Q, Bohnert HJ. An Arabidopsis gene network
ables. Bioinformatics 2002;18(Suppl. 2):S231–40.
based on the graphical Gaussian model. GenomeRes 2007;17:
1614–25. 54. Butte AJ, Kohane IS. Mutual information relevance net-
works: functional genomic clustering using pairwise entropy
38. Ingkasuwan P, Netrphan S, Prasitwattanaseree S, et al.
measurements. Pac Symp Biocomput 2000;418–29.
Inferring transcriptional gene regulation network of starch
metabolism in Arabidopsis thaliana leaves using graphical 55. Less H, Galili G. Coordinations between gene modules
gaussian model. BMC Syst Biol 2012;6:100. control the operation of plant amino acid metabolic net-
works. BMC Syst Biol 2009;3:14.
39. Chu J, Weiss ST, Carey VJ, et al. A graphical model
approach for inferring large-scale networks integrating 56. Obayashi T, Hayashi S, Saeki M, et al. ATTED-II provides
gene expression and genetic polymorphism. BMC Syst Biol coexpressed gene networks for Arabidopsis. Nucleic Acids Res
2009;3:55. 2009;37:D987–91.
40. Krumsiek J, Suhre K, Illig T, et al. Gaussian graphical 57. Mutwil M, Øbro J, Willats WGT, et al. GeneCAT—novel
modeling reconstructs pathway reactions from webtools that combine BLAST and co-expression analyses.
high-throughput metabolomics data. BMC Syst Biol 2011; Nucleic Acids Res 2008;36:W320–6.
5:21. 58. Hirai MY, Klein M, Fujikawa Y, et al. Elucidation of
41. Murphy K, Mian S. Modelling Gene Expression Data using gene-to-gene and metabolite-to-gene networks in arabi-
Dynamic Bayesian Networks. http://www-devel.cs.ubc.ca/ dopsis by integration of metabolomics and transcriptomics.
murphyk/Papers/ismb99.pdf (28 December 2012, date J Biol Chem 2005;280:25590–5.
last accessed). 59. Saito K, Hirai MY, Yonekura-Sakakibara K. Decoding
42. Husmeier D. Sensitivity and specificity of inferring genetic genes with coexpression networks and metabolomics – ‘ma-
regulatory interactions from microarray experiments with jority report by precogs’. Trends Plant Sci 2008;13:36–43.
dynamic Bayesian networks. Bioinformatics 2003;19: 60. Yao CW, Hsu BD, Chen BS. Constructing gene
2271–82. regulatory networks for long term photosynthetic light
43. Friedman N, Linial M, Nachman I, et al. Using Bayesian acclimation in Arabidopsis thaliana. BMC Bioinformatics
networks to analyze expression data. J Comput Biol 2000;7: 2011;12:335.
601–20. 61. Arabidopsis Interactome Mapping Consortium. Evidence
44. Werhli AV, Husmeier D. Reconstructing gene regulatory for network evolution in an Arabidopsis interactome map.
networks with Bayesian networks by combining expression Science 2011;333:601–7.
data with multiple sources of prior knowledge. Stat Appl 62. Mukhtar MS, Carvunis AR, Dreze M, et al. Independently
Genet Mol Biol 2007;6:Article15. evolved virulence effectors converge onto hubs in a plant
45. Grzegorczyk M, Husmeier D, Edwards KD, etal. Modelling immune system network. Science 2011;333:596–601.
non-stationary gene regulatory processes with a non- 63. Stuart JM, Segal E, Koller D, et al. A gene-coexpression
homogeneous Bayesian network and the allocation sampler. network for global discovery of conserved genetic modules.
Bioinformatics 2008;24:2071–8. Science 2003;302:249–55.
46. Zou M, Conzen SD. A new dynamic Bayesian network 64. Ficklin SP, Feltus FA. Gene coexpression network
(DBN) approach for identifying gene regulatory networks alignment and conservation of gene modules between
from time course microarray data. Bioinformatics 2005;21: two grass species: Maize and rice. Plant Physiol 2011;156:
71–9. 1244–56.
Reconstruction of gene co-expression networks 467

65. Mutwil M, Klie S, Tohge T, et al. PlaNet: Combined 71. Ponnamblam S. Construction of Gene Networks from
sequence and expression comparisons across plant net- Expression Profiles. http://opus.kobv.de/ubp/volltexte/
works derived from seven species. Plant Cell 2011;23: 2011/5075/pdf/mutwil_diss.pdf (28 December 2012, date
895–910. last accessed).
66. Liao CS, Lu K, Baym M, et al. IsoRankN: spectral methods 72. Watanabe Y, Seno S, Takenaka Y, et al. An estimation
for global alignment of multiple protein networks. method for inference of gene regulatory net-work using
Bioinformatics 2009;25:i253–8. Bayesian network with uniting of partial problems. BMC
67. Seo YS, Chern M, Bartley LE, et al. Towards establishment Genomics 2012;13(Suppl. 1):S12.
of a rice stress response interactome. PLoS Genetics 2011;7: 73. Yu J, Smith VA, Wang PP, et al. Advances to Bayesian
e1002020. network inference for generating causal networks from
68. Daub CO, Steuer R, Selbig J, et al. Estimating mutual in- observational biological data. Bioinformatics 2004;20:
formation using B-spline functions–an improved similarity 3594–603.
measure for analysing gene expression data. BMC 74. Wu X, Ye Y, Subramanian RK. Interactive analysis of gene
Bioinformatics 2004;5:118. interactions using graphical Gaussian model. In: BIOKDD
69. Bandyopadhyay S, Bhattacharyya M. A biologically inspired 3rd ACM SIGKDD Workshop on Data Mining in Bioinformatics,

Downloaded from https://academic.oup.com/bfg/article/12/5/457/206992 by guest on 05 February 2021


measure for coexpression analysis. IEEE/ACM Trans Comput Washington, DC, USA, 2003. pp. 63–69. Rensselaer Polytec-
Biol Bioinform 2011;8:929–42. nic Institute, Troy, NY, USA.
70. Ruan J, Dean AK, Zhang W. A general co-expression 75. Mortazavi A, Williams BA, McCue K, et al. Mapping and
network-based approach to gene expression analysis: com- quantifying mammalian transcriptomes by RNA-Seq. Nat
parison and applications. BMC Syst Biol 2010;4:8. Methods 2008;5:621–8.

You might also like