Professional Documents
Culture Documents
Review Gene Coexpression Networks Lopez 2013
Review Gene Coexpression Networks Lopez 2013
1093/bfgp/elt003
Abstract
Corresponding author. Liliana López-Kleine, Statistical Department, Universidad Nacional de Colombia, Ciudad Universitaria. Cra 30
No 45-03, Colombia. Tel: þ57 1 3165000, ext. 13175; Fax: þ57 1 3165000, ext. 13210; E-mail: llopezk@unal.edu.co
Liliana Lo¤pez is a biologist with a PhD in applied statistics. Her main research interests are systems biology and statistical analysis for
genomic data.
Luis Leal has a BS in Chemical Engineering and is a student in the Master of Science in Statistics program at the Universidad Nacional
de Colombia. Together with Liliana López-Kleine and Camilo López, he is working on a common project regarding the construction
and comparison of GCNs in plants.
Camilo Lo¤pez is a biologist with a PhD in Life Sciences. He is the leader of the Manihot-biotec group in the Biology Department
(Universidad Nacional). His main interest is to elucidate the immune responses activated in cassava in response to cassava bacterial blight
disease.
ß The Author 2013. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com
458 Lo¤pez-Kleine et al.
nodes or vertices of elements and links or edges be- from gene expression data that combine several
tween nodes. The edges can signify different types of experiments [13–15].
relationships, for example, the coordinated action of
elements or their direct interaction. Depending on GCNs allow for the summarizing of systemic
the system, these edges can be directed or form knowledge regarding all the cellular processes of an
loops, or be undirected or acyclic. In the case of organism, summarizing the appropriate molecular in-
plants or any biological system at the cellular level, formation. Moreover, the mere construction of
the definition of the elements and the significance of GCNs can contribute to the discovery of novel func-
the edges depend on the available biological data. tional relationships, leading to biological hypotheses.
A very clear partition of different biological The reliability of GCNs is, of course, dependent on
networks is provided by Christensen et al. [1], who the quality of the available information and on the
separated these networks into five main categories as method of construction. GCNs have been con-
participation in the same function (i.e. a functional were the first to apply the GGM to infer relationships
relationship based on co-expression). between genes based on microarray data from
Numerous studies have constructed GCNs based Saccharomyces cerevisiae. Since then, several articles
on the PCC in plants. For Arabidopsis, Mao et al. [13] have been published on the application of these
used 1094 ATH1 arrays to calculate PCCs. This net- methods for microorganisms, animals and plants.
work was constructed and partitioned into modules The GGM is one of the most popular methods to
that were analyzed for biological processes [gene model genetic networks employing data from micro-
ontology (GO) enrichment terms in modules]. arrays. Under certain circumstances, this technique
Atias et al. [9] followed a similar methodology; they works better than other methods [29]. Although
assessed the PCCs for pairs of genes within each the GGM has several advantages (e.g. it considers
microarray experiment separately. Thus, they the effect of other genes and is simple), the main
proposed a scoring system to merge the PCCs disadvantage is that it is best suited for cases in
sizes [13]. Another Arabidopsis network based on gene are a sequence of temporal steps where genes only
co-expression employing the GGM was constructed interact from step to step. These attributes make
for more than 7000 genes, which permitted the DBNs especially useful to model regulatory net-
identification of the regulation of the biochemical works from microarray time series data [46, 47].
and stress response pathways, including in particular The use of BNs has allowed the integration of
the trehalose-6-phosphate phosphatase [22]. Another prior biological knowledge, such as Kyoto
GGM-based GCN was also constructed in Encyclopedia of Genes and Genomes (KEGG) path-
Arabidopsis with a focus on the starch metabolism in ways [48,49].
leaves [38]. The grouping of genes according to their Although Bayesian approaches were first used for
oscillating day/night expression was carried out to the construction of gene networks by Friedman et al.
identify particular patterns of co-regulation with [43], these methods have only been applied to a
starch biosynthesis and degradation, indicating a re- small number of a couple of sets of plant gene ex-
expression profiles. The first approach reviewed by information regarding position and metabolic path-
the authors is based on ‘mutual information’, defined way, among others [56].
as a measure of correlation between gene expression
patterns [53]. A higher mutual information between
a pair of genes means that they are nonrandomly STRATEGIES TO INCLUDE
associated [54]. In this method, the regulatory inter- ADDITIONAL GENOMIC DATA
action between two genes is established if the mutual Several studies have attempted to integrate various
information on their expression patterns is signifi- types of genomic data in order to reconstruct gene
cantly larger than a P-threshold value calculated networks in plants in specialized databases. A success-
from the mutual information between random per- ful example is the CoP database, which has been
mutations of the same patterns. Unlike Bayesian constructed using a large dataset (10 022 assays) ob-
theory, which analyzes whole networks and selects tained from public plant microarray databases. It as-
and Hyaloperonospora arabidopsidis [62]. This to translate functional gene information between
study found that pathogen proteins have preference these two species [64]. In a second work, the
to interact with highly connected plant proteins, GCNs for Arabidopsis and six crops species (barley,
which in most of the cases are related with poplar, rice, soybean and wheat) were compared.
immune responses [62]. The information based on A novel algorithm was implemented to combine
protein–protein interaction obtained employing gene sequences with the co-expression network
yeast two-hybrid, chromatin immunoprecipitation structure. In this way, similar network vicinities
or RNA, DNA–protein interactions will be valuable within and across species were inferred which
to get a more realistic figure of the cell functioning. allowed to predict functional homologs between
these species [65]. Both works open new possibilities
to attribute a putative function to unknown genes
STRATEGIES COMPARING and to transfer the knowledge from model plant spe-
regulate positively or negatively the resistance to bac- resulting GCNs are very similar. Nevertheless, the
terial disease and tolerance to submergence [10, 67]. most significant difference is the method used to
Similarly, the function of the unknown proteins that construct the network. These methods can be very
were predicted to be involved in immunity based on simple, such as those based on the PCC and mutual
their interactions with immunity-related proteins in information, or they can be more elaborate and in-
the protein–protein Arabidopsis interactome were volve more statistical and mathematical complexity,
validated employing plants mutated in the corres- such as the GGM, BNs and DBNs. For DBNs,
ponding genes. Some of these mutant plants were high-level computational resources and precise algo-
more susceptible to the infection of the pathogens rithms are mandatory, which limits their application.
confirming the prediction [62]. The tradeoff between accurate results and computa-
tional cost/complexity depends on the application
[13, 51].
Pearson correlation Only linear relationships are detected [13, 30, 64, 65, 69, 70]
No need for large datasets
Makes assumptions based on distribution
Very sensitive to outliers
High number of false-positive relationships
Confusion between indirect and direct relationships
No loops or feedbacks as in BNs
Graphical Gaussian models More complex and computationally costlier than PCC-based methods [22, 23, 30, 71]
Eliminates the effect of other genes when similarity is calculated
Bayesian networks Directionality of relationships and loops is depicted [16, 44, 72]
Computationally costly
Other similarity metrics No assumptions based on distribution [3, 24, 71, 73]
More samples needed than for PCC
No loops or feedbacks as in BNs
464 Lo¤pez-Kleine et al.
appear when the expression levels of genes are very observations. An important difficulty appears when
low; again, the method cannot produce reliable the number of included genes is high, although
results [69]. Finally, with the PCC, many pairs of several algorithms have emerged to overcome this
genes have the tendency to show similar behavior problem [50, 71]. Large-scale networks are still
in expression profiles by chance even though there more difficult to study because the number of pos-
is not a biological relationship. This random similar- sible networks grows exponentially as the number of
ity makes it difficult to calculate the significance of genes increases [72]. Accordingly, BNs are better
the results [30]. suited for small networks because of low computa-
GCNs based on the GGM compared with tional costs [16]. As this situation is not usually the
networks based on the PCC have allowed the de- case in plants, which have very large genomes,
termination that correlation alone by PCC is not BNs are not a good choice for these types of
strong enough and cannot distinguish between organisms when global networks need to be con-
with relevance networks, graphical gaussian models and 47. Geier F, Timmer J, Fleck C. Reconstructing gene-
Bayesian networks. Bioinformatics 2006;22:2523–31. regulatory networks from time series, knock-out data, and
30. Markowetz F, Spang R. Inferring cellular networks – a prior knowledge. BMC Syst Biol 2007;1:11.
review. BMC Bioinformatics 2007;8:S5. 48. Imoto S, Higuchi T, Goto T, et al. Combining microarrays
31. Toh H, Horimoto K. Inference of a genetic network by a and biological knowledge for estimating gene networks via
combined approach of cluster analysis and graphical Bayesian networks. Proc IEEE Comput Soc Bioinform Conf
Gaussian modeling. Bioinformatics 2002;18:287–97. 2003;2:104–13.
32. Schäfer J, Strimmer K. A shrinkage approach to large-scale 49. Werhli AV, Husmeier D. Gene regulatory network recon-
covariance matrix estimation and implications for functional struction by Bayesian integration of prior knowledge and/or
genomics. Stat Appl Genet Mol Biol 2005;4:Article32. different experimental conditions. J Bioinform Comput Biol
2008;6:543–72.
33. Schäfer J, Strimmer K. An empirical Bayes approach to
inferring large-scale gene association networks. 50. Needham CJ, Manfield IW, Bulpitt AJ, et al. From gene
Bioinformatics 2005;21:754–64. expression to gene regulatory networks in Arabidopsis thali-
ana. BMC Syst Biol 2009;3:85.
34. Magwene PM, Kim J. Estimating genomic coexpression
51. Dondelinger F, Husmeier D, Lèbre S. Dynamic Bayesian
65. Mutwil M, Klie S, Tohge T, et al. PlaNet: Combined 71. Ponnamblam S. Construction of Gene Networks from
sequence and expression comparisons across plant net- Expression Profiles. http://opus.kobv.de/ubp/volltexte/
works derived from seven species. Plant Cell 2011;23: 2011/5075/pdf/mutwil_diss.pdf (28 December 2012, date
895–910. last accessed).
66. Liao CS, Lu K, Baym M, et al. IsoRankN: spectral methods 72. Watanabe Y, Seno S, Takenaka Y, et al. An estimation
for global alignment of multiple protein networks. method for inference of gene regulatory net-work using
Bioinformatics 2009;25:i253–8. Bayesian network with uniting of partial problems. BMC
67. Seo YS, Chern M, Bartley LE, et al. Towards establishment Genomics 2012;13(Suppl. 1):S12.
of a rice stress response interactome. PLoS Genetics 2011;7: 73. Yu J, Smith VA, Wang PP, et al. Advances to Bayesian
e1002020. network inference for generating causal networks from
68. Daub CO, Steuer R, Selbig J, et al. Estimating mutual in- observational biological data. Bioinformatics 2004;20:
formation using B-spline functions–an improved similarity 3594–603.
measure for analysing gene expression data. BMC 74. Wu X, Ye Y, Subramanian RK. Interactive analysis of gene
Bioinformatics 2004;5:118. interactions using graphical Gaussian model. In: BIOKDD
69. Bandyopadhyay S, Bhattacharyya M. A biologically inspired 3rd ACM SIGKDD Workshop on Data Mining in Bioinformatics,