You are on page 1of 48

In: Information Theory: New Research

Editors: P. Deloumeaux et al, pp. 137-184

ISBN: 978-1-62100-325-0
c 2011 Nova Science Publishers, Inc.

Chapter 4

T HE ROLE OF I NFORMATION T HEORY


IN G ENE R EGULATORY N ETWORK
I NFERENCE
Enrique Hernandez-Lemus and Claudia Rangel-Escareno

Computational Genomics Department,


National Institute of Genomic Medicine, Mexico

Abstract
One important problem in contemporary computational biology, is that
of reconstructing the best possible set of regulatory interactions between
genes (a so called gene regulatory network -GRN) from partial knowledge, as given for example by means of gene expression analysis experiments. Since only highly noisy-data is available, doing this represents
a challenge to common probabilistic modeling approaches. However, a
variety of algorithms rooted in information theory and maximum entropy
methods, have been developed and they have coped with the problem successfully (to a certain degree). Mutual information maximization, Markov
random fields, use of the data processing inequality, minimum description
length, Kullback-Liebler divergence and information-based similarity are
some of these. Another approach to modeling gene regulatory networks
combines information theory and machine learning techniques. Monte
Carlo methods and variational methods can also be used to measure data
information content. Hidden Markov models (HMM) or stochastic linear

E-mail address: ehernandez@inmegen.gob.mx

138

Enrique Hernandez-Lemus and Claudia Rangel-Escareno


dynamical systems use time series data to represent information of a state
sequence about the past through a discrete random variable called the hidden state. Similarly, stochastic linear dynamical systems represent information about the past but through a real-valued hidden state vector. Common to these models is the fact that conditioned on the hidden state vector, the past, present and future observations are statistically independent.
State-Space models, also known as Linear Dynamical Systems (LDS) or
Kalman Filter models, are a subclass of dynamic Bayesian networks used
for modeling time series data. Expressing time series models in statespace form allows for unobserved components - an important factor when
modeling gene expression data. Unobserved variables can model biological effects that are not taken into account by the observables. They could
model the effects of genes that have not been included in the experiment,
levels of regulatory proteins or possible effects of mRNA degradation.
Work presented here shows the use of these models to reverse engineer
regulatory networks from high-throughput data sources such as microarray gene expression profiling. In this review we will also describe the
basic theoretical foundations common to such methods and will briefly
outline their virtues and limitations.

Keywords: Information theory, Network inference, probabilistic modeling

1.

Introduction

A common situation in several emerging fields of science and technology,


such as bioinformatics and computational biology, high energy physics and astronomy, to name a few, is that researchers are confronted with datasets having
thousands of variables, large noise levels, non-linear statistical dependencies
and a very reduced sampling universe. The detection of functional and structural relationships of the data when confronted with such situations is always
a major challenge. In particular, the construction of dynamic maps of gene interactions (also-called genetic regulatory networks) relies on understanding the
interplay between thousands of genes. Several issues arise in the analysis of
data related to gene function: the measurement processes generate highly noisy
signals; there are far more variables involved (number of genes and interactions
among them) than experimental samples. Another source of complexity is the
highly nonlinear character of the underlying biochemical dynamics.
Hence two important milestones in the analysis of genomic regulation are
variable selection (also called feature selection) and network inference. The for-

The Role of Information Theory...

139

mer is a machine learning topic whose goal is to select from amongst thousands
of input variables, those that lead to the best predictive model. Feature selection methods applied to genomic data allows, for instance, to improve molecular
diagnosis and prognosis in complex diseases (such as cancer) by identifying a
set (called a molecular signature) of features or variables that best represent
the phenomenon. In the case of network inference, consists in representing the
(in general non-linear) set of statistical dependencies between variables on a set
(that can be the whole input dataset or a feature-selected subset of it) by means
of a graph. When applied to genomic expression data (e.g. from microarray
experiments), network inference is able to reverse-engineer the transcriptional
gene regulatory network (GRN) of the related cell. Knowledge of this GRN
would allow for instance, to the discovery of new drug targets to cure diseases.
Information theory (IT) has resulted on a powerful theoretical foundation
to develop algorithms and computational techniques to deal both with feature
selection and with network inference problems applied to real data. There are
however goals and challenges involved in the application of IT to genomic analysis. The applied algorithms should return intelligible models (i.e. they must
result understandable), they must also rely on little a priori knowledge, deal
with thousands of variables, detect non-linear dependencies and all of this starting from tens (or at most few hundreds) of highly noisy samples. As we will
shown in this chapter, IT has provided approaches to deal with this problems.
Some of these approaches are based on machine learning techniques, basically
by modeling a target function connecting the variables of a system. Here, the
output or target variable is the one to be predicted and the input variables are the
predictors.
As a means to produce intelligible models we perform feature-selection procedures. The goal of these procedures is to select inputs among a set of variables
which lead to the best predictive model. In the vast majority of cases, feature
selection is a preprocessing step prior to the actual machine learning stage. This
is a somewhat critical part of the whole inference process. In the one hand, variable or feature elimination can lead to information losses. In the other, feature
selection is a mean to improve the accuracy of a model, to improve the generalizability of such model, as well as its intelligibility and at the same time to
decrease the computational burden for the training and inference stages. Computational methods for feature selection usually consist in a search algorithm
that explores different combinations of variables, supplemented with a measure
of performance (or score) for this combinations. There are several ways to ac-

140

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

complish this task, in our opinion, the best benchmarking options for the GRN
inference scenario, are the use of sequential search algorithms (as opposed to
stochastic search) and performance measures based on IT, since this made feature selection fast end efficient, and also provide an easy means to communicate
the results to non-specialists (e.g. molecular biologists, geneticists and physicians).
GRNs are graph-theoretical constructs that describe the integrated state of
a cell (or a small population of similar cells to be more precise) under certain
biological conditions at a given time. GRNs are means for identifying gene
interactions from experimental data through the use of theoretical models and
computational analysis. The inference of such an interaction connectivity network involves the solution of an inverse problem (a deconvolution) that aims to
uncover the interactions from the properties and dynamics of observable behavior in the form of, for example, RNA transcription levels in a characteristic gene
expression profile. A growing number of deconvolution methods (also called
reverse engineering methods) have been proposed in the past [6, 62]. Their
goal is to provide a well-defined representation of the cellular network topology from the transcriptional interactions as revealed by gene expression measurements that are then treated as samples from a joint probability distribution.
The goal of deconvolution methods is the discovery of GRNs based on statistical dependencies within this joint distribution [13]. One major shortcoming is
that, surprisingly, there is still no conceptual agreement as to what the dependencies are within these multivariate settings and about the role of noise and
stochastic dynamics in the problem. The special case of conditional statistical
dependence has gained, however, a certain place as a somehow useful criterion
in most biomedical applications. The central aim is to find a way to decompose the Statistical Dependency Matrix (SDM) -that is, the deviation of a joint
probability distribution from the product of its marginals- into a series of well
defined contributions coming from interactions of several orders of complexity. IT is therefore the right setting to do so. Typical means to reach this goal
consist in the quantification of the new information content that arise when we
look at the full joint probability distribution compared to a series of successive
independence approximations.
In GRNs each variable of the dataset is represented by a node (or vertex) in
the graph. There is a link joining two variables-nodes if these variables exhibit
a particular form of dependency (the particular form of dependency depends
explicitly on the inference method chosen). Some genes can produce a protein

The Role of Information Theory...

141

(or other biomolecules such as a microRNA) that is able to activate or repress


the production of another genes protein. There is thus, a presence of circuits
coded in the DNA of a cell. A useful way to represent this circuits is a graph
where the nodes represent the genes and the links or arcs are the interactions
between them. Here we will be dealing with reverse engineering methods for
GRNs using whole-genome gene expression data as input data. This problem
is very general and useful in contemporary research in computational molecular
biology, however it is a question that remains ut to date open due to its combinatorial nature and the poor information content of the data. Validation of network
against available real-life data will be thus an important stage in the discovery
of reliable GRNs.
As we have seen there are two major shortcomings related to the feature
selection and network inference procedures: i) non-linearity and ii) large number of variables. IT methods are often efficient techniques to deal with issues
i) and ii) [52, 22, 21, 38, 26]. It can be seen that most of these methods rely
on some form of mutual information metric. Mutual information (MI) is an
information-theoretic measure of dependency which is model independent and
has been used to define (and quantify) relevance, redundancy and interaction in
such large noisy datasets. MI has the enormous advantage that captures nonlinear dependencies [38, 26]. Finally MI it is rather fast to compute, hence it
can be calculated a high number of times in a still reasonable amount of time,
an explicit requirement in whole-genome transcription analysis.

2.

Information Theoretical Measures and Probability


Measures

We will introduce here the essential notions of IT that will be used, like
entropy, mutual information and other measures. In order to do so, let X and Y
denote two discrete random variables having the following features:
Finite alphabet X and Y respectively
Joint probability mass distribution p(X, Y )
Marginal probability mass distributions p(X) and p(Y )
and Y denote two additional discrete random variables defined
Let also X
on X and Y respectively, the associated probability mass distributions will be

142

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

p(X) and p(Y ), their joint probability mass distribution p(X, Y ) and defined on
J , the joint probability sampling space; J = X Y. For particular realizations,
we have p(x) = P (X = x) and p(y) = P (Y = y).
Following Shannon [58], for every discrete probability distribution X it is
possible to define the information theoretical entropy H of such distribution as
follows
H = Ks

p (X) log p (X)

(1)

here H is called Shannon-Weavers entropy, Ks is a constant useful the


determine the units in which entropy is measured and p (X) is the mass probability density for state of the random variable given by X = x. Entropy
was originally developed to serve as a measure of of the amount of uncertainty
associated with the value of X hence relating the predictability of an outcome
with the probability distribution.
The Kullback-Leibler divergence, KL(, :) is a non-commutative measure of
the difference between two discrete probability distributions [33].
KL [ p(Y ); pe(Y ) ] =

p(y) log

yY

p(y)
pe(y)

(2)

The Joint Kullback-Leibler divergence between two probability mass distributions p(X, Y ) and pe(X, Y ) is given by:
KL [ p(X, Y ); pe(X, Y ) ] =

p(x)

xX

p(y|x) log

yY

p(x, y)
pe(x, y)

(3)

In a similar way, it is possible to define the Conditional Kullback-Leibler


divergence between p(Y |X) and pe(Y |X) as follows:
KL [ p(Y |X); pe(Y |X) ] =

xX

p(x)

yY

p(y|x) log

p(y|x)
pe(y|x)

(4)

Equation 4 means that a conditional Kullback-Leibler divergence can also


be defined as the expected value of the Kullback-Leibler divergence of the conditional probability mass functions averaged over the conditioning random variables.
Recalling equation 2 we notice that it can be rephrased as follows:

The Role of Information Theory...

KL [ p(Y ); pe(Y ) ] =

p(y) log p(y)

yY

yY

143

p(y) log pe(y)

(5)

We could see that the first term in the right hand side of equation 5 is precisely the negative of the entropy H(Y) as given by equation 1. Shannons entropy depends on the distribution p(Y ) and, as Shannon himself showed [58], it
is maximum for a uniform distribution u(Y ). H[u(Y )] = log |Y|. If we replace
pe(y) for u(Y ) in equation 5 we get:
H[p(Y )] = log |Y| KL [ p(Y ); u(Y ) ]

(6)

As we can see, equation 6 states that the entropy of a random variable Y


is the logarithm of the size of the support set minus the Kullback-Leibler divergence between the probability distribution of Y and the uniform distribution
over the same domain Y. Thus, the closer the probability distribution is to a
uniform distribution, the higher is the entropy. Hence, entropy measures randomness and unpredictability of a distribution.
Now, let us consider a pair of discrete random variables (Y, X) with a Joint
Probability Distribution (JPD) p(Y, X). For these random variables the joint
entropy H(Y, X) is given in terms of the JPD as:
H(Y, X) =

X X

p(y, x) log p(y, x)

(7)

yY xX

We could notice that the maximal joint entropy is attained under independence conditions of the random variables Y and X, that is when the JPD is
factorized p(Y, X) = p(Y )p(X), in this case the entropy of the JPD is just the
sum of their respective entropies. An inequality theorem could be stated as an
upper bound for the join entropy:
H(Y, X) H(Y ) + H(X)

(8)

Equality only holds if X andY are statistically independent.


Also, given a Conditional Probability Distribution (CPD), the corresponding
conditional entropy of Y given X can be defined as:
H(Y |X) =

X X

yY xX

p(y, x) log p(y|x)

(9)

144

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

Conditional entropies are useful to measure the uncertainty of a random


variable once another one (the conditioner) is known. It can be proved [12] that:
H(Y, X) = H(X) + H(Y |X) H(Y ) + H(X)

(10)

Or, in other words:


H(Y |X) H(Y )

(11)

Equality only holds when X and Y are statistically independent. Expression


11 is extremely useful in the inference/prediction scenario: if Y is a target variable and X is a predictor, adding variables can only decrease the uncertainty on
the target Y . This will result almost essential for IT methods of GRN inference.
Entropy reduction by conditioning can be accounted in a pretty formal way
if we consider a measure called the mutual information, I(Y, X) which is a
symmetrical measure (i.e. I(Y, X) = I(X, Y )) that is written as:
I(Y, X) = H(Y ) H(Y |X) or I(X, Y ) = H(X) H(X|Y )

(12)

If we resort to Shannons definition of entropy (equation 1) [58] and substitute it into equation 12 we get:
H(Y, X) =

X X

yY xX

p(x, y) log

p(x, y)
p(x)p(y)

(13)

Mutual information can be written as the product of the Kullback-Liebler


divergence between the JPD and the product distribution:
I(Y, X) = KL [ p(X, Y ); p(X)p(Y ) ]

(14)

Mutual information is also given by the Kullback-Liebler divergence between the marginal distribution p(X) and the conditional distribution p(X|Y )
I(Y, X) = KL [ p(X|Y ); p(X) ]

(15)

Mutual information and Kullback-Liebler divergences are two of the most


widely used IT measures to solve the GRN inference problem.
A comprehensive catalogue of algorithms to calculate diverse information
theoretical measures has been developed for [R] the statistical scientific computing environment [27].

The Role of Information Theory...

3.

145

Methods in Regulatory Network Inference

The deconvolution of a GRN could be based on a maximum entropy optimization of the JPD of gene-gene interactions as given by gene expression experimental data could be implemented as follows [26]. The JPD for the stationary expression of all genes, P ({gi }), i = 1, . . . , N may be written as follows
[38]:
P ({gi }) =

Hgen = [

N
X
i

i (gi )

N
X
i,j

1
expHgen
Z

i,j (gi , gj )

N
X

i,j,k (gi , gj , gk ) . . .]

(16)

(17)

i,j,k

Here N is the number of genes, Z is a normalization factor (the partition


function), the s are interaction potentials. A truncation procedure in equation
17 it is used to define an approximate hamiltonian Hp that aims to describe statistical properties of the system. A set of variables (genes) , interacts with each
other if and only if the potential between such set of variables is non-zero.
The relative contribution of is taken as proportional to the strength of the interaction between this set. Equation 17 does not define the potentials uniquely,
thus, additional constraints should be provided in order to avoid ambiguity. A
usual approach to do so is specify s using maximum entropy (MaxEnt) approximations consistent with the available information on the system in the form
of marginals. Information theory provides a set of useful criteria for setting up
probability distribution functions (PDFs) on the basis of partial knowledge.
The MaxEnt estimate of a PDF is the least biased estimate possible, given
the information, i.e. the PDF that is maximally non-committal with regard to
missing information [28]. It is not possible to constrain the system via the
specification of all possible N-way potentials when N is large, hence one has
to approximate the interaction structure. According to the current genomics
literature, sample sizes of order 102 (the usual maximum size available in
most present-day studies) are generally sufficient to estimate 2-way marginals,
whereas 3-way marginals (e.g. triplet interactions i,j,k (gi , gj , gk )) require
about an order of magnitude more samples, a sample size unattainable under
present circumstances. Being this the case, one is usually confronted with a
2-way hamiltonian of the form:

146

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

Figure 1. A set of genes i interacts with another set of genes k by means of


a potential 6= 0 and is non-interacting with another set of genes j since the
corresponding potential functional is equal to zero.

H approx =

N
X
i

i (gi )

N
X

i,j (gi , gj )

(18)

i,j

Under that approximation, the reconstruction (or deconvolution) of the associated GRN consists in the inverse-problem of determining the complete set
of relevant 2-way interactions i,j (gi , gj ) consistent with the JPD (equations
16 and 17) that defines all known constrictions, e.g. the values of the stationary
expression of genes gi as given by the set of i (gi )s and non-committal with
every other restriction in the form of a marginal. The modeling of a GRN depends on the description of the interactions in the form of several correlation
functions. A great deal of work has been done within the framework of the
Bayesian Network (BN) approach [51, 23]. BN models both static and dynamic
have provided with a better understanding of the problem in terms of solvability, noise reduction and algorithmic complexity. Since BNs are a form of the
Directed Acyclic Graph (DAG) problem, there are several instances (e.g. feedforward loops, feed-back cycles, etc.) in which the DAG formalism of BNs

The Role of Information Theory...

147

fails short. It has been noted [6] that BNs require a larger number of data points
(samples) to infer the probability density distributions whereas information theoretical approaches perform well for steady-state data and can be applied even
when few experiments (compared to the number of genes) are available. A recently developed approach is the use of statistical and information theoretical
models to describe the interactions [36].
If we consider a 2-way interaction hamiltonian, all gene pairs i,j for which
i,j = 0 are said to be non-interacting. This is true for genes that are statistically
independent, P (gi , gj ) P (gi ) P (gj ), but it is also valid for genes that do not
have a direct interaction but are connected via other genes i.e. i,j = 0 but
P (gi , gj ) 6= P (gi ) P (gj ). Several metrics such as Pearson Correlation, Square
Correlation and Spearman Ranked coefficients over the sampling universe have
been used, but the performance of these methods is usually poor as suffers from
a big number of false positive predictions.

3.1.

Information Theoretical Methods

3.1.1. Mutual Information


An information theoretical measure that has been used successfully to infer
2-way interactions in GRNs is mutual information (MI) [38, 37, 3, 4]. MI for
a pair of random variables , and is defined as I(, ) = H() + H()
H(, ). Here H is the information theoretical entropy (Shannons entropy),
P
H(x) = hlog p(xi )i = i p(xi ) log p(xi ). MI measures the degree of
statistical dependency between two random variables. From the definition one
can see that I(, ) = 0 if and only if and are statistically independent.
Estimating MI between gene expression profiles under high throughput experimental setups typical of todays research in the field is a computational and
theoretical challenge of considerable magnitude. One possible approximation
is the use of estimators. Under a Gaussian kernel approximation [60], the JPD

of a 2-way measurement Xi = (xi , yi ), i = 1, 2, . . . , M is given as [38]:

f (X ) =




1 |
G[
h
X

X
X
i| ]
1

h2

(19)

G is the bivariate standard normal density and h is the associated kernel


width [38]. The mutual information could be evaluated as follows:

148

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

I({xi }, {yi }) =

1 X
f (xi , yi )
log
M i
f (xi ) f (yi )

(20)

hence, two genes with expression profiles gi and gj for which I(gi , gj ) 6= 0
are said to interact each other with a strength I(gi , gj ) (gi , gj ), whereas
two genes for which I(gi , gj ) is zero are declared non-directly interacting
to within the given approximations. Since MI is reparametrization invariant, one usually calculates the normalized mutual information. In this case
I(gi , gj ) [0, 1], i, j.

Figure 2. Panel i shows a bivariate interaction between gene A and genes B


and C, panel ii shows an indirect interaction of gene A on gene C mediated by
gene B, panel iii depicts two independent interactions between gene A and B
and gene A and C.

A highly customizable set of algorithms for mutual information inference


of gene regulatory networks has been implemented in the [R]/BioConductor
scheme [43, 42] and is called minet. The inference proceeds in two steps.
First, the Mutual Information Matrix (MIM) is computed, a square matrix whose
M IMij term is the mutual information between gene xi and xj . Secondly, an
inference algorithm takes the MIM matrix as input and attributes a score to
each edge connecting a pair of nodes. Different entropy estimators are implemented in this package as well as different inference methods, namely aracne,
clr and mrnet, finally the package integrates accuracy assessment tools, like
PR-curves and ROC-curves, to compare the inferred network with a reference
one [41]. The approach used there is based also on techniques from information theory, it is called the maximum relevance/minimum redundancy algorithm
(MRMR) [17] and results a highly effective information-theoretic technique for

The Role of Information Theory...

149

feature (or variable) selection in supervised learning. The MRMR principle


consists in selecting among the least redundant variables the ones that have the
highest mutual information with the target.
MRNET [41] extends this feature selection principle to networks in order to
infer gene-dependence relationships from microarray data. The MRMR method
[17] used in conjunction with a best-first search strategy for performing filter
selection in supervised learning problems can be performed as follows: Considering a supervised learning task, where the output is denoted by Y and V
is the set of input variables. MRMR ranks the set V of inputs according to
a score that is the difference between the mutual information with the output
variable Y (maximum relevance) and the average mutual information with the
previously ranked variables (minimum redundancy). Hence direct interactions
should be well ranked, whereas indirect interactions should be badly ranked
by the method. Then a greedy search algorithm starts by selecting the variable Xi that shows the highest mutual information to the target Y. The following selected variable Xj will be the one with a high information I(Xj ; Y ) to
the target and at the same time a low information I(Xj ; Xi ) to the previously
selected variable. In the following steps, given a set S of selected variables,
the criterion updates S by choosing the variable that maximizes the score. At
each step of the algorithm, the selected variable is expected to allow an efficient
trade-off between relevance and redundancy. The MRMR criterion is therefore
an optimal pairwise approximation (a proxy) of the conditional mutual information between any two genes Xj and Y given the set S of selected variables
I(Xj ; Y |S). MRNET (and minet also) works-out by repeating such MRMR
algorithm for every target gene (or in any case for every gene to search for de
novo transcriptional interactions).
MRNET reverse engineers networks by means of a forward selection strategy that aims to identify a maximally-independent set of neighbors for every
variable. A known limitation of algorithms based on forward selection, however, is that the quality of the selected subset strongly depends on the first variable selected (dependence of initial conditions). A modified version is presented
called mrnetb [43], which is an improved version of MRNET that overcomes
this shortcoming by using backward selection followed by a sequential replacement and it can be implemented with about the same computational burden as
the original forward selection strategy. The optimization problem of MRNET is
a form of binary quadratic optimization for which backward elimination combined with a sequential search is known to perform well. Backward elimination

150

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

starts with a set containing all the variables and then selects the variable Xi
whose removal induces the highest increase of the objective function. The procedure is enhanced by an iterative sequential replacement which, at each step,
swaps the status of a selected and a non selected variable such that the largest
increase in the objective function is achieved. The sequential replacement is
stopped when no further improvement is met [43]. Forward selection, backward
elimination, and sequential replacement all have an algorithmic complexity of
O(n2 ) so that the network built by backward elimination followed by sequential
replacement has the same asymptotic computational cost as the one based on a
forward selection strategy alone.
As one could further notice, the inference of GRNs by means of such high
performance IT methods is posed by large computational complexity. The limiting condition to these approaches is the time-consuming step of computing
the MI matrix. A method has been proposed by Qiu and colleagues [53] to
reduce this computation time. It is based in the application of spectral analysis to re-order the genes, so that genes that share regulatory relationships are
more likely to be placed close to each other. Then, using a sliding window approach with appropriate window size and step size, the MI for the genes within
the sliding window is then computed, and the remainder is assumed to be zero.
Qius method does not incur performance loss in regions of high-precision and
low-recall, while the computational time is significantly lowered. The essence
of Qius method is as follows: To determine the new gene ordering, a Laplacian matrix is derived from the correlation matrix of the gene expression data,
assuming the correlation matrix provides an adequate approximation to the adjacency matrix for our purpose, then it is computed the Fiedler vector [11], which
is the eigenvector associated with the second smallest eigenvalue of the Laplacian matrix. Since the Fiedler vector is smooth with respect to the connectivity
described by the Laplacian matrix, the elements of the Fiedler vector are then
sorted to obtain the desired gene ordering. The computational complexity of obtaining the gene ordering is negligible compared to the computation of the MI
matrix. The reduction in computational complexity is the result of computing
only the diagonal part of the reshuffled MI matrix. Because the remaining entries of the MI matrix are set to be zeros, there is potential loss of reconstruction
accuracy although due to Fielder minimization [53] this effect is not expected
to be significant. In fact, according with a benchmark of the method [53] in the
high-precision low-recall regime, applying the sliding window does cause a performance loss. In some cases, applying the sliding window yields slightly better

The Role of Information Theory...

151

performance. In the low-precision regime, however, the windowed version has


lower recall but this regime is dismissed, because one is not able to distinguish
biologically meaningful links from false positive ones.
3.1.2. Markov Random Fields
A Markov random field is a n-dimensional random process defined on a
discrete lattice. Usually the lattice is a regular 2-dimensional grid in the plane,
either finite or infinite. Assuming that Xn is a Markov Chain taking values in a
finite set,
P (Xn = xn |Xk = xk ; k 6= n) = P (Xn = xn |Xn1 = xn1 ; Xn+1 = xn+1 ) (21)

Hence, full conditional distribution of Xn depends of only in the neighbors


Xn1 and Xn+1 : In the 2-D setting, if S = 1; 2; . . . ; N S = 1; 2; . . . ; N is the
set of N 2 points, called sites or states. The aforementioned morphism defines a
conditional Markov random field [32].
Markov random field (MRF) models have been applied in several scenarios
within he computational molecular biology setting, for instance with regards
to functional prediction of proteins in protein-protein interaction networks [14,
15, 35], in the discovery of molecular pathways for protein interaction and gene
expression data [56] and in general network-based analysis for genomic data
[63, 64]. In the case of reverse engineering methods for network inference, a
MRF model could be stated as follows [63]:
An arbitrary state assignment for a gene set will be denoted by x =
(x1 , x2 , . . . , xp ), here xi is the expression state (either equally or differentially expressed, 0 or 1 respectively) of gene i, let x be the true but unknown
gene expression state. We can interpret this as a particular realization of a
random vector X = (X1 , X2 , . . . , Xp ) where Xi assigns expression state to
gene i. Let yi stand for the experimentally observed mRNA expression level
of gene i and y the corresponding vector, that here is interpreted as a particular realization of a random vector Y = (Y1 , Y2 , . . . , Yn ). Yi itself is a vector
yi = (yi,1 , yi,2 , . . . yi,m , yi,m+1 , yi,m+2 , . . . , yi,m+n ). This vector contains m
replicates under one condition and n replicates on the other condition. The
joint distribution of Y could be given in terms of a MRF, to write down this
joint probability we need to know conditional dependence/independence. Information theory could then be useful to determine from the distributions such

152

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

conditional dependencies. One way to do that is by means of the so-called Iterative Conditional Mode (ICM) algorithm [63] but other IT-based alternatives
could be also used.
Conditional dependencies are not the only application of IT and MRFs in
transcriptional network inference. To study functional robustness in GRNs,
Emmert-Streib and Dehmer [20] modeled the information processing within the
network as a first order Markov chain and studied the influence of single gene
perturbations on the global, asymptotic communication among genes. Differences were accounted by an information theoretic measure that allowed to predict genes that are fragile with respect to single gene knockouts. The information theoretic measure used to capture the asymptotic behavior of information
processing evaluates the deviation of the unperturbed (or normal (n)) state from
the perturbed (p) state caused by the perturbation of gene k. The relative entropy
or Kullback- Leibler (KL) divergence was used to quantify this deviation:
h

n,
=
KLi,k = KL pp,
i,k ; pi

pp,
i,k (m) log

pp,
i,k (m)
pn,
(m)
i

(22)

n,
In equation 22 the stationary distributions pp,
are given by:
i,k and pi
t 0
pp,
i,k = lim T pi

(23)

= lim Tkt p0i


pn,
i

(24)

The Markov chain given by Tk corresponds to the process obtained by perturbing gene k in the network. By means of this Markov chain model supplemented with an information theoretical KL measure, Emmert-Streib and
Dehmer [20] were able to study the asymptotic behavior of the transcriptional
regulatory network of yeast regarding information propagation under the influence of single gene perturbations. Hence not only static network properties
(such as structure) of the transcriptional regulation networks but also dynamic
features (such as robustness) could be analyzed from the standpoint of IT. The
study concludes that the knocked out genes destroy some communication paths
and, hence, can still have a strong impact on the information processing within
the cell. It seems to be reasonable to assume that the further away the knockout
gene is from the starting gene (say in Dijkstra distance [16]) the less the impact
will be. This is a strong evidence that information processing on a systems level

The Role of Information Theory...

153

depends crucially on the information processing in a local environment of the


gene that sends the information.
From a perspective of information processing the connection between
asymptotic information change and local network structure represented by their
degrees is interesting because it indicates that a local subgraph may be sufficient
to study information processing in the overall network. This finding seems truly
interesting because it would allow to reduce the computational complexity (and
the computational burden also) that arises when studying large genomes on a
systems scale. From the standpoint of information processing, it was shown
that the connection between asymptotic information changes and local network
structure at a local subgraph level may be sufficient to study information processing in the overall network.
3.1.3. Data Processing Inequality
In engineering and information theory, the data processing inequality (DPI)
is a simple but useful theorem that states that no matter what processing you do
on some data, you cannot get more information (in the sense of Shannon [58])
out of a set of data than was there to begin with. In a sense, it provide a bound
on how much can be accomplished with signal processing [12]. More quantitatively, consider two random variables, X and Y, whose mutual information is
I(X, Y ). Now consider a third random variable, Z, that is a (probabilistic) function of Y only. The only qualifier means PZ|XY (z|x, y) = PZ|Y (z|y), which in
turn implies that PX|Y Z (x|y, z) = PX|Y (x|y), as is easy to show using Bayes
theorem. The DPI states that Z cannot have more information about X than Y
has about X; that is I(X; Z) I(X; Y ). This inequality, which again is a property that Shannons information should have, can be proved, thus, I(X; Z) =
H(X) H(X|Z) H(X) H(X|Y, Z) = H(X) H(X|Y ) = I(X; Y ).
The inequality follows because conditioning on an extra variable (in this case Y
as well as Z) can only decrease entropy, and the second to last equality follows
because PX|Y Z (x|y, z) = PX|Y (x|y). This same principle is applicable either
to engineering control systems or to biological signal processing such as the one
present in GRNs [38, 57].
In reference [38] the DPI states that if genes g1 and g3 interact only through
a third gene, g2 ; we have that I(g1 , g3 ) min[I(g1 , g2 ); I(g2 , g3 )]. Hence,
the least of the three MIs can come from indirect interactions only so that the
proposed algorithm (ARACNe) examines each gene triplet for which all three

154

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

MIs are greater than I0 and removes the edge with the smallest value. DPI is
thus useful to quantify efficiently the dependencies among a large number of
genes. The ARACNe algorithm eliminates those statistical dependencies that
might be of an indirect nature, such as between two genes that are separated by
intermediate steps in a transcriptional cascade. Such genes will very likely have
non-linear correlated expression profiles which may result in in high MI, and
otherwise would be selected as candidate interacting genes. Given a transcription factor, application of the DPI will generate predictions about other genes
that may be its direct transcriptional targets or its upstream transcriptional regulators [39, 25].
The use of the DPI may result not only in a greater assessment of the results but also in a significant reduction of the computational burden associated
with network inference. Zola, et al. [67] presented a parallel method integrating
mutual information, data processing inequality, and statistical testing to detect
significant dependencies between genes, and efficiently exploit parallelism inherent in such computations. They developed a method to carry out permutation testing for assessing statistical significance of interactions, while reducing
its computational complexity by a factor of O(n2 ), where n is the number of
genes. The problem of inference (usually consuming thousand of computation
hours) at the whole genome network level by constructing a 15,222 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in
30 minutes on a 2,048-CPU IBM Blue Gene/L, and in 2 hours and 25 minutes
on a 8-node Cell blade cluster [67].
3.1.4. Minimum Description Length
One of the major drawbacks for the information theoretic models to infer
GRNs is that of setting up a threshold which defines the regulatory relationships
between genes. The minimum description length (MDL) principle has been
implemented to overcome this problem [10, 19]. The description length used
by the MDL principle is the sum of model length and data encoding length.
A user-specified fine tuning parameter is used as control mechanism between
model and data encoding, but it is difficult to find the optimal parameter. A new
inference algorithm has been proposed, which incorporates mutual information
(MI), conditional mutual information (CMI) [defined in terms of the associated conditional entropies] and predictive minimum description length (PMDL)
principle to infer gene regulatory networks from DNA microarray data. In this

The Role of Information Theory...

155

algorithm, the information theoretic quantities MI and CMI determine the regulatory relationships between genes and the PMDL principle method attempts to
determine the best MI threshold without the need of a user-specified fine tuning
parameter.
Given three random variables X, Y and Z, the conditional mutual information is a measure of the reduction in the uncertainty of X due to knowledge of
Y when Z is given. In other words,
H(Y, X) =

X X X

p(x, y, z) log

yY xX zZ

p(x, y|z)
p(x|z)p(y|z)

(25)

The description length of the MDL principle involves the calculation of the
model length and the data length. As the length can vary for various models,
the method could give biased results towards the length of the model. A model
based on universal code length is the PMDL principle. The description length
for a model in PMDL is given as:
LD =

m1
X

log[p(Xt+1 |Xt )]

(26)

t=0

With p(Xt+1 |Xt ) is the conditional mass probability or density. A gene


can take any value when transformed from one time point to another due to
the probabilistic nature of the network. The network is associated with Markov
chain which is used to model state transitions, in this sense MDL is related
with the log-likelihood of a Markov random field. This transition probability
p(Xt+1 |Xt ) is connected with entropy as follows: Each state transition brings
new information that is measured by the conditional entropy H(Xt+1 |Xt ) in
such a way that:
LD = H(X1 ) +

m1
X

H(Xj+1 |Xj )

(27)

j=1

Since H(X1 ) is common to all models it can be removed from the description length to give [10]:
LD =

m1
X
j=1

H(Xj+1 |Xj )

(28)

156

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

It is also noticeable that the MDL principle also helps to achieve a good
trade-off between the network model complexity and the accuracy of data fitting, since given a network and a dataset, the MDL principle evaluates simultaneously the goodness of fit of the network and the data. Intuitively, the more
complicated the network is, the better the data would be fitted. However, very
often models which are over-fitted relative to the actual systems are selected,
which give rise to numerous errors. MDL aims to achieve a good trade-off between model complexity and fitness of the data. A general criterion is thus obtained for constructing the network so as to contain only direct interactions. The
convergence of the proposed MDL-based network inference algorithms can be
assessed by the recovery of the topology of some artificial networks and through
the error rate plots obtained through extensive simulations on datasets produced
by synthetic networks [66].
3.1.5. Kullback-Liebler Divergence
Kullback-Liebler divergence [33] (as well as its symmetricized version, the
Jensen-Shannon measure) are, as it turns out, very commonly used information densities in GRN inference and other problems in computational molecular
biology. Either as unique measure [45, 44] or used in conjunction with other
indicators, such as spectral metrics [29], Markov fields [20], minimum description lengths [19], Bayesian networks [50, 31, 46, 48] and multivariate analysis
[40].
However, by far the most general use of the KL-divergence within GRN
information setting is by playing the role of the multi-information: it is known
[40] that for two variables, X1 and X2 , independence is well defined via decomposition of the bivariate JPD, P (X1 , X2 ) = P (X1 )P (X2 ), and mutual
information I(X1 ; X2 ) = hlog2 P (X1 , X2 )/[P (X1 )P (X2 )]i which is the only
measure of dependence [58]. Along the same lines, the total interaction (i.e. the
deviation from independence) in a multivariate JPD, P (Xi ), i = 1, ..., N , can
be measured by the multi-information as follows:

"

I(X1 ; X2 ; . . . XN ) = KL [P (X1 ; X2 ; . . . XN ), P ] = KL P (X1 ; X2 ; . . . XN ),

Here P (X1 ; X2 ; . . . XN ) is the full JPD and P =

Y
i

P (Xi )

#
(29)

P (Xi ) is the prob-

The Role of Information Theory...

157

ability distribution approximated under independence assumption. Since P is


the maximum entropy (MaxEnt) distribution [28] that has the same univariate
marginals as P only without statistical dependencies among the variables, the
multi-information is given by the KL divergence between the JPD and its MaxEnt approximation with univariate marginal constraints. This KL-divergence
measures the gain in information by knowing the complete JPD against assuming total independence. In a similar fashion, thus, MaxEnt distributions consistent with various multivariate marginals of the JPD introduce no statistical
interactions apart from the corresponding marginals. By comparing the JPD to
its MaxEnt approximations under various marginal constraints, we are expecting to separate dependencies included in the low-order statistics from those not
present in them [40].
Assuming that we have a N-variables GRN and we know a set of marginal
distributions of all variable subsets (for size k 1), One can ask what is the
JPD P k that captures all multivariate interactions prescribed by these marginals,
but introduces no additional dependencies. This is of course equivalent to
search for the minimum I(X1 ; X2 ; . . . XN ) or conversely, its maximum entropy
H(X1 ; X2 ; . . . XN ), turning our inference problem into a MaxEnt problem:
k

"

P arg max
H(P )

P ,{}

k
M (PM

PM )

(30)

where M is the set of constrained variables.


3.1.6. Information Based Similarity
A promising approach consists in considering that the interactivity of the
system is based on communication channels (either real or abstract) for the biosignals. Thus, Information Theory (IT) could play a useful role in identifying
entropic measures between pairs {gi , gj } of genes within the sampling universe
as potential interactions i,j . IT can also provide with means to test for the
MaxEnt distribution, by considering, for example the Kullback-Liebler (KL)
divergence (in the sense of multi-information) or the Connected Information as
criteria of iterative convergence to the MaxEnt PDF in the same sense that the
cumulative distribution leads to the specification of usual PDFs [61].
One possible approach that we propose below is based on the quantification
of the so called Information-Based Similarity Index (IBS) [65] initially developed to work out the complex structure generated by the human heart beat time

158

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

series. Nevertheless, IBS has proved to be a very powerful tool in the comparison of the dynamics of highly nonlinear processes. Within the present context
[26], the symbolic sequence represent the expression values of a single gene
(say gene k-th) all along the sampling universe (of size M ), as given by a vector

=
gk = (gk1 , gk2 , . . . , gkM ). Let us consider a series that could well represent
a gene expression vector. It is possible to classify each pair of successive points
into one of the following binary states Bn , if (n+1 n ) < 0 then Bn = 0; in
the other case Bn = 1. This procedure maps the M step real-valued time series
(i) into an M 1 step binary-valued series B(i). It is now possible to define
a binary sequence of length m (called an m-bit word). Each of the m-bit words
wk represents a unique pattern in a given time series. For every unitary timeshift , the algorithm makes a different collection W of m-bit words over the
whole time series, W = {w1 , w2 , . . . , wn } . It is expected that the frequency
of occurrence of these m-bit words will reflect the underlying dynamics of the
original (real-valued) time series. We are then looking to write down a probability distribution function in the rank-frequency representation (RF-PDF). This
RF-PDF represents the statistical hierarchy of symbolic words of the original
series [65]. Two given symbolic sequences are said to have similarity if they
give rise to similar probability distribution functions.
Following the very same order of ideas, Yang and collaborators [65] defined
a measure of similarity (akin to statistical equivalence) between two series by
plotting the rank number of every m-bit word in the first series with the rank
for the same m-bit word in the second series. Of course since the series are
supposed to be finite, the m-bit words are not equally likely to appear. The
method introduces the likelihood of each word by defining a weighted distance
m between two given symbolic sequences 1 and 2 as follows:
m

2
X
1
m (1 , 2 ) = m
|R1 (wk ) R2 (wk )|F (wk )
2 1 k=1

(31)

F (wk ) is the normalized likelihood of the m-bit word k, weighted by its


given Shannon entropy, i.e.:
F (wk ) =

1
[p1 (wk ) log p1 (wk ) p2 (wk ) log p2 (wk )]
Z

(32)

pi (wk ) and Ri (wk ) represent the probability and rank of a given


word wk in the i-th series. The normalization factor in equation 32 is

The Role of Information Theory...

159

the total Shannons entropy of the ensemble and is calculated as Z =


k [p1 (wk ) log p1 (wk ) p2 (wk ) log p2 (wk )]. m (1 , 2 ) is called the Information Based Similarity Index (IBS) between series 1 , and 2 (e.g. expression vectors g1 and g2 for genes 1 and 2 respectively). One notices that
m (1 , 2 ) [0, 1]; 1 , 2 ; m. In fact one is able to consider m (1 , 2 )
as a probability measure. If lim m (1 , 2 ) 1 the series are absolutely dissimilar, whereas in the opposite case (lim m (1 , 2 ) 0) the two series become equivalent (in the statistical sense). One can then approximate the value
of the interaction potentials (gi , gj ) as follows. If one is to consider interaction as given by correlation or information flow, one can notice that high values
of m imply stronger dissimilarity, hence lower correlation and since m is a
probability measure, one can define the complementary measure m = 1m
and then one can approximate (gi , gj ) m (gi , gj ).
P

4.

Bayesian and Machine Learning Methods

Systems biology aims to understand biological processes in living systems


by developing mathematical models which are capable of integrating both experimental and theoretical knowledge, and it works both ways: Given a prespecified mathematical framework, the behavior of a set of genes in a specific
GRN can be simulated under a variety of biological conditions and used to test
hypotheses. But also, given a particular pre-specified mathematical framework,
the observation of gene behavior under specific conditions may be used to infer
the underlying GRN. Generally speaking, the reconstruction of a GRN based on
experimental data is known as a reverse engineering approach.
In the context of information theory combined with systems biology, there
are two well known information extraction approaches, characterized as topdown and bottom-up, both have been used to infer GRNs from high-throughput
data sources such as microarray gene expression measurements. A top-down
approach mainly breaks down a system, in order to gain insights into the system. On the other side, bottom-up approaches seek to construct a synthetic gene
networks.
The simplest network in an information theory approach is the correlation
network. This is an undirected graph with edges that are weighted by correlation coefficients. It is simple, computationally manageable and with small data
requirement. The drawback of these is that the models are static and they do not
infer the causality of gene regulation.

160

4.1.

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

Bayesian Networks

A Bayesian network (BN) is a probabilistic graphical network model, described by a directed acyclic graph (DAG). In the model each node represents a
random variable and edges define conditional independence relations between
these random variables. These relationships e.g, gene-gene interactions, can be
seen in a directed graph without cycles. Without cycles means a gene may have
no direct or indirect interaction with itself. In order to reverse engineer a gene
network using this approach, one would need to find the directed acyclic graph
that best describes the gene expression data. This particular limitation of a directed acyclic graph can be overcome by using a dynamic Bayesian network.

4.2.

Dynamic Bayesian Networks

Bayesian networks that model sequences of variables are called dynamic


Bayesian networks (DBNs). Murphy and Mian [47] first introduced the use of
DBNs to model gene expression time series data. The benefits of DBNs include
the ability to handle latent variables and missing data (such as transcription
factor protein concentrations, which may have an effect on mRNA steady state
levels) and to model stochasticity. Friedman et al. [23] explored experimental
applications to microarray data analysis. Dynamic Bayesian networks may also
use continuous measurements rather than discrete. Feedback loops can also
be unfolded with respect to time, by explicitly modeling the influence of gene
g1 at time t1 on another gene g2 at time t2 , where t2 > t1 . An appropriate
model for gene expression microarray data belongs to the class of linear state
space models, widely used in estimation and control problems arising in system
modeling. These models consist of a state variable that is either unobserved
or partially observed, an observable that evolves in a linear relation to the state
variable, and a structural specification which is a set of parameters in the linear
and distributional relationships between state variables, observables, and noise
terms.

4.3.

State-Space Models

State-Space models, also known as Linear Dynamical Systems (LDS), are


a subclass of dynamic Bayesian networks. A state space model is a mathematical model for a process that accepts inputs which are the drivers of the process
and generates outputs that are interpreted as observable manifestations of what

The Role of Information Theory...

161

is going on inside the process and how this internal behavior is affected by the
inputs. These models are suitable for modeling time series data where we have
a series of observations related to a series of unobserved variables changing
over time. Time series models in state-space representation can be thought of as
unobserved component models. The state vector represents those unobserved
or hidden or missing variables and their dynamics over time are governed by
a state transition equation. In the very general setting of a state-space model,
the state vector determines the future evolution of the dynamic system, given
future time paths of all of the variables affecting the system. The variables are
not restricted, they can be either discrete with a countable number of possible
values or continuous with an associated density curve. For example, modeling
gene expression data assumes continuous variables and requires the inclusion
of hidden states. Hidden variables could model the effects of genes that have
not been included in the experiment, they could also model levels of regulatory
proteins as well as possible effects of mRNA or protein degradation. One goal
is to infer the characteristics and properties of the unobserved variables based
on the observations. In linear state-space models, a sequence of p-dimensional
real-valued observation vectors {y1 ..., yT }, is modeled by assuming that at each
time step yt was generated from a K-dimensional real-valued hidden (i.e. unobserved) state variable xt , and that the sequence of xs is governed by a first-order
Markov process. This type of model is shown pictorially in Figure (3).
A linear-Gaussian state space model of the time series {yt } is specified by
the matrices A and C called system matrices and is described by a pair of equations:
xt+1 = Axt + wt

(33)

yt = Cxt + vt

(34)

These two equations represent the most basic form of a state-space model.
The vector xt RK is called the state vector at time t. The state equation
(33) shows how this vector evolves with time. A is the dynamic or transition
state matrix, and its eigenvalues are important in determining the way the data
behave. The observation equation (34) specifies the relationship between the
observed data and this newly introduced vector xt . C describes the relation between state and observation, and wt and vt are zero-mean random noise vectors.
For the most general case the noise vectors could be mutually correlated,
although serially uncorrelated. In the particular Linear Gaussian case they are

162

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

Figure 3. State-Space model.

mutually independent and independent of the initial state value x0 . Assuming


that the initial state x0 is fixed or Gaussian distributed, and the noise vectors are
jointly Gaussian, then the state and output of the system is also Gaussian. That
is, all future hidden states xt and observations yt generated from those hidden
states will be Gaussian distributed.
This model has been extensively used in state-space modeling. Brockwell
and Davis [7] develop the state-space model described by (33) and (34) as
well as the associated Kalman filter recursions and apply these in representing
ARMA (autoregressive moving average) and ARIMA (autoregressive integrated
moving average) processes. The Kalman filter recursions define recursive estimators for the state vector xt , given observations up to the present time t. Stoffer
and Shumway [59] present a similar development and apply it to representing
ARMAX (autoregressive-moving average with exogenous terms) models. Stoffer and Shumway also develop the recursive smoother, which gives estimators
of the state variable xt given observations prior to and after time t, and develop
state space models that include exogenous inputs in the state equation, observation equation, or both. State-space models can be written in different ways. The
structure of the model used in this thesis includes exogenous variables in both
equations and its derivation is detailed in the next section.

The Role of Information Theory...

4.4.

163

LDS Model for Gene Expression

Fluorescent intensities are measures of gene expression levels. Values of


some of these variables influence the values of others through the regulatory
proteins they express, including the possibility that the expression of a gene at
one time point may, in various circumstances, influence the expression of that
same gene at a later time point.
To model the effects of the influence of the expression of one gene at a
previous time point on another gene and its associated hidden variables the LDS
model with inputs we modify the structure as follows.
(i)
(i)
We let the observations yt = gt , the expression level of gene i at time
point t, and the inputs ht = gt and ut = gt1 to give the model shown in Figure
4.

Figure 4. Bayesian network representation of the model for gene expression.

This model is described by the following equations:


xt+1 = Axt + Bgt + wt
gt = Cxt + Dgt1 + vt

(35)
(36)

164

Enrique Hernandez-Lemus and Claudia Rangel-Escareno


Model Assumptions

The vector ut Rpu is the exogenous input observation vector, ht Rph


represents the exogenous influence on the hidden states. As before, the state
and observation vectors xt and yt have dimensions K and p, respectively.
A is the state transition matrix,
B is the input to state matrix in the state transition equation,
C is the state to observation matrix and
D is the input to observation matrix.
The state and observation noise vectors, wt and vt respectively, are random
vectors serially independent and identically distributed, and also independent
of the initial values of x and y and independent of one another.

Remarks
These system matrices A, B, C, D are taken to be constant in this research
but they also may be varying over time in which case it is appropriate to add a
subscript indicating this.
When the sequence {x1 , w1 , ..., wT } is independent then the distribution of
xt+1 |xt , ...x1 is the same as the distribution of xt+1 |xt , hence the state vector
xt evolves with a first-order Markov property with A as the transition matrix.
The noise vectors can also be viewed as hidden variables. Here the matrix
D in the observation equation captures gene-gene expression level influences at
consecutive time points whilst the matrix C captures the influence of the hidden
variables on gene expression level at each time point. Matrix B models the
influence of gene expression values from previous time points on the hidden
states, and A is the state transition matrix. However, our interests focus on
CB + D which not only captures the direct gene to gene interaction but also
the gene to gene interactions through the hidden states over time. This is
actually the matrix we will concentrate the analysis on, since it captures all of
the information related to gene-gene interaction over time.

5.

Constrained LDS

Mathematically speaking, the idea of adding constraints to the model is basically to reduce the number of parameters to estimate. Narrowing down the

The Role of Information Theory...

165

range of parameters to estimate by adding extra constraints reduces dimensionality which can considerably simplify the search for the parameters that best
describe the model. At all times during modeling with constraints diagnostics
should be made to make sure the model still fits well after taking account of
the constraints. How precisely to include these forms of information into the
inference process was not a straightforward task. However, this is the true art of
modeling.
From the biological point of view, the current application to gene expression
data is already complex. Data generation, low-level analyses and classification
are known to be crucial in getting gene expression levels. Different algorithms
can lead to different sets of genes. Hence, biological mining should be present in
any machine learning approach. In this sense, any knowledge about gene behavior and regulatory interactions are helpful. Now, if this additional information
can be included and modeled, estimation not only becomes more realistic due
to the reduction os parameters but also due to a more biology based approach.
Given either a-priori or new hypothesized information leading to a set of
plausible models, the LDS model is re-trained based on this knowledge about
the parameters. The a-priori information would be supplied by past experiments
or biological knowledge, while the new hypothesized information is obtained
from the bootstrap analysis

5.1.

Model Definition

Two competing motivations must be kept in mind when defining a model:


fidelity and tractability. The models fidelity describes how closely it corresponds to reality. On the other hand, the models tractability focuses on the ease
with which it can be mathematically described as well as analyzed and validated
statistically based on observations and measurements. It is understandable that
increasing one (either fidelity or tractability) is usually done at the expense of
the other. Consequently, the ideal model should be developed in close cooperation between the science governing the application and feasible mathematical
and statistical methods. One common assumption that aids tractability is that
model errors are normally (or Gaussian) distributed. Indeed, a large number
of existing algorithms and methods of statistical inference are based on jointly
Gaussian observables. Though rarely satisfied exactly in practice, this assumption is often justified because it makes the analysis of the model tractable and
the resulting statistical inferences are robust in the sense of being insensitive to

166

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

small departures from normality. The model definition used in this work, is defined with the Gaussian assumption only insofar as it makes the analysis of the
models more straight forward and tractable. However, for statistical inferences
and validation of the model, no essential use of the Gaussian assumption is be
made. Instead, more general methods such as bootstrapping are employed.

5.2.

Structural Specification

We will concentrate here on incorporating a-priori information, and for this,


the emphasis is on constraining elements in the matrix D. The reason for this is
simple: D describes the direct gene-to-gene interactions over time, and therefore seems the most suitable place to incorporate a-priori information. Recall,
that the gene regulatory network is constructed from the estimate of CB + D,
and thus has incorporated in it also the influence of hidden variables (e.g. the
influence of missing genes / proteins, etc.). Thus, the hypothesized form of this
dag entails that some elements of the matrix CB + D are zero. The idea now
would be to impose those constraints on CB + D and re-estimate the model
structural parameters under these constraints and verify that the model still fits
the data well. Imposing constraints reduces the dimensionality of the unknownparameter space, and thus creates a new estimation problem (one for which the
remaining unconstrained parameters can be estimated more precisely). Because
of this, solving this new estimation problem (and performing diagnostics) could
expose shortcomings in how well the constrained model describes the data, or
could expose other parts of the model structure that were obscured because of
the larger number of parameters to estimate in the unconstrained model.

5.3.

Estimation

With the structural specification known, the objective is to estimate, in a


least-squares sense, the unknown or unobserved state variables from the available observations. The so-called Kalman filter solves this problem, and variations of the filter give interpolation, extrapolation, and smoothing estimators
of the state variables (see the book by Aoki [5], for example). The resulting
estimators are optimal in the sense of least- squares, given that one is restricted
to consideration of estimators that are linear functions of the observables. Their
derivation can be accomplished in generality by casting the problem in the context of approximation in a Hilbert space of random variables possessing finite

The Role of Information Theory...

167

second order moments. This reduces the problem to one of computing projections onto the subspaces spanned by the observables, but the derivations and
machinery of that theoretical approach are tedious. However, in the special
case when the states and observables are jointly Gaussian, the least squares estimators of state are given by conditional expectations (conditioned on the observables) which are in turn linear functions of the observables. Moreover, the
conditional expectation operator has all the essential properties of the subspace
projection operator in the Hilbert space context. As a consequence, the shorter
and more elegant analysis of the problem in the Gaussian context leads to exactly the same estimators of the state variables as the more general Hilbert space
context. Thus, in terms of formulating the state estimators, there is no loss of
generality in assuming Gaussian joint distributions.
Regarding the estimation of the structural parameters, in the absence of assumptions regarding the joint distributions of the state variables and observables
or any other pertinent information, a weighted least-squares approach would be
reasonable and justified. If the assumption is made that the state variables and
observables are jointly Gaussian, then the method of maximum likelihood leads
to parameter estimators that are essentially equivalent to those yielded by the
weighted least-squares approach. Thus, again there is no loss of generality in
making the Gaussian assumption for constructing estimators of structural parameters.

5.4.

Derivation

To model the effects of the influence of the expression of one gene at a


previous time point on another gene and its associated hidden variables, we
consider the state-space model
xt+1 = Axt + Byt + wt

(37)

yt = Cxt + Dyt1 + vt

(38)

The column vector x is the state vector of hidden variables for the system,
u is the input observation vector, C is the state to observation matrix which
captures the influence of the hidden variables on gene expression level at each
time point.
The matrix D describes the gene-to-gene interaction at consecutive time
points. From this matrix we obtain the Bayesian network representation of the

168

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

causal relationship between the genes. After the model parameters are estimated
using the EM algorithm as well as the Kalman filter and smoother in the E-step,
we proceed to analyze the matrix D. The values in this matrix determine the
conditional probabilities of the relationships between genes. In order to test
the robustness of the model a bootstrap experiment is performed by randomly
resampling the data. Using 300 boostrap samples of the data I average the values
of the Di s and threshold it to find values that are not significant from zero.
Those values that are significant from zero are zero out leading to a new D
matrix.
The model with this filtered matrix D is put it back into LDS to be trained
again and find better estimates of the parameters. The gain is that this time the
number of parameters to be estimated had been reduced.
The transition matrix can then be constrained to have elements equal to
zero. To do so, one approach would be to constrain the elements of D under
the restriction DF = G as is suggested in [59], with F and G known and
in this particular case specified in such a way that we can zero out some elements in D. Under this restriction the constrained estimators for C, D and R
are determined by a constrained minimization problem applying the technique
of Lagrange multipliers.
D IGRESSION :
Recall that in the state transition equation (37), A is the transition state matrix
and B is the input to state matrix. The state and observation noise vectors, wt
and vt respectively are random variables assumed to be Gaussian distributed,
mutually independent and independent of the initial values of x and y. Since
the constraints will be applied to the observation equation we are interested in
the terms involving C, D and R so we are able to obtain a reduced likelihood
function:
2L(C, D, R)

N T log |R| +
T
N X
X

(tr(R1 )E[(yt Cxt Dut ) (yt Cxt Dut )|y1 , ..., yT ])

j=1 t=1

(39)

where tr denotes the trace. To facilitate the algebra manipulation and make
more clear the process, this expression can be rewritten as
2L(C, D, R) = N T log |R|

The Role of Information Theory...

169

+ tr(R1 )(Syy Syx C Syu D CSyx


+ CP C

+ CSxu D DSyu
+ DSxu
C + DSuu D )

where
Syy =

T
N X
X

(j)

(j)

(j)

(j)

(j)

(j)

(j)

(j)

(j)

(j)

yt yt

j=1 t=1

Syx =

T
N X
X

yt x
t

j=1 t=1

Syu =

T
N X
X

yt u t

j=1 t=1

Sxu =

T
N X
X

x
t ut

j=1 t=1

Suu =

T
N X
X

ut ut

j=1 t=1

T
N X
X

E[xt xt |y1 , ..., yT ]

j=1 t=1

Taking partial derivatives of (39) and making them equal to zero, we solve for
C, D and R. In other words, we find the unconstrained estimators that minimize the likelihood function (39).

D = (Syu Syx P 1 Sxu )(Suu Sxu


P 1 Sxu )1

DSxu
)P 1

C = (Syx

1 

Syy CSyx
DSyu
R =
NT

(40)
(41)
(42)

To obtain the constrained estimators(Dcons , Ccons , and Rcons ) we need


to solve the following

170

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

Constrained Minimization Problem


Minimize

2L(C, D, R) = N T log |R|

+ tr(R1 )(Syy Syx C Syu D CSyx


+ CP C

+ CSxu D DSyu
+ DSxu
C + DSuu D )

subject to the constraint DF G = 0


Solution: We introduce the Lagrange Multipliers method to minimize
the new likelihood function (39) subject to the constraint DF G = 0.
Let us define the real-valued column vector of Lagrange multipliers
= (1 , 2 , ..., n ) . The likelihood function and the constraints associated with it define our objective function as:
M (C, D, R) = tr(N T log |R|)

+ tr(R1 )(Syy Syx C Syu D CSyx


+ CP C

+ CSxu D DSyu
+ DSxu
C + DSuu D )

+ tr[ (DF G)]

(43)

Necessary conditions for a minimum of M (C, D, R) are that elements in


C, D, R, and be chosen to give
M
M
M
= 0,
, and
= Constraints = 0
C
D

The third expression implies that a minimum for M is also a minimum for the
likelihood function (39).
M
C

M
D

tr(R1 )(Syx C CSyx


+ CP C + CSxu D + DSxu
C)
C

2R1 (Ccons P + Dcons Sxu


Syx ) = 0

(44)

tr(R1 )(Syu D + CSxu D DSyu


+ DSxu
C + DSuu D ) + F
D

The Role of Information Theory...

171

2R1 (Ccons Sxu + Dcons Suu Syu ) + F = 0

(45)

Dcons F G = 0

(46)

From (44) and (45) we get the constrained estimators for C and D

Ccons = (Syx Dcons Sxu


)P 1

(47)

1
1
Dcons = (Syu Ccons Sxu Rcons F )Suu
2
Using the expressions (40) and (41) for the unconstrained estimators we get the
constrained D matrix
1

P 1 Sxu )1
Dcons = D Rcons F (Suu Sxu
2
Substituting these back into (46) and solving for gives:
1

Rcons = (DF G)(F (Suu Sxu


P 1 Sxu )1 F )1
2
Putting the expression above back into (5.4.) and solving for Dcons we finally
obtain the constrained estimators for C and D in terms of the unconstrained
ones.
Dcons

P 1 Sxu )1
P 1 Sxu )1 F )1 F (Suu Sxu
D (DF G)(F (Suu Sxu

Ccons

C (DF G)(F (Suu Sxu


P 1 Sxu )1 F )1 F (Suu Sxu
P 1 Sxu )1 Sxu
P 1

Similarly, the constrained covariance matrix Rcons is obtained by differentiating


with respect to R and solve.
M
R

N T Rcons
(Syy Syx Ccons
Syu Dcons
Ccons Syx
+ Ccons P Ccons

+Ccons Sxu Dcons


Dcons Syu
+ Dcons Sxu
Ccons
+ Dcons Suu Dcons
)

(48)

172

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

leads to
1
(Syu + Ccons Sxu + Dcons Suu )Dcons
NT 

1
1

= R+
Rcons F Dcons
(49)
NT
2
1

(DF G)(F (Suu Sxu


P 1 Sxu )1 F )1 G (50)
= R
NT

Rcons = R +

Unfortunately, this constraints cannot be implemented in the model used for


this research. The selection of the matrices F and G that could zero out some
elements in D become difficult as the size of the matrix increases. However, by
re-writting the constrained problem using the vec operator we can easily handle
any matrix size.

5.5.

Vec Formulation

The vec operator vectorizes a matrix by piling up the columns. That is,
suppose we want to vectorize a 2x2 matrix M

M=

"

m11 m12
m21 m22

, vec(M ) =

m11
m21
m12
m22

The Kronecker product of two matrices plays an important role when using the
vec operator. There are important relationships that will be used in the development of the constrained minimization problem in vec formulation.
Definition: The Kronecker product of two matrices, A and B, where A is
m x n and B is p x q, is defined as

AB =
which is an mp x nq matrix.

A11 B A12 B
A21 B A22 B
...
...
Am1 B Am2 B

. . . A1n B
. . . A2n B
...
...
. . . Amn B

The Role of Information Theory...

173

Important Operator Relationships


vec(AXB) = (B T A)vec(X)

(51)

(AC BD) = (A B)(C D)

(52)

(A B)
dxT Ax
dx

= A

(53)

= xT (A + AT )

(54)

To show the application of the vec operator in the constraint settings let us look
at the following example.
E XAMPLE :
Let us consider a 2x2 matrix D and suppose we want to constrain it to be
diagonal. Select the matrices F and G to be
D=

"

d11 d12
d21 d22

, F =

"

0 1 0 0
0 0 1 0

G=

"

0
0

Then, applying the constraint F vec(D)=G we get that the elements d1 and d2
are zero and the matrix D becomes:
D=

"

d11 0
0 d22

In general, for any n x n matrix D we can find matrices F and G and solve the
constrained minimization problem using vec formulation as follows:
Constrained Minimization Problem 2
Minimize

2L(C, D, R) = N T log |R|

+ tr(R1 )(Syy Syx C Syu D CSyx


+ CP C

+ CSxu D DSyu
+ DSxu
C + DSuu D )

174

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

subject to the constraint F vec(D) - G = 0


Solution: We introduce the Lagrange Multipliers method to minimize
the objective function
M (C, D, R) = tr(N T log |R|)

+ tr(R1 )(Syy Syx C Syu D CSyx


+ CP C

+ CSxu D DSyu
+ DSxu
C + DSuu D )

+ (F vec(D) G)

(55)

subject to the constraint F vec(D)-G = 0.


M
C

M
vec(D)

tr(R1 )(Syx C CSyx


+ CP C + CSxu D + DSxu
C )
C

2R1 (Ccons P + Dcons Sxu


Syx ) = 0

1
1
= 2vec(Rcons
Syu ) + 2vec(Rcons
Ccons Sxu ) +
1
2vec(Rcons
Dcons Suu ) + vec( F ) = 0

M
R

(56)

= F vec(Dcons ) G = 0

(57)
(58)

= N T Rcons
(Syy Syx Ccons
Syu Dcons
Ccons Syx
+

Ccons P Ccons
+ Ccons Sxu Dcons
Dcons Syu
+ Dcons Sxu
Ccons
+

Dcons Suu Dcons


)=0

From (57) and the following expressions


1
1
vec(Rcons
Dcons Suu ) = (Suu Rcons
)vec(Dcons )
1

1
vec(Rcons
Ccons Sxu ) = (Sxu
Rcons
)vec(Ccons )

vec( F ) = F
vec(Ccons ) = vec(Syx P 1 ) (P 1 Sxu I)vec(Dcons )

(59)

The Role of Information Theory...

175

we have that,
1

P 1 Sxu )1 Rcons )F
vec(Dcons ) = vec(D) ((Suu Sxu
2
We still need to work out the value for . Hence, substituting (57) into (58) and
solving for gives:
1

= (F vec(D) G)( F (Suu Sxu


P 1 Sxu )1 Rcons )F )1
2

(60)

Now, putting this expression for back into (5.5.) we obtain


vec(Dcons ) = vec(D) V 1 F [F V 1 F ]1 (F vec(D) G)

(61)

where,

V = (Suu Sxu
P 1 Sxu )1 Rcons )

Finally, from (59) we obtain the expression for Rcons implicitly in the
form of Rcons = R + f (Rcons ) for which we will need to iterate and reshape
the matrix Dcons at each iteration.
Rcons = R +

5.6.

1
NT

Rcons F Dcons
2

(62)

Constraints Implementation - EM Procedure

In order to apply the EM algorithm, we require initial values of the state


and covariance as well as the parameters which are initialized using linear
regression. Then the EM procedure operates as follows:
E-step
Given the initial estimators x0 , P0 and initial estimators of A, B, C, D, Q and
R use the Kalman filter equations to compute the estimates for x
+
t and Pt .
M-step
Re-estimate the unconstrained A, B, C, D, Q, and R using the values for x
+
t

176

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

and Pt in the formulas for a, b, c, d, e, and P (* Here, is where we add the


constraints F vec(D) - G = 0 *)
A LGORITHM :
1. Start with the unconstrained estimates of C, D, and R. Equations (40)(42).
2. The vec expression for the constrained Ccons and Dcons are in fact functions of Rcons which is in turn a function of the unconstrained Ru and the
previous Rcons , and has to be calculated by iteration. That is,
vec(Dcons )

vec(D) V 1 F [F V 1 F ]1 (F vec(D) G)

(63)

vec(Ccons )

vec(C) (P 1 d I)V 1 F [F V 1 F ]1 (F vec(D) G)

(64)

where V (Rcons (r)) is as in (5.5.), and


Rcons = R + f (Rcons ) with Rcons (0) = R and
Rcons (r + 1) = R + f (Rcons (r));
||Rc(r + 1) Rc(r)|| < tol

0, 1, 2, ... until

Hence,
Rc = Rc(r + 1),
Cc = Cc(Rc(r + 1)), and
Dc = Dc(Rc(r + 1))

3. Now, in the iteration process,


Rc (r + 1) =

1
[a Cc(Rc(r))b Dc(Rc(r))c
NT

+(Cc(Rc(r))d + Dc(Rc(r))e c)Dc(Rc(r + 1))]


So, for each iteration r we need to reshape vec(Dc) and vec(Cc) and put
it back into matrix form to compute a new Rc(r+1). Continue this until
convergence and once we have the final Rc put it back one more time to
find vec(Dc) and vec(Cc) and reshape them.

The Role of Information Theory...

177

4. Then Dc and Cc are the matrices that go back to the E-step to be used
(along with the other parameters) to find an updated and more accurate
estimate of x
+
t and Pt

6.

Conclusions

Information theory as such, is concerned with the quantification, analysis


and forecasting of information processing in systems under incomplete and/or
noisy data acquisition. As we discussed in this chapter, the problem of the inference and analysis of gene regulatory networks from experimental data on
gene expression at a genome wide scale, is closely related with the foundational
tenets of information theory. In fact, given the current biological understanding of gene regulation as an extremely complex signal processing phenomena,
information theoretical tools and concepts result a natural choice for the task
of inference/analysis of such GRNs. We presented several instances in which
information theory, either on its own, or combined with probabilistic graphical
models, Bayesian statistics and machine-learning techniques have been used in
the inference and assessment of GRNs.
Purely information theoretical approaches are based on complex graph renderings (i.e. both cyclic and acyclic probabilistic models are allowed) and are
able to describe the system using either continuous or discrete probability density functions. The means for dealing with incomplete or noisy data is by quantifying interactions that are usually valued by means of statistical dependence
measures such as mutual information and Kullback-Leibler divergences, either
on a marginal or conditional setting. The use of minimum description length
as a measure of algorithmic complexity, of the data processing inequality to
discriminate between direct and indirect interactions, and of Shannons signal
processing theorems to establish thresholds or bounds of confidence, is usually
supplemented with optimization based on maximum entropy (MaxEnt) techniques.
In the other hand, Bayesian/machine-learning implementations of information theoretical models are usually based on directed acyclic graphs (DAGs),
these also allow either discrete or continuous probability distribution functions.
Bayesian methods dealt with incomplete or noisy data by means of linear models with error/noise terms known as linear dynamical systems (LDS). Error prediction/correction is modeled by means of hidden-variables, often treated as
constraints via Lagrange multiplier formalisms. By definition Bayesian net-

178

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

work models are based on conditional probability and expectation, a fact that
made possible, in some instances, the inference of directed causal networks.
Incompleteness of the data is usually overcome by the use of the ExpectationMaximization (E-M) algorithm and other machine learning techniques. E-M is
also used to establish bounds of confidence, either on its own or supplemented
by Bootstrapping and cross-validation and optimization is based on maximum
likelihood estimates by means of objective function oriented constraint minimization.
Each and every one of the current implementations mentioned here posses
its particular set of achievements and shortcomings. Current state of the art
points out to a combination of methods, either as a means of assessment or used
in the form of hybrid methods, as the best options to tackle these incredibly
complex, yet highly interesting and important problems. It is our hope that ideas
considered here will stimulate further development in the area of information
theoretical / machine learning-based computational biology.

References
[1] Albert, R. and Barabsi,A.-L.; Statistical mechanics of complex networks,
Reviews of Modern Physics 74, 47 (2002)
[2] Albert, R., Scale-free networks in cell biology, Journal of Cell Science
118, 4947-4957 (2005)
[3] Andrecut, M., Kauffman S.A., Mean-field model of genetic regulatory networks, New Journal of Physics 8, 148, (2006)
[4] Andrecut, M., Kauffman, S.A., A simple method for reverse engineering causal networks, Journal of Physics A Mathematical and General 39,
L647-L655, (2006)
[5] Aoki, M., State Space Modeling of Time Series, Springer-Verlag, (1987)
[6] Bansal, M., Belcastro, V., Ambesi-Impiombato, A., di Bernardo, D.; How
to infer gene networks from expression profiles, Molecular Systems Biology 3:78 (2007)
[7] Brockwell, P.J., Davis, R.A., Introduction to Time Series and Forecasting,
Springer-Verlag second edition, New York, (2002)

The Role of Information Theory...

179

[8] Butte, A.J., Kohane, I.S.,Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements, Pacific
Symposium on Biocomputing 418-429, (2000)
[9] Cercignani, C., Illner R., Pulvirenti, M. ; The Mathematical Theory of
Dilute Gases, Applied Mathematical Sciences 106, Springer-Verlag (1994)
[10] Chaitankar, V., Ghosh, P., Perkins, E.J., Gong, P., Deng, Y., Zhang, C.,
A novel gene network inference algorithm using predictive minimum description length approach, BMC Systems Biology 4(Suppl 1):S7, (2010)
[11] Chung, F. R. K., Spectral Graph Theory, Amer. Math. Soc., Providence,
R.I., (1997)
[12] Cover T. M., Thomas J.A., Elements of Information Theory, New York:
John Wiley & Sons; (1991)
[13] de Jong, H., Modelling and simulation of genetic regulatory systems: a
literature review, J. Comp. Biol., 9, 1, 67-103 (2002)
[14] Deng, M.H. et al. Integrated probabilistic model for functional prediction
of proteins, J. Comput. Biol. 11, 463476, (2004)
[15] Deng, M.H. et al., Prediction of protein function using protein-protein interaction data, In The First IEEE Computer Society Bioinformatics Conference, CSB2002, pp. 117126, (2002)
[16] Dijkstra, E., A note on two problems in connection with graphs, Numerische Math 1, 269-271, (1959)
[17] Ding, C., Peng, H., Minimum redundancy feature selection from microarray gene expression data, Journal of Bioinformatics and Computational
Biology 3, 2, 185-205, (2005)
[18] Dong, J. and Horvath, S., Understanding network concepts in modules,
BMC Systems Biology, 1, 24, (2007). See especially table 1.
[19] Dougherty,J., Tabus, I., Astola, J., Inference of Gene Regulatory Networks
Based on a Universal Minimum Description Length, EURASIP Journal on
Bioinformatics and Systems Biology Volume 2008, Article ID 482090, 11
pages, doi:10.1155/2008/482090, (2008)

180

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

[20] Emmert-Streib, F., Dehmer, M., Information processing in the transcriptional regulatory network of yeast: Functional robustness, BMC Systems
Biology, 3, 35, (2009) doi:10.1186/1752-0509-3-35
[21] Faith, J., Hayete, B., Thaden, J., Mogno, I, Wierzboski, J., Cottarel, G.,
Kasif, S., Collins, J., Garner, T., Large scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression
profiles, PLoS Biology 5, xii, (2007)
[22] Fleuret, F., Fast binary feature selection with conditional mutual information, Journal of Machine Learning Research 5, 1531-1555, (2004)
[23] Friedman, N., Linial, M., Nachman, I., Peer, D., Using Bayesian networks
to analyze expression data. J. Comput. Biol. 7, 601 (2000)
[24] Fruchterman, T. M. J., Reingold, E. M., Graph Drawing by Force-Directed
Placement. Software: Practice and Experience, 21(11), (1991)
[25] He, F., Balling, R., Zeng, A-P; Reverse engineering and verification of
gene networks: Principles, assumptions, and limitations of present methods and future perspectives, Journal of Biotechnology 144, 3, 190-203,
(2009)
[26] Hernandez-Lemus, E., Velazquez-Fernandez, D., Estrada-Gil, J.K., SilvaZolezzi, I., Herrera-Hernandez, M.F., Jimenez-Sanchez, G., Information
Theoretical Methods to Deconvolute Genetic Regulatory Networks applied to Thyroid Neoplasms, Physica A 388, 5057-5069, (2009)
[27] INFOTHEO http://cran.r-project.org/web/packages/infotheo/index.html
[28] Jaynes, E.T., Information Theory and Statistical Mechanics, Phys. Rev.,
106, 4, 620-639, (1957)
[29] Jurman, G., Visintainer, R., Furlanello, C., An introduction to spectral
distances in networks (extended version), paper presented at the Workshop on Networks Across Disciplines: Theory and Applications within
the 24th Annual Conference on Neural Information Processing Systems
(NIPS 2010), manuscript at http://arxiv.org/abs/1005.0103
[30] Kamada, T., Kawai, S., An Algorithm for Drawing General Undirected
Graphs, Information Processing Letters, 31:7-15, (1988)

The Role of Information Theory...

181

[31] Kasza, J., Solomon, P., Kullback Leibler Divergence for Bayesian Networks with Complex Mean Structure, http://arxiv.org/abs/1009.1463
[32] Kindermann, Ross; Snell, J. Laurie, Markov Random Fields and Their
Applications. American Mathematical Society. ISBN 0-8218-5001-6.
MR0620955, (1980)
[33] Kullback, S. Leibler, R. A. On information and sufficiency, The Annals of
Mathematical Statistics, 22, 79-86, (1951)
[34] Lai, D., Lu, H., Lauria, M., Di Bernardo; D., Nardini, C., MANIA: A
Gene Network Reverse Algorithm for Compounds Mode-Of-Action and
Genes Interactions Inference, Advances in Complex Systems 13, 1, 83-94,
(2010)
[35] Letovsky,S. and Kasif,S., Predicting protein function from protein/protein
interaction data: a probabilistic approach, Bioinformatics 19 (Suppl. 1),
i197i204, (2003)
[36] Li, H. and Zhan, M., Analysis of Gene Coexpression by B-spline Based
CoD Estimation, EURASIP Journal on Bioinformatics and Systems Biology, doi:10.1155/2007/49478, (2007)
[37] Madni, A.M., Andrecut, M., Design And Implementation Of A Gene Network Reverse Engineering Method Based On Mutual Information, Journal
of Integrated Design & Process Science 11, 3, 55-68, (2007)
[38] Margolin, A.A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G.,
Dalla Favera, R., Califano, A., ARACNe: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context, BMC Bioinformatics, 7 (Suppl I):S7, (2006) doi:10.1186/1471-21057-S1-S7
[39] Margolin, A.A., Wang, K., Lim, W.K., Kustagi, M., Nemenman, I., Califano, A., Reverse engineering cellular networks, Nat Protoc., 1, 2, 662-71,
(2006)
[40] Margolin, A.A., Wang, K., Califano, A., Nemenman, I., Multivariate dependence and genetic networks inference, IET Syst. Biol. 4, 6, 428440,
(2010)

182

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

[41] Meyer, P. E., Kontos, K., Lafitte, F., Bontempi, G., Information-theoretic
inference of large transcriptional regulatory networks, EURASIP Journal on Bioinformatics and Systems Biology, Article ID 79879, (2007),
doi:10.1155/2007/79879 2007
[42] Meyer, P. E., Lafitte, F., Bontempi, G., minet: A R/Bioconductor Package
for Inferring Large Transcriptional Networks Using Mutual Information,
BMC Bioinformatics, 9, 461, (2008)
[43] MINET http://bioconductor.org/packages/2.6/bioc/vignettes/minet/inst/
doc/minet.pdf
[44] Mohapatra, A., Mishra, P.M., Padhy, S., Modeling Biological Signals using Information-Entropy with Kullback-Leibler-Divergence, IJCSNS International Journal of Computer Science and Network Security 9, 1, 147154, (2009)
[45] Morganella, S., Zoppoli, P., Ceccarelli, M., IRIS: a method for reverse
engineering of regulatory relations in gene networks, BMC Bioinformatics
10, 444, (2009); doi:10.1186/1471-2105-10-444
[46] Morrissey, E.R., Juarez, M.A., Denby, K.J., Burroughs, N.J., On reverse
engineering of gene interaction networks using time course data with repeated measurements, Bioinformatics 26, 18, 2305-12, (2010)
[47] Murphy, K., Mian, S. Modelling Gene Expression Data using Dynamic
Bayesian Networks. Technical Report, University of California: Berkeley,
(1999)
[48] Nemenman, I., Information theory, multivariate dependence, and genetic
network inference, eprint arXiv:q-bio/0406015
[49] Newman, M.E.J., A measure of betweenness centrality based on random
walks. arXiv cond-mat/0309045, (2003)
[50] Palacios, R., Goni, J., Martinez-Forero, I., Iranzo, J., Sepulcre, J.,
Melero, I., Villoslada, P., A Network Analysis of the Human TCell Activation Gene Network Identifies Jagged1 as a Therapeutic
Target for Autoimmune Diseases, PLoS ONE 2, 11, e1222, (2007)
doi:10.1371/journal.pone.0001222

The Role of Information Theory...

183

[51] Peer, D., Bayesian network analysis of signaling networks: a primer, Science STKE 281: p14, (2005)
[52] Peng, H., Long F., Ding, C., Feature selection based on mutual information: criteria for max-dependency, max-relevance and min-redundancy,
IEEE Trans. Pattern Analysis and Machine Intelligence 27, 8, 1226-1238,
(2005)
[53] Qiu, P., Gentles, A.J., Plevritis, S.K., Reducing the Computational Complexity of Information Theoretic Approaches for Reconstructing Gene
Regulatory Networks, Journal of Computational Biology 17, 2, 1-8 (2010)
[54] [R] http://cran.r-project.org/
[55] Ravasz, E., Somera,A. L., Mongru,D. A., Oltvai,Z. N., Barabsi, A.-L.,
Hierarchical Organization of Modularity in Metabolic Networks, Science,
297, 1551-1555, (2002)
[56] Segal,E. et al., Discovering Molecular Pathways from Protein Interaction
and Gene Expression Data, Bioinformatics 19 (Suppl. 1), 264272, (2003)
[57] Sehgal, M.S.B., Gondal, I., Dooley, L., Coppel, R., Mok, G.K., Transcriptional Gene Regulatory Network Reconstruction Through Cross Platform
Gene Network Fusion, in Pattern Recognition in Bioinformatics, Lecture
Notes in Computer Science,4774/2007, 274-285, (2007) doi: 10.1007/9783-540-75286-8 27
[58] Shannon, C.E., Weaver, W., The Mathematical Theory of Communication,
The University of Illinois Press, Urbana, Illinois, (1949)
[59] Shumway, R.H., Stoffer, D.S., Time Series Analysis and Its Applications:
With R Examples, Springer Texts in Statistics, Third Edition, (2010)
[60] Steuer R, Kurths J, Daub CO, Weise J, Selbig J, The mutual information:
detecting and evaluating dependencies between variables, Bioinformatics
18 (Suppl 2), 231240, (2002)
[61] van Kampen, N., Stochastic Processes in Physics and Chemistry, North
Holland, Elsevier, The Netherlands,(1997)
[62] van Someren, E.P., Wessels, L.F.A., Backer, E., Reinders, M.T.J., Genetic
Network Modelling, Pharmacogenomics, 3, 4, 507-525, (2002)

184

Enrique Hernandez-Lemus and Claudia Rangel-Escareno

[63] Wei, Z. and Li, H., A Markov random field model for network-based analysis of genomic data, Bioinformatics 23, 12, 15371544, (2007)
[64] Wei, Z. and Li, H., A hidden spatial-temporal Markov random field model
for network-based analysis of time course gene expression data, Ann. Appl.
Stat. 2, 408-429, (2008)
[65] Yang AC, Hseu SS, Yien HW, Goldberger AL, Peng CK, Linguistic analysis of the human heartbeat using frequency and rank order statistics, Phys
Rev Lett 90: 108103, (2003)
[66] Zhao, W., Serpedin, E., Dougherty, E.R., Inferring gene regulatory networks from time series data using the minimum description length principle, Bioinformatics 22, 17, 2129-35, (2006)
[67] Zola, J.; Aluru, M.; Sarje, A.; Aluru, S., Parallel Information-TheoryBased Construction of Genome-Wide Gene Regulatory Networks, IEEE
Transactions on Parallel and Distributed Systems 21, 12, 1721-1733,
(2010)

You might also like