Professional Documents
Culture Documents
ISBN: 978-1-62100-325-0
c 2011 Nova Science Publishers, Inc.
Chapter 4
Abstract
One important problem in contemporary computational biology, is that
of reconstructing the best possible set of regulatory interactions between
genes (a so called gene regulatory network -GRN) from partial knowledge, as given for example by means of gene expression analysis experiments. Since only highly noisy-data is available, doing this represents
a challenge to common probabilistic modeling approaches. However, a
variety of algorithms rooted in information theory and maximum entropy
methods, have been developed and they have coped with the problem successfully (to a certain degree). Mutual information maximization, Markov
random fields, use of the data processing inequality, minimum description
length, Kullback-Liebler divergence and information-based similarity are
some of these. Another approach to modeling gene regulatory networks
combines information theory and machine learning techniques. Monte
Carlo methods and variational methods can also be used to measure data
information content. Hidden Markov models (HMM) or stochastic linear
138
1.
Introduction
139
mer is a machine learning topic whose goal is to select from amongst thousands
of input variables, those that lead to the best predictive model. Feature selection methods applied to genomic data allows, for instance, to improve molecular
diagnosis and prognosis in complex diseases (such as cancer) by identifying a
set (called a molecular signature) of features or variables that best represent
the phenomenon. In the case of network inference, consists in representing the
(in general non-linear) set of statistical dependencies between variables on a set
(that can be the whole input dataset or a feature-selected subset of it) by means
of a graph. When applied to genomic expression data (e.g. from microarray
experiments), network inference is able to reverse-engineer the transcriptional
gene regulatory network (GRN) of the related cell. Knowledge of this GRN
would allow for instance, to the discovery of new drug targets to cure diseases.
Information theory (IT) has resulted on a powerful theoretical foundation
to develop algorithms and computational techniques to deal both with feature
selection and with network inference problems applied to real data. There are
however goals and challenges involved in the application of IT to genomic analysis. The applied algorithms should return intelligible models (i.e. they must
result understandable), they must also rely on little a priori knowledge, deal
with thousands of variables, detect non-linear dependencies and all of this starting from tens (or at most few hundreds) of highly noisy samples. As we will
shown in this chapter, IT has provided approaches to deal with this problems.
Some of these approaches are based on machine learning techniques, basically
by modeling a target function connecting the variables of a system. Here, the
output or target variable is the one to be predicted and the input variables are the
predictors.
As a means to produce intelligible models we perform feature-selection procedures. The goal of these procedures is to select inputs among a set of variables
which lead to the best predictive model. In the vast majority of cases, feature
selection is a preprocessing step prior to the actual machine learning stage. This
is a somewhat critical part of the whole inference process. In the one hand, variable or feature elimination can lead to information losses. In the other, feature
selection is a mean to improve the accuracy of a model, to improve the generalizability of such model, as well as its intelligibility and at the same time to
decrease the computational burden for the training and inference stages. Computational methods for feature selection usually consist in a search algorithm
that explores different combinations of variables, supplemented with a measure
of performance (or score) for this combinations. There are several ways to ac-
140
complish this task, in our opinion, the best benchmarking options for the GRN
inference scenario, are the use of sequential search algorithms (as opposed to
stochastic search) and performance measures based on IT, since this made feature selection fast end efficient, and also provide an easy means to communicate
the results to non-specialists (e.g. molecular biologists, geneticists and physicians).
GRNs are graph-theoretical constructs that describe the integrated state of
a cell (or a small population of similar cells to be more precise) under certain
biological conditions at a given time. GRNs are means for identifying gene
interactions from experimental data through the use of theoretical models and
computational analysis. The inference of such an interaction connectivity network involves the solution of an inverse problem (a deconvolution) that aims to
uncover the interactions from the properties and dynamics of observable behavior in the form of, for example, RNA transcription levels in a characteristic gene
expression profile. A growing number of deconvolution methods (also called
reverse engineering methods) have been proposed in the past [6, 62]. Their
goal is to provide a well-defined representation of the cellular network topology from the transcriptional interactions as revealed by gene expression measurements that are then treated as samples from a joint probability distribution.
The goal of deconvolution methods is the discovery of GRNs based on statistical dependencies within this joint distribution [13]. One major shortcoming is
that, surprisingly, there is still no conceptual agreement as to what the dependencies are within these multivariate settings and about the role of noise and
stochastic dynamics in the problem. The special case of conditional statistical
dependence has gained, however, a certain place as a somehow useful criterion
in most biomedical applications. The central aim is to find a way to decompose the Statistical Dependency Matrix (SDM) -that is, the deviation of a joint
probability distribution from the product of its marginals- into a series of well
defined contributions coming from interactions of several orders of complexity. IT is therefore the right setting to do so. Typical means to reach this goal
consist in the quantification of the new information content that arise when we
look at the full joint probability distribution compared to a series of successive
independence approximations.
In GRNs each variable of the dataset is represented by a node (or vertex) in
the graph. There is a link joining two variables-nodes if these variables exhibit
a particular form of dependency (the particular form of dependency depends
explicitly on the inference method chosen). Some genes can produce a protein
141
2.
We will introduce here the essential notions of IT that will be used, like
entropy, mutual information and other measures. In order to do so, let X and Y
denote two discrete random variables having the following features:
Finite alphabet X and Y respectively
Joint probability mass distribution p(X, Y )
Marginal probability mass distributions p(X) and p(Y )
and Y denote two additional discrete random variables defined
Let also X
on X and Y respectively, the associated probability mass distributions will be
142
p(X) and p(Y ), their joint probability mass distribution p(X, Y ) and defined on
J , the joint probability sampling space; J = X Y. For particular realizations,
we have p(x) = P (X = x) and p(y) = P (Y = y).
Following Shannon [58], for every discrete probability distribution X it is
possible to define the information theoretical entropy H of such distribution as
follows
H = Ks
(1)
p(y) log
yY
p(y)
pe(y)
(2)
The Joint Kullback-Leibler divergence between two probability mass distributions p(X, Y ) and pe(X, Y ) is given by:
KL [ p(X, Y ); pe(X, Y ) ] =
p(x)
xX
p(y|x) log
yY
p(x, y)
pe(x, y)
(3)
xX
p(x)
yY
p(y|x) log
p(y|x)
pe(y|x)
(4)
KL [ p(Y ); pe(Y ) ] =
yY
yY
143
(5)
We could see that the first term in the right hand side of equation 5 is precisely the negative of the entropy H(Y) as given by equation 1. Shannons entropy depends on the distribution p(Y ) and, as Shannon himself showed [58], it
is maximum for a uniform distribution u(Y ). H[u(Y )] = log |Y|. If we replace
pe(y) for u(Y ) in equation 5 we get:
H[p(Y )] = log |Y| KL [ p(Y ); u(Y ) ]
(6)
X X
(7)
yY xX
We could notice that the maximal joint entropy is attained under independence conditions of the random variables Y and X, that is when the JPD is
factorized p(Y, X) = p(Y )p(X), in this case the entropy of the JPD is just the
sum of their respective entropies. An inequality theorem could be stated as an
upper bound for the join entropy:
H(Y, X) H(Y ) + H(X)
(8)
X X
yY xX
(9)
144
(10)
(11)
(12)
If we resort to Shannons definition of entropy (equation 1) [58] and substitute it into equation 12 we get:
H(Y, X) =
X X
yY xX
p(x, y) log
p(x, y)
p(x)p(y)
(13)
(14)
Mutual information is also given by the Kullback-Liebler divergence between the marginal distribution p(X) and the conditional distribution p(X|Y )
I(Y, X) = KL [ p(X|Y ); p(X) ]
(15)
3.
145
The deconvolution of a GRN could be based on a maximum entropy optimization of the JPD of gene-gene interactions as given by gene expression experimental data could be implemented as follows [26]. The JPD for the stationary expression of all genes, P ({gi }), i = 1, . . . , N may be written as follows
[38]:
P ({gi }) =
Hgen = [
N
X
i
i (gi )
N
X
i,j
1
expHgen
Z
i,j (gi , gj )
N
X
i,j,k (gi , gj , gk ) . . .]
(16)
(17)
i,j,k
146
H approx =
N
X
i
i (gi )
N
X
i,j (gi , gj )
(18)
i,j
Under that approximation, the reconstruction (or deconvolution) of the associated GRN consists in the inverse-problem of determining the complete set
of relevant 2-way interactions i,j (gi , gj ) consistent with the JPD (equations
16 and 17) that defines all known constrictions, e.g. the values of the stationary
expression of genes gi as given by the set of i (gi )s and non-committal with
every other restriction in the form of a marginal. The modeling of a GRN depends on the description of the interactions in the form of several correlation
functions. A great deal of work has been done within the framework of the
Bayesian Network (BN) approach [51, 23]. BN models both static and dynamic
have provided with a better understanding of the problem in terms of solvability, noise reduction and algorithmic complexity. Since BNs are a form of the
Directed Acyclic Graph (DAG) problem, there are several instances (e.g. feedforward loops, feed-back cycles, etc.) in which the DAG formalism of BNs
147
fails short. It has been noted [6] that BNs require a larger number of data points
(samples) to infer the probability density distributions whereas information theoretical approaches perform well for steady-state data and can be applied even
when few experiments (compared to the number of genes) are available. A recently developed approach is the use of statistical and information theoretical
models to describe the interactions [36].
If we consider a 2-way interaction hamiltonian, all gene pairs i,j for which
i,j = 0 are said to be non-interacting. This is true for genes that are statistically
independent, P (gi , gj ) P (gi ) P (gj ), but it is also valid for genes that do not
have a direct interaction but are connected via other genes i.e. i,j = 0 but
P (gi , gj ) 6= P (gi ) P (gj ). Several metrics such as Pearson Correlation, Square
Correlation and Spearman Ranked coefficients over the sampling universe have
been used, but the performance of these methods is usually poor as suffers from
a big number of false positive predictions.
3.1.
f (X ) =
1 |
G[
h
X
X
X
i| ]
1
h2
(19)
148
I({xi }, {yi }) =
1 X
f (xi , yi )
log
M i
f (xi ) f (yi )
(20)
hence, two genes with expression profiles gi and gj for which I(gi , gj ) 6= 0
are said to interact each other with a strength I(gi , gj ) (gi , gj ), whereas
two genes for which I(gi , gj ) is zero are declared non-directly interacting
to within the given approximations. Since MI is reparametrization invariant, one usually calculates the normalized mutual information. In this case
I(gi , gj ) [0, 1], i, j.
149
150
starts with a set containing all the variables and then selects the variable Xi
whose removal induces the highest increase of the objective function. The procedure is enhanced by an iterative sequential replacement which, at each step,
swaps the status of a selected and a non selected variable such that the largest
increase in the objective function is achieved. The sequential replacement is
stopped when no further improvement is met [43]. Forward selection, backward
elimination, and sequential replacement all have an algorithmic complexity of
O(n2 ) so that the network built by backward elimination followed by sequential
replacement has the same asymptotic computational cost as the one based on a
forward selection strategy alone.
As one could further notice, the inference of GRNs by means of such high
performance IT methods is posed by large computational complexity. The limiting condition to these approaches is the time-consuming step of computing
the MI matrix. A method has been proposed by Qiu and colleagues [53] to
reduce this computation time. It is based in the application of spectral analysis to re-order the genes, so that genes that share regulatory relationships are
more likely to be placed close to each other. Then, using a sliding window approach with appropriate window size and step size, the MI for the genes within
the sliding window is then computed, and the remainder is assumed to be zero.
Qius method does not incur performance loss in regions of high-precision and
low-recall, while the computational time is significantly lowered. The essence
of Qius method is as follows: To determine the new gene ordering, a Laplacian matrix is derived from the correlation matrix of the gene expression data,
assuming the correlation matrix provides an adequate approximation to the adjacency matrix for our purpose, then it is computed the Fiedler vector [11], which
is the eigenvector associated with the second smallest eigenvalue of the Laplacian matrix. Since the Fiedler vector is smooth with respect to the connectivity
described by the Laplacian matrix, the elements of the Fiedler vector are then
sorted to obtain the desired gene ordering. The computational complexity of obtaining the gene ordering is negligible compared to the computation of the MI
matrix. The reduction in computational complexity is the result of computing
only the diagonal part of the reshuffled MI matrix. Because the remaining entries of the MI matrix are set to be zeros, there is potential loss of reconstruction
accuracy although due to Fielder minimization [53] this effect is not expected
to be significant. In fact, according with a benchmark of the method [53] in the
high-precision low-recall regime, applying the sliding window does cause a performance loss. In some cases, applying the sliding window yields slightly better
151
152
conditional dependencies. One way to do that is by means of the so-called Iterative Conditional Mode (ICM) algorithm [63] but other IT-based alternatives
could be also used.
Conditional dependencies are not the only application of IT and MRFs in
transcriptional network inference. To study functional robustness in GRNs,
Emmert-Streib and Dehmer [20] modeled the information processing within the
network as a first order Markov chain and studied the influence of single gene
perturbations on the global, asymptotic communication among genes. Differences were accounted by an information theoretic measure that allowed to predict genes that are fragile with respect to single gene knockouts. The information theoretic measure used to capture the asymptotic behavior of information
processing evaluates the deviation of the unperturbed (or normal (n)) state from
the perturbed (p) state caused by the perturbation of gene k. The relative entropy
or Kullback- Leibler (KL) divergence was used to quantify this deviation:
h
n,
=
KLi,k = KL pp,
i,k ; pi
pp,
i,k (m) log
pp,
i,k (m)
pn,
(m)
i
(22)
n,
In equation 22 the stationary distributions pp,
are given by:
i,k and pi
t 0
pp,
i,k = lim T pi
(23)
(24)
The Markov chain given by Tk corresponds to the process obtained by perturbing gene k in the network. By means of this Markov chain model supplemented with an information theoretical KL measure, Emmert-Streib and
Dehmer [20] were able to study the asymptotic behavior of the transcriptional
regulatory network of yeast regarding information propagation under the influence of single gene perturbations. Hence not only static network properties
(such as structure) of the transcriptional regulation networks but also dynamic
features (such as robustness) could be analyzed from the standpoint of IT. The
study concludes that the knocked out genes destroy some communication paths
and, hence, can still have a strong impact on the information processing within
the cell. It seems to be reasonable to assume that the further away the knockout
gene is from the starting gene (say in Dijkstra distance [16]) the less the impact
will be. This is a strong evidence that information processing on a systems level
153
154
MIs are greater than I0 and removes the edge with the smallest value. DPI is
thus useful to quantify efficiently the dependencies among a large number of
genes. The ARACNe algorithm eliminates those statistical dependencies that
might be of an indirect nature, such as between two genes that are separated by
intermediate steps in a transcriptional cascade. Such genes will very likely have
non-linear correlated expression profiles which may result in in high MI, and
otherwise would be selected as candidate interacting genes. Given a transcription factor, application of the DPI will generate predictions about other genes
that may be its direct transcriptional targets or its upstream transcriptional regulators [39, 25].
The use of the DPI may result not only in a greater assessment of the results but also in a significant reduction of the computational burden associated
with network inference. Zola, et al. [67] presented a parallel method integrating
mutual information, data processing inequality, and statistical testing to detect
significant dependencies between genes, and efficiently exploit parallelism inherent in such computations. They developed a method to carry out permutation testing for assessing statistical significance of interactions, while reducing
its computational complexity by a factor of O(n2 ), where n is the number of
genes. The problem of inference (usually consuming thousand of computation
hours) at the whole genome network level by constructing a 15,222 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in
30 minutes on a 2,048-CPU IBM Blue Gene/L, and in 2 hours and 25 minutes
on a 8-node Cell blade cluster [67].
3.1.4. Minimum Description Length
One of the major drawbacks for the information theoretic models to infer
GRNs is that of setting up a threshold which defines the regulatory relationships
between genes. The minimum description length (MDL) principle has been
implemented to overcome this problem [10, 19]. The description length used
by the MDL principle is the sum of model length and data encoding length.
A user-specified fine tuning parameter is used as control mechanism between
model and data encoding, but it is difficult to find the optimal parameter. A new
inference algorithm has been proposed, which incorporates mutual information
(MI), conditional mutual information (CMI) [defined in terms of the associated conditional entropies] and predictive minimum description length (PMDL)
principle to infer gene regulatory networks from DNA microarray data. In this
155
algorithm, the information theoretic quantities MI and CMI determine the regulatory relationships between genes and the PMDL principle method attempts to
determine the best MI threshold without the need of a user-specified fine tuning
parameter.
Given three random variables X, Y and Z, the conditional mutual information is a measure of the reduction in the uncertainty of X due to knowledge of
Y when Z is given. In other words,
H(Y, X) =
X X X
p(x, y, z) log
yY xX zZ
p(x, y|z)
p(x|z)p(y|z)
(25)
The description length of the MDL principle involves the calculation of the
model length and the data length. As the length can vary for various models,
the method could give biased results towards the length of the model. A model
based on universal code length is the PMDL principle. The description length
for a model in PMDL is given as:
LD =
m1
X
log[p(Xt+1 |Xt )]
(26)
t=0
m1
X
H(Xj+1 |Xj )
(27)
j=1
Since H(X1 ) is common to all models it can be removed from the description length to give [10]:
LD =
m1
X
j=1
H(Xj+1 |Xj )
(28)
156
It is also noticeable that the MDL principle also helps to achieve a good
trade-off between the network model complexity and the accuracy of data fitting, since given a network and a dataset, the MDL principle evaluates simultaneously the goodness of fit of the network and the data. Intuitively, the more
complicated the network is, the better the data would be fitted. However, very
often models which are over-fitted relative to the actual systems are selected,
which give rise to numerous errors. MDL aims to achieve a good trade-off between model complexity and fitness of the data. A general criterion is thus obtained for constructing the network so as to contain only direct interactions. The
convergence of the proposed MDL-based network inference algorithms can be
assessed by the recovery of the topology of some artificial networks and through
the error rate plots obtained through extensive simulations on datasets produced
by synthetic networks [66].
3.1.5. Kullback-Liebler Divergence
Kullback-Liebler divergence [33] (as well as its symmetricized version, the
Jensen-Shannon measure) are, as it turns out, very commonly used information densities in GRN inference and other problems in computational molecular
biology. Either as unique measure [45, 44] or used in conjunction with other
indicators, such as spectral metrics [29], Markov fields [20], minimum description lengths [19], Bayesian networks [50, 31, 46, 48] and multivariate analysis
[40].
However, by far the most general use of the KL-divergence within GRN
information setting is by playing the role of the multi-information: it is known
[40] that for two variables, X1 and X2 , independence is well defined via decomposition of the bivariate JPD, P (X1 , X2 ) = P (X1 )P (X2 ), and mutual
information I(X1 ; X2 ) = hlog2 P (X1 , X2 )/[P (X1 )P (X2 )]i which is the only
measure of dependence [58]. Along the same lines, the total interaction (i.e. the
deviation from independence) in a multivariate JPD, P (Xi ), i = 1, ..., N , can
be measured by the multi-information as follows:
"
Y
i
P (Xi )
#
(29)
157
"
P arg max
H(P )
P ,{}
k
M (PM
PM )
(30)
158
series. Nevertheless, IBS has proved to be a very powerful tool in the comparison of the dynamics of highly nonlinear processes. Within the present context
[26], the symbolic sequence represent the expression values of a single gene
(say gene k-th) all along the sampling universe (of size M ), as given by a vector
=
gk = (gk1 , gk2 , . . . , gkM ). Let us consider a series that could well represent
a gene expression vector. It is possible to classify each pair of successive points
into one of the following binary states Bn , if (n+1 n ) < 0 then Bn = 0; in
the other case Bn = 1. This procedure maps the M step real-valued time series
(i) into an M 1 step binary-valued series B(i). It is now possible to define
a binary sequence of length m (called an m-bit word). Each of the m-bit words
wk represents a unique pattern in a given time series. For every unitary timeshift , the algorithm makes a different collection W of m-bit words over the
whole time series, W = {w1 , w2 , . . . , wn } . It is expected that the frequency
of occurrence of these m-bit words will reflect the underlying dynamics of the
original (real-valued) time series. We are then looking to write down a probability distribution function in the rank-frequency representation (RF-PDF). This
RF-PDF represents the statistical hierarchy of symbolic words of the original
series [65]. Two given symbolic sequences are said to have similarity if they
give rise to similar probability distribution functions.
Following the very same order of ideas, Yang and collaborators [65] defined
a measure of similarity (akin to statistical equivalence) between two series by
plotting the rank number of every m-bit word in the first series with the rank
for the same m-bit word in the second series. Of course since the series are
supposed to be finite, the m-bit words are not equally likely to appear. The
method introduces the likelihood of each word by defining a weighted distance
m between two given symbolic sequences 1 and 2 as follows:
m
2
X
1
m (1 , 2 ) = m
|R1 (wk ) R2 (wk )|F (wk )
2 1 k=1
(31)
1
[p1 (wk ) log p1 (wk ) p2 (wk ) log p2 (wk )]
Z
(32)
159
4.
160
4.1.
Bayesian Networks
A Bayesian network (BN) is a probabilistic graphical network model, described by a directed acyclic graph (DAG). In the model each node represents a
random variable and edges define conditional independence relations between
these random variables. These relationships e.g, gene-gene interactions, can be
seen in a directed graph without cycles. Without cycles means a gene may have
no direct or indirect interaction with itself. In order to reverse engineer a gene
network using this approach, one would need to find the directed acyclic graph
that best describes the gene expression data. This particular limitation of a directed acyclic graph can be overcome by using a dynamic Bayesian network.
4.2.
4.3.
State-Space Models
161
is going on inside the process and how this internal behavior is affected by the
inputs. These models are suitable for modeling time series data where we have
a series of observations related to a series of unobserved variables changing
over time. Time series models in state-space representation can be thought of as
unobserved component models. The state vector represents those unobserved
or hidden or missing variables and their dynamics over time are governed by
a state transition equation. In the very general setting of a state-space model,
the state vector determines the future evolution of the dynamic system, given
future time paths of all of the variables affecting the system. The variables are
not restricted, they can be either discrete with a countable number of possible
values or continuous with an associated density curve. For example, modeling
gene expression data assumes continuous variables and requires the inclusion
of hidden states. Hidden variables could model the effects of genes that have
not been included in the experiment, they could also model levels of regulatory
proteins as well as possible effects of mRNA or protein degradation. One goal
is to infer the characteristics and properties of the unobserved variables based
on the observations. In linear state-space models, a sequence of p-dimensional
real-valued observation vectors {y1 ..., yT }, is modeled by assuming that at each
time step yt was generated from a K-dimensional real-valued hidden (i.e. unobserved) state variable xt , and that the sequence of xs is governed by a first-order
Markov process. This type of model is shown pictorially in Figure (3).
A linear-Gaussian state space model of the time series {yt } is specified by
the matrices A and C called system matrices and is described by a pair of equations:
xt+1 = Axt + wt
(33)
yt = Cxt + vt
(34)
These two equations represent the most basic form of a state-space model.
The vector xt RK is called the state vector at time t. The state equation
(33) shows how this vector evolves with time. A is the dynamic or transition
state matrix, and its eigenvalues are important in determining the way the data
behave. The observation equation (34) specifies the relationship between the
observed data and this newly introduced vector xt . C describes the relation between state and observation, and wt and vt are zero-mean random noise vectors.
For the most general case the noise vectors could be mutually correlated,
although serially uncorrelated. In the particular Linear Gaussian case they are
162
4.4.
163
(35)
(36)
164
Remarks
These system matrices A, B, C, D are taken to be constant in this research
but they also may be varying over time in which case it is appropriate to add a
subscript indicating this.
When the sequence {x1 , w1 , ..., wT } is independent then the distribution of
xt+1 |xt , ...x1 is the same as the distribution of xt+1 |xt , hence the state vector
xt evolves with a first-order Markov property with A as the transition matrix.
The noise vectors can also be viewed as hidden variables. Here the matrix
D in the observation equation captures gene-gene expression level influences at
consecutive time points whilst the matrix C captures the influence of the hidden
variables on gene expression level at each time point. Matrix B models the
influence of gene expression values from previous time points on the hidden
states, and A is the state transition matrix. However, our interests focus on
CB + D which not only captures the direct gene to gene interaction but also
the gene to gene interactions through the hidden states over time. This is
actually the matrix we will concentrate the analysis on, since it captures all of
the information related to gene-gene interaction over time.
5.
Constrained LDS
Mathematically speaking, the idea of adding constraints to the model is basically to reduce the number of parameters to estimate. Narrowing down the
165
range of parameters to estimate by adding extra constraints reduces dimensionality which can considerably simplify the search for the parameters that best
describe the model. At all times during modeling with constraints diagnostics
should be made to make sure the model still fits well after taking account of
the constraints. How precisely to include these forms of information into the
inference process was not a straightforward task. However, this is the true art of
modeling.
From the biological point of view, the current application to gene expression
data is already complex. Data generation, low-level analyses and classification
are known to be crucial in getting gene expression levels. Different algorithms
can lead to different sets of genes. Hence, biological mining should be present in
any machine learning approach. In this sense, any knowledge about gene behavior and regulatory interactions are helpful. Now, if this additional information
can be included and modeled, estimation not only becomes more realistic due
to the reduction os parameters but also due to a more biology based approach.
Given either a-priori or new hypothesized information leading to a set of
plausible models, the LDS model is re-trained based on this knowledge about
the parameters. The a-priori information would be supplied by past experiments
or biological knowledge, while the new hypothesized information is obtained
from the bootstrap analysis
5.1.
Model Definition
166
small departures from normality. The model definition used in this work, is defined with the Gaussian assumption only insofar as it makes the analysis of the
models more straight forward and tractable. However, for statistical inferences
and validation of the model, no essential use of the Gaussian assumption is be
made. Instead, more general methods such as bootstrapping are employed.
5.2.
Structural Specification
5.3.
Estimation
167
second order moments. This reduces the problem to one of computing projections onto the subspaces spanned by the observables, but the derivations and
machinery of that theoretical approach are tedious. However, in the special
case when the states and observables are jointly Gaussian, the least squares estimators of state are given by conditional expectations (conditioned on the observables) which are in turn linear functions of the observables. Moreover, the
conditional expectation operator has all the essential properties of the subspace
projection operator in the Hilbert space context. As a consequence, the shorter
and more elegant analysis of the problem in the Gaussian context leads to exactly the same estimators of the state variables as the more general Hilbert space
context. Thus, in terms of formulating the state estimators, there is no loss of
generality in assuming Gaussian joint distributions.
Regarding the estimation of the structural parameters, in the absence of assumptions regarding the joint distributions of the state variables and observables
or any other pertinent information, a weighted least-squares approach would be
reasonable and justified. If the assumption is made that the state variables and
observables are jointly Gaussian, then the method of maximum likelihood leads
to parameter estimators that are essentially equivalent to those yielded by the
weighted least-squares approach. Thus, again there is no loss of generality in
making the Gaussian assumption for constructing estimators of structural parameters.
5.4.
Derivation
(37)
yt = Cxt + Dyt1 + vt
(38)
The column vector x is the state vector of hidden variables for the system,
u is the input observation vector, C is the state to observation matrix which
captures the influence of the hidden variables on gene expression level at each
time point.
The matrix D describes the gene-to-gene interaction at consecutive time
points. From this matrix we obtain the Bayesian network representation of the
168
causal relationship between the genes. After the model parameters are estimated
using the EM algorithm as well as the Kalman filter and smoother in the E-step,
we proceed to analyze the matrix D. The values in this matrix determine the
conditional probabilities of the relationships between genes. In order to test
the robustness of the model a bootstrap experiment is performed by randomly
resampling the data. Using 300 boostrap samples of the data I average the values
of the Di s and threshold it to find values that are not significant from zero.
Those values that are significant from zero are zero out leading to a new D
matrix.
The model with this filtered matrix D is put it back into LDS to be trained
again and find better estimates of the parameters. The gain is that this time the
number of parameters to be estimated had been reduced.
The transition matrix can then be constrained to have elements equal to
zero. To do so, one approach would be to constrain the elements of D under
the restriction DF = G as is suggested in [59], with F and G known and
in this particular case specified in such a way that we can zero out some elements in D. Under this restriction the constrained estimators for C, D and R
are determined by a constrained minimization problem applying the technique
of Lagrange multipliers.
D IGRESSION :
Recall that in the state transition equation (37), A is the transition state matrix
and B is the input to state matrix. The state and observation noise vectors, wt
and vt respectively are random variables assumed to be Gaussian distributed,
mutually independent and independent of the initial values of x and y. Since
the constraints will be applied to the observation equation we are interested in
the terms involving C, D and R so we are able to obtain a reduced likelihood
function:
2L(C, D, R)
N T log |R| +
T
N X
X
j=1 t=1
(39)
where tr denotes the trace. To facilitate the algebra manipulation and make
more clear the process, this expression can be rewritten as
2L(C, D, R) = N T log |R|
169
+ CSxu D DSyu
+ DSxu
C + DSuu D )
where
Syy =
T
N X
X
(j)
(j)
(j)
(j)
(j)
(j)
(j)
(j)
(j)
(j)
yt yt
j=1 t=1
Syx =
T
N X
X
yt x
t
j=1 t=1
Syu =
T
N X
X
yt u t
j=1 t=1
Sxu =
T
N X
X
x
t ut
j=1 t=1
Suu =
T
N X
X
ut ut
j=1 t=1
T
N X
X
j=1 t=1
Taking partial derivatives of (39) and making them equal to zero, we solve for
C, D and R. In other words, we find the unconstrained estimators that minimize the likelihood function (39).
DSxu
)P 1
C = (Syx
1
Syy CSyx
DSyu
R =
NT
(40)
(41)
(42)
170
+ CSxu D DSyu
+ DSxu
C + DSuu D )
+ CSxu D DSyu
+ DSxu
C + DSuu D )
(43)
The third expression implies that a minimum for M is also a minimum for the
likelihood function (39).
M
C
M
D
(44)
171
(45)
Dcons F G = 0
(46)
From (44) and (45) we get the constrained estimators for C and D
(47)
1
1
Dcons = (Syu Ccons Sxu Rcons F )Suu
2
Using the expressions (40) and (41) for the unconstrained estimators we get the
constrained D matrix
1
P 1 Sxu )1
Dcons = D Rcons F (Suu Sxu
2
Substituting these back into (46) and solving for gives:
1
P 1 Sxu )1
P 1 Sxu )1 F )1 F (Suu Sxu
D (DF G)(F (Suu Sxu
Ccons
N T Rcons
(Syy Syx Ccons
Syu Dcons
Ccons Syx
+ Ccons P Ccons
(48)
172
leads to
1
(Syu + Ccons Sxu + Dcons Suu )Dcons
NT
1
1
= R+
Rcons F Dcons
(49)
NT
2
1
Rcons = R +
5.5.
Vec Formulation
The vec operator vectorizes a matrix by piling up the columns. That is,
suppose we want to vectorize a 2x2 matrix M
M=
"
m11 m12
m21 m22
, vec(M ) =
m11
m21
m12
m22
The Kronecker product of two matrices plays an important role when using the
vec operator. There are important relationships that will be used in the development of the constrained minimization problem in vec formulation.
Definition: The Kronecker product of two matrices, A and B, where A is
m x n and B is p x q, is defined as
AB =
which is an mp x nq matrix.
A11 B A12 B
A21 B A22 B
...
...
Am1 B Am2 B
. . . A1n B
. . . A2n B
...
...
. . . Amn B
173
(51)
(52)
(A B)
dxT Ax
dx
= A
(53)
= xT (A + AT )
(54)
To show the application of the vec operator in the constraint settings let us look
at the following example.
E XAMPLE :
Let us consider a 2x2 matrix D and suppose we want to constrain it to be
diagonal. Select the matrices F and G to be
D=
"
d11 d12
d21 d22
, F =
"
0 1 0 0
0 0 1 0
G=
"
0
0
Then, applying the constraint F vec(D)=G we get that the elements d1 and d2
are zero and the matrix D becomes:
D=
"
d11 0
0 d22
In general, for any n x n matrix D we can find matrices F and G and solve the
constrained minimization problem using vec formulation as follows:
Constrained Minimization Problem 2
Minimize
+ CSxu D DSyu
+ DSxu
C + DSuu D )
174
+ CSxu D DSyu
+ DSxu
C + DSuu D )
+ (F vec(D) G)
(55)
M
vec(D)
1
1
= 2vec(Rcons
Syu ) + 2vec(Rcons
Ccons Sxu ) +
1
2vec(Rcons
Dcons Suu ) + vec( F ) = 0
M
R
(56)
= F vec(Dcons ) G = 0
(57)
(58)
= N T Rcons
(Syy Syx Ccons
Syu Dcons
Ccons Syx
+
Ccons P Ccons
+ Ccons Sxu Dcons
Dcons Syu
+ Dcons Sxu
Ccons
+
1
vec(Rcons
Ccons Sxu ) = (Sxu
Rcons
)vec(Ccons )
vec( F ) = F
vec(Ccons ) = vec(Syx P 1 ) (P 1 Sxu I)vec(Dcons )
(59)
175
we have that,
1
P 1 Sxu )1 Rcons )F
vec(Dcons ) = vec(D) ((Suu Sxu
2
We still need to work out the value for . Hence, substituting (57) into (58) and
solving for gives:
1
(60)
(61)
where,
V = (Suu Sxu
P 1 Sxu )1 Rcons )
Finally, from (59) we obtain the expression for Rcons implicitly in the
form of Rcons = R + f (Rcons ) for which we will need to iterate and reshape
the matrix Dcons at each iteration.
Rcons = R +
5.6.
1
NT
Rcons F Dcons
2
(62)
176
vec(D) V 1 F [F V 1 F ]1 (F vec(D) G)
(63)
vec(Ccons )
(64)
0, 1, 2, ... until
Hence,
Rc = Rc(r + 1),
Cc = Cc(Rc(r + 1)), and
Dc = Dc(Rc(r + 1))
1
[a Cc(Rc(r))b Dc(Rc(r))c
NT
177
4. Then Dc and Cc are the matrices that go back to the E-step to be used
(along with the other parameters) to find an updated and more accurate
estimate of x
+
t and Pt
6.
Conclusions
178
work models are based on conditional probability and expectation, a fact that
made possible, in some instances, the inference of directed causal networks.
Incompleteness of the data is usually overcome by the use of the ExpectationMaximization (E-M) algorithm and other machine learning techniques. E-M is
also used to establish bounds of confidence, either on its own or supplemented
by Bootstrapping and cross-validation and optimization is based on maximum
likelihood estimates by means of objective function oriented constraint minimization.
Each and every one of the current implementations mentioned here posses
its particular set of achievements and shortcomings. Current state of the art
points out to a combination of methods, either as a means of assessment or used
in the form of hybrid methods, as the best options to tackle these incredibly
complex, yet highly interesting and important problems. It is our hope that ideas
considered here will stimulate further development in the area of information
theoretical / machine learning-based computational biology.
References
[1] Albert, R. and Barabsi,A.-L.; Statistical mechanics of complex networks,
Reviews of Modern Physics 74, 47 (2002)
[2] Albert, R., Scale-free networks in cell biology, Journal of Cell Science
118, 4947-4957 (2005)
[3] Andrecut, M., Kauffman S.A., Mean-field model of genetic regulatory networks, New Journal of Physics 8, 148, (2006)
[4] Andrecut, M., Kauffman, S.A., A simple method for reverse engineering causal networks, Journal of Physics A Mathematical and General 39,
L647-L655, (2006)
[5] Aoki, M., State Space Modeling of Time Series, Springer-Verlag, (1987)
[6] Bansal, M., Belcastro, V., Ambesi-Impiombato, A., di Bernardo, D.; How
to infer gene networks from expression profiles, Molecular Systems Biology 3:78 (2007)
[7] Brockwell, P.J., Davis, R.A., Introduction to Time Series and Forecasting,
Springer-Verlag second edition, New York, (2002)
179
[8] Butte, A.J., Kohane, I.S.,Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements, Pacific
Symposium on Biocomputing 418-429, (2000)
[9] Cercignani, C., Illner R., Pulvirenti, M. ; The Mathematical Theory of
Dilute Gases, Applied Mathematical Sciences 106, Springer-Verlag (1994)
[10] Chaitankar, V., Ghosh, P., Perkins, E.J., Gong, P., Deng, Y., Zhang, C.,
A novel gene network inference algorithm using predictive minimum description length approach, BMC Systems Biology 4(Suppl 1):S7, (2010)
[11] Chung, F. R. K., Spectral Graph Theory, Amer. Math. Soc., Providence,
R.I., (1997)
[12] Cover T. M., Thomas J.A., Elements of Information Theory, New York:
John Wiley & Sons; (1991)
[13] de Jong, H., Modelling and simulation of genetic regulatory systems: a
literature review, J. Comp. Biol., 9, 1, 67-103 (2002)
[14] Deng, M.H. et al. Integrated probabilistic model for functional prediction
of proteins, J. Comput. Biol. 11, 463476, (2004)
[15] Deng, M.H. et al., Prediction of protein function using protein-protein interaction data, In The First IEEE Computer Society Bioinformatics Conference, CSB2002, pp. 117126, (2002)
[16] Dijkstra, E., A note on two problems in connection with graphs, Numerische Math 1, 269-271, (1959)
[17] Ding, C., Peng, H., Minimum redundancy feature selection from microarray gene expression data, Journal of Bioinformatics and Computational
Biology 3, 2, 185-205, (2005)
[18] Dong, J. and Horvath, S., Understanding network concepts in modules,
BMC Systems Biology, 1, 24, (2007). See especially table 1.
[19] Dougherty,J., Tabus, I., Astola, J., Inference of Gene Regulatory Networks
Based on a Universal Minimum Description Length, EURASIP Journal on
Bioinformatics and Systems Biology Volume 2008, Article ID 482090, 11
pages, doi:10.1155/2008/482090, (2008)
180
[20] Emmert-Streib, F., Dehmer, M., Information processing in the transcriptional regulatory network of yeast: Functional robustness, BMC Systems
Biology, 3, 35, (2009) doi:10.1186/1752-0509-3-35
[21] Faith, J., Hayete, B., Thaden, J., Mogno, I, Wierzboski, J., Cottarel, G.,
Kasif, S., Collins, J., Garner, T., Large scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression
profiles, PLoS Biology 5, xii, (2007)
[22] Fleuret, F., Fast binary feature selection with conditional mutual information, Journal of Machine Learning Research 5, 1531-1555, (2004)
[23] Friedman, N., Linial, M., Nachman, I., Peer, D., Using Bayesian networks
to analyze expression data. J. Comput. Biol. 7, 601 (2000)
[24] Fruchterman, T. M. J., Reingold, E. M., Graph Drawing by Force-Directed
Placement. Software: Practice and Experience, 21(11), (1991)
[25] He, F., Balling, R., Zeng, A-P; Reverse engineering and verification of
gene networks: Principles, assumptions, and limitations of present methods and future perspectives, Journal of Biotechnology 144, 3, 190-203,
(2009)
[26] Hernandez-Lemus, E., Velazquez-Fernandez, D., Estrada-Gil, J.K., SilvaZolezzi, I., Herrera-Hernandez, M.F., Jimenez-Sanchez, G., Information
Theoretical Methods to Deconvolute Genetic Regulatory Networks applied to Thyroid Neoplasms, Physica A 388, 5057-5069, (2009)
[27] INFOTHEO http://cran.r-project.org/web/packages/infotheo/index.html
[28] Jaynes, E.T., Information Theory and Statistical Mechanics, Phys. Rev.,
106, 4, 620-639, (1957)
[29] Jurman, G., Visintainer, R., Furlanello, C., An introduction to spectral
distances in networks (extended version), paper presented at the Workshop on Networks Across Disciplines: Theory and Applications within
the 24th Annual Conference on Neural Information Processing Systems
(NIPS 2010), manuscript at http://arxiv.org/abs/1005.0103
[30] Kamada, T., Kawai, S., An Algorithm for Drawing General Undirected
Graphs, Information Processing Letters, 31:7-15, (1988)
181
[31] Kasza, J., Solomon, P., Kullback Leibler Divergence for Bayesian Networks with Complex Mean Structure, http://arxiv.org/abs/1009.1463
[32] Kindermann, Ross; Snell, J. Laurie, Markov Random Fields and Their
Applications. American Mathematical Society. ISBN 0-8218-5001-6.
MR0620955, (1980)
[33] Kullback, S. Leibler, R. A. On information and sufficiency, The Annals of
Mathematical Statistics, 22, 79-86, (1951)
[34] Lai, D., Lu, H., Lauria, M., Di Bernardo; D., Nardini, C., MANIA: A
Gene Network Reverse Algorithm for Compounds Mode-Of-Action and
Genes Interactions Inference, Advances in Complex Systems 13, 1, 83-94,
(2010)
[35] Letovsky,S. and Kasif,S., Predicting protein function from protein/protein
interaction data: a probabilistic approach, Bioinformatics 19 (Suppl. 1),
i197i204, (2003)
[36] Li, H. and Zhan, M., Analysis of Gene Coexpression by B-spline Based
CoD Estimation, EURASIP Journal on Bioinformatics and Systems Biology, doi:10.1155/2007/49478, (2007)
[37] Madni, A.M., Andrecut, M., Design And Implementation Of A Gene Network Reverse Engineering Method Based On Mutual Information, Journal
of Integrated Design & Process Science 11, 3, 55-68, (2007)
[38] Margolin, A.A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G.,
Dalla Favera, R., Califano, A., ARACNe: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context, BMC Bioinformatics, 7 (Suppl I):S7, (2006) doi:10.1186/1471-21057-S1-S7
[39] Margolin, A.A., Wang, K., Lim, W.K., Kustagi, M., Nemenman, I., Califano, A., Reverse engineering cellular networks, Nat Protoc., 1, 2, 662-71,
(2006)
[40] Margolin, A.A., Wang, K., Califano, A., Nemenman, I., Multivariate dependence and genetic networks inference, IET Syst. Biol. 4, 6, 428440,
(2010)
182
[41] Meyer, P. E., Kontos, K., Lafitte, F., Bontempi, G., Information-theoretic
inference of large transcriptional regulatory networks, EURASIP Journal on Bioinformatics and Systems Biology, Article ID 79879, (2007),
doi:10.1155/2007/79879 2007
[42] Meyer, P. E., Lafitte, F., Bontempi, G., minet: A R/Bioconductor Package
for Inferring Large Transcriptional Networks Using Mutual Information,
BMC Bioinformatics, 9, 461, (2008)
[43] MINET http://bioconductor.org/packages/2.6/bioc/vignettes/minet/inst/
doc/minet.pdf
[44] Mohapatra, A., Mishra, P.M., Padhy, S., Modeling Biological Signals using Information-Entropy with Kullback-Leibler-Divergence, IJCSNS International Journal of Computer Science and Network Security 9, 1, 147154, (2009)
[45] Morganella, S., Zoppoli, P., Ceccarelli, M., IRIS: a method for reverse
engineering of regulatory relations in gene networks, BMC Bioinformatics
10, 444, (2009); doi:10.1186/1471-2105-10-444
[46] Morrissey, E.R., Juarez, M.A., Denby, K.J., Burroughs, N.J., On reverse
engineering of gene interaction networks using time course data with repeated measurements, Bioinformatics 26, 18, 2305-12, (2010)
[47] Murphy, K., Mian, S. Modelling Gene Expression Data using Dynamic
Bayesian Networks. Technical Report, University of California: Berkeley,
(1999)
[48] Nemenman, I., Information theory, multivariate dependence, and genetic
network inference, eprint arXiv:q-bio/0406015
[49] Newman, M.E.J., A measure of betweenness centrality based on random
walks. arXiv cond-mat/0309045, (2003)
[50] Palacios, R., Goni, J., Martinez-Forero, I., Iranzo, J., Sepulcre, J.,
Melero, I., Villoslada, P., A Network Analysis of the Human TCell Activation Gene Network Identifies Jagged1 as a Therapeutic
Target for Autoimmune Diseases, PLoS ONE 2, 11, e1222, (2007)
doi:10.1371/journal.pone.0001222
183
[51] Peer, D., Bayesian network analysis of signaling networks: a primer, Science STKE 281: p14, (2005)
[52] Peng, H., Long F., Ding, C., Feature selection based on mutual information: criteria for max-dependency, max-relevance and min-redundancy,
IEEE Trans. Pattern Analysis and Machine Intelligence 27, 8, 1226-1238,
(2005)
[53] Qiu, P., Gentles, A.J., Plevritis, S.K., Reducing the Computational Complexity of Information Theoretic Approaches for Reconstructing Gene
Regulatory Networks, Journal of Computational Biology 17, 2, 1-8 (2010)
[54] [R] http://cran.r-project.org/
[55] Ravasz, E., Somera,A. L., Mongru,D. A., Oltvai,Z. N., Barabsi, A.-L.,
Hierarchical Organization of Modularity in Metabolic Networks, Science,
297, 1551-1555, (2002)
[56] Segal,E. et al., Discovering Molecular Pathways from Protein Interaction
and Gene Expression Data, Bioinformatics 19 (Suppl. 1), 264272, (2003)
[57] Sehgal, M.S.B., Gondal, I., Dooley, L., Coppel, R., Mok, G.K., Transcriptional Gene Regulatory Network Reconstruction Through Cross Platform
Gene Network Fusion, in Pattern Recognition in Bioinformatics, Lecture
Notes in Computer Science,4774/2007, 274-285, (2007) doi: 10.1007/9783-540-75286-8 27
[58] Shannon, C.E., Weaver, W., The Mathematical Theory of Communication,
The University of Illinois Press, Urbana, Illinois, (1949)
[59] Shumway, R.H., Stoffer, D.S., Time Series Analysis and Its Applications:
With R Examples, Springer Texts in Statistics, Third Edition, (2010)
[60] Steuer R, Kurths J, Daub CO, Weise J, Selbig J, The mutual information:
detecting and evaluating dependencies between variables, Bioinformatics
18 (Suppl 2), 231240, (2002)
[61] van Kampen, N., Stochastic Processes in Physics and Chemistry, North
Holland, Elsevier, The Netherlands,(1997)
[62] van Someren, E.P., Wessels, L.F.A., Backer, E., Reinders, M.T.J., Genetic
Network Modelling, Pharmacogenomics, 3, 4, 507-525, (2002)
184
[63] Wei, Z. and Li, H., A Markov random field model for network-based analysis of genomic data, Bioinformatics 23, 12, 15371544, (2007)
[64] Wei, Z. and Li, H., A hidden spatial-temporal Markov random field model
for network-based analysis of time course gene expression data, Ann. Appl.
Stat. 2, 408-429, (2008)
[65] Yang AC, Hseu SS, Yien HW, Goldberger AL, Peng CK, Linguistic analysis of the human heartbeat using frequency and rank order statistics, Phys
Rev Lett 90: 108103, (2003)
[66] Zhao, W., Serpedin, E., Dougherty, E.R., Inferring gene regulatory networks from time series data using the minimum description length principle, Bioinformatics 22, 17, 2129-35, (2006)
[67] Zola, J.; Aluru, M.; Sarje, A.; Aluru, S., Parallel Information-TheoryBased Construction of Genome-Wide Gene Regulatory Networks, IEEE
Transactions on Parallel and Distributed Systems 21, 12, 1721-1733,
(2010)