Professional Documents
Culture Documents
SunYong Kim (1) , Seiya Imoto (1) , and Satoru Miyano (1)
(1)
Human Genome Center, Institute of Medical Science, The University of Tokyo,
4-6-1 Shirokane-dai, Minato-ku, Tokyo 108-8639 - Japan
{sunk, imoto, miyano}@ims.u-tokyo.ac.jp
Introduction
A Bayesian network is a promising method for modeling relations among a large number of random variables and has
received considerable attention from the studies of gene network estimation using microarray gene expression data. Imoto et
al. [2, 3] proposed a Bayesian network and nonparametric regression model for capturing nonlinear relations between genes
without discretizing microarray data. However, a problem of Bayesian networks is that it cannot construct cyclic regulations,
while real gene networks may have many feedback loops. For a solution to this problem, we propose a dynamic Bayesian
network and nonparametric regression model [4] for estimating a gene network with cyclic regulations from time series
microarray data. We also derive a criterion for selecting a network from a Bayesian statistical viewpoint. The effectiveness
of our method is displayed though a real data application of Saccharomyces cerevisiae gene expression data.
Let X be an n×p time series microarray data matrix, where n and p are the number of microarrays and genes, respectively.
Assuming the first order Markov relation between time points, the joint density can be decomposed as f (x11 , · · · , xnp ) =
f1 (x1 )f2 (x2 |x1 ) × · · · × fn (xn |xn−1 ), where xi = (xi1 , · · · , xip )T is an observation vector at time i. The conditional
density fi (xi |xi−1 ) can be decomposed as fi (xi |xi−1 ) = g1 (xi1 |pi−1,1 ) × · · · × gp (xip |pi−1,p ), where pi−1,j denotes the
expression values of the parents of jth gene at time i − 1. For modeling each gj (·), we assume a nonparametric additive
regression model based on B-splines, and f (X) can then be formulated as
( )
Yp n
Y 1 (xij − µ(pi−1,j ))2
f (x11 , · · · , xnp ; θ G ) = f1 (x1 ) q exp − ,
2σ 2
2
j=1 i=2 2πσj j
PMj1 (j) (j) (j) PMjqj (j) (j) (j) (j) (j)
where µ(pi−1,j ) = m=1 γm1 bm1 (pi−1,1 ) + ... + m=1 γmqj bmqj (pi−1,qj ), γ1k , ..., γMjk k are coefficient parameters and
(j) (j)
{b1k (·), ..., bMjk k (·)} is a prescribed set of B-splines.
When the network structure is given, we can construct a gene network by using the proposed model. However, the true
gene network is usually unknown, and we need to infer the optimal network structure from data. We derive a criterion for
evaluating a network structure from a Bayesian statistical viewpoint. By using the Laplace approximation for integrals, the
criterion, named BNRCdynamic can be expressed as
½ Z ¾
BNRCdynamic (G) = −2 log πprior (G) f (x11 , · · · , xnp ; θG )π(θ G |λ)dθ G
≈ −2 log πprior (G) − r log(2π/n) + log |Jλ (θ̂ G )| − 2nlλ (θ̂ G |X),
FUS3 FAR1 FUS3 FAR1 FUS3 FAR1
(a) Target pathway (b) Result of the BN model (c) Result of the DBN model
where π(θ G |λ) and πprior (G) are the prior distribution of the parameter θ G and the prior probability of the network
G, respectively, λ is the hyper parameter vector, r is the dimension of θ G , lλ (θ G |X) = log f (x11 , · · · , xnp ; θ G )/n +
log π(θ G |λ)/n, Jλ (θ G ) = −∂ 2 {lλ (θ G |X)}/∂θ G ∂θ TG and θ̂ G is the mode of lλ (θ G |X). We select a network which
minimizes the BNRCdynamic score.
We apply the proposed method to the Saccharomyces cerevisiae cell cycle data collected by Spellman et al. [5]. The target
network is a part of cell cycle pathway compiled in the KEGG database [1], around CDC28 (YBR160w; cyclin-dependent
protein kinase). This network contains 45 genes and the partial pathway registered in KEGG is shown in Figure 1 (a). Figure
1 (b) and (c) are the resulting networks of the Bayesian network (BN) model [2, 3] and the dynamic Bayesian network
(DBN) model [4] respectively. A shaded region represents the genes which compose a complex. We consider the edges
inside these regions as correct edges since the genes inside the same region will co-express with some delay. We indicate a
correct estimation by an edge attached with a circle. A triangle represents either a misdirected edge or an edge skipping at
most one gene. A cross represents a wrong estimation. We observed that our dynamic Bayesian network model succeeded in
decreasing the number of false positives compared with the Bayesian network model.
References
[1] http://www.genome.ad.jp/kegg/.
[2] S. Imoto, T. Goto, and S. Miyano. Estimation of genetic networks and functional structures between genes by using bayesian network
and nonparametric regression. Pacific Symposium on Biocomputing, 7:175–186, 2002.
[3] S. Imoto, S. Kim, T. Goto, S. Aburatani, K. Tashiro, S. Kuhara, and S. Miyano. Bayesian network and nonparametric heteroscedastic
regression for nonlinear modeling of genetic network. Proc. 1st IEEE Computer Society Bioinformatics Conference, pages 219–227,
2002.
[4] S. Kim, S. Imoto, and S. Miyano. Dynamic bayesian network and nonparametric regression for nonlinear modeling of gene networks
from time series gene expression data. Proc. Computational Methods in Systems Biology, Lecture Note in Computer Science, Springer-
Verlag, 2602:104–113, 2003.
[5] P. Spellman, G. Sherlock, M. Zhang, V. Iyer, K. Anders, M. Eisen, P. Brown, D. Botstein, and B. Futcher. Comprehensive identification
of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell,
9:3273–3297, 1998.