Computation of Mutual Information in HMM

Computational Biology and Chemistry 34 (2010) 328333
Contents lists available at ScienceDirect
Computational Biology and Chemistry

journal homepage: www.elsevier.com/locate/compbiolchem
Brief communication
Computation of mutual information from Hidden Markov Models

Daniel Reker a,b , Stefan Katzenbeisser b , Kay Hamacher a,
a b
Theoretical Biology and Bioinformatics, Institute of Microbiology and Genetics, Department of Biology, TU Darmstadt, Schnittspahnstr. 10, 64287 Darmstadt, Germany Security Engineering Group, Department of Computer Science, Technische Universitt Darmstadt, 64287 Darmstadt, Germany
a r t i c l e
i n f o
a b s t r a c t
Understanding evolution at the sequence level is one of the major research visions of bioinformatics. To this end, several abstract models such as Hidden Markov Models and several quantitative measures such as the mutual information have been introduced, thoroughly investigated, and applied to several concrete studies in molecular biology. With this contribution we want to undertake a rst step to merge these approaches (models and measures) for easy and immediate computation, e.g. for a database of a large number of externally tted models (such as PFAM). Being able to compute such measures is of paramount importance in data mining, model development, and model comparison. Here we describe how one can efciently compute the mutual information of a homogenous Hidden Markov Model orders of magnitude faster than with a naive, straight-forward approach. In addition, our algorithm avoids sampling issues of real-world sequences, thus allowing for direct comparison of various models. We applied the method to genomic sequences and discuss properties as well as convergence issues. 2010 Elsevier Ltd. All rights reserved.
Article history: Received 29 June 2010 Received in revised form 30 August 2010 Accepted 30 August 2010 Keywords: Hidden Markov Model Mutual information Dynamic Programming Co-evolutionary signals
1. Introduction Evolutionary pressure enforces correlations in biological sequences. A particularly promising method to reveal the presence of such (co-)evolutionary signals and to investigate general information contained within biological data sets is the computation of the mutual information (MI) between different positions, e.g. in a set of strings of biological codes. The co-evolution of amino acids in a protein, for example, reveals itself by high MI content in a set of homologous sequences from various taxa. MI based studies have become an important tool to understand evolutionary processes in such gene products (Boba et al., 2010; Hamacher, 2008, 2010; Weil et al., 2009). At the same time, biological sequences are routinely modeled in bioinformatics by Hidden Markov Models (HMM) (Durbin et al., 1998) and large databases of (manually) curated models exist (Finn et al., 2006). Besides these applications in evolutionary and computational biology, signal creating processes in neurobiology, speech synthesis (Dines and Sridharan, 2001; Zen et al., 2007) or biochemistry (Grundy et al., 1997) are frequently modeled by HMMs, too. Such HMMs capture the essentials of the consensus sequence as well as additional uctuations in the individual sequences. Due to their widespread usage and the importance of evolutionary signals, understanding the ability of HMMs to model the underlying correlation in sequences is of great importance. One has
Corresponding author. Tel.: +49 6151 16 5318; fax: +49 6151 16 2956. E-mail address: kay.hamacher@gmail.com (K. Hamacher). 1476-9271/$ see front matter 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2010.08.005
also to concede that HMMs themselves in particular as a plain collection of probability values are not instructive at all. They do not provide immediate insight into the non-local effects in the sequences under investigation. The MI on the other hand offers a direct, intuitive, and transmissible interpretation; in particular one can easily visualize it (Bremm et al., 2010). A generic framework based on an analytical approach to compute MI from HMMs directly is desirable. Such a framework avoids problems of empirical data sets with nite size. The naive approach of computing MI from sequences emitted by an HMM would typically be subject to statistical uctuations (Weil et al., 2009). In particular, this offers the possibility to use the existing biological knowledge and machine readable information in the form of HMMs in an automated fashion (Stiller and Radons, 1999). From the algorithmic point of view the MI computation from HMMs poses an interesting problem as one needs to nd an algorithm to efciently compute the combinatorial many paths through the state space of an HMM. For all of the above mentioned practical and theoretical arguments we want to show in this paper how to compute the MI for (homogeneous) HMMs efciently. The investigation of homogeneous HMMs is a rst step in our nal goal of mining general HMM databases for co-evolutionary signals. PFAM models (Finn et al., 2006), for example, model C- and N-termini of proteins by homogeneous sites in the respective HMM. Therefore we present here a rst step to nally compute MI of HMMER models. In particular the nding of functional motifs by HMMs can be facilitated by ltering for information rich regions (Horan et al., 2010). Our results provide for alternative insight into the
D. Reker et al. / Computational Biology and Chemistry 34 (2010) 328333
329
for a given T and the joint probability P(x |x ) to be in hidden state x for a given T and in hidden state for a given T: P( )=
x
P(
|x )P(x )
(2)
P(
Fig. 1. The structure of a general Hidden Markov Model (HMM) with the emission probabilities P( |x ) and the transition probabilities P(x +1 |x ).
)=
x ,x
P(
|x )P(
|x )P(x |x ).
(3)
information contained within HMMs and might therefore open an alternative route to such guided protocols. 2. Approach 2.1. Mutual information in sequences Let := { 1 ,. . ., m } be the set of the m observed symbols in a biological sequence data set. We consider our data to be strings of these symbols with length end N. A position in the string can therefore be referenced by any number {1, 2, . . . , end } =: T N. The MI between two positions , T in symbol sequences can be calculated by MI
,
Therefore the computation of the MI reduces to the determi and , T with < and nation of P(x |x ) for every xt , the probabilities in Eqs. (2) and (3). The joint probability can be obtained using P(x |x ) = P(x |x )P(x ). (4)
We use the Markov property and the law of alternatives to evaluate the conditional probability P(x |x ) as P(x |x ) =
x
1
P(x |x
1 )P(x 1 |x
).
(5)
This equation can be used to build a recursive formula for the joint probability (4) for < : P(x , x ) =
x
1
=
,
P(
) log2
P( P(
, ), P(
) )
(1)
P(x |x
1 )P(x
,x
1 ).
(6)
at posiwhere P( ) is the probability to observe the symbol tion T and P( , ) is the joint probability to observe symbol at position T and symbol at position T. Typically, these probabilities are estimated using frequencies of symbol occurrence in an empirical data set. We will show how to compute the probabilities directly from a given HMM. 2.2. Denition of Hidden Markov Models We will consider a time-discrete, time-homogeneous HMM shown in Fig. 1 up to a string length of end . For such HMMs we dene :={x1 , . . ., xn } to be the set of the n hidden states of our HMM. It is described by three probability functions (Eddy, 1995): P( |x ) : R, which is the emission probability for a cer conditioned on an internal state x at a tain symbol position . As we use homogeneous HMMs, the emission probability for any hidden state does not depend on its position, rendering P( > x ) to be -independent and a universal table. P(x +1 |x ) : R, which reects the transition probability into a hidden state x from a state at position . Again, because of the time homogeneity of our model, these probabilities do not change with and need only to be dened for = + 1 due the Markov property employed in HMMs. (x) : R, which represents the starting probabilities at the (virtual) starting position = 0. To calculate MI from the HMM we need to nd a way to calculate the values of P( , ) and P( ) for any , and for any , T with the help of the given probability functions. 2.3. An analytic solution In the following we will only consider the case < . The case > is symmetric to < in Eq. (1) due to the symmetry in joint probabilities. The case = leads to the entropy for position and is not relevant for our investigation. To calculate P( , ) and P( ), we use the probability P(x ) to be in a certain hidden state x
Implementing this formula, we can calculate the required values of P(x |x ) iteratively. Thereby, we use the values of P(x |x ) for to calculate P(x |x +1 ) for any . every +1 2.4. Dynamic programming The computation of P(x |x +1 ) can be performed efciently following a dynamic programming approach (Bellman, 1952). To this end we express the sum in (6) as a matrix multiplication. Let A :=(P(x +1 |x ))x +1 , x be the matrix containing all transition probabilities for position . As we use a homogeneous HMM these matrices are the same for all T, which we will simply call A. With the matrix A we compute matrices (P(x , x ))x ,x =: X , R| || | for all , T with . These matrices will contain all required values and can simply be calculated by: Xt,t = A Xt,t 1 for for < < X1,1 (7)
,
. With the help of this equation we can calculate all X iteratively as long as we know X , . is simply initialized with
X1,1 :=diag( (x1 ), (x2 ), . . . , (xn )). To compute X , for T,1 < we can use the values of X and the law of alternatives to compute X , as X = diag
i {1,...,n}
(8)
1,
1,
[i, 1], . . . ,
X
i {1,...,n}
1,
[i, n]
(9)
with X , [i,j] being the element in the i-th row and the j-th column of the matrix X , . We can therefore simply collapse the columns of X 1, to calculate X , . Using these equations we can calculate the required values of P(x |x ) by rst initializing X1,1 and then successively calculating X1, for all {2,. . ., end }. After one step of this calculation, we obtain X1,2 with which we can calculate X2,2 . Using X2,2 we can then successively calculate X2, for {3,. . ., end } and so on. After having calculated all necessary X , , P( , ) and P( ) can be calculated with help of (2) and (3). Again, these sums are expressible in matrixvector multiplication which enables fast
330
D. Reker et al. / Computational Biology and Chemistry 34 (2010) 328333 Table 1 We used four different genomic sequences and extracted different training data from them. Due to memory restrictions we were not able to t HMMs in RHmm (Taramasco, 2009) for all sequence set ups to the limit of | | = 20. The maximum number of states we were able to t is shown in the last column. Organism A. thaliana A. thaliana A. thaliana B. subtilis B. subtilis B. subtilis HIV1 HIV1 HIV1 Tobacco M. Virus Tobacco M. Virus Tobacco M. Virus No. of sequences 18,585 500 500 4215 500 500 8982 229 500 6169 159 500 Length of sequences 1000 1000 100 1000 1000 100 200 40 100 200 40 100 Max. no. of hidden states 1 16 20 8 14 20 12 20 20 12 20 20
calculation. Finally, we can calculate the MI using (1). The computation of the overall transition probabilities consitutes the use of the forward- and backward-algorithms for HMMs. Due to the homogeneity of our HMM we can reduce the problem to a more restricted notion of the MI that does not distinguish between the specic positions, but focuses solely on the separation of positions L := . For all pairs ( , ) with the same distance L, we will therefore consider the value MI(L) = 1
end L
end L
=1
MI
, +L
as
+ L.
(10)
This is the quantity we will consider in Section 3 below. 2.5. Algorithmic complexity Using this algorithm one can efciently compute the MI from HMMs. We proposed to calculate the probabilities P(x |x ) beforehand, which is a O(|T|2 | |2 ) complexity. We were able to express this calculation in |T|2 /2 efcient matrix calculations, so that the determination of these probabilities can be done using libraries for fast matrix operations. Additionally, the algorithm for the calculation of P(x |x ) is highly parallelizable: The computation of X , can be done separately for different starting positions as long as X , is given through prior calculations. Having calculated these probabilities, we can calculate all MI values in O(|T|2 | |2 | |2 ) operations. However, a naive approach that calculates the MI values independently would need O(|T|2 | |2 | |2 | ||T| = O(|T|2 | |2+|T| | |2 |) operations. On the other hand, our precalculation increases the memory requirements: To provide P(x |x ), we need O(|T|2 | |2 |) units of memory to save all the auxiliary values. However, in real applications we did not encounter any limitations by this increased memory demand. For very long sequences and very complex HMMs with many hidden states, this could nevertheless pose a problem. As only a few of these values are needed at the same time, the memory requirements might be reduced through sophisticated memory management. 3. Experimental results 3.1. Sequence data
statistical software package R (R Development Core Team, 2009). Our algorithm was implemented in python (van Rossum, 1995), which we also used to emit articial sequences for the trained HMMs (Schliep et al., 2004). 3.3. Analytic result as innite-size limit of data We investigated whether our analytical result for the MI is the limit of a randomized sequence creation process driven by a trained HMM. We used trained HMMs, emitted varying numbers of sequences and computed the MI for these articial sequence sets (Hoffgaard et al., 2010). In Fig. 2 we show the results for B. subtilis. Indeed our analytical result is the limit for an innite sized sequence set emitted by the HMMs. Although these results are very promising in particular showing that our approach does not suffer from nite-size effects of the data set of articially created sequences the restriction imposed by the Markov property of the underlying model need to be considered as well. 3.4. Restrictions induced by the Markov property
The notion of hidden states was found to be of biological relevance in studies of HMMs early on, e.g. for detecting CpGislands (Dasgupta et al., 2002). We therefore used whole genome sequences (GenBankIDs, 2010) of Bacillus subtilis, Arabidopsis thaliana chromosome 4, the Tobacco Mosaic Virus, and the Human Immunodeciency Virus Type 1 (HIV1). This collection of sequence data covers a wide range of genomic organizationsfrom those in viral capsids, over chromatin structures in eukaryotes to prokaryotic DNA. We sampled sets of varying size from the different genomic sequences. These sets contained subsequences of a certain length. Varying these lengths, we generated different training sets for tting of HMMs. We show details of the training data and the corresponding tted HMMs in Table 1. We used only those cases with two or more hidden states, as an HMM with just one hidden state is in fact a trivial Markov chain. In a rst step we convinced ourselves that tting of any HMM and the subsequent computation of the resulting MI by the algorithm described above is invariant under construction of the sequence set used for training. Therefore, we use only the smallest sequence sets from Table 1 for each organism. 3.2. Software We trained homogeneous HMMs with increasing counts of hidden states. We used the RHmm library (Taramasco, 2009) for the
We investigated how well HMMs can capture all correlations found in real-world sequences. Therefore we computed the MI for the full genomic sequences (Hoffgaard et al., 2010) and compared the results to the analytically determined MI from tted HMMs. In Fig. 3 we observe a signicant correlation between the data sets. The correlation appears to be stable against increases in model complexity above 57 hidden states. A potential source for this effect might be the fact that the used genomic sequences are (near) Markov chains at the investigated length scales. Overall genomic structure was found to obey long range correlations (Dehnert et al., 2006; Holste et al., 2003); while our results suggest that probably a few hidden states are sufcient to model this observation. 3.5. Scaling behavior of mutual information in hidden Markov models The curves in Fig. 2 suggest an exponential decay of MI along a sequence with increasing L. Due to the Markov property of the modeling approach we have chosen here, this does not come as a surprise. In Fig. 4 we show the tted decay exponents for varying number of states. The data suggests a faster decay of MI for larger numbers of hidden states. Clearly for a larger number of states the state path can follow a larger number of routes through state
331
States=2 10-1 10-2 10-3 10-4 -5 10 -6 10 10-7 10-8 10-9 100 10-2 N=1000 N=5000 N=15000 N=25000 N=35000 N=50000 Analytic 2 4 6 8 10 12 14 16 18 20 10-4 MI(L) 10
-6
States=8
10-8 10-10 10-12
N=1000 N=5000 N=15000 N=25000 N=35000 N=50000 Analytic 2 4 6 8 10 12 14 16 18 20
MI(L)
L States=16 10-1 10-2 10-3 10-4 10-5 10-6 10-7 -8 10 10-9 10-1 10-2 -3 10 10-4 10-5 10-6 10-7 10-8 10-9 10-10
L States=20
N=1000 N=5000 N=15000 N=25000 N=35000 N=50000 Analytic 2 4 6 8 10 12 14 16 18 20 L
N=1000 N=5000 N=15000 N=25000 N=35000 N=50000 Analytic 2 4 6 8 10 12 14 16 18 20 L
MI(L)
Fig. 2. The computed MI(L) for varying separation of positions L for various number of hidden states and N sequences emitted by the respective HMM. We only show a selection for brevity, the results hold for all of our data; including all numbers of hidden states [1;20].
space and potentially destroy more correlation. In this case we would have over-tted the HMMs. 3.6. General remarks We found the empirical MI to be larger than the analytical MI from Eq. (1) for all of our data. Due to Steins Lemma (Cover and Thomas, 2006) we can rationalize this nding as follows: The MI is a special case of a Kullback-Leibler divergence DKL (p/q)
Pearson 1 0.99 0.98 0.97 0.96
MI(L)
(MacKay, 2004), which can be used to quantify several interesting features in molecular biophysics (Hamacher, 2007; Pape et al., 2010). The Kullback-Leibler divergence is the innite sample limit of the probability for identication of false positives in the empirical distribution p with respect to a reference distribution q, which in our case is the null model of neutral evolution between two sites q:=P(x ) P(x ), see Eq. (1). Therefore the MI computed here measures how likely it is to falsely detect a co-evolutionary signal in p while referring to neutral evolution in q.
Spearman 0.7 0.65 0.6 0.55 0.5
0.95 0.94 0.93 0.92 0.91 2 4 6 8 10 12 14 16 18 20 A. thaliana B. subtilis HIV1 Tobacco
r
0.45 0.4 0.35 0.3 0.25 2 4 6 8 10 12 14 16 18 20
States
States
Fig. 3. The correlation coefcient r (expressed as Spearman and Pearson correlation coefcients (Press et al., 1995)) between the empirically computed MI (Hoffgaard et al., 2010) and the one obtained by the MI contained within tted HMMs for various numbers of hidden states and all genomic sequences of this study. (MI values correlated for L [1;20].) Note that small Spearman correlation for Tobacco is insignicant, as its p-value is larger than 0.1, while for the other organisms the Spearman ranking coefcients is signicant with p < 0.001 always.
332
0.3
0.25
0.2
more complex dependencies. Silent states are a powerful tool to overcome such restrictions and are key to the success of the HMMs contained in the well-known SUPFAM database (Gough et al., 2001; Wilson et al., 2007). We therefore intend to extend our algorithm to enable the processing of more complex models like inhomogeneous HMMs and to automatically harvest the evolutionary information contained in SUPFAM. Acknowledgment KH is grateful to the Fonds der chemischen Industrie which supported this study through a grant for junior faculty.
B. subtilis Tobacco
a
0.15 0.1
References
20
0.05
10
12
14
16
18
States
Fig. 4. The tted prefactor a in the scaling law MI exp(aL) for the analytically derived MIs. For brevity we show only two organisms. Error bars for a are negligible.
3.7. Convergence behavior We note that our results are invariant to the starting probability (x). To this end, we repeated the above procedures with a converged starting probability c (x):= lim An (1, 0, 0, . . .)T and found deviations only within numerical accuracy [data not shown]. 4. Discussion and conclusion In this work we have developed and implemented an efcient procedure to compute MI from trained HMMs. We discussed the speed-up in computing time of order | ||T| where | | is the number of hidden states in the HMM and |T| is the number of positions. This dramatic increase is compensated, however, by a higher demand in temporary memory. We showed that our analytical result avoids a major problem in empirical studies: nite-size effects. We obtained a better accuracy by avoiding the emission of symbols and the subsequent computation of pseudo-empirical MIs, which would only converge slowly (Weil et al., 2009). We were able to demonstrate furthermore that the training of homogeneous HMMs is for practical considerations invariant of the sampling, suggesting that an accurate training can be achieved also with sparse data. Finally we would like to point out that our work is not meant to replace immediate computation of MI for a given sequence set. Obviously, the direct computation maintains long range correlation and avoids the exponential loss of information implied by the Markov property. Nevertheless, our contribution can be very useful in the future, as it allows for a given HMM to investigate correlated information between positions, thus a) enabling users, who do not have access to the training sequences to still be able to compute the MI [for example sequences might be inaccessible due to intellectual property issues] and b) to harvest existing databases of already trained HMMs for (co-)evolutionary signals (see belowSection 5). 5. Future work Despite promising results, we could show that the simplicity of homogeneous HMMs restrict our capability to model involved correlations. Although appropriate for the investigated genomic sequences, homogeneous HMMs could be insufcient to describe
n
Bellman, R., 1952. The theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America 38 (8), 716 719. Boba, P., Weil, P., Hoffgaard, F., Hamacher, K., 2010. Co-evolution in HIV enzymes. In: Fred, A., Filipe, J., Gamboa, H. (Eds.), BIOINFORMATICS2010, pp. 3947. Bremm, S., Schreck, T., Boba, P., Held, S., Hamacher, K., 2010. Computing and visually analyzing mutual information in molecular co-evolution. BMC Bioinformatics 11, 330. Cover, T.M., Thomas, J.A., 2006. Elements of Information Theory, second ed. Wiley, Hoboken. Dasgupta, N., Lin, S., Carin, L., 2002. Sequential modeling for identifying CpG island locations in human genome. IEEE Signal Processing Letters 9 (12). Dehnert, M., Helm, W.E., Htt, M.-T., 2006. Informational structure of two closely related eukaryotic genomes. Physical Review E 74 (August (2)), 021913. Dines, J., Sridharan, S., 2001. Trainable speech synthesis with trended hidden markov models. In: ICASSP01: Proceedings of the Acoustics, Speech, and Signal Processing, 2001. on IEEE International Conference. IEEE Computer Society, Washington, DC, USA, pp. 833836. Durbin, R., Eddy, S., Krogh, A., Mitchison, G., 1998. Biological Sequence Analysis. Cambridge University Press, Cambridge. Eddy, S., 1995. Multiple alignment using hidden markov models. In: Third International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Cambridge, pp. 114120. Finn, R., Mistry, J., Schuster-Bckler, B., Grifths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S., Sonnhammer, E., Bateman, A., 2006. Pfam: clans, web tools and services. Nucleic Acids Research 34, D247D251. GenBank identiers: Bacillus subtilis gi225184640; HIV-1 gi4558520; Tobacco mosaic virus genome gi62124; Arabidopsis thaliana chromosome 4 gi240256243. Gough, J., Karplus, K., Hughey, R., Chothia, C., 2001. Assignment of homology to genome sequences using a library of hidden markov models that represent all proteins of known structure. Journal of Molecular Biology 313 (4), 903 919. Grundy, W.N., Bailey, T.L., Elkan, C.P., Baker, M.E., 1997. Hidden markov model analysis of motifs in steroid dehydrogenases and their homologs. Biochemical and Biophysical Research Communications 231 (3), 760766. Hamacher, K., 2007. Information theoretical measures to analyze trajectories in rational molecular design. Journal of Computational Chemistry 28 (16), 25762580. Hamacher, K., 2008. Relating sequence evolution of HIV1-protease to its underlying molecular mechanics. Gene 422, 3036. Hamacher, K., 2010. Protein domain phylogeniesinformation theory and evolutionary dynamics. In: Fred, A., Filipe, J., Gamboa, H. (Eds.), BIOINFORMATICS2010, pp. 114122. Hoffgaard, F., Weil, P., Hamacher, K., 2010. Biophysconnector: connecting sequence information and biophysical models. BMC Bioinformatics 11 (1), 199. Holste, D., Grosse, I., Beirer, S., Schieg, P., Herzel, H., 2003. Repeats and correlations in human DNA sequences. Physical Review E 67 (6), 061913. Horan, K., Shelton, C.R., Girke, T., 2010. Predicting conserved protein motifs with sub-hmms. BMC Bioinformatics 11 (1), 205. MacKay, D., 2004. Information Theory, Inference, and Learning Algorithms, second ed. Cambridge University Press, Cambridge. Pape, S., Hoffgaard, F., Hamacher, K., 2010. Distance-dependent classication of amino acids by information theory. Proteins: Structure, Function, and Bioinformatics 78 (10), 23222328. Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T., 1995. Numerical Recipies in C. Cambridge University Press, Cambridge. Development Core Team, R., 2009. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3900051-07-0. Schliep, A., Georgi, B., Rungsarityotin, W., Costa, I.G., Schnhuth, A., 2004. The general hidden markov model library: analyzing systems with unobservable states. In: Proceedings of the Heinz-Billing-Price, pp. 121136. Stiller, J., Radons, G., 1999. On-line estimation of hidden markov models. IEEE Signal Processing Letters 6 (8), 213.
D. Reker et al. / Computational Biology and Chemistry 34 (2010) 328333 Taramasco, O., 2009. RHmm: Hidden markov models simulations and estimations. http://r-forge.r-project.org/projects/rhmm/ (accessed on July, 10th, 2010). van Rossum, G., May 1995. Python reference manual. Tech. rep., CWI Report, CSR9525. Weil, P., Hoffgaard, F., Hamacher, K., 2009. Estimating sufcient statistics in co-evolutionary analysis by mutual information. Computational Biology and Chemistry 33 (6), 440444.
333
Wilson, D., Madera, M., Vogel, C., Chothia, C., Gough, J., 2007. The superfamily database in 2007: families and functions. Nucleic Acids Research 35 (Database issue), 308313. Zen, H., Tokuda, K., Masuko, T., Kobayasih, T., Kitamura, T., 2007. A hidden semi-Markov Model-based speech synthesis system. IEICE Transactions on Information and Systems E90-D (5), 825834.

Computation of Mutual Information in HMM

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computation of Mutual Information in HMM

Uploaded by

Copyright:

Available Formats

Computational Biology and Chemistry 34 (2010) 328333

Contents lists available at ScienceDirect

Computational Biology and Chemistry

Computation of mutual information from Hidden Markov Models

D. Reker et al. / Computational Biology and Chemistry 34 (2010) 328333

D. Reker et al. / Computational Biology and Chemistry 34 (2010) 328333

10-8 10-10 10-12

N=1000 N=5000 N=15000 N=25000 N=35000 N=50000 Analytic 2 4 6 8 10 12 14 16 18 20

N=1000 N=5000 N=15000 N=25000 N=35000 N=50000 Analytic 2 4 6 8 10 12 14 16 18 20 L

N=1000 N=5000 N=15000 N=25000 N=35000 N=50000 Analytic 2 4 6 8 10 12 14 16 18 20 L

0.95 0.94 0.93 0.92 0.91 2 4 6 8 10 12 14 16 18 20 A. thaliana B. subtilis HIV1 Tobacco

D. Reker et al. / Computational Biology and Chemistry 34 (2010) 328333

You might also like