You are on page 1of 4

A Fast Mutual Information Calculation Algorithm

Hans-Peter BERNHARD1 Gernot KUBIN2

Institut fur Nachrichtentechnik und Hochfrequenztechnik, Vienna University of Technology, Gusshausstr. 25/389, A-1040 Vienna, Austria, Tel/Fax: +43 1 58801 3524 / +43 1 5870583, 1Email: 2 Email:

Abstract. In modern signal analysis, chaos-theoretic interpretation is becoming more and more popu-

lar. One of the most important parameters of chaotic systems is their information gain or entropy rate. Its measurement provides performance limits for nonlinear signal models in source coding and prediction applications. We have developed a fast algorithm for the estimation of the entropy rate for attractors reconstructed from nite-length signals with the method of delays. With conventional algorithms, the computational load increases almost exponentially with the reconstruction dimension. We exploit dynamical memory organization principles to arrive at a linear increase, i.e. whenever we increment the reconstruction dimension by one, the calculation time of the new algorithm increases only by 1=10 of the calculation time needed in two dimensions. Furthermore, the calculation time is proportional to N log2 N where N is the signal length.

If we want to apply chaos-theoretic tools to the analysis of signals , it is necessary to nd out whether the given signal can be interpreted as the output of a nonlinear dynamical system . Generally, we may have only incomplete (if any) knowledge about the nonlinear equations of the underlying signal generation system. The signal is our primary knowledge source and, therefore, it should be used to estimate several parameters relevant for the description of the underlying system. Most of these (such as dimension, Lyapunov exponents, and entropy rate) can be determined by viewing stationary signal segments as lying on attractors (i.e. steadystate solutions) of the system. For a scalar real-valued signal x(t), an attractor can be reconstructed in the following way: we assemble several delayed samples x(t ? mT ) in an n{dimensional `state' vector ~ T (t) = x(t); x(t ? T ); : : : ; x(t ? (n ? 1)T )] where T is the xn delay time parameter and n is the dimension parameter of the reconstruction. If we follow ~ T (t) as a function of time t, xn we obtain a trajectory in an n{dimensional `state space'. If the observed signal segment is stationary, the trajectory corresponds to an attractor of the underlying system. This can be associated with a stationary probability measure which is de ned on the ensembles of signal samples X (t) and state ~T vectors Xn (t) where capital letters serve to distinguish statistical ensembles from instantaneous values. To calculate the entropy rate of the underlying system, ~T we measure the mutual information I (X (t + T ); Xn (t)) between the ensemble of future signal samples X (t + T ) and the ~T ensemble of state vectors Xn (t). If n is chosen high enough (such that the reconstruction is an embedding of the true attractor) the mutual information saturates at a value which corresponds to the amount of information predictable over a time delay T . This procedure is then applied e.g. for two di erent reconstructions of the same attractor where we use the delay time parameters T1 and T2 , respectively. >From this, the entropy rate is obtained as

1. Introduction


~T I (X (t + T ); Xn (t)) = RT ? RT?1 (2) n n T de ned later on in equation (3). with the redundancy Rn
R. Shaw describes in SR81] the entropy rate properties of nonlinear dynamical systems. If we know the systems entropy rate we are able to decide whether it operates in a chaotic or in a non-chaotic regime. In the former case, the entropy rate is greater than zero and equal to zero in the latter case. Beyond the sign of the entropy rate, its magnitude is also an important quantitative measure. Positive entropy rate can be interpreted as information gain: observing the evolution of the system allows to continuously re ne the information about its (unknown) initial condition. In source coding applications, the entropy rate of a nonlinear deterministic signal model de nes the number of `new bits' produced per time unit and provides a performance limit complementary to the prediction gain for linear stochastic signal models.

~ T1 ? I ~ T2 h = I (X (t + T1 ); Xn (tT)) ? T(X (t + T2 ); Xn (t)) ;

2 1


In the following, we present a fast mutual information (FMI) calculation algorithm as needed for the estimation of the entropy rate of nite-length signals. We use existing algorithms such as the one published by Fraser in Fra89] as a baseline for our performance comparisons1 Computation of the mutual information is based on the (implicit) estimation of an n-dimensional probability density function. Therefore, the computation load usually increases drastically with increasing reconstruction dimension n. This can be seen in gure 1 which shows the calculation time plotted versus the reconstruction dimension for Fraser's algorithm implemented on a 486 PC using the GNU C-Compiler under 386BSD Unix. Both Fraser's and our algorithm organize their computations recursively according to the structure of the given attractor. The algorithms partition the entire state space recursively in smaller and smaller hypercubes, which ultimately serve as histogram classes for the estimation of the
1 The C-code of Fraser's algorithm is available by anonymous ftp from the internet server The FMI algorithm can be obtained by contacting the rst author at the above email address.

2. The Fast Algorithm

0 This work has been supported by the Austrian Science Foundation (FWF) through grant P08779-TEC.

To appear in Proc. EUSIPCO-94

c 1994

Figure 1.

Total computation time versus reconstruction dimension

Calculation time in sec] 500 6 450 Fraser's algorithm 400 350 250 200 150 % % 100 " " ( 50 ( ( FMI algorithm ( ( ( 2 3 4 5 6 7 8 9 10 11 12 13 14 15

the partition by a simple n-dimensional array. An array like this would clearly exceed all memory limits if the partitioning becomes ner. Therefore we store information of the tree as a system of linked vectors. For explanation we refer to gure 2 where nodes connected by pointers can be seen. The nodes denote a hypercube in phase space and the pointers represent the dependency between the nodes or cubes. Within the rectangular boxes of gure 2 the recursion depth is numbered. At the recursion level 0; a single cube is dened that contains all vectors of the complete state space. This initialization is necessary to start the recursive algorithm with equation (5). The recursive algorithm operates in the following way independent of the recursion depth: Dividing level
e0 P
Q PPP Q P ? s q P e0 Q e PP e2 ? A J 1 ? J A ? ? ^ ?U e0 J e e e e A e Z S 1 B Z 0 1 2 C 3 BS Z C BN S Z ?B e S e Z e e CCe ~ W e w

0 1 2 3

multidimensional probability density function. In our considerations we assume every hypercube is uniquely indicated by an m-tupel Km . m denotes the recursion depth of the state space dividing process. To address the subcubes one dividing level deeper than m, we simply extend the m-tupel by one number to an m + 1-tupel Km ; j , where j de nes one subcube contained in the cube determined by Km . In this context, a class will be de ned as a hypercube in state space which contains no further measurable details about the attractor structure. In other words, if the n-dimensional state vectors are uniformly distributed over a hypercube (no further substructure can be found) this cube is not further partitioned but retained as a class. The number of vectors in one class is given by N (Km), so we can evaluate the mutual information with the resulting ndimensional histogram. This calculation is done by considering the redundancy in di erent state space reconstructions (2). The redundancy is given by K ) RT = Px1 xn (Km)ld P (Px1 ) :x:n:(P m(K ) : (3) n xn m x1 K m K

Figure 2.


Global structure: Recursive tree of linked

0 1 2

3 0 1


where Km denotes the sum over all hypercubes in the ndimensional state space considered as a class. In equation (3), xn denotes the n-th component of the reconstructed state space and Px1 xn (Km ) is the probability of state vectors to fall in class Km , given by (4) P (K ) = N (Km) :
x1 x n m

For a more detailed description we refer to Fra89], FS86] and BK91]. With these de nitions we can evaluate the recursive procedure given in FS86] where (5) RT = 1 F (K ) ? ldN;

denotes the starting equation. We proceed with the recursion

n X F (Km; j) F (Km) = nN (Km) +

and terminate if there exists no substructure with the terminating equation in every single class F (Km) = N (Km )ldN (Km): (7) The computation is based on a recursive tree of histogram classes. To deal with it we have to use an e cient description because it is not possible to indicate every element of 2



First, the distribution of the state vectors contained in the current cube is tested, and if we nd a uniform distribution, the entire cube is considered a homogeneous class. The contribution of this cube to the total redundancy result is determined from equation (7). Here N (Km ) is the number of vectors within the considered cube. But, if we can nd that the cube contains substructure we divide it further in subcubes. The distribution structure is tested with the chi-square test where the null-hypothesis is a uniform distribution of the state vectors. If the null-hypothesis fails, we suppose that the considered cube is not a class, and it has to be divided in subcubes. The subcubes are obtained from binary splits in each dimension. Therefore, the number of possible subcubes is 2n . In gure 2, the analysis of one particular attractor structure is schematically plotted. A node with pointers connected to other nodes symbolizes a partition of a cube into subcubes. Additionally, we de ne that one cube has to contain at least one state vector. If it is empty the cube does not contribute to the redundancy result at all and we do not process this cube in any way. Hence the number of pointers originating from a node can be less than 2n . This tree pruning is one of the reasons why our algorithm is very fast. Particularly in high dimensions, we observe that the calculation time of our FMI algorithm increases linearly with the number of dimensions instead of the exponential increase found in Fraser's implementation, see also gure 1. Moreover, to avoid a huge number of calculations every single state vector is processed only one time at each dividing level. To this end, the set of state vectors in one cube is divided up in subsets by the following operation. Initially, no subcube structure is given and we process every vector in the cube sequentially. For each state vector, its subcube location is determined with n threshold comparisons (one per dimension) and it is assigned as a subcube number. If a subcube number appears for the rst time, this subcube is created. In this context, creation means that, in the schematic plot of

gure 2, one new node is inserted and it is connected to the current node by a pointer from cube to subcube. The new node initially consists of only one state vector. Otherwise, if a vector belongs to an already existing subcube, the vector is attached to this subcube. The attaching of vectors requires an internal organization of vectors within one cube which can be seen in gure 3. Every single box symbolizes one state vector. For every vector, a pointer indicates the next vector that belongs to the same cube. This linked list of vectors is cyclically closed with the last vector pointing to the rst vector of the subcube. So for any given state vector in a cube we can access all state vectors belonging to the same cube. Consequently, it is also fast and easy to add a new vector to the subcube. The pointer of the last state vector (initially pointing to the rst one) is linked to the new added vector. The pointer of the new vector links itself to the rst vector. Now the added vector is the last element of the subcube.

1 2 3 4 5 6 7 8 9 10 11 12
6 6 6 6 6 6


Figure 3.

Likewise, just as the state vectors, the subcubes are linked among themselves: the rst (and maybe only) state vector of the subcube has an additional pointer which points to the next subcube. The last subcube points again to the rst. For speci cation of one entire cube, only one vector pointing to the rst element of the rst subcube is used. To evaluate equation (6) for every recursion step, the recursive procedure has to be called for every subcube. Basically only one argument, the pointer to the subcube, is passed as argument to the recursive procedure. Subsequently, the procedure starts with the distribution test again. Reviewing the algorithm, we begin at level 0 with the entire attractor lying in a single cube. To describe the attractor structure, we divide it up in subcubes of equal size. Proceeding recursively, subcubes are analyzed in the same way. The dividing procedure stops if no signi cant substructure is found. Parallel to the dividing operations, the redundancy is calculated.

Cube structure: Cyclic list of linked vectors



n-dimensional array containing n thresholds of the current subcube division; Lnk cub call indicates if a cube contains at least one state vector or not; Lnk cub ptr pointer to the last state vector of a cube; 2. The Recursion F MIND is shown in gure 4. Whenever this function is called new cube borders xg are calculated locally for this cube. Therefore if the function is left the borders are restored to their original values at calling time. this cube represents the current input cube, and indicates the cube to be divided in subcubes by the function NN MIND. The distribution of all state vectors in the cube is tested next by the function SUBSTRUCTURE. If there is no structure the recursive function returns the result of equation (7). Equation (6) is used to proceed with further recursion steps, where the return variable f is initialized with nN (Km ) of equation (6). The return values of the function calls are added to the return value f. 3. The dividing procedure NN MIND expects one pointer as input indicating the cube to be divided. All of the several vectors connected by LNK are processed and for each of them the subcube number elem is de ned. In Lnk cub call the current value of NN CALLS is stored if this cube contains at least one state vector. Hence if NN CALLS and Lnk cub call elem] are not equal the actual NN CALLS is inserted, and the algorithm creates a new cube as explained in the last section. If the subcube already exists the current vector is attached to the set of subcube vectors. We count also the number of vectors in the array N. Accordingly, the element of N which has the same index as the rst vector in the subcube is incremented. For an e ective and fast indexing we use Lnk cub ptr to de ne the last element of a subcube. To determine the redundancy we start with the equation (5) where the recursive procedure F MIND is called the rst time.
xg double { double long F_MIND(LONG nl, LONG this_cube) f; next_n,n;

3. Implementation

The practical implementation in C language consists of three main parts. 1. Initialization of the state vectors, arrays and variables according to the following list: DIM reconstruction dimension; VAL MI n-dimensional array with the reconstructed state vectors (to save memory it can be 1dimensional and the vectors are reconstructed during the calculation); N array of integers containing the number of vectors in the subcube. NN CALLS actual number of calls to the dividing function; LNK array of pointers to link state vectors (see gure 3); LNK CUBE array of pointers with the same structure as LNK to describe the cubes for one dividing step from cube to subcube; 3

DEFINE_NEW_BORDERS(xg); NN_MIND(this_cube); if (SUBSTRUCTUR(this_cube)) { f = (DOUBLE)(nl*DIM); n = this_cube; do { next_n = LNK_CUBE n]; if (N n] > 1) f = f + F_MIND(N n],n); n = next_n; } while (n != this_cube); } else f = (double)nl*ld((double)nl); RESET_OLD_BORDERS(xg); return(f); }
Figure 4.

When progressively partitioning a hypercube in 2n subcubes, many of the subcubes may be empty, i.e. they do not contain a single vector. Still, direct processing of these empty subcubes results in the exponential increase of the calculation time, as can be seen in Fraser's algorithm. Therefore, it

4. Results and Conclusion

C-Code for the recursive procedure

void { long




7 6 5 4 3

NN_CALLS++; last_cube = this_cube; act_T = this_cube; next_act_T = LNK act_T]; do { elem = 0; for (n = 0; n < DIM; n++) elem += ((*(VAL_MI n]+act_T) >= *(xg+n)) << n); if (Lnk_cub_call elem] != NN_CALLS) { Lnk_cub_call elem]=NN_CALLS; LNK_CUBE last_cube] = act_T; LNK_CUBE act_T] = this_cube; act_cub = act_T; last_cube = act_T; N act_T] = 1; } else { ++N LNK Lnk_cub_ptr elem]]]; act_cub = LNK Lnk_cub_ptr elem]]; LNK Lnk_cub_ptr elem]] = act_T; } Lnk_cub_ptr elem] = act_T; next_act_T = LNK act_T]; LNK act_T] = act_cub; act_T = next_act_T; } while (act_T != this_cube); }
Figure 5.

Mutual Inf. Bit] Theoretical Maximum " Q FMI-Algorithm "


Q b `` b `X Q bH XX h X hX XX h HH Fraser's Algorithm X hh h

Figure 6.

4 5 6 7 8 9 Reconstruction Dimension Comparison of the computational accuracy

is more e cient to process only the links of the state vectors to their corresponding hypercubes. If a cube is empty our algorithm does not even take notice of this cube. Because all state vectors obtained from the signal are unique in state space (a pre-processing algorithm guarantees this fact) the maximum number of partitioning steps is log2 N where N is the length of the data record. In the worst case, we have to rede ne the link to the current subcube for all N vectors in each partition level. So the calculation time is proportional to N log2 N . The calculation time also depends on the reconstruction dimension n, but this dependence is only linear for our algorithm rather than exponential: at each partition level, a threshold comparison has to be made to divide the cube. As these comparison operations constitute only a minor part of the total computation load and as no other operations depend explicitly on the reconstruction dimension n, incrementing n to n + 1 increases the total calculation time only by 1=10 of the calculation time needed for n = 2. Figure 6 compares the experimental results for both algorithms obtained for a quasi-periodic signal: sin(!1 t) + 1=9 sin(3!1 t)+ 1=4 sin(!2 t) + 1=36 sin(3!2 t) + noise (SNR p 42dB, !1 = 6=40; !2 = !1 (1 + 5)=10; and t is an integer time index). The attractor is reconstructed with the delay time parameter set equal to T = 20. Even for 2{, 3{, and 4{dimensional reconstructions, there is already a signi cant di erence between the calculation time of the algorithms. For higher dimensions, (n = 9 and up), the new algorithm is over ten times faster than Fraser's implementation. There are three more advantages to our new algorithm: 1. It gets closer to the theoretical saturation value of the mutual information as de ned by the signal-to-noise ratio of the observed data record. This becomes possible due to an improved procedure for a more accurate partitioning of the subcubes. The calculation time for this algorithm version is about 1.5 times longer than our fastest version. We have compared the accuracy of the 4

C-Code to divide a n-dimensional cube

algorithms for the above quasi-periodic signal. While both the new algorithm and Fraser's algorithm exhibit a similar overall behavior in a plot of mutual information versus reconstruction dimension (see gure 6 ), only the new algorithm succeeds in coming close to the theoretical maximum value of 7 Bit as calculated in Fra89]. 2. The memory requirements can be signi cantly reduced for high-dimensional reconstructions. In principle, for an n-dimensional attractor of length N , one needs to store nN samples because all n components of the N vectors have to be stored separately. We circumvent this requirement by reconstructing the attractor concurrently with the mutual information estimation. So memory usage becomes independent of the reconstruction dimension n. To implement this strategy without sacri cing speed, we need a total of 4N memory locations. For simplicity, this feature is not used in the code segments gure 4 and gure 5, but is implemented as an option in our complete program. 3. In Fraser's algorithm, the length of the sampled signal has to be a power of 2. Our algorithm does not have such a restriction. This is an asset for practical experiments, because a signal record of arbitrary length can be processed completely without discarding `overlength' data. In summary, we have developed and implemented a new algorithm for the estimation of the entropy-rate of nitelength signals. Its computational e ciency and high accuracy make this algorithm attractive both for signal analysis and the design of source coding systems. BK91] H.-P. Bernhard and G. Kubin. Detection of


GRETSI Symp. on Signal Image Processing, 1301{1311, Juan-les-Pins (France), Sept. 1991. Fra89] A.M. Fraser. Information and entropy in strange attractors. IEEE Trans. on Information Theory, IT-35(2):245{262, March 1989. FS86] A.M. Fraser and H.L. Swinney. Independent coor-

chaotic behaviour in speech signals using Fraser's mutual information algorithm. Proc. 13th

dinates for strange attractors from mutual information. Physical Rev. A, 33(2):1134{1140, Feb. 1986. SR81] R. Shaw. Strange Attractors, Chaotic Behavior, and Information Flow. Z. Naturforsch, 36a,
80-112, 1981.