Professional Documents
Culture Documents
Michael Wibral
Raul Vicente
Joseph T. Lizier Editors
Directed Information
Measures in
Neuroscience
Understanding Complex Systems
Founding Editor
Prof. Dr. J.A. Scott Kelso
Center for Complex Systems & Brain Sciences
Florida Atlantic University
Boca Raton FL, USA
E-mail: kelso@walt.ccs.fau.edu
Springer Complexity
Springer Complexity is an interdisciplinary program publishing the best research and academic-level
teaching on both fundamental and applied aspects of complex systems - cutting across all traditional dis-
ciplines of the natural and life sciences, engineering, economics, medicine, neuroscience, social and com-
puter science.
Complex Systems are systems that comprise many interacting parts with the ability to generate a new
quality of macroscopic collective behavior the manifestations of which are the spontaneous formation of
distinctive temporal, spatial or functional structures. Models of such systems can be successfully mapped
onto quite diverse “real-life” situations like the climate, the coherent emission of light from lasers, chem-
ical reaction-diffusion systems, biological cellular networks, the dynamics of stock markets and of the
internet, earthquake statistics and prediction, freeway traffic, the human brain, or the formation of opin-
ions in social systems, to name just some of the popular applications.
Although their scope and methodologies overlap somewhat, one can distinguish the following main
concepts and tools: self-organization, nonlinear dynamics, synergetics, turbulence, dynamical systems,
catastrophes, instabilities, stochastic processes, chaos, graphs and networks, cellular automata, adaptive
systems, genetic algorithms and computational intelligence.
The two major book publication platforms of the Springer Complexity program are the monograph
series “Understanding Complex Systems” focusing on the various applications of complexity, and the
“Springer Series in Synergetics”, which is devoted to the quantitative theoretical and methodological foun-
dations. In addition to the books in these two core series, the program also incorporates individual titles
ranging from textbooks to major reference works.
Michael Wibral · Raul Vicente
Joseph T. Lizier
Editors
Directed Information
Measures in Neuroscience
ABC
Editors
Michael Wibral Joseph T. Lizier
Brain Imaging Center CSIRO Computational Informatics
Frankfurt am Main Marsfield
Germany Sydney
Australia
Raul Vicente
Max-Planck Institute for Brain Research
Frankfurt am Main
Germany
In scientific discourse and the media it is commonplace to state that brains exist to
‘process information’. Curiously enough, however, we only have a certain under-
standing of what is meant by this when we refer to some specific tasks solved by
information processing like perceiving or remembering objects, or making decisions
– just to name a few. Information processing itself is rather general however, and it
seems much more difficult to exactly quantify it without recurring to specific tasks.
These difficulties arise mostly because, only with specific tasks it is easy to restrict
the parts of the neural system to include in the analysis and to define the roles they
assume, e.g. as inputs or outputs (for the task under consideration). In contrast to
these difficulties that arise when trying to treat information processing in brains, we
have no difficulties to quantify information processing in a digital computer, e.g.
in terms of the information stored on its hard disk, or the amount of information
transfered per second from its hard disk to its random access memory, and then on
to the CPU. In the case of the digital computer it seems completely unnecessary to
recur to specific tasks to understand the general principles of information processing
implemented in this multi-purpose machine, and components of its information pro-
cessing are easily quantified and are also understood to some degree by almost every
one. Why then is it so difficult to perform a similar quantification for biological, and
especially neural information processing?
One answer to this question is the conceptual difference between a digital com-
puter and a neural system: In a digital computer all components are laid out such
that they only perform specific operations on information: a hard disk should store
information, the CPU should quickly modify it, and system buses exist only to trans-
fer information. In contrast, in neural systems it is safe to assume that each agent
of the system (each neuron) simultaneously stores, transfers and modifies informa-
tion in variable amounts, and that these component processes are hard to separate
and quantify. This is because of the recurrent nature of neural circuits that defy the
traditional separation of inputs and outputs, and because the general ’computations’
that are performed may be of a nature that renders the explicit definition or analysis
of a ’code’ exceedingly difficult. Thus, while in digital computers the distinction
between information storage, transfer and modification comes practically for free,
VI Preface
in neuroscience. In the last part, the chapters by Lizier and Chicharro suggest two
new interesting contexts for the study of information transfer in neuroscience. The
chapter by Lizier shows how to quantify the dynamics of information transfer on a
local scale in space and time, thereby opening the possibility to follow information
processing step by step in time and node by node in space. Chicharro then points
out the relation between different measures of information transfer and criteria to
infer causal interactions in complex systems.
The editors and authors gratefully acknowledge the generous funding of the Land
Hessen via the LOEWE grant “Neuronale Koordination Forschungsschwerpunkt
Frankfurt (NeFF)” that sponsored a workshop that gave rise to this book. The editors
also acknowledge the help of Daniel Chicharro in reviewing.
This part of the book provides an introduction to the concepts of directed infor-
mation measures, especially transfer entropy, to the relation of causal interactions
and information transfer, and to practical aspects of estimating information theoretic
quantities from real-world data.
Transfer Entropy in Neuroscience
1 Introduction
This chapter introduces transfer entropy, which to date is arguably the most widely
used directed information measure, especially in neuroscience. The presentation of
Michael Wibral
MEG Unit, Brain Imaging Center, Goethe University, Heinrich-Hoffmann Strasse 10,
60528 Frankfurt am Main, Germany
e-mail: wibral@em.uni-frankfurt.de
Raul Vicente
Max-Planck Institute for Brain Research, 60385 Frankfurt am Main, Germany
e-mail: raulvicente@gmail.com
Michael Lindner
School of Psychology and Clinical Language Science, University of Reading
e-mail: m.lindner@reading.ac.uk
the basic concepts behind transfer entropy and a special section devoted to the cor-
rect interpretation of the measure are meant to prepare the reader for more in depth
treatments in chapters that follow. The chapter should also serve as a frame of refer-
ence for these more specialised treatments and present an overview over the field of
studies in information transfer. In this sense, it may be treated as both and opening
and a closing chapter to this volume.
Since its introduction by Paluš [55] and Schreiber [60] transfer entropy has
proven extremely useful in a wide variety of application scenarios ranging from
neuroscience [69, 73, 55, 66, 67, 68, 8, 1, 4, 6, 7, 17, 19, 40, 47, 52, 59, 62, 34, 22,
52, 36, 5, 63, 27, 6, 35, 6, 7], physiology [11, 13, 12], climatology [57], complex
systems theory [44, 45, 40] and other fields, such as economics [32, 29]. This wide
variety of application fields suggests that transfer entropy measures a useful and
fundamental quantity to understand complex systems, especially those that can be
conceptualized as some kind of network of interacting agents or processes. It is the
purpose of this chapter to organise the available material so as to help the reader in
understanding why transfer entropy is an indispensable tool in understanding com-
plex systems.
In the first section of this chapter we will introduce the fundamental concepts
behind transfer entropy, give a guide to its interpretation and will help to distinguish
it from measures of causal influences based on interventions. In the second section
we will then proceed to show how the concepts of transfer entropy can be cast into
an efficient estimator to obtain reliable transfer entropy values from empirical data,
and also consider aspects of computationally efficient implementations. The third
section deals with common problems and pitfalls encountered in transfer entropy
estimation. We will also briefly discuss two other directed information measures
that have been proposed for the analysis of information transfer – Marko’s directed
information and Pompe’s momentary information transfer. We will show in what
aspects they differ from transfer entropy, and what these differences mean for their
application in neuroscience. In the concluding remarks we will explain how transfer
entropy is much more than just the tool for model-free investigations of directed
interactions that it is often portrayed to be, and point out the important role it may
play even in the analysis of detailed neural models.
2 Concepts
1 1
H(X|Y ) = ∑ p(y) ∑ p(x|y) log
p(x|y) x∈AX∑
= p(x, y) log
p(x|y)
(5)
y∈AY x∈AX ,y∈AY
The conditional entropy H(X|Y ) is the average amount of information that we get
from making an observation of X after having already made an observation of Y . In
terms of uncertainties H(X|Y ) is the average remaining uncertainty in X once Y was
observed. We can also say H(X|Y ) is the information that is unique to X. Conditional
entropy is useful if we want to express the amount of information shared between
the two variables X,Y . This is because the shared information is the the total average
Transfer Entropy in Neuroscience 7
information in the one variable H(X) minus the average information that is unique
to this variable, H(X|Y ). Hence, we define mutual information as:
being zero. It is also important to note that Wiener’s principle requires the best self-
prediction possible, as sub-optimal self prediction will lead to erroneously inflated
transfer entropy values [72, 69].
2.4 Interpretation of TE
After introducing the concept of transfer entropy and its relation to other
information-theoretic measures, it is important to now take a broader perspective
and describe the interpretation and the use of TE in the field of complex systems
analysis, including neuroscience, before turning to actual estimation techniques in
section 3, below.
In line with this simple example, several recent studies demonstrate clearly that
this measure should be strictly interpreted as predictive information transfer [45]
for at least four reasons:
1. An investigation of the presence of causal interactions will ultimately require
interventions to come to definite conclusions, as can be seen by a well-known
toy example (see figure 1). In fact, a causal measure that is intimately related to
TE, but employs Pearl’s ’do-formalism’ [56] for interventions, has been proposed
by Ay and Polani [3].
2. TE values are not a measure of ’causal effect size’, as noted by Chicharro and
colleagues [9]. Chicharro and colleagues found that the concept of a causal effect
size even lacked proper definition, and when defined properly did not align with
the quantities determined by measures of predictive information transfer such as
TE or other Wiener-type measures.
3. TE is not a measure of coupling strength and cannot be used to recover an es-
timate of a coupling parameter. This is illustrated by the fact that TE often de-
pends in a non-monotonic way on the coupling strengths between two systems.
For example, increasing the interaction strength between two systems may lead
to their complete synchronization. In this case, the systems’ dynamics are iden-
tical copies of each other, and information can not be transferred. Hence, TE is
zero by definition in this case and thus smaller than in cases with smaller cou-
pling strength and incomplete synchronization (see figure 1 in [28], and figure 1
in [24]).
4. Not all causal interactions in a system serve the purpose of information transfer
from the perspective of distributed computation, because some interactions serve
active information storage, rather than transfer, depending on network topology
[38], and dynamic regime [42].
The last item on this list deserves special attention as it points out a particular
strength of TE: It can differentiate between interactions in service of information
storage and those in service of information transfer. This differentiation is absolutely
crucial to understanding distributed computation in systems composed of many in-
teracting, similar agents that dynamically change their roles in a computation. Im-
portantly, this differentiation is not possible using measures of causal interactions
based on interventions [43], as these ultimately reveal physical interaction struc-
ture rather than computational structure. In neuroscience, this physical interaction
structure can be equated to anatomical connectivity at all spatial scales.
Another advantage of an information theoretic approach as compared to a causal
one arises when we want to understand a specific computation in a neural system
that specifically relies on the absence of interventions (e.g. spontaneous perceptual
switches). In this case the investigation of causal interactions could only be carried
out under certain fortunate circumstances that may be rarely met in neural system
[9]. In contrast, an analysis of the information transfer underlying the computation
is still well defined in information theoretic terms [10] and fruitful as long as one is
aware of the conceptual difference between information transfer and causal interac-
tions.
10 M. Wibral, R. Vicente, and M. Lindner
Fig. 1 Causality without information transfer. Two example systems that demonstrate the
difference between causal interactions and information transfer. (A) A system of to nodes
where each node has only internal dynamics that make each nodes’ state flip alternatingly
between the two states of a bit, 1 (black), and 0 (white). There is no causal interactions
between the nodes, and no information transfer (TE=0). (B) Another system with no internal
dynamics in the two nodes, but with mutual causal interactions that always impose the bit
state of the source node onto the target node at each update. In this example there is a causal
interaction, but again no information transfer (TE=0). Note that the states of the full system
of two nodes are identical to the ones in (A). (C) The same systems as in (B), but this time
’programmed’ with a different initial state (0,0). Example simplified from the one given in
[3].
2.4.3 Multivariate TE
So far we have mostly considered the information transfer from one source process
X to another target process Y. In neuroscience, however, we deal with networks com-
posed of many nodes, and therefore have to consider information transfer between
multiple processes. If we look at a target process Y and a set of source processes X(i) ,
12 M. Wibral, R. Vicente, and M. Lindner
3 Practical Application
In this section we now turn to the problem of obtaining transfer entropy values from
experimental data, and to judge the significance of these estimates.
where d denotes the embedding dimension, describing how many past time samples
are used, and τ denotes Taken’s embedding delay, describing how far apart these
samples are in time (compare figure 2, A, where the relevant samples are spaced τ
time steps apart). The space containing all delay embedding vectors is the delay-
embedding space (compare figure 2, B). This delay-embedding space is the state
space of the process, if embedding successfully captured all past information in the
process that is relevant to its future.
The importance of a proper state-space reconstruction cannot be overstated as
insufficient state-space reconstruction may lead to false positive results and reversed
directions of information transfer (see [69] for toy examples).
2 Hence, these parameters do not feature explicitly in the TE estimation, but can be consid-
ered part of the algorithm itself.
14 M. Wibral, R. Vicente, and M. Lindner
where the parameter u is the assumed time that the information transfer needs to get
from X to Y , and the subscript SPO (for self prediction optimal) is a reminder that
d y
the past state of Y , yt−1 , has to be constructed such that self prediction is optimal.
We can rewrite equation 11 using a representation in the form of four Shannon
(differential) entropies H(·), as:
dy dx dy dx
T ESPO (X → Y, u) = H yt−1 , xt−u − H yt , yt−1 , xt−u
(12)
dy dy
+H yt , yt−1 − H yt−1 .
T E (X → Y, u) = ψ (k) + ψ n dy + 1
yt−1
− ψ n dy + 1 (13)
yt yt−1
− ψ n dy dx + 1 ,
yt−1 xt−u t
2
T E (X → Y, u) = ψ (k) −
k
+ ψ n dy
yt−1
1 (14)
− ψ n dy +
yt yt−1 n dy
yt yt−1
1
− ψ n dy dx + ,
yt−1 xt−u n dy dx t
yt−1 xt−u
where ψ denotes the Digamma function, while the angle brackets (·t ) indicate an
averaging over different time points. The distances to the k-th nearest neighbour
dy dx
in the highest dimensional space (spanned by yt , yt−1 , xt−u ) define the diameter of
the hypercubes (or rectangles, for eq. 14) for the counting of the number of points
n(·) that are (1) strictly in these hypercubes (equation 13), or (2) inside or on the
borders of the hyper-rectangles (equation 14) around each state vector in all the
marginal spaces (·) involved. Equation 14 yields an estimator that is thought to be
more precise when very large sample sizes are available, whereas equation 13 yields
an estimator that is more robust when only small sample sizes are available, but has
more bias. Since bias problems can be handled based on surrogate data techniques
(see section 4.1), in neuroscience equation 13 seems to be the generally preferred
option.
A state X(t-u)
1
B
X
(source)
0
delay δ prediction
point
Yt
1
state Y(t-1) Xt-u
Y
(target)
0
t-1-τ t-1 Yt-1
t-1-2τ
τ τ u
C
t-u-2τ t-u-τ t-u t
Yt p(Yt , Xt-u =0.45±Δ ,Yt-1=0.15±Δ)
s
on
uti p(Yt,Yt-1,Xt-u)
rib 1
d ist
n al
itio 1
nd
co
Yt
1
0
Δ→0, normalization
1
p(Yt| Xt-u =0.15 ,Yt-1=0.15)
0 p(Yt , Xt -u=0.15±Δ ,Yt-1=0.15±Δ)
ignore Xt-u
E D p(Yt , Yt-1=0.15±Δ)
Yt p(Yt,Yt-1)
1
?
= Δ→0, normalization
p(Yt | Yt-1=0.15)
conditional distributions marginal distribution 0 1
“flattened” by simply
ignoringthe x-related coordinates of all points (Figure 2,D),
dy
and the distribution p yt |yt−1 is obtained (again shown for a binning approach).
Last, the obtained x-dependent conditional distributions (Figure 2,E) are compared
18 M. Wibral, R. Vicente, and M. Lindner
Given enough data, this estimation of the information transfer delay δ works
robustly, and can even separate out differential delays for the two directions of
transfer between two bidirectionally coupled systems (figure 4) – as long as the
Transfer Entropy in Neuroscience 19
Fig. 3 Illustration of the idea behind interaction delay reconstruction using the TESPO
estimator. (A) Scalar time courses of processes X,Y coupled X → Y with delay δ , as indi-
cated by the solid arrow. Light grey boxes with circles indicate data belonging to a certain
state of the respective process. The star on the Y time series indicates the scalar observation
y(t) to be predicted in Wiener’s sense. Three settings for the delay parameter u are depicted:
(1) u < δ – u is chosen such that influences of the state X(t − u1 ) on Y arrive in the future
of the prediction point. Hence, the information in this state is useless and yields no transfer
entropy. (2) u = δ – u is chosen such that influences of the state X(t − u2 ) arrive exactly at
the prediction point, and influence it. Information about this state is useful, and we obtain
non-zero transfer entropy. (3) u > δ – u is chosen such that influences of the state X(t − u3 )
arrive in the far past of prediction point. This information is already available in the past of
the states of Y that we condition upon in T ESPO Information about this state is useless again,
and we obtain zero transfer entropy. (B) Depiction of the same idea in a more detailed view,
depicting states (grey boxes) of X and the samples of the most informative state (black cir-
cles) and non-informative states (white circles). The the curve in the left column indicates
the approximate dependency of T ESPO versus u. The solid black circles on the curves on the
left indicate the TE value obtained with the respective states on the right. Modified from [72].
Creative Commons Attribution (CC-BY) license.
20 M. Wibral, R. Vicente, and M. Lindner
Fig. 5 Interaction delay reconstruction in the turtle brain. (A) Electroretinogram (green),
and LFP recordings (blue), light pulses are marked by yellow boxes. (B) Schematic depiction
of stimulation and recording, including the investigated interactions and the identified delays.
Modified from [72], Creative Commons Attribution (CC-BY) license.
search have been developed recently (see [75], and http://www.trentool.de). Open
source toolboxes that already include these algorithms offer an elegant way to save
on coding work here, and typically provide code that is tested thoroughly.
Toolboxes differ in what type of data they are handle (discrete or continuous
valued), how they deal with multivariate time series in the input to avoid the detec-
tion of spurious information transfer (approx. algorithms to multivariate treatment),
which estimators are implemented (binned, kernel, Kraskov), how efficient their im-
plementation is (algorithms for next neighbour search, parallel computing on GPU
or CPU), what preprocessing tools they offer for state space reconstruction, and how
flexible the creation of surrogate data and statistical tests is handled.
At the time of writing the most established toolboxes for TE analysis of
neural data seem to be TRENTOOL (www.trentool.de) [36], a MATLAB R tool-
box, the transfer-entropy-toolbox (TET) (http://code.google.com/p/transfer-entropy-
toolbox/) [27], which provides C-code callable from MATLAB R (mex-files) and
22 M. Wibral, R. Vicente, and M. Lindner
3 Spiking data can be analysed after convolution with a kernel modelling post-synaptic
potentials.
Transfer Entropy in Neuroscience 23
transfer entropy T E(A → C) and T E(B → C) is zero, while the transfer entropy
T E(A, B → C) from the joint process A, B to the target C is non-zero.
While the first two problems are widely recognized, the last problem seems to
be less well known, potentially because synergies and redundancies were defined
in various ways in the past and a satisfactory axiomatic definition of synergies and
redundancies has only emerged recently [77, 39, 21, 25].
To address the first and second problem, it was proposed to reconstruct the timing
of information transfers in a bivariate analysis [72] and to then identify cascade and
common driver effects based on their signature in the graph of delays - for cascade
effects the spurious link has a delay that is equal to the sums of delays on the true
path, for common driver effects, the difference of the summed delays on the driving
paths is equal to the delay of the spurious links [74]. If a link meets neither of these
two conditions it cannot be due to cascade or common driver effects.
To address problem the third problem, Lizier and colleagues proposed an ap-
proximate greedy approach to a fully multivariate analysis [46]. This approach tries
recursively to find for each target node all source nodes, or combinations thereof,
that have significant information transfer into that target node – conditional on the
information provided by other nodes with significant information transfer that have
already been included. The approach also solves the first and second problem. It is
an approximation to a fully multivariate approach. In a fully multivariate approach,
TE would be evaluated for each pair of source and target, conditioned on the past of
all other processes in the network. In practice, however, this ’approximation’ even
yields more accurate results than the fully multivariate approach, because it is more
data efficient and therefore more robust on small samples sizes.
More in-depth treatments of multivariate transfer entropy can be found in the
chapters by Lizier and Faes in this book.
where we assume that N (X) and N (Y ) are statistically independent noise processes.
The most important practical problem arising from observation noise is that Markov
processes X and Y are transformed into hidden Markov processes X̃ and Ỹ of which
the states are not easily reconstructed. Without properly reconstructed states, how-
ever, transfer entropy estimation may fail or produce spurious results (e.g. [69]). A
proper analytical treatment of transfer entropy on noisy variables is hampered by the
26 M. Wibral, R. Vicente, and M. Lindner
(X)
fact that the Shannon entropy of the sum of two variables H(X + Nt ) cannot be
decomposed into terms containing just one or the other variable. In fact, the entropy
H(X + Y ) for two random variables can be infinite or zero, even if both entropies
H(X) and H(Y ) exist and are finite [33].
In the face of lacking analytical approaches to the problem, simulation studies
must demonstrate the applicability of TE estimation. Indeed, it was shown that both,
transfer entropy estimation and the reconstruction of information transfer delays are
quite robust under Gaussian, white noise [72, 69, 36]. Nevertheless, simulations for
other typical (neuro-)physiological noise profiles seem warranted.
p(xt−u , yt )
I(Xt−u ,Yt ) = ∑ p(xt−u , yt ) log
p(xt−u ), p(yt )
(18)
xt−u ,yt
What exactly is lost if we simply consider this measure of shared information, where
the mutual information is not conditional on the past of the target process as in TE
(equation 8)?
One answer to this question was already given by Schreiber in the initial paper
that introduced TE [60]. Schreiber pointed out that the additional conditioning on
the past of the target that is included in TE, but not in the time-lagged mutual in-
formation, creates a measure of the influence of the past of the source process on
the state-transitions occurring in the target process. This adds a dynamical systems
aspect to the measure [60]. This dynamical systems aspect is also closely related
to the notion of state-dependent influences from control theory as pointed out by
Williams and Beer [78].
In more detail, only conditioning on the past of the target reveals synergistic in-
formation transfer from the past of the target and the source jointly to the future of
the the target (state dependent transfer entropy, [78]) and also removes redundant
information between the past of source and target (see section for 2.4.2 for more de-
tails). As a consequence, it is easily possible to construct two processes X, Y , such
that the time-lagged mutual information between them is always zero whereas there
is non-zero TE. For example, we may choose the source process X as being com-
posed of random variables Xt that are independent identically distributed random
bits, and to construct Y such that Y0 is also a random bit, whereas all other Yt are con-
structed such that an their outcomes are determined by an exclusive OR-operation:
yt = XOR(xt−1 , yt−1 ). In this example it can be easily verified that I(Xt−u ,Yt ) = 0 for
all time-lags u, whereas T ESPO (X → Y, 1) = 1 bit.
28 M. Wibral, R. Vicente, and M. Lindner
Last, conditioning on the past of the target process is necessary to separate in-
formation transfer and information storage in the sense of component processes of
distributed computation (see section 2.4.1, and the chapter by Lizier in this book for
more details).
For this system with a known and highly specific causal structure, Massey wanted
to find a more precise bound for the information that could be transmitted through
this channel when feedback was present, because information theory had not con-
sidered feedback as part of a communication channel correctly before [50]. While
for the channel without feedback the mutual information between input and output
I(X N ;Y N ) is an upper bound for the information that can be transmitted, I(U K ;V K ),
Massey could show that in the presence of feedback a tighter limit on the transmit-
table information holds [50]:
all four contributions on the other hand is also not an option, because information is
accounted for multiple times this way.
Therefore, authors applying DI to systems with unknown dependency structure,
such as neural systems, often modify DI, or rather the interpretation thereof by in-
terpreting the index n as physical time (’t’) rather than a channel-use index, and by
subsequently stripping the ’instantaneous’ information transfer from Xn to Yn of its
directedness, based on the argument that Xn and Yn ’happen simultaneously’ [1, 2].
While this indeed yields a useful measure for neuroscience, it is a clear violation of
the original ideas by Massey. Therefore, the reinterpreted measure should perhaps
be given another name to highlight the fact that instantaneous causality is seen as
uninterpretable in terms of a direction in the new measure, a problem that Massey
did not face because of indexing by channel-use, and the prespecified causal struc-
ture in his use case. We leave the renaming of Massey’s directed information to
the community and simply refer to this new interpretation by DI’. Using this new
interpretation of directed information one can show that [2]:
I(X N ;Y N ) = DI
(X N−1 → Y N−1 ) + DI
(Y N−1 → X N ) + DI
(X N ↔ Y N−1 ) (21)
N
DI
(X N ↔ Y N ) := ∑ I(xn ; yi |yi−1 , xn−1 ) (22)
i=1
which is a useful decomposition of the mutual information into two directed infor-
mation transfers and a contribution of the undirected, instantaneously shared infor-
mation, called instantaneous information exchange, DI
(X N−1 ↔ Y N−1 ). Further-
more it can be shown that in the limit of t → ∞, the rates of the directed parts in
equation 21 are nothing but the transfer entropy rates for X → Y and Y → X [1, 2].
In sum, most of the confusion about the use of transfer entropy and directed
information arises because the use case for directed information is no longer the one
intended by Massey, and indeed the measure has been changed via reinterpretation,
while references are still made to the original use case and claims by Massey.
That is, while MIT retains the conditioning on the immediately previous state
of the target Yt−1 that is used in T ESPO , MIT additionally conditions on the state
Transfer Entropy in Neuroscience 31
variable of the source, Xt−u−1 , immediately preceding the scalar source observation
under consideration, Yt .
The essence of Pompe and Runge’s argument is that their conditioning on Xt−u−1
seeks to find the delay over which the transferred information is first available in
the source. While the measure does indeed have this property, we note that for a
measure of information transfer delays it is important to identify the point in time
when the information in the source is most relevant to predict the future of the target,
as was shown by mathematical proof in [72]. As shown by example in the same
study, MIT may therefore slightly misidentify information transfer delays, yielding
inflated delay values. The mathematical reason for this is the removal of memory
in the source via the additional conditioning before determining the information
transfer.
the case of neuroscience in an elegant treatment by David Marr in his book Vision
[49]:
• The computational level: What is computed by the neural system, and why is this
computation ecologically relevant to the organism?
• The algorithmic level: What representations of quantities of the outside world
exist (in the neural system) and in what algorithms are they used?
• The implementation level: How are these algorithms implemented in the bio-
physics of the neural system?
As noted already by Marr and later emphasized by Poggio (in the afterword added
to [49]), these level of understanding only loosely constrain each other as any re-
alization at one level may map to multiple possibilities at the other levels. Poggio
also emphasized the need for analysis approaches that bring the levels closer to-
gether again, after their initial separation brought clarity to neuroscientific study.
If we take into account that transfer entropy quantifies the amount of information
transferred in service of a computation, we see that the analysis of transfer entropy
in a neural system uses data from the implementational level but gives constraints
on the algorithms the systems runs. This way, transfer entropy effectively links the
implementation level to the algorithmic level – and does so both for empirical data
and models. As models offer the possibility of virtually unlimited access to data,
and as this is highly beneficial for reliable analyses of information theoretic meth-
ods, we think that the understanding of neural systems will strongly profit from the
application of transfer entropy analysis specifically to data from detailed, large scale
neural simulations that will become available in the near future.
References
1. Amblard, P.O., Michel, O.J.J.: On directed information theory and Granger causality
graphs. J. Comput. Neurosci. 30(1), 7–16 (2011)
2. Amblard, P.O., Michel, O.J.J.: The relation between Granger causality and directed in-
formation theory: A review. Entropy 15(1), 113–143 (2012)
3. Ay, N., Polani, D.: Information flows in causal networks. Adv. Complex Syst. 11, 17
(2008)
4. Barnett, L., Barrett, A.B., Seth, A.K.: Granger causality and transfer entropy are equiva-
lent for Gaussian variables. Phys. Rev. Lett. 103(23), 238–701 (2009)
5. Battaglia, D., Witt, A., Wolf, F., Geisel, T.: Dynamic effective connectivity of inter-areal
brain circuits. PLoS Comput. Biol. 8(3), e1002 (2012)
6. Besserve, M., Schlkopf, B., Logothetis, N.K., Panzeri, S.: Causal relationships between
frequency bands of extracellular signals in visual cortex revealed by an information the-
oretic analysis. J. Comput. Neurosci. 29(3), 547–566 (2010)
7. Bühlmann, A., Deco, G.: Optimal information transfer in the cortex through synchro-
nization. PLoS Comput. Biol. 6(9), e1000934 (2010)
8. Chávez, M., Martinerie, J., Le Van Quyen, M.: Statistical assessment of nonlinear causal-
ity: application to epileptic EEG signals. J. Neurosci. Methods 124(2), 113–128 (2003)
9. Chicharro, D., Ledberg, A.: When two become one: the limits of causality analysis of
brain dynamics. PLoS One 7(3), e32466 (2012)
Transfer Entropy in Neuroscience 33
10. Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley-Interscience, New
York (1991)
11. Faes, L., Nollo, G.: Bivariate nonlinear prediction to quantify the strength of complex
dynamical interactions in short-term cardiovascular variability. Med. Biol. Eng. Com-
put. 44(5), 383–392 (2006)
12. Faes, L., Nollo, G., Porta, A.: Information-based detection of nonlinear granger causal-
ity in multivariate processes via a nonuniform embedding technique. Phys. Rev. E Stat.
Nonlin. Soft. Matter Phys. 83(5 Pt. 1), 051112 (2011)
13. Faes, L., Nollo, G., Porta, A.: Non-uniform multivariate embedding to assess the infor-
mation transfer in cardiovascular and cardiorespiratory variability series. Comput. Biol.
Med. 42(3), 290–297 (2012)
14. Faes, L., Nollo, G., Porta, A.: Compensated transfer entropy as a tool for reliably esti-
mating information transfer in physiological time series. Entropy 15(1), 198–219 (2013)
15. Felts, P.A., Baker, T.A., Smith, K.J.: Conduction in segmentally demyelinated mam-
malian central axons. J. Neurosci. 17(19), 7267–7277 (1997)
16. Freiwald, W.A., Valdes, P., Bosch, J., Biscay, R., Jimenez, J.C., Rodriguez, L.M., Ro-
driguez, V., Kreiter, A.K., Singer, W.: Testing non-linearity and directedness of interac-
tions between neural groups in the macaque inferotemporal cortex. J. Neurosci. Meth-
ods 94(1), 105–119 (1999)
17. Garofalo, M., Nieus, T., Massobrio, P., Martinoia, S.: Evaluation of the performance
of information theory-based methods and cross-correlation to estimate the functional
connectivity in cortical networks. PLoS One 4(8), e6482 (2009)
18. Gomez-Herrero, G., Wu, W., Rutanen, K., Soriano, M.C., Pipa, G., Vicente, R.: Assess-
ing coupling dynamics from an ensemble of time series. arXiv preprint arXiv:1008.0539
(2010)
19. Gourevitch, B., Eggermont, J.J.: Evaluating information transfer between auditory corti-
cal neurons. J. Neurophysiol. 97(3), 2533–2543 (2007)
20. Gray, C.M., Knig, P., Engel, A.K., Singer, W.: Oscillatory responses in cat visual cortex
exhibit inter-columnar synchronization which reflects global stimulus properties. Na-
ture 338(6213), 334–337 (1989)
21. Griffith, V., Koch, C.: Quantifying synergistic mutual information. In: Prokopenko, M.
(ed.) Guided Self-Organization: Inception, pp. 159–190. Springer, Heidelberg (2014)
22. Hadjipapas, A., Hillebrand, A., Holliday, I.E., Singh, K.D., Barnes, G.R.: Assessing in-
teractions of linear and nonlinear neuronal sources using MEG beamformers: a proof of
concept. Clin. Neurophysiol. 116(6), 1300–1313 (2005)
23. Hahs, D.W., Pethel, S.D.: Distinguishing anticipation from causality: anticipatory bias in
the estimation of information flow. Phys. Rev. Lett. 107(12), 128701 (2011)
24. Hahs, D.W., Pethel, S.D.: Transfer entropy for coupled autoregressive processes. En-
tropy 15(3), 767–788 (2013)
25. Harder, M., Salge, C., Polani, D.: Bivariate measure of redundant information. Phys. Rev.
E Stat. Nonlin. Soft Matter Phys. 87(1), 012130 (2013)
26. Hebb, D.O.: The organization of behavior: A neuropsychological theory. Wiley, New
York (1949)
27. Ito, S., Hansen, M.E., Heiland, R., Lumsdaine, A., Litke, A.M., Beggs, J.M.: Extending
transfer entropy improves identification of effective connectivity in a spiking cortical
network model. PLoS One 6(11), e27431 (2011)
28. Kaiser, A., Schreiber, T.: Information transfer in continuous processes. Physica D 166,
43 (2002)
29. Kim, J., Kim, G., An, S., Kwon, Y.K., Yoon, S.: Entropy-based analysis and
bioinformatics-inspired integration of global economic information transfer. PLoS
One 8(1), e51986 (2013)
34 M. Wibral, R. Vicente, and M. Lindner
30. Kozachenko, L., Leonenko, N.: Sample estimate of entropy of a random vector. Probl.
Inform. Transm. 23, 95–100 (1987)
31. Kraskov, A., Stoegbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev.
E Stat. Nonlin. Soft Matter Phys. 69(6 Pt. 2), 066138 (2004)
32. Kwon, O., Yang, J.S.: Information flow between stock indices. EPL (Europhysics Let-
ters) 82(6), 68003 (2008)
33. Lapidoth, A., Pete, G.: On the entropy of the sum and of the difference of independent
random variables. In: IEEE 25th Convention of Electrical and Electronics Engineers in
Israel, IEEEI 2008, pp. 623–625. IEEE (2008)
34. Leistritz, L., Hesse, W., Arnold, M., Witte, H.: Development of interaction measures
based on adaptive non-linear time series analysis of biomedical signals. Biomed. Tech.
(Berl.) 51(2), 64–69 (2006)
35. Li, X., Ouyang, G.: Estimating coupling direction between neuronal populations with
permutation conditional mutual information. NeuroImage 52(2), 497–507 (2010)
36. Lindner, M., Vicente, R., Priesemann, V., Wibral, M.: Trentool: A Matlab open source
toolbox to analyse information flow in time series data with transfer entropy. BMC Neu-
rosci. 12(119), 1–22 (2011)
37. Lizier, J.: The Local Information Dynamics of Distributed Computation in Complex Sys-
tems. Springer theses. Springer (2013)
38. Lizier, J.T., Atay, F.M., Jost, J.: Information storage, loop motifs, and clustered structure
in complex networks. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 86(2 Pt. 2), 026110
(2012)
39. Lizier, J.T., Flecker, B., Williams, P.L.: Towards a synergy-based approach to measuring
information modification. In: Proceedings of the 2013 IEEE Symposium on Artificial
Life (ALIFE), pp. 43–51. IEEE (2013)
40. Lizier, J.T., Heinzle, J., Horstmann, A., Haynes, J.D., Prokopenko, M.: Multivariate
information-theoretic measures reveal directed information structure and task relevant
changes in fMRI connectivity. J. Comput. Neurosci. 30(1), 85–107 (2011)
41. Lizier, J.T., Mahoney, J.R.: Moving frames of reference, relativity and invariance in
transfer entropy and information dynamics. Entropy 15(1), 177–197 (2013)
42. Lizier, J.T., Pritam, S., Prokopenko, M.: Information dynamics in small-world Boolean
networks. Artif. Life 17(4), 293–314 (2011)
43. Lizier, J.T., Prokopenko, M.: Differentiating information transfer and causal effect. Eur.
Phys. J. B 73, 605–615 (2010)
44. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Local information transfer as a spatiotem-
poral filter for complex systems. Phys. Rev. E 77(2 Pt. 2), 026110 (2008)
45. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Information modification and particle colli-
sions in distributed computation. Chaos 20(3), 037109 (2010)
46. Lizier, J.T., Rubinov, M.: Multivariate construction of effective computational networks
from observational data. Max Planck Preprint 25/2012. Max Planck Institute for Mathe-
matics in the Sciences (2012)
47. Lüdtke, N., Logothetis, N.K., Panzeri, S.: Testing methodologies for the nonlinear anal-
ysis of causal relationships in neurovascular coupling. Magn. Reson. Imaging 28(8),
1113–1119 (2010)
48. Marko, H.: The bidirectional communication theory–a generalization of information the-
ory. IEEE Transactions on Communications 21(12), 1345–1351 (1973)
49. Marr, D.: Vision: A Computational Investigation into the Human Representation and
Processing of Visual Information. Henry Holt and Co. Inc., New York (1982)
50. Massey, J.: Causality, feedback and directed information. In: Proc. Int. Symp. Informa-
tion Theory Application (ISITA 1990), pp. 303–305 (1990)
Transfer Entropy in Neuroscience 35
51. Merkwirth, C., Parlitz, U., Lauterborn, W.: Fast nearest-neighbor searching for nonlinear
signal processing. Phys. Rev. E Stat. Phys. Plasmas. Fluids Relat. Interdiscip. Topics 62(2
Pt. A), 2089–2097 (2000)
52. Neymotin, S.A., Jacobs, K.M., Fenton, A.A., Lytton, W.W.: Synaptic information trans-
fer in computer models of neocortical columns. J. Comput. Neurosci. 30(1), 69–84
(2011)
53. Nolte, G., Ziehe, A., Nikulin, V.V., Schlogl, A., Kramer, N., Brismar, T., Muller, K.R.:
Robustly estimating the flow direction of information in complex physical systems. Phys.
Rev. Lett. 100(23), 234101 (2008)
54. Oostenveld, R., Fries, P., Maris, E., Schoffelen, J.M.: Fieldtrip: Open source software for
advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput. Intell.
Neurosci. 2011, 156869 (2011)
55. Paluš, M.: Synchronization as adjustment of information rates: detection from bivariate
time series. Phys. Rev. E 63, 046211 (2001)
56. Pearl, J.: Causality: models, reasoning, and inference. Cambridge University Press
(2000)
57. Pompe, B., Runge, J.: Momentary information transfer as a coupling measure of time
series. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 83(5 Pt. 1), 051122 (2011)
58. Ragwitz, M., Kantz, H.: Markov models from data by simple nonlinear time series pre-
dictors in delay embedding spaces. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 65(5 Pt.
2), 056201 (2002)
59. Sabesan, S., Good, L.B., Tsakalis, K.S., Spanias, A., Treiman, D.M., Iasemidis, L.D.:
Information flow and application to epileptogenic focus localization from intracranial
EEG. IEEE Trans. Neural. Syst. Rehabil. Eng. 17(3), 244–253 (2009)
60. Schreiber, T.: Measuring information transfer. Phys. Rev. Lett. 85(2), 461–464 (2000)
61. Small, M., Tse, C.: Optimal embedding parameters: a modelling paradigm. Physica D:
Nonlinear Phenomena 194, 283–296 (2004)
62. Staniek, M., Lehnertz, K.: Symbolic transfer entropy: inferring directionality in biosig-
nals. Biomed. Tech (Berl.) 54(6), 323–328 (2009)
63. Stetter, O., Battaglia, D., Soriano, J., Geisel, T.: Model-free reconstruction of excita-
tory neuronal connectivity from calcium imaging signals. PLoS Comput. Biol. 8(8),
e1002653 (2012)
64. Sun, L., Grützner, C., Bölte, S., Wibral, M., Tozman, T., Schlitt, S., Poustka, F., Singer,
W., Freitag, C.M., Uhlhaas, P.J.: Impaired gamma-band activity during perceptual orga-
nization in adults with autism spectrum disorders: evidence for dysfunctional network
activity in frontal-posterior cortices. J. Neurosci. 32(28), 9563–9573 (2012)
65. Takens, F.: Detecting Strange Attractors in Turbulence. In: Dynamical Systems and
Turbulence, Warwick. Lecture Notes in Mathematics, vol. 898, pp. 366–381. Springer
(1980)
66. Vakorin, V.A., Kovacevic, N., McIntosh, A.R.: Exploring transient transfer entropy based
on a group-wise ica decomposition of EEG data. Neuroimage 49(2), 1593–1600 (2010),
67. Vakorin, V.A., Krakovska, O.A., McIntosh, A.R.: Confounding effects of indirect con-
nections on causality estimation. J. Neurosci. Methods 184(1), 152–160 (2009)
68. Vakorin, V.A., Mii, B., Krakovska, O., McIntosh, A.R.: Empirical and theoretical aspects
of generation and transfer of information in a neuromagnetic source network. Front Syst.
Neurosci. 5, 96 (2011)
69. Vicente, R., Wibral, M., Lindner, M., Pipa, G.: Transfer entropy – a model-free measure
of effective connectivity for the neurosciences. J. Comput. Neurosci. 30(1), 45–67 (2011)
70. Victor, J.: Binless strategies for estimation of information from neural data. Phys. Rev.
E 72, 051903 (2005)
36 M. Wibral, R. Vicente, and M. Lindner
71. Whitford, T.J., Ford, J.M., Mathalon, D.H., Kubicki, M., Shenton, M.E.: Schizophrenia,
myelination, and delayed corollary discharges: a hypothesis. Schizophr Bull. 38(3), 486–
494 (2012)
72. Wibral, M., Pampu, N., Priesemann, V., Siebenhhner, F., Seiwert, H., Lindner, M., Lizier,
J.T., Vicente, R.: Measuring information-transfer delays. PLoS One 8(2), 055809 (2013)
73. Wibral, M., Rahm, B., Rieder, M., Lindner, M., Vicente, R., Kaiser, J.: Transfer entropy
in magnetoencephalographic data: Quantifying information flow in cortical and cerebel-
lar networks. Prog. Biophys. Mol. Biol. 105(1-2), 80–97 (2011)
74. Wibral, M., Wollstadt, P., Meyer, U., Pampu, N., Priesemann, V., Vicente, R.: Revisiting
wiener’s principle of causality – interaction-delay reconstruction using transfer entropy
and multivariate analysis on delay-weighted graphs. Conf. Proc. IEEE Eng. Med. Biol.
Soc. 2012, 3676–3679 (2012)
75. Wollstadt, P., Martinéz-Zarzuela, M., Vicente, R., Dı́az-Pernas, F., Wibral, M.: Ef-
ficient transfer entropy analysis of non-stationary neural time series. arXiv preprint
arXiv:1401.4068 (2014)
76. Wiener, N.: The theory of prediction. In: Beckmann, E.F. (ed.) In Modern Mathematics
for the Engineer, McGraw-Hill, New York (1956)
77. Williams, P.L., Beer, R.D.: Nonnegative decomposition of multivariate information.
arXiv preprint arXiv:1004.2515 (2010)
78. Williams, P.L., Beer, R.D.: Generalized measures of information transfer. arXiv preprint
arXiv:1102.1507 (2011)
79. Wolfram, S.: A new kind of science. Wolfram Media, Champaign (2002)
Efficient Estimation of Information Transfer
Abstract. Any measure of interdependence can lose much of its appeal due to a
poor choice of its numerical estimator. Information theoretic functionals are partic-
ularly sensitive to this problem, especially when applied to noisy signals of only
a few thousand data points or less. Unfortunately, this is a common scenario in
applications to electrophysiology data sets. In this chapter, we will review the state-
of-the-art estimators based on nearest-neighbor statistics for information transfer
measures. Nearest neighbors techniques are more data-efficient than naive partition
or histogram estimators and rely on milder assumptions than parametric approaches.
However, they also come with limitations and several parameter choices that influ-
ence the numerical estimation of information theoretic functionals. We will describe
step by step the efficient estimation of transfer entropy for a typical electrophysi-
ology data set, and how the multi-trial structure of such data sets can be used to
partially alleviate the problem of non-stationarity.
1 Introduction
Inferring interdependencies between subsystems from empirical data is a com-
mon task across different fields of science. In neuroscience the subsystems from
which we would like to infer an interdependency can consist of a set of stimulus
and a region of the brain [1], two regions of the brain [2], or even two frequency
bands registered at the same brain region [3]. An important characterization of di-
rected dependency is the information transfer between subsystems, especially when
Raul Vicente
Max-Planck Institute for Brain Research, 60385 Frankfurt am Main, Germany
e-mail: raulvicente@gmail.com
Michael Wibral
MEG Unit, Brain Imaging Center, Goethe University, Heinrich-Hoffmann Strasse 10,
60528 Frankfurt am Main, Germany
e-mail: wibral@em.uni-frankfurt.de
describing the information processing capabilities of a system [4, 5]. The success of
this task crucially depends not only on the quality of the data but on the numerical
estimator of the interdependency measure [6]. In this chapter we will review the dif-
ferent stages in obtaining a numerical estimate of information transfer, as measured
by transfer entropy, from a typical electrophysiology data set. Specifically, in Sec-
tion 2 we answer why transfer entropy is used as a quantifier of information transfer.
Next, we describe different strategies to estimate transfer entropy along with their
advantages and drawbacks. Section 4 explains step by step the procedure to numer-
ically estimate transfer entropy from nearest neighbor statistics. The section covers
from the choice of parameters for the embedding of raw time series to the testing of
statistical significance. In Section 5, we illustrate how to integrate multi-trial infor-
mation to improve the temporal resolution of transfer entropy. Finally, in Section 6
we briefly discuss the current status of the field and some future developments that
will be needed for moving forward the application of information transfer measures
in neuroscience.
Taking into account these conceptual difficulties and severe measure limita-
tions (experimental measures only capture a hugely subsampled and coarse-grained
version of underlying neural processes), it is probably safe to affirm that many over-
interpretations are found in results dealing with information theory applied to sys-
tems neuroscience and other fields [20]. This justifies in part the skepticism that
information theory applied to neural data generated among rigorous information
theoreticians [21, 22]. However, information theory, even in its most classical and
simple framework can still provide very useful insights and lower bounds on funda-
mental quantities characterizing the transmission of information. The latter aspect
is the base of many analyses that try to determine the flexible routing of information
across brain areas on top of its anatomical architecture. To this end a generalization
of mutual information named transfer entropy (TE) has become the tool of choice.
T E (X → Y ) = MI(Y + ; X − |Y − ) , (3)
where the superscripts + and − denote adequate future and past state reconstruc-
tions of the respective random variables.
The conditioning in the former equation equips transfer entropy with several ad-
vantages over the unconditioned mutual information to describe information trans-
fer. First, it enables transfer entropy to consider transition between states and thus
incorporates the dynamics of the processes. Second, transfer entropy is inherently
asymmetric with respect to the exchange of X and Y and thus can distinguish the two
possible directions of interaction. These two properties allow one to assess the di-
rected information being dynamically transferred between two process as opposed
to the information being merely statically shared [23]. This can also be observed
from rewriting Eq. 3 as
T E (X → Y ) = MI Y + ; (Y − , X − ) − MI(Y + ;Y − ) , (4)
Efficient Estimation of Information Transfer 41
which makes explicit that transfer entropy is the reduction of uncertainty in one vari-
able due to another that is not explained by its own past alone. Another arrangement
of transfer entropy, in this case in terms of Shannon entropies, reads
T E (X → Y ) = H Y − , X − − H Y + ,Y − , X − + H Y + ,Y − − H Y − . (5)
For a detailed review on the concept of transfer entropy and its application to
neuroscience see Chapter 1 by Wibral in this volume.
Note also that we refer to transfer entropy as capturing causal dependencies only
in the sense that there is some value in the past of an observed signal in explaining
the future evolution of another signal beyond its own past. Observational causal-
ity as defined by Wiener differs in general from interventional causality in which
perturbations of one process while conditioning or controlling the state of others,
is necessary to infer the graph of causal interactions. Indeed transfer entropy actu-
ally captures the notion of information transfer as opposed to quantify the strength
of causal interactions [26]. The two concepts are different as reviewed in Chap-
ter 8 by Chicharro in this volume. An easy observation illustrating this difference
is that transfer entropy will be zero for both independent and fully synchronized
processes, possibly due to a null and very strong causal interaction, respectively
[27, 28]. However, information transfer across brain regions is arguably the quantity
of interest to study the flexible information routing in the brain rather than interven-
tional causal connectivity which is directly related to its relatively fixed anatomical
circuitry [29, 25, 30, 31].
Regarding the estimation of transfer entropy, the innocent formulation in Eq. 3
does not make explicit its dependence on several probability densities. For Markov
processes indexed by a discrete valued time-index t this reads
d
p(yt+1 |yt y , xtdx )
∑ p(yt+1 |yt , xt ) log
dy dx
T E (X → Y ) = dy
, (6)
dy dx
y ,y ,x
p(yt+1 |y t )
t+1 t t
d
where xtdx = (xt , ..., xt−dx +1 ), yt y = (yt , ..., yt−dy +1 ), while dx and dy are the orders
(memory) of the Markov processes X and Y , respectively.
The formula in Eq. 6 does not appear so innocent anymore and the appearance of
several probability densities in possibly high dimensions already hints that the esti-
mation procedure might be difficult. In the next sections we will describe different
types of estimators of transfer entropy from a collection of time series.
3 A Zoo of Estimators
Given a data set and application in mind, which is a good estimator for transfer
entropy? Before addressing this question we shall recall some basic notions about
estimators and their classification.
An estimator is a function or rule that takes observed data as input and out-
puts an estimate of an unknown parameter or variable [32]. Any estimator can be
42 R. Vicente and M. Wibral
characterized in terms of the bias and variance of its estimates, this is its systematic
deviance from the true value and its variability across different realizations of the
sampling. Often one is interested in knowing how the bias of the estimate and its
convergence to the expected value behave as the number of samples grows, i.e., the
asymptotic bias and consistency of the estimator, respectively. From all unbiased
estimators, one estimator is more efficient than another if it needs less samples to
achieve a given level of variance. More generally, one can be interested in control-
ling the balance between bias and variance. For example, if one decides to contrast
the estimate for one data set with that of surrogate data (see Section 5), the analyst
might chose to reduce the variance or statistical error of its estimate at the expense
of increasing its bias. The reason is that if the surrogate data is suspected to have a
similar bias to the observed data, the bias can be canceled out in the comparison. In
another case, the analyst might be interested in a direct interpretation of the value of
an estimate. To attain such a goal a low bias estimator is mandatory.
Thus, selecting the appropriate numerical estimator for a given application is cru-
cial since too large biases or statistical errors can severely hamper the interpretation
of the estimated functionals or their practical use. An optimal selection is indeed a
complex question that depends on several factors including the number of samples
available, the dimensionality of problem, the levels of quantization of the samples,
the desired bias versus variance balance, and the computational resources. Differ-
ent estimators can be classified according to several criteria and each class will
exhibit different advantages and costs depending on the above-mentioned factors.
Taxonomies of transfer entropy estimators closely follow that of other information
theoretical functionals such as Shannon or mutual information [33]. A usual first di-
vision consists of the separation between parametric and non-parametric estimators.
Here we will focus on those estimators that can readily be applied to high-
dimensional spaces for two reasons. First, the time series from electrophysiology
experiments typically render time series that can only be embedded in some high
dimensional space. This step is necessary to properly represent their true state (see
Section 4). Second, the evaluation of transfer entropy involves joint probability
densities, compounding the problem. Furthermore, only continuous signals will be
considered. The existence of a reliable estimator of transfer entropy for point pro-
cesses such as spike trains is still to be proved. For some heuristic approaches see
[34, 35, 36].
and start by first inferring the parameters of the family that best fits the sampled dis-
tribution. Note that due to the need to embed the time series into a high dimensional
space the distributions considered here, both the parametric family and the sampled,
are necessarily multi-dimensional. After the inference of the density parameters, a
direct estimation of transfer entropy or other information theoretic functionals then
proceeds by applying the proper functional to the estimated density function.
An advantage of the parametric approach is that in some cases it allows for an-
alytical insight on how an information theoretic functional depends on relevant pa-
rameters. For example, for the Gaussian family and other distributions some func-
tionals of the densities can be computed analytically [27, 37]. For example see the
work by Hlaváčková for the derivation of differential entropy and transfer entropy
for Gaussian, log-normal, Weinman, and generalized normal distributions [38]. Also
if time series are well fitted by some generative models it is possible to estimate
transfer entropy directly from the parameters of the generative equations. For exam-
ple, for coupled first-order auto-regressive models or second-order linear differential
equations with Gaussian input transfer entropy is analytically solvable [27, 39, 40].
Furthermore, for linear Gaussian systems the distribution of estimations given the
data length, as well as for that of surrogates on the given data, is analytically known
[41] which simplifies the evaluation of statistical significance for these systems (see
Section 4.3 for a discussion on assessing statistical significance for transfer entropy
in the general case).
The success of the parametric approach depends on the correctness of the as-
sumptions. For example, under certain conditions it might be reasonable to assume
that during resting state samples from certain electrodes are distributed according
to a Gaussian law (in which case transfer entropy is proportional to Granger causal-
ity). Even if the data are not distributed according to any member of a well known
family of distributions it is possible to apply some transform to bring them into one.
This procedure can also be useful to estimate bounds for certain functionals. For
example, since the data processing inequality implies that I(X,Y ) ≥ I( f (X), g(Y )),
where f and g are deterministic functions, a lower bound can be obtained if the
distributions of f (X) and g(Y ) are easier to estimate.
will most probably saturate the term with highest dimensionality and underestimate
transfer entropy.
Until now we have considered partitions of fixed size and independent of the
data but it is possible to generate partitions with cells or divisions of different sizes
adaptively to the observed samples. One possibility to overcome some of the above-
mentioned problems is to partition the observation space such that it is guarantee
that the occupations of bins satisfy some desired property. For example, for mu-
tual information Paluš proposed that some problems of over-quantization can be-
come less critical by using partitions that ensure an equal occupation of bins in the
marginal spaces [46]. Such equiquantization ensures a maximization of the entropy
for the marginal probabilities, which makes the mutual information depend only √
on the joint entropy term. In general, Paluš suggests that the condition h < dim+1 N
should be met for the practical estimation of mutual information by equiquantized
marginal partitions [46]. A different adaptive technique is based on the local recur-
sive refinement of a partition to uniformize the distribution of samples in the joint
space [47, 48, 49]. Yet a third type of approach considers fuzzy bins by allowing
a continuous weighting of a sample at multiple bins [50]. While in principle these
strategies can be generalized to estimate other information theoretical functionals,
no systematic study has tested their convergence properties for transfer entropy.
More generally, the curse of dimensionality is the major impediment for applying
these techniques to data sets living in moderate or high dimensional spaces, which
is the usual case in electrophysiology.
Note that the histogram technique is not readily applicable to spike trains in prac-
tical settings. Although they are usually considered binary processes (two states),
they have indeed a mixed topology due to the continuous nature of the time stamp
at which each spike occurs [51]. Only in case of very long recordings and after
the application of some bias corrections have histograms strategies produced some
reliable results for spike trains or signals with some continuous variable [12].
For continuous random variables, the sum of smooth kernels converge faster to the
underlying density than binning based techniques [54].
However, the evaluation of functionals of a density that is expressed as a sum of
kernels centered at the irregularly distributed sample points is difficult. For exam-
ple, the estimation of transfer entropy for continuous variables could in principle be
carried out by combining four Shannon entropy terms, or more strictly speaking, dif-
ferential entropies terms. While the application of kernel estimation of densities is
straightforward, evaluating each of the entropy terms requires to numerically com-
pute an integral over joint spaces, which for electrophysiology signals can easily
reach a dimensionality of fifteen or higher.
1 N
x̂ = ∑ xt .
N t=1
(7)
which provides a direct estimation of the mean from the samples without computing
the full distribution in first place.
The
derivation of the KL estimator starts by noticing that a differential entropy
term p(x) log (p(x)) can be approximated by the sample average of log(p(x)) eval-
uated at the sampled points x = xt [55, 51]. The next ingredient is the assumption that
the probability density near each point xt is locally uniform and equal to p(x = xt ),
which is an approximation of p as local as possible given the data available. Given
the former assumption and using the trinomial formula, log (p(xt )) can be calculated
from the probability that after N − 1 other samples have been drawn according to
p, the nearest neighbor to xt is at least at distance ε . Finally, the sample average of
log (p(xt )) can be shown to be, up to a constant, equal to the sample average of the
log of the distance of each sample point to its nearest neighbor. The general form of
the KL estimator for differential entropy finally reads
d
N∑
H (X) = −ψ (k) + ψ (N) + log(|Bd |) + log (2εt ) , (8)
t
Efficient Estimation of Information Transfer 47
where ψ denotes the digamma function, |Bd | is the volume of the unit ball in the d-
dimensional Euclidean space and εt is the distance of xt to its k-th nearest neighbor.
To note, other norms different from the Euclidean, such as the maximum norm, can
also be used in the former formula to estimate distance to nearest neighbors.
The KL estimator for differential entropy is endowed with several properties that
make it particularly attractive for practical applications. First, Kozachenko and Leo-
nenko demonstrated that under mild assumptions for the continuity of p, the above
estimator is asymptotically unbiased and consistent. Even for finite samples the pa-
rameter k (number of nearest neighbors) still allows for a certain control of the bias
versus variance balance (larger k reduce statistical errors at the expense of a higher
bias). Second, Victor has reported that the data-efficiency in estimating differential
entropy with the KL formula can reach a thousand times that of histogram strategies
for typical electrophysiology data sets [51]. Additionally, compared to histogram
techniques KL and other nearest neighbor approaches are centered on each data
sample and thus avoid certain arbitrariness typical of binning procedures. Third, KL
estimator effectively implements an adaptive resolution where the distance scale
used changes according to the underlying density. And fourth, the search of nearest
neighbors from a set of points is a classic problem that has received a lot of attention
and for which several algorithms exists [56, 57, 58, 59].
However, there remain important drawbacks. For example, the application of KL
estimator might return unreliable results if the assumption of continuity of p is not
appropriate. The validity of such assumption seems natural for most of continuous
electrophysiology signals but is a condition to check for each application. Also if
the number of samples is short and the dimensionality very high, the KL estima-
tor will suffer from the curse of dimensionality. In addition, for some applications
with a large data size living in very high-dimensional spaces the computation of ex-
act nearest neighbor distances can be computationally expensive. Unfortunately, less
expensive alternatives such approximative nearest neighbors, where some margin of
error is allowed in finding the k-th nearest neighbor members, leads to an amplifi-
cation of errors for the entropy estimate that renders the advantage gained from the
approximated search less useful. However, hierarchical clustering techniques and
parallel computing possibly assisted by GPU technology have paved the way to-
wards high-performance computing of massive exact nearest neighbor calculations
[60, 61].
Since most relevant information theoretic functionals can be decomposed in
terms of differential entropies, a naive estimator for such functionals would consist
of summing the individual differential entropy estimators. For example, for transfer
entropy the naive approach would consist of estimating each term of Eq. 5 from
a KL estimator. This is however not adequate for many applications. To see why,
it is important to first note that the probability densities involved in computing TE
or MI from individual terms can be of very different dimensionality (from dy up
to dy + dx + 1 for the case of bivariate TE). For a fixed k, this means that different
distance scales are effectively used for spaces of different dimension. The second
important aspect is to note that the KL estimator is based on the assumption that the
density of the distribution of samples is constant within an ε -ball. The bias of the
48 R. Vicente and M. Wibral
final entropy estimate depends on the validity of this assumption, and thus, on the
values of εt . Since the size of the ε -balls depends directly on the dimensionality of
the random samples, the biases of estimates for the dierential entropies in Eq. 5 will,
in general, not cancel, leading to a poor estimator of the transfer entropy.
The solution to this problem came from Kraskov, Stögbauer, and Grassberger
(KSG) who provided a methodology to adapt the KL estimator to estimate mutual
information [62, 63]. This set the path to estimators for other information theoretic
functionals such transfer entropy. Their solution came from the insight that Eq. 8
holds for any k and thus, it is not necessary to use a fixed k. Therefore, we can
vary the value of k in each data point so that the radii of the corresponding ε -balls
would be approximately the same for the joint and the marginal spaces. The key
idea is then to use a fixed mass (k) only in the higher dimensional space and project
the distance scale set by this mass into the lower dimensional spaces. Thus, the
procedure designed for mutual information suggests to first determine the distances
to k-th nearest neighbors in the joint space. Then, an estimator of MI can be obtained
by counting the number of neighbors n that fall within such distances for each point
in the marginal space. The estimator of MI based on this method displays many good
statistical properties, it inherits the data-efficiency of the KL estimator, it greatly
reduces the bias obtained with individual KL estimates, and it seems to become an
exact estimator in the case of independent variables.
The idea can be generalized to estimate other functionals such as conditional mu-
tual information, including its specific formulation for transfer entropy [64]. Finally,
the KSG estimator of transfer entropy for Markov processes indexed by the discrete
time variable t (Eq. 6) is written as
1
T E (X → Y ; u) = ψ (k) + ∑ ψ n dy + 1
N t yt−1
− ψ n dy + 1 − ψ n dy dx + 1 , (9)
yt yt−1 yt−1 xt−u
where the distances to the k-th nearest neighbor in the highest dimensional space
dy dx
(spanned by yt yt−1 xt−u ) define the radius of the balls for the counting of the number
of points n(·) in these balls around each state vector in all the marginal spaces (·)
involved. In the above formulation we have also included a temporal parameter u
which accounts for the time delay for the information transfer to occur between two
processes as explained in [45].
In summary, since the KSG estimator is more data efficient and accurate than
other techniques (especially those based on binning), it allows one to analyze shorter
data sets possibly contaminated by small levels of noise. At the same time, the
method is especially geared to handle the biases of high dimensional spaces nat-
urally occurring after the embedding of raw signals. Thus, the use of KSG enhances
the applicability of information theoretic functionals in practical scenarios with lim-
ited data of unknown distribution such as in neuroscience applications [25]. As
such, in the next section, we focus on the use of this estimator in a typical elec-
trophysiolgical data set. However, even using this improved estimator the curse of
Efficient Estimation of Information Transfer 49
dimensionality and inaccuracies in estimation are unavoidable, especially for the re-
strictive conditions of electrophysiology data sets. For these reasons it is suggested
that the raw value of transfer entropy may be less reliable than its use as a statis-
tic (in some statistical significance test against the null hypothesis that time series
measured are independent) to infer a directed relationship between time series.
This procedure depends on two parameters, the dimension d and the delay τ of the
embedding. The parameters d and τ considerably affect the outcome of the TE es-
timates. For instance, a low value of d can be insufficient to unfold the state space
of a system and consequently degrade the meaning of transfer entropy. On the other
hand, a too large dimensionality reduces the accuracy of the estimators given a sam-
ple size and can significantly increase the computing time.
While there is an extensive literature on how to choose such parameters, the dif-
ferent methods proposed are far away from reaching a consensus. A popular option
is to take the delay embedding as the auto-correlation decay time of the signal
or the first minimum, if any, of the auto mutual information [66]. Once the delay
50 R. Vicente and M. Wibral
time of the embedding has been fixed, the Cao criterion offers an algorithm to de-
termine the embedding dimension. This is based on detecting false neighbors due
to points being projected into a too low dimensional state space [67]. However, for
the purpose of interpreting transfer entropy as an information theoretical incarna-
tion of Wiener’s principle of causality, it is important not only that the embedding
parameters allow one to reconstruct the state space but also that they provide an
optimal self-prediction [45, 24]. Otherwise, if the reconstruction is not optimal in
the self-predicting sense, there might be a trivial reason for which the past states of
one system help to predict the future of another system better than from its own past
alone (see Chapter 1 in this volume for more details). Fortunately, the Ragwitz cri-
terion yields delay embedding states that provide optimal self-prediction for a large
class of systems, either deterministic or stochastic in nature. The Ragwitz criterion
is based on scanning the (d-τ ) plane to identify the point in that plane that mini-
mizes the locally constant predictor error [68]. This is how we finally recommend
making the choice of the embedding parameters for each time series. However, it is
always a good idea to check how transfer entropy measurements depend on values
for d and τ around those found by any criterion.
where X
denotes the surrogate data. Statistical significance can then be obtained for
the excess transfer entropy by non-parametric methods using permutation testing as
detailed in Vicente et al. [25] or Lindner et al. [71] to minimize the potential effects
of bias introduced by small sample size.
4.4 Toolboxes
Several toolboxes have been developed to tackle some or all of the three former
steps. Here we mention three of the toolboxes that handle the complexity of the
52 R. Vicente and M. Wibral
KSG estimation for transfer entropy but we make no claim that this short list is
exhaustive and encourage the reader to find the toolbox or software that fits best to
its intended application domain.
TRENTOOL is a MATLAB R
open source toolbox co-developed by the authors
and M. Lindner that is especially geared to neurophysiological data sets [71, 72]. It
is integrated with the popular Fieldtrip toolbox and handles the reconstruction, es-
timation, and non-parametric statistical significance of transfer entropy and mutual
information for multichannel recordings. It also features parallel search of nearest
neighbors and analysis of non-stationary time series by the ensemble method (see
next Section).
TIM is an open source C++ toolbox by K. Rutanen that estimates a large range
of information functionals for continuous-valued signals including transfer entropy,
mutual information, Kullback-Leibler divergence, Shannon entropy, Renyi entropy,
and Tsallis entropy [73].
Java information dynamics toolkit is a software written by J. Lizier that imple-
ments all of the relevant information dynamics measures, including basic measures
such as entropy, joint entropy, mutual information, conditional mutual information,
as well as advanced measures such as transfer entropy, active information stor-
age, excess entropy, separable information. It features discrete-valued estimators,
kernel estimators, nearest neighbors estimators, and Gaussian approximation based
estimates [74].
estimated over all trials. Thus, for each signal and trial we are led to consider a set
of embedded points as such
xtdx (r) = xt (r), xt−τ (r), xt−2τ (r), . . . , xt−(dx −1)τ (r) . (12)
A time-resolved transfer entropy can be formulated by using only the data points
from all trials belonging to a particular time window (t − σ ,t + σ ). This ensem-
ble TE can be decomposed into a sum of four time-resolved individual Shannon
entropies as in Eq. 5
dy dx dy dx
T E (X → Y,t; u) = H yt−1 (r), xt−u (r) − H yt (r), yt−1 (r), xt−u (r)
dy dy
+ H yt (r), yt−1 (r) − H yt−1 (r) , (13)
where r denotes the trial index in the full set of trials. We have also taken into ac-
count that propagation delays u between two processes X and Y affect the timing of
information transfer. Now it is again possible to adapt the KSG estimator to partially
cancel the errors of the different terms. The main difference consists in that in the
ensemble variant of transfer entropy we proceed by enabling the search of nearest
neighbors through points across all trials and not only from the same trial as the
point of reference of the search. If all the trials are aligned according to meaningful
events (such as stimulus or response onset) then it is possible to restrict the search
of neighbors around a time stamp t within a temporal window of width σ to control
the temporal resolution of the estimator. Thus, the ensemble estimator of transfer
entropy reads [64]
T E (X → Y,t; u, σ ) = ψ (k) + ψ n dy + 1
yt−1
− ψ n dy + 1 − ψ n dy dx + 1 , (14)
yt ,yt−1 yt−1 ,xt−u
r
where ψ denotes the digamma function and the angle brackets (<>) denote an av-
eraging over points at different trials at the time index t. Thus, in contrast to time
averaging used in Eq. 9, in the former expression averages are taken over points
across different trials and the nearest neighbor searches are defined within the tem-
poral window (t − σ ,t + σ ). The distances to the k-th nearest neighbor in the space
dy dx
spanned by yt (r), yt−1 (r), xt−u (r) define the radii of the balls for the counting of the
number of points (n(·)) in these balls around each state vector in all the marginal
spaces (·) involved. Such counting is restricted to only points within the interval
(t − σ ,t + σ ) across all trials.
To facilitate its computation, the ensemble estimator, including the state space
reconstruction and the statistical significance testing, has been recently added to the
TRENTOOL open source toolbox [72].
54 R. Vicente and M. Wibral
The ensemble estimator has recently been applied detect time-dependent cou-
plings between processes [64, 61] and they are closely related to local measures of
information discussed in Lizier in Chapter 7 of this volume.
6 Discussion
The characterization of a system in terms of the information transfer between its
subsystems is a common goal in many fields of science. However, this approach
seems particularly necessary when dealing with systems such as the nervous sys-
tem for which implementing a flexible routing of information is a key function.
Along this chapter we have described different methods to estimate transfer entropy
from continuous-valued time series typical from electrophysiology recordings. We
have also discussed that methods based on nearest neighbors statistics provide effi-
cient estimators and detailed the KSG estimator as an attractive option for practical
applications.
In this description we have restricted ourselves to a bivariate formulation of trans-
fer entropy. However, to distinguish cascade effects and common drive interactions
in networks of possibly interacting systems (as measured by multichannel record-
ings), it is fundamental to surpass the bivariate limitation. While the mathematical
extension to the multi-variate case is straightforward, its numerical estimation is
rather challenging. The curse of dimensionality and the combinatorial explosion
of possibilities makes unpractical an exhaustive computation of transfer entropies
beyond order 3 for applications with tens or hundreds of channels as it occurs in
typical EEG/MEG recordings. Fortunately recent developments on the optimal sub-
selection of channels as well as efficient multi-variate embedding reconstructions
have paved the way to practical approximations to higher order transfer entropies in
multichannel recordings [75, 76].
Future developments are expected to fully exploit the low dimensionality of the
manifolds on which the dynamics of many systems live. Since the manifold dimen-
sionality is typically far way lower than the Euclidean embedding space, it is pos-
sible that non-linear manifold learning techniques might provide a substantial leap
over current standard techniques. Also a mathematical fully rigorous formulation
of transfer entropy, including an adequate state space reconstruction, for point pro-
cesses such as spike trains would be very welcome. On the applications side, the nu-
merical decomposition of transfer entropy in state-dependent and state-independent
contributions seems a very useful tool to better discern the role of a receiving system
in processing information.
Finally, it is to note that since the 1948’s seminal works of Wiener (on cybernet-
ics [77]) and Shannon (on the quantification of information [13]) the idea that uni-
fying information aspects run deep below the diverse physical descriptions of many
phenomena, has been slowly gaining importance [78]. We believe that the charac-
terization of complex systems using transfer entropy as well as other functionals
describing the dynamics of information [5, 79], is a promising approach towards
Efficient Estimation of Information Transfer 55
Acknowledgements. The authors would like to thank Wei Wu and Joe Lizier for fruitful
discussions and suggestions.
References
1. Hubel, D.H., Wiesel, T.N.: Receptive fields and functional architecture of monkey striate
cortex. The Journal of Physiology 195(1), 215–243 (1968)
2. Gray, C.M., Knig, P., Engel, A.K., Singer, W.: Oscillatory responses in cat visual cortex
exhibit inter-columnar synchronization which reflects global stimulus properties. Na-
ture 338(6213), 334–337 (1989)
3. Canolty, R.T., Knight, R.T.: The functional role of cross-frequency coupling. Trends in
Cognitive Sciences 14(11), 506–515 (2010)
4. Victor, J.D.: Approaches to information-theoretic analysis of neural activity. Biological
Theory 1(3), 302 (2006)
5. Lizier, J.T.: The Local Information Dynamics of Distributed Computation in Complex
Systems. Springer theses. Springer (2013)
6. Lehmann, E.L., Casella, G.: Theory of point estimation, vol. 31. Springer (1998)
7. Niso, G., Brua, R., Pereda, E., Gutirrez, R., Bajo, R., Maest, F., Del-Pozo, F.: Hermes:
Towards an integrated toolbox to characterize functional and effective brain connectivity.
Neuroinformatics 11, 405–434 (2013)
8. Pereda, E., Quiroga, R.Q., Bhattacharya, J.: Nonlinear multivariate analysis of neuro-
physiological signals. Progress in Neurobiology 77(1), 1–37 (2005)
9. Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley-Interscience, New
York (1991)
10. Latham, P.E., Nirenberg, S.: Synergy, redundancy, and independence in population
codes, revisited. J. Neurosci. 25(21), 5195–5206 (2005)
11. Williams, P.L., Beer, R.D.: Nonnegative decomposition of multivariate information.
arXiv preprint arXiv:1004.2515 (2010)
12. Rieke, F., Warland, D., Deruytervansteveninck, R., Bialek, W.: Spikes: exploring the
neural code (computational neuroscience). MIT Press (1999)
13. Shannon, C.E.: The bell technical journal. A Mathematical Theory of Communica-
tion 27(4), 379–423 (1948)
14. Shannon, C.E., Weaver, W.: The mathematical theory of communication, urbana, il,
vol. 19(7), p. 1. University of Illinois Press (1949)
15. Barlow, H.B.: Possible principles underlying the transformation of sensory messages.
Sensory Communication, 217–234 (1961)
16. de Ruyter van Steveninck, R.R., Laughlin, S.B.: The rate of information transfer at
graded-potential synapses. Nature 379(6566), 642–645 (1996)
17. Lewicki, M.S.: Efficient coding of natural sounds. Nature Neuroscience 5(4), 356–363
(2002)
18. Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current Opinion in Neu-
robiology 14(4), 481–487 (2004)
19. Johnson, D.H.: Information theory and neuroscience: Why is the intersection so small?
In: IEEE Information Theory Workshop, ITW 2008, pp. 104–108 (2008)
20. Shannon, C.E.: The bandwagon. IRE Transactions on Information Theory 2(1), 3 (1956)
56 R. Vicente and M. Wibral
21. Nirenberg, S.H., Victor, J.D.: Analyzing the activity of large populations of neurons: how
tractable is the problem? Current Opinion in Neurobiology 17(4), 397–400 (2007)
22. Johnson, D.H.: Information theory and neural information processing. IEEE Transac-
tions on Information Theory 56(2), 653–666 (2010)
23. Schreiber, T.: Measuring information transfer. Phys. Rev. Lett. 85(2), 461–464 (2000)
24. Wiener, N.: The theory of prediction. In: Beckmann, E.F. (ed.) Modern Mathematics for
the Engineer. McGraw-Hill, New York (1956)
25. Vicente, R., Wibral, M., Lindner, M., Pipa, G.: Transfer entropy – a model-free measure
of effective connectivity for the neurosciences. J. Comput. Neurosci. 30(1), 45–67 (2011)
26. Ay, N., Polani, D.: Information flows in causal networks. Adv. Complex Syst. 11, 17
(2008)
27. Kaiser, A., Schreiber, T.: Information transfer in continuous processes. Physica D 166,
43 (2002)
28. Chicharro, D., Ledberg, A.: When two become one: the limits of causality analysis of
brain dynamics. PLoS One 7(3), e32466 (2012)
29. Chávez, M., Martinerie, J., Le Van Quyen, M.: Statistical assessment of nonlinear causal-
ity: application to epileptic EEG signals. J. Neurosci. Methods 124(2), 113–128 (2003)
30. Wibral, M., Rahm, B., Rieder, M., Lindner, M., Vicente, R., Kaiser, J.: Transfer entropy
in magnetoencephalographic data: Quantifying information flow in cortical and cerebel-
lar networks. Prog. Biophys. Mol. Biol. 105(1-2), 80–97 (2011)
31. Vicente, R., Gollo, L.L., Mirasso, C.R., Fischer, I., Pipa, G.: Dynamical relaying can
yield zero time lag neuronal synchrony despite long conduction delays. Proceedings of
the National Academy of Sciences 105(44), 17157–17162 (2008)
32. Kay, S.M.: Fundamentals of statistical signal processing. In: Estimation Theory, vol. 1
(1993)
33. Hlaváčková-Schindler, K., Paluš, M., Vejmelka, M., Bhattacharya, J.: Causality detec-
tion based on information-theoretic approaches in time series analysis. Physics Re-
ports 441(1), 1–46 (2007)
34. Gourevitch, B., Eggermont, J.J.: Evaluating information transfer between auditory corti-
cal neurons. J. Neurophysiol. 97(3), 2533–2543 (2007)
35. Ito, S., Hansen, M.E., Heiland, R., Lumsdaine, A., Litke, A.M., Beggs, J.M.: Extending
transfer entropy improves identification of effective connectivity in a spiking cortical
network model. PLoS One 6(11), e27431 (2011)
36. Li, Z., Li, X.: Estimating temporal causal interaction between spike trains with permuta-
tion and transfer entropy. PLoS One 8(8), e70894 (2013)
37. Barnett, L., Barrett, A.B., Seth, A.K.: Granger causality and transfer entropy are equiva-
lent for Gaussian variables. Phys. Rev. Lett. 103(23), 238701 (2009)
38. Hlavácková-Schindler, K.: Equivalence of Granger causality and transfer entropy: A gen-
eralization. Applied Mathematical Sciences 5(73), 3637–3648 (2011)
39. Nichols, J.M., Seaver, M., Trickey, S.T., Todd, M.D., Olson, C., Overbey, L.: Detecting
nonlinearity in structural systems using the transfer entropy. Phys. Rev. E Stat. Nonlin.
Soft Matter Phys. 72(4 Pt. 2), 046217 (2005)
40. Hahs, D.W., Pethel, S.D.: Transfer entropy for coupled autoregressive processes. En-
tropy 15(3), 767–788 (2013)
41. Barnett, L., Bossomaier, T.: Transfer entropy as a log-likelihood ratio. Physical Review
Letters 109(13), 138105 (2012)
42. Miller, G.A.: Note on the bias of information estimates. Information Theory in Psychol-
ogy: Problems and Methods 2, 95–100 (1955)
43. Efron, B., Stein, C.: The jackknife estimate of variance. The Annals of Statistics, 586–
596 (1981)
Efficient Estimation of Information Transfer 57
44. Pompe, B., Runge, J.: Momentary information transfer as a coupling measure of time
series. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 83(5 Pt. 1), 051122 (2011)
45. Wibral, M., Pampu, N., Priesemann, V., Siebenhhner, F., Seiwert, H., Lindner, M., Lizier,
J.T., Vicente, R.: Measuring information-transfer delays. PLoS One 8(2), e55809 (2013)
46. Paluš, M.: Testing for nonlinearity using redundancies: Quantitative and qualitative as-
pects. Physica D: Nonlinear Phenomena 80(1), 186–205 (1995)
47. Fraser, A.M., Swinney, H.L.: Independent coordinates for strange attractors from mutual
information. Phys. Rev. A. 33, 1134 (1986)
48. Darbellay, G.A., Vajda, I.: Estimation of the information by an adaptive partitioning
of the observation space. IEEE Transactions on Information Theory 45(4), 1315–1321
(1999)
49. Cellucci, C.J., Albano, A.M., Rapp, P.E.: Statistical validation of mutual information
calculations: Comparison of alternative numerical algorithms. Physical Review E 71(6),
066208 (2005)
50. Daub, C.O., Steuer, R., Selbig, J., Kloska, S.: Estimating mutual information using b-
spline functions–an improved similarity measure for analysing gene expression data.
BMC Bioinformatics 5(1), 118 (2004)
51. Victor, J.: Binless strategies for estimation of information from neural data. Phys. Rev.
E 72, 051903 (2005)
52. Silverman, B.W.: Density estimation for statistics and data analysis, vol. 26. CRC Press
(1986)
53. Young-Il, M., Rajagopalan, B., Lall, U.: Estimation of mutual information using kernel
density estimators. Physical Review E 52(3), 2318 (1995)
54. Steuer, R., Kurths, J., Daub, C.O., Weise, J., Selbig, J.: The mutual information: detecting
and evaluating dependencies between variables. Bioinformatics 18(suppl. 2), S231–S240
(2002)
55. Kozachenko, L.F., Leonenko, N.N.: Sample estimate of entropy of a random vector.
Probl. Inform. Transm. 23, 95–100 (1987)
56. Knuth, D.E.: The art of computer programming. In: Sorting and Searching, vol. 3 (1973)
57. Vaidya, P.M.: An O(n logn) algorithm for the all-nearest-neighbors problem. Discrete &
Computational Geometry 4(1), 101–115 (1989)
58. Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity search: The metric space ap-
proach. Advances in Database Systems, vol. 32. Springer, Secaucus (2005)
59. Heineman, G.T., Pollice, G., Selkow, S.: Algorithms in a Nutshell. O’Reilly Media, Inc.
(2009)
60. Merkwirth, P., Lauterborn, W.: Fast nearest-neighbor searching for nonlinear signal pro-
cessing. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics 62(2 Pt. A),
2089–2097 (2000)
61. Wollstadt, P., Martinez-Zarzuela, M., Vicente, R., Wibral, M.: Efficient transfer entropy
analysis of nonstationary neural time series. arXiv preprint arXiv:1401.4068 (2014)
62. Kraskov, A., Stoegbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev.
E Stat. Nonlin. Soft Matter Phys. 69(6 Pt. 2), 066138 (2004)
63. Kraskov, A.: Synchronization and Interdependence measures and their application to the
electroencephalogram of epilepsy patients and clustering of data. PhD thesis, University
of Wuppertal (February 2004)
64. Gomez-Herrero, G., Wu, W., Rutanen, K., Soriano, M.C., Pipa, G., Vicente, R.: Assess-
ing coupling dynamics from an ensemble of time series. arXiv preprint arXiv:1008.0539
(2010)
65. Takens, F.: Detecting Strange Attractors in Turbulence. In: Dynamical Systems and Tur-
bulence, Warwick, 1980. Lecture Notes in Mathematics, vol. 898, pp. 366–381. Springer
(1981)
58 R. Vicente and M. Wibral
66. Kantz, H., Schreiber, T.: Nonlinear Time Series Analysis, 2nd edn. Cambridge University
Press (November 2003)
67. Cao, L.Y.: Practical method for determining the minimum embedding dimension of a
scalar time series. Physica A 110, 43–50 (1997)
68. Ragwitz, M., Kantz, H.: Markov models from data by simple nonlinear time series pre-
dictors in delay embedding spaces. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 65(5 Pt.
2), 056201 (2002)
69. Theiler, J.: Spurious dimension from correlation algorithms applied to limited time-series
data. Physical Review A 34(3), 2427 (1986)
70. Vejmelka, M., Hlaváčková-Schindler, K.: Mutual information estimation in higher di-
mensions: A speed-up of a k-nearest neighbor based estimator. In: Beliczynski, B.,
Dzielinski, A., Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007, Part I. LNCS,
vol. 4431, pp. 790–797. Springer, Heidelberg (2007)
71. Lindner, M., Vicente, R., Priesemann, V., Wibral, M.: Trentool: A Matlab open source
toolbox to analyse information flow in time series data with transfer entropy. BMC Neu-
rosci. 12(119), 1–22 (2011)
72. Lindner, M., Vicente, R., Wibral, M., Pampu, N., Wollstadt, P., Martinez-Zarzuela, M.:
TRENTOOL, http://www.trentool.de
73. Rutanen, K.: TIM 1.2.0,
http://www.cs.tut.fi/˜timhome/tim-1.2.0/tim.htm
74. Lizier, J.: Java Information Dynamics Toolkit,
http://code.google.com/p/information-dynamics-toolkit/
75. Faes, L., Nollo, G., Porta, A.: Non-uniform multivariate embedding to assess the infor-
mation transfer in cardiovascular and cardiorespiratory variability series. Comput. Biol.
Med. 42(3), 290–297 (2012)
76. Lizier, J.T., Rubinov, M.: Inferring effective computational connectivity using incremen-
tally conditioned multivariate transfer entropy. BMC Neuroscience 14(suppl. 1), P337
(2013)
77. Wiener, N.: Cybernetics. Hermann, Paris (1948)
78. Davies, P.C.W., Gregersen, N.H.: Information and the Nature of Reality, vol. 3. Cam-
bridge University Press, Cambridge (2010)
79. Barnett, L., Lizier, J.T., Harré, M., Seth, A.K., Bossomaier, T.: Information flow in a
kinetic Ising model peaks in the disordered phase. Physical Review Letters 111(17),
177203 (2013)
Part II
Information Transfer in Neural and Other
Physiological Systems
60 Part II: Information Transfer in Neural and Other Physiological Systems
1 Introduction
The study of many physical phenomena is often performed according to a reduction-
ist approach whereby the dynamics of the observed complex system are described
as resulting from the activity of less complex subsystems and from the interaction
among these subsystems. For instance, the human brain can be seen as a complex
network characterized by distinct neural ensembles, each represented by a single
oscillator, which are highly interconnected with each other according to specific
patterns of connectivity [1]. With a broader perspective, the whole human organism
can be seen as an integrated network where multiple physiological systems under
the neural regulation, such as the cardiac, circulatory, respiratory and muscular sys-
tems, each with its own internal dynamics, continuously interact to preserve the
overall physiological function [2]. The aim of this approach is to describe how com-
plex properties of the observed network arise from the dynamics and the dynamical
interaction of simpler and likely more accessible parts. When the physiology of the
composite system and the way in which the subsystems interact is well known, the
analysis may be performed constructing suitable generative models and comparing
the dynamics of these models with the available experimental data. If, as often hap-
pens, the available knowledge is insufficient to support the definition of a generative
model, data-driven approaches are needed whereby the properties of the subsystems
and their interactions are estimated from the measured data.
When the data-driven approach is considered, the central need is to identify a
suitable framework for describing the properties of the observed complex network
in terms of meaningful measures of system activity and connectivity. In this chapter,
we focus on the well-posed analysis framework provided by dynamical information
theory [3]. Compared with other frameworks commonly used for the analysis of
physiological networks, for instance the linear parametric representation of multi-
ple time series performed either in time or frequency domains [4], the information-
theoretic approach offers the intriguing possibility of exploring system dynamics
from a nonlinear and model-free perspective. Attracted by this opportunity, several
researchers have defined and developed different measures of information dynam-
ics based on the computation of entropy rates. In particular, single-process condi-
tional entropy measures computed via their various formulations (e.g., approximate
entropy [5], sample entropy [6], corrected conditional entropy [7]) are typical mea-
sures of system complexity, while mutual information and cross-entropies quan-
tify the information shared between coupled systems [8, 9], and measures based
on transfer entropy [10] quantify the directional information flow from one sys-
tem to another. Recent advances in the field of information dynamics have shown
that, if properly defined and contextualized, these measures form the basis of dis-
tributed computation in complex networks [11] and are not independent with each
other when the aim is to characterize the overall behavior of a network of interact-
ing systems [12]. In this contribution, new definitions of the so-called self entropy
(SE) and cross entropy (CE) are integrated, together with the well known trans-
fer entropy (TE), into an unified framework for the study of information dynamics.
These elements are combined together to compute the reduction of the information
Conditional Entropy-Based Evaluation of Information Dynamics 63
associated to the target system due to the knowledge of the dynamics of a network
of interacting dynamical components.
Another aspect of paramount importance in practical analysis is the design of
data-efficient procedures for the estimation of information-theoretic measures in the
challenging conditions of physiological signal analysis. Surveying our past and re-
cent research in the field [7],[9],[13],[14],[15],[16],[17],[18],[19],[20], we present
in the second part of the chapter a general strategy for the estimation of SE, CE
and TE from short realizations of multivariate processes. The strategy is based on
the utilization of a corrected conditional entropy estimator and of appropriate em-
bedding schemes, and aims at dealing with the curse of dimensionality –an issue
unavoidably affecting the estimation of information-theoretic quantities defined in
high-dimensional state spaces from time series of limited length. As suggested by
the reported practical applications this approach successfully describes, in terms of
SE, CE and TE estimated from multivariate time series, both individual and collec-
tive properties of the systems composing brain and physiological networks.
where p(y) is the probability for the variable Y to take the value y. The conditional
entropy (CondEn) quantifies the average uncertainty that remains about Y when X is
known as:
H (Y |X) = − ∑ p (x, y) log p (y|x), (2)
while the mutual information (MI) quantifies the amount of information shared
between X and Y as:
p (y|x)
I (X;Y ) = ∑ p (x, y) log (3)
p (y)
where p(x,y) is the joint probability of observing simultaneously the values x and
y for the variables X and Y, and p(y|x) is the conditional probability observing y
given that x has been observed. Note that the sums in (1-3) extend to the sets of
all values with nonzero probability. Combining (1), (2) and (3) one can easily see
that the three measures are linked to each other by the relation I(X;Y)=H(Y)-H(Y|X).
64 L. Faes and A. Porta
p (y|x, z)
I (X;Y |Z) = ∑ p (x, y, z) log , (4)
p (y|z)
p(yn |y−
n)
SY = ∑ p(yn , y−
n ) log (5)
p(yn )
that quantifies the average reduction in uncertainty about Yn resulting from the
knowledge of Y − n . The SE ranges from SY =0, measured when the past states Y n
−
do not provide any reduction in the uncertainty about the present state Yn (i.e., when
p(yn |y−
n )=p(yn )), to its maximum value SY =H(Yn ), measured when the whole un-
certainty about Yn is reduced by learning Y − −
n (i.e., when p(yn |yn )=1). Moreover, the
influence of the past states of the source process X onto the present state of the target
process Y can be assessed by means of the cross entropy (CE), defined here as:
p(yn |x−
n)
CX→Y = ∑ p(yn , x−
n ) log (6)
p(yn )
Conditional Entropy-Based Evaluation of Information Dynamics 65
p(yn |x− −
n , yn )
TX→Y = ∑ p(yn , y− −
n , xn ) log (7)
p(yn |y−
n)
SY = I(Yn ; Y − −
n ) = H(Yn ) − H(Yn |Y n ), (8)
CX→Y = I(Yn ; X − −
n ) = H(Yn ) − H(Yn |X n ), (9)
TX→Y = I(Yn ; X − − − − −
n |Y n ) = H(Yn |Y n ) − H(Yn |X n , Y n ). (10)
From these compact formulations, it is intuitive to see that SE, CE and TE mea-
sure the reduction in the information carried by Y respectively due to the introduc-
tion of its own past, due to the introduction of the past of X when the contribution
of the past of Y is not taken into account, and due to the introduction of the past of
X when the contribution of the past of Y is taken into account.
processes other than the two considered source and destination processes, in order
to rule out the side information related to these other processes that may possi-
bly confound the analysis of information dynamics. This is achieved by defining
multivariate (conditional) variants of self, cross and transfer entropy measures as
follows. Suppose that we are interested in evaluating the information of the target
process Y in relation to the source process X, collecting the remaining processes in
the set Z={Z(k) }k=1,...,M−2 . Then, the multivariate SE of Y given Z quantifies the
additional reduction of information about Yn due to the introduction of Y − −
n in Zn ,
− −
thus accounting for the contribution of Yn that is not already provided by Zn :
SY |Z = I(Yn ; Y − − − − −
n |Zn ) = H(Yn |Zn ) − H(Yn |Y n , Zn ); (11)
CX→Y |Z = I(Yn ; X − − − − −
n |Zn ) = H(Yn |Zn ) − H(Yn |X n , Zn ); (12)
TX→Y |Z = I(Yn ; X− − − − − − − −
n |Y n , Zn ) = H(Yn |Y n , Zn ) − H(Yn |X n , Y n , Zn ). (13)
than two dynamical systems are known to interact with each other, the utilization
of conditional MI measures has proven useful to identify statistical dependencies
between pairs of systems in the context of their multivariate representation where
the remaining interacting systems are considered. For instance, the multivariate ver-
sion of the TE has been proven useful to address the confounding effects of indirect
connections in the estimation of direct information transfer between the nodes of a
network [29]. In addition, the conditional MI I(X;Y|Z) has been given a novel inter-
pretation in terms of partial information decomposition [30] which has led to show,
e.g., in cellular automata [3], that the multivariate TE TX→Y |Z assesses the predic-
tive information transfer by suppressing the redundant information provided by X
and Z about Y, but also incorporating the synergistic information found in X and Z
about Y.
Interestingly, SE, CE and TE do not describe isolated aspects of the dynamics of
information in a composite dynamical system. In accordance with a recently pro-
posed information-theoretic framework for the study of dependencies in networks
of dynamical systems [12], we show here that SE, CE and TE appear naturally as
terms in the decomposition of the system predictive information about the observed
target system. The predictive information is defined for an assigned target process
as the amount of information about the present state of the process that is explained
by its past states and the past states of all other available processes. This can be
measured, for a bivariate process {X,Y} where Y is taken as target process, through
the conditional MI:
PY = I(Yn ; X − − − −
n , Y n ) = H(Yn ) − H(Yn |X n , Y n ), (14)
which can be further decomposed in two terms related to the bivariate SE and TE
of (8) and (9), or alternatively in two terms related to the bivariate CE (9) and the
multivariate SE (11):
PY = H(Yn ) − H(Yn|Y − − − −
n ) + H(Yn |Y n ) − H(Yn |X n , Y n ) = SY + TX→Y , (15a)
PY = H(Yn ) − H(Yn|X − − − −
n ) + H(Yn |X n ) − H(Yn |X n , Y n ) = CX→Y + SY |X (15b)
PY = I(Yn ; X − − − − − −
n , Y n , Zn ) = H(Yn ) − H(Yn |X n , Y n , Zn ), (16)
which can be expanded, according to the chain rule for the decomposition of the
conditional MI, in six different ways:
68 L. Faes and A. Porta
PY = I(Yn ; Y − − − − − −
n ) + I(Yn ; X n |Y n ) + I(Yn ; Zn |X n , Y n ) = SY + TX →Y + TZ→Y |X , (17a)
PY = I(Yn ; Y − − − − − −
n ) + I(Yn ; Zn |Y n ) + I(Yn ; X n |Y n , Zn ) = SY + TZ→Y + TX →Y |Z , (17b)
PY = I(Yn ; X− − − − − −
n ) + I(Yn ; Y n |X n ) + I(Yn ; Zn |X n , Y n ) = CX →Y + SY |X + TZ→Y |X , (17c)
PY = I(Yn ; Z− − − − − −
n ) + I(Yn ; Y n |Zn ) + I(Yn ; X n |Y n , Zn ) = CZ→Y + SY |Z + TX →Y |Z , (17d)
PY = I(Yn ; X− − − − − −
n ) + I(Yn ; Zn |Xn ) + I(Yn ; Y n |X n , Zn ) = CX →Y +CZ→Y |X + SY |X ,Z ,(17e)
PY = I(Yn ; Z− − − − − −
n ) + I(Yn ; X n |Zn ) + I(Yn ; Y n |Xn , Zn ) = CZ→Y +CX →Y |Z + SY |X ,Z . (17f)
The decompositions in (15) and (17) are useful to explain how the uncertainty
about the states visited by the target system is reduced as a result of the state tran-
sitions relevant to the overall bivariate or multivariate system. In particular, they
show that SE, CE and TE are the elements through which this uncertainty reduc-
tion is achieved. It is worth noting that the different decompositions in (15) or in
(17) are equally valid, as they reflect simply different orders through which con-
ditioning to the past of the constituent processes is performed [12]. Therefore, as
none of the decompositions can be considered as preeminent, SE, CE and TE can
be seen as equally important terms of the description of a target system in terms
of predictive information. Which of these decompositions should be chosen to dis-
sect the system predictive information about the target system may depend only on
side information, e.g., based on physiological knowledge. For instance, when the
target process Y is known to be a passive process a decomposition evidencing the
CE might be preferred to another evidencing the TE, to limit the underestimation
of information transfer which may result by conditioning to Y − n ; on the contrary,
when Y exhibits self-sustained oscillatory activity the SE should be highlighted to
rule out the possibility that such an activity is misinterpreted as information transfer.
Moreover, formulations evidencing CX→Y |Z and TX→Y |Z should be preferred when
some of the processes composing Z may potentially affect both X and Y, while the
computation of CX→Y and TX→Y may suffice when X and Z can be considered in-
dependent. In any case, one particular decomposition can be supported a posteriori,
i.e. showing how useful it is to understand how the overall bivariate or multivariate
system behaves when examined under different conditions.
of CondEn, and show how this framework may be used to estimate SE, CE and TE
from short-length realizations of the observed multivariate processes.
The CondEn terms involved in Eqs. (8-10) and (11-13) have to be computed by
conditioning on the past history of one or more observed systems. In practical analy-
sis, this is achieved through the so-called state-space reconstruction of the observed
dynamical systems [31]. State space reconstruction refers to identifying the finite
dimensional state variables that better approximate the past states of the observed
processes X− − −
n , Y n , and Zn . The most commonly followed approach is to perform
uniform time delay embedding, whereby each scalar process is mapped into trajec-
tories described by delayed coordinates uniformly spaced in time [32]. In this way
the state variable of the target process Y, Y −
n , is approximated with the delay vector
[Yn−u Yn−u−τ ··· Yn−u−(d−1)τ ], with d, τ and u representing respectively the so-
called embedding dimension, embedding time and prediction time. This procedure
suffers from many disadvantages. First, univariate embedding whereby coordinate
selection is performed separately for each process does not guarantee optimality of
the reconstruction for the multivariate state space [33]. Moreover, selection of the
embedding parameters d, τ and u is not straightforward, as many competing criteria
exist which are all heuristic and somewhat mutually exclusive [34]. Most impor-
tant, uniform embedding exposes the state space reconstruction procedure to the so
called “curse of dimensionality”, a problem related to the sparsity of the available
data within state spaces of increasing volume [35]; this problem is exacerbated in
the presence of multivariate time series and when the series are of limited length, as
commonly happens in physiological system analysis due to lack of data or stationar-
ity requirements. In these conditions the estimation of CondEn suffers from serious
limitations, as it is found that –whatever the underlying dynamics– short time se-
ries generate estimates of entropy rates that progressively decrease towards zero
at increasing the embedding dimension [7], thus rendering completely useless the
computed measures. This issue forced many authors to fix the embedding dimension
at very small arbitrary values to obtain reliable CondEn estimates (see, e.g., [36]).
To show how these problems can be counteracted in the practical estimation of mea-
sures of information dynamics, we describe in the following an estimation strategy
based on the utilization of a corrected CondEn estimator [7], and an improvement
of this strategy based on a non-uniform embedding technique [18].
where the summation is extended over all states (i.e., hypercubes) in the embedding
space, and the probabilities p(Vξ ) are estimated for each hypercube simply as the
fraction of quantized vectors Vξ falling into the hypercube (i.e., the frequency of
occurrence of Vξ within Ad ). An illustrative example is reported in Fig. 1, showing
estimation of H(yn ), H(yn−1, yn−3 ) and H(yn , yn−1 , yn−3 ) representing respectively
the entropies H(Yn ), H(Y − −
n ) and H(Yn ,Y n ) computed with an embedding vector
(d,u,τ )
yn =[yn−1 , yn−3 ].
A major problem in estimating the CondEn from time series of limited length
is that it always decreases towards zero at increasing the embedding dimension d.
This results from the fact that, letting d increase, the embedding vectors become
more and more isolated in the state space of increasing dimension, and this isola-
tion results in an increasing numbers of vectors Vξ found alone inside an hypercube
of the quantized space. This effect is seen already at low dimensions in Fig. 1c,
noting that using yn−1 as embedding vector would have resulted in only one single
point, while the use of [yn−1 , yn−3 ] as in the figure results in four single points.
(d,u,τ )
The problem with single points is that, when a vector yn is alone inside an
(d,u,τ )
hypercube of the d-dimensional space, the vector [yn , yn ] is also alone in the
(d+1)-dimensional space. Therefore, single points in the d-dimensional space give
to H(Y − −
n ) the same contribution given to H(Yn ,Y n ) by the corresponding points in
the (d+1)-dimensional space, bringing a null contribution to H(Yn |Y − n ). Thus, the
increase of the number of single points with d leads to a progressive decrease of
Conditional Entropy-Based Evaluation of Information Dynamics 71
Fig. 1 Example of state space partitioning of a time series for the computation of en-
tropy and conditional entropy. (a) The values yn of the series descriptive of the process
Y, ranging from ymin to ymax , are uniformly quantized using ξ =6 quantization levels; (b)
the values of yn are binned according to quantization, and the entropy H(Yn ) is estimated
as H(Yn )=-∑p(yn )logp(yn ), where the probabilities p(yn ) are estimated as the relative fre-
quency of visitation of each bin; (c) assuming a prediction time u=1, an embedding time
τ =2 and an embedding dimension d=2, all embedding vectors of the form V=[yn−1 ,yn−3 ]
built from the time series are represented in a bidimensional state space, and are assigned
to square bins resulting from the uniform quantization of the two coordinates (gray grid);
then the entropy H(Y − n ) is estimated as H(V)=-∑p(V)logp(V); (d) the analysis is repeated
for all values assumed by the vector [yn ,V] to estimate the entropy H(Yn ,Y − n ) as H(yn ,V)=-
∑p(yn ,V)logp(yn ,V), where cubic bins result now from the uniform quantization of three co-
ordinates. Then, the CondEn is estimated as H(Yn |Y − n )=H(yn ,V)-H(V), and the CorrCondEn
as Hc (Yn |Y −
n )= =H(Y n |Y − )+n(V)H(Y ), where n(V) is the fraction of vectors V found alone
n n
into an hypercube in panel (c) (gray squares). Note that single points in (c) remain always sin-
gle also in the higher dimensional space in (d), while other single points may appear (black
squares).
the estimated CondEn. This occurs even for completely unpredictable processes for
which the conditional entropy should stay at high values regardless of the embed-
ding dimension (an example is in Fig. 2a,c). To counteract this bias, a corrected
conditional entropy (CorrCondEn) can be defined as [7, 18]:
(d,u,τ ) (d,u,τ ) (d,u,τ )
H c (Yn |Y −
n ) = H (Yn |Y n
c
) = H(Yn |Y n ) + n(Yn ) · H(Yn ) (19)
72 L. Faes and A. Porta
(d,u,τ )
where in the context of uniform quantization n(Y n ) is the fraction of single
(d,u,τ )
points in the quantized space, i.e. the fraction of vectors Y n , represented in
(d,u,τ )
their quantized form, found only once within Ad (0 ≤ n(Yn ) ≤ 1). The scale
factor H(Yn ) is chosen because it represents the CondEn of a white noise with the
same probability distribution of the considered process; with this choice, the null
contribution of single points is substituted with the maximal information carried
by a white noise, so that the CondEn of the relevant white noise is estimated after
finding 100% of single points.
The CorrCondEn is the sum of two terms, the first decreasing and the second
increasing with the dimension of the explored state space. Hence, Hc (Yn |Y − n )=
(d,u,τ )
H (Yn |Y n
c ) exhibits a minimum over d, and this minimum value may be taken
as an estimate of the CondEn. Following this idea, CondEn analysis may be per-
formed over short time series without constraining the embedding dimension to
low predetermined values. An example is shown in Fig. 2, illustrating the computa-
(d,u,τ )
tion of the CorrCondEn Hc (Yn |Y n ) as a function of d, with parameters u=τ =1,
ξ =6, for a second-order autoregressive process y defined by two complex and con-
jugated poles with modulus ρ and phase ϕ = π /4: Yn =2ρ cosϕ Yn−1 − ρ 2 Yn−2 +wn
(w is a white noise innovation process). The regularity of the process is determined
by the parameter ρ : with ρ =0 the process reduces to a fully unpredictable white
noise (Fig. 2a), while with ρ =0.98 a partially predictable stochastic oscillation is
(d,u,τ )
set (Fig. 2b). Accordingly, the entropy of Yn conditioned to Y n =[Yn−1 ,...,Yn−d ]
is expected to be high and constant at varying d when ρ =0, and to show a mini-
mum reflecting the predictability of the process when ρ =0.98. These two situations
are well reproduced by the CorrCondEn. For the white noise process, the slow de-
(d,u,τ )
crease of H(Yn |Y n ) with increasing d (dashed line) is fully compensated by
(d,u,τ )
the corrective term n(Y n )·H(Yn ) (dotted line), resulting in a roughly flat pro-
(d,u,τ )
file of Hc (Yn |Y n ) (solid line) that determined a minimum estimate close to the
expected CondEn (Fig. 2c). For the partially predictable process, the decrease of
(d,u,τ )
H(Yn |Y n ) is substantial already at low values of d due to the usefulness of past
samples to describe the present of Y, while the corrective term intervened at higher
values of d, thus producing a well defined minimum at d=5 (Fig. 2d).
In accordance with the above described procedure, an estimate of the SE in (8)
results simply by subtracting the estimated CorrCondEn from the Shannon entropy
of the series. The same procedure may be easily followed to estimate the CE in (9),
(d,u,τ ) (d,u,τ )
simply conditioning on X− −
n instead of on Y n (i.e, using Xn in place of Y n ) in
the computation of the CorrCondEn [9]. However, a possible limitation of this pro-
cedure is in the fact that the terms used for conditioning are included progressively
into the embedding vector without checking their effective relevance for describ-
ing the dynamics of the target process. While the progressive inclusion based on
the time lag of the past terms (i.e. the terms Yn−1 ,Yn−2 ,... are sequentially added to
the embedding vector when conditioning on Y − n ) is intuitive and works well under
most circumstances, it is exposed to an inclusion of irrelevant terms that is likely to
impair the detection of dependencies. This problem is exacerbated in the presence
Conditional Entropy-Based Evaluation of Information Dynamics 73
Fig. 2 Example of computation of the corrected conditional entropy for short time series
(N=300 points) with different level of predictability [7]. (a) realization of a fully unpredictable
white noise; (b) realization of a partially predictable autoregressive process; (c,d) correspond-
ing estimated profiles of the CondEn (dashed line), the corrective term (dotted line) and the
CorrCondEn (solid line) obtained at varying the dimension d of uniform embedding from 1
to 15.
of short time series, for which the corrective term prevents the exploration of high-
dimensional state spaces. Moreover, the issue may become critical when one aims
at estimating information measures that account for conditioning schemes involv-
ing several different variables, such as the bivariate TE in (10) and the multivariate
extensions of SE, CE and TE defined in (11-13). In these situations, a reliable esti-
mation of the CondEn in the presence of short realizations of multiple conditioning
processes may be performed only through an intelligent embedding strategy that al-
lows to include into the embedding vector only the terms which are relevant to the
dynamics of the target process. This is achieved by the procedure for nonuniform
embedding presented in the next subsection.
of H(Yn |X− − −
n ,Y n ,Zn ) will be the set Ω 2 ={Ω 1 ,Xn−1 ,...,Xn−L }, where L is the number
of time lagged terms to be tested for each process. Given the generic candidate set
Ω , the procedure for estimating the CorrCondEn Hc (Yn |Ω ) starts with an empty
embedding vector V0 =[·], and proceeds as follows:
• at each step k ≥ 1, form the candidate vector [s,Vk−1 ], where s is an element of Ω
not already included in Vk−1 , and compute the CorrCondEn of the target process
y given the considered candidate vector, Hc (Yn |[s,Vk−1 ]);
• repeat the previous step for all possible candidates, and then retain the candidate
for which the CorrCondEn is minimum, i.e., set Vk =[s
, Vk−1 ] where s
=arg mins
Hc (Yn |[s,Vk−1 ]);
• terminate the procedure when a minimum in the CorrCondEn is found, i.e., at the
step k
such that H c (Yn |Vk
) ≥ H c (Yn |Vk
− 1), and set Vd =Vk
− 1 as embedding
vector.
This procedure is devised to try to include into the embedding vector only the com-
ponents that effectively contribute to resolving the uncertainty of the target process
(in terms of CondEn reduction), while leaving out the irrelevant components. This
feature, together with the termination criterion which prevents the selection of new
terms when they do not bring further resolution of uncertainty for the destination
process, help escaping the curse of dimensionality for the multivariate estimation of
the CondEn. Moreover the procedure avoids the nontrivial task of setting the embed-
ding parameters d, τ and u (the only parameter here is the number L of candidates
to be tested for each process, which can be as high as allowed by the affordable
computational times).
To illustrate the procedure we report an example of computation of the multivari-
ate TE on a short realization (N=300 points) of the processes associated with the
M=3 processes described as [19]:
Xn = 1.4 − Xn−1
2 + 0.3X
n−2
Yn = 1.4 − 0.5 (Xn−1 + Yn−1)Yn−1 + 0.1Yn−2 (20)
Zn = |Xn−3 | + Yn−1
instance, if we consider the analysis from X to Y (Fig. 3a) we see that the embedding
of Yn based on the set of candidates Ω 1 ={Y − −
n ,Zn }≈{Yn−1 ,...,Yn−L ,Zn−1 ,..., Zn−L }
terminates at the step d=4 returning the embedding vector V4 =[yn−1, yn−3 , zn−3 ,
zn−1 ] and the corresponding CorrCondEn Hc (Yn |Y − −
n ,Zn )=0.396; at the second rep-
− − −
etition the set of candidates is Ω 2 ={Xn ,Y n ,Zn }≈{Ω 1 , Xn−1 ,...,Xn−L}, and we see
that the procedure selects a term from the input system, xn−1 , at the second step,
leading to a decreased CorrCondEn minimum, Hc (Yn |X− − −
n ,Y n ,Zn )=0.263, and ulti-
mately to a positive information transfer measured by the TE TX→Y |Z . Note that for
this realization the obtained embedding vector is exactly the one expected from the
generating equation of y, i.e., V3 =[yn−1 , xn−1 , yn−2 ] (see (20)). On the contrary, if
we consider the opposite direction of interaction from Y to X (Fig. 3d), we see that
the two repetitions of the embedding procedure yield the same embedding vector,
V4 =[xn−1, xn−2 , xn−5 , xn−3 ]. In this case two terms in excess are selected besides the
two terms entering the equation for X in (20), but – though confounding the inter-
pretation of the internal dynamics of X – this does not lead to detection of spurious
information transfer as the embedding vector is unchanged from the first to the sec-
ond repetition, so that Hc (Xn |X− − − − −
n , Zn )=H (Xn |Xn , Y n ,Zn )=0.308 and TY →X|Z =0.
c
prior knowledge about the propagation times in the overall dynamical system (see,
e.g., [20] for cardiovascular variability or [39] for magnetoencephalography).
The parameter related to the binning procedure for entropy estimation is the num-
ber of quantization levels ξ used to spread the dynamics of the observed time series.
Theoretically, increasing ξ would lead to a finer partition of the state space and bet-
ter estimates of the conditional probabilities. However, this observation holds for
time series of infinite length, while in practical applications with series of length
N the number of quantization levels should remain as low as ξ d ≈N [7, 18]. In the
studies of short-term cardiovascular and cardiorespiratory variability reviewed in
Sect. 4, where N≈300 and CorrCondEn estimates were usually obtained from three
lagged terms (or at most four in few cases), the common choice is to use ξ =6 levels.
A number of levels such that ξ d ≈N may seem too high according to some other
prescriptions (e.g., Lungarella et al. [40] recommend to work with a number of hy-
percubes at least three times lower than the series length). However, the suitability
of our choice may be explained in that the search for relevant components achieved
by non-uniform embedding makes it able to target only a restricted “typical set”
Conditional Entropy-Based Evaluation of Information Dynamics 77
of hypercubes with higher probability than the other regions of the state space (see
[21], chapter 3), thus allowing some extent of over-quantization with respect to tra-
ditional embedding.
As seen in section 3.2, the non-uniform embedding approach for computing Cor-
rCondEn allows reliable estimation of information dynamics measures from short
realizations of multivariate processes. Nevertheless, it suffers from some limitations
that leave room for improving the estimation of information dynamics measures. A
main problem of the approach is the selection of some terms in excess during the
sequential embedding. This is seen in the reported simulation example where xn−5
and xn−3 are selected in the embedding of X (Fig. 3d,e) and xn−4 is selected in the
embedding of Z (Fig. 3b,c); while this mis-selection is not problematic in terms of
TE computation, it may hamper the estimation of the other terms of an information
decomposition, or other tasks like delay estimation. A first explanation for the de-
tection of excess terms may be the fact that the contribution of the corrective term
is not strong enough to produce the CorrCondEn minimum before the inclusion of
irrelevant terms. From this point of view, we tested alternative corrections: e.g., a
(d,u,τ )
more strict selection is proposed in [7, 9] using the corrective term n(Yn ,Y n ) in
(d,u,τ )
place of the term n(Y n ) used here in (19) and in [18, 20]. Nevertheless, a bal-
ance need always to be found because a more strict selection decreases the rate of
false detections but at the same time increases the number of missed detections.
More generally, factors that may affect the accuracy of component selection are:
(i) the estimator of CondEn; (ii) the empirical nature of the correction; and (iii) the
sub-optimal nature of the exploration of candidates, which, being sequential and not
exhaustive, somehow disregards joint effects that more candidates may have on the
reduction of the CondEn. The binning entropy estimator used here in Eq. (17) may
be inaccurate due to its known bias [37] and to the fact that the associated quantiza-
tion may leave a certain amount of information unexplained even after selection of
the correct causal sources, and thus leave room for excess source selection. While
in principle any alternative entropy estimator might be used, we remark that in the
context of non-uniform embedding the introduction of a corrective term serves, be-
sides for compensating the bias, for guaranteeing the existence of a CondEn mini-
mum, which we use to terminate the sequential procedure in attempting to avoid the
inclusion of irrelevant terms. Therefore, the integration within the proposed proce-
dure of any improved entropy measure has to cope with the need of finding a clear
minimum of the CondEn estimated while increasing the embedding dimensions.
From this point of view, the utilization of accurate Shannon entropy estimators such
as those based on kernels or nearest neighbors [37] would face the necessity of
counteracting the isolation of the embedding vectors in state spaces of increasing
dimension through a corrective term. An interesting alternative solution might be
that recently proposed in [41], where a k-nearest neighbor approach was pursued to
estimate the CondEn directly in one step (rather than in two steps as the difference
between entropy estimates) yielding an estimate which exhibits a minimum over
the embedding dimension without requiring the addition of a corrective term. An-
other way to avoid the use of a corrective term would be to assess, at each step of
the selection procedure, the statistical significance of the contribution brought by the
78 L. Faes and A. Porta
selected candidate to the description of the target process, so that only the candidates
bringing significant contribution can be selected and the procedure would terminate
when the contribution of the selected candidate is not significant. We are currently
exploring this alternative criterion, both using the binning entropy estimator [42]
and using nearest neighbor estimators [26]. As to the point (iii), the problem is that
a sequential exploration of the candidate space does not guarantee convergence to
the absolute minimum of the CondEn, and thus it does not assure a semipositive
value for the measures defined as difference between two CondEn terms like those
defined in Eqs. (10-13). Nevertheless, a sequential approach needs to be adopted be-
cause an exhaustive exploration of all possible candidate terms, which would lead to
the absolute CondEn minimum, would become computationally intractable still at
low embedding dimensions; e.g., in a common practical situation such as that with
M=4 conditioning processes and L=5 candidates explored per process, the number
of combinations to be tested would be 4845 for k=4 and 15504 for k=5. The pos-
sibility of finding negative values for SE, CE and TE computed with this approach
suggest the need of assessing the statistical significance of each estimated measure,
e.g. through the utilization of surrogate data [18, 20]. Of note, the introduction of a
significance criterion for candidate selection as mentioned above would implicitly
provide a tool to assess the statistical significance of information measures without
resorting to surrogate approaches [42, 43].
period variability increased with tilt and paced ventilation at low breathing rates,
likely due to the entrainment of multiple physiological mechanisms into specific fre-
quency bands. In the case of administration of a high dose of atropine the reduction
of complexity is not due to the entrainment of different physiological mechanisms
but, more likely, to the reduction of the complexity of the neural inputs to the sinus
node due to the cholinergic blockade. Moreover, systolic arterial pressure and mus-
cle sympathetic variability series were respectively more regular and more complex
than heart period variability, and their regularity was not markedly affected by the
specified experimental conditions.
The results about SE analysis of the heart period variability during tilt test and
paced breathing protocols were strengthened and further interpreted in a following
study [44]. Moreover, a subsequent study demonstrated the ability of SE measures
to evidence a progressive decrease of the complexity of heart period variability as
a function of the tilt table inclination during graded head-up tilt [16]. This finding
was of great relevance as it established a straightforward link between physiolog-
ical mechanisms and the behavior of an information dynamical quantity like the
SE. Indeed, since graded head-up tilt produces a gradual shift of the sympathovagal
balance toward sympathetic activation and parasympathetic deactivation, the cor-
responding gradual decrease of CorrCondEn observed in the study indicated that
complexity of heart period variability is under the control of the autonomic nervous
system. Another interesting result of the study was that standard measures related
to SE like the approximate entropy [5] were unable to reveal the same gradual de-
crease in complexity during the protocol unless they were corrected according to a
strategy similar to that presented in Sect. 3.1. This pointed out the necessity of ex-
ploiting the CorrCondEn or similar measures devised according to the same strategy
to extract fruitful information from the short data sequences commonly available in
experimental settings.
Another interesting applicative context of SE was the characterization of the neu-
ral control on heart rate variability during sleep [15, 45], a condition which is known
to be associated with important changes of the autonomic cardiovascular regulation.
In [15], the complexity of heart period variability of healthy subjects was found
to follow a circadian pattern characterized by larger CorrCondEn during night-
time than during daytime; this day-night variation was lost in heart failure patients
due to a tendency of complexity to increase during daily activities and decrease at
night, corroborating the association between SE and sympathetic modulation. Inter-
estingly, significant circadian variations were observed only normalizing the Cor-
rCondEn to the entropy of the heart period series; this suggested the opportunity of
reducing the dependence of the estimated SE on the shape of the static distribution
of the observed process through normalization, so that to magnify the dynamical
complexity in the resulting normalized measure. In [45], the short term complexity
of heart period variability was characterized during different sleep stages in young
and elderly healthy persons, observing a significant reduction of CorrCondEn in
older subjects, especially during REM sleep. These results suggested that with ag-
ing REM sleep is associated with a simplification of the mechanisms of cardiac
80 L. Faes and A. Porta
control, that could lead to an impaired ability of the cardiovascular system to react
to adverse events.
varying d). This synchronization measure was first used to measure the coupling
strength between the beat-to-beat variability of the sympathetic discharge and ven-
tilation in decerebrate artificially ventilated cats [9]. The measure was able to reflect
the coupling between sympathetic discharge and ventilation, being very large in the
presence of periodic dynamics in which the sympathetic discharge is locked to the
respiratory forcing input and close to zero for quasiperiodic or aperiodic dynam-
ics resulting, for instance, after spinalization. The synchronization index was also
utilized in humans to evaluate the coupling degree of bivariate systems compris-
ing cardiac, vascular, pulmonary and muscular systems in response to experimental
maneuvers or in pathologic conditions, leading to important results which were re-
lated to physiological mechanisms in health and disease. Specifically, Porta et al. [9]
observed that the synchronization between the beat-to-beat variability of the heart
period and the ventricular repolarization interval was not changed by experimental
conditions that alter the sympathovagal balance but strongly decreased after my-
ocardial infarction. Nollo et al. [14] found also that after infarction the synchroniza-
tion index is associated with an impaired cardiovascular response to head-up tilt,
observing that the index computed between heart period and arterial pressure vari-
ability decreased significant in post-infarction patients, while it increased in healthy
subjects. Moreover, relevant results from [13] were that: the cardiovascular cou-
pling was significant but weak at rest, and increased with head-up tilt and paced
breathing; the cardiopulmonary and vasculo-pulmonary couplings were significant
and increased with paced breathing at 10 breaths/min; muscle nerve activity and
respiration were uncoupled in control condition but become coupled after atropine
administration.
The CE based on CorrCondEn was also successfully exploited as an asymmetric
measure of coupling quantifying the directed information in bivariate physiolog-
ical systems, with special emphasis on the study of the closed loop interactions
between the spontaneous variability of heart period and arterial pressure in humans.
In this applicative context, the CE has been proven useful in disentangling this in-
tricate closed loop, evidencing information flows directed either through the barore-
flex (i.e., from systolic pressure to heart period) or through circulatory mechanics
(i.e. from heart period to systolic pressure). Nollo et al. [14] pointed out that the
Conditional Entropy-Based Evaluation of Information Dynamics 81
information flow was balanced over the two directions and higher during head-up
tilt than at rest in young healthy subjects, while it was unbalanced (with prevalence
of the information flow from heart period to systolic arterial pressure) and lower
during head-up tilt in post-myocardial infarction patients. Porta et al. [25] demon-
strated the usefulness of CE, compared with the traditional approach based on the
analysis of Fourier phases, in detecting the dominant direction of interaction in the
cardiovascular loop. They showed that: (i) CE is able to detect the lack of informa-
tion transfer through the baroreflex in heart transplant recipients, and the gradual
restoration of this transfer with time after transplantation; (ii) CE quantitatively re-
flects the progressive shift from the prevalence of information transfer through the
circulatory mechanics to the prevalence of information transfer through the barore-
flex with tilt table inclination during graded head-up tilt in healthy subjects. Recent
studies [24, 46] focused on how the information transfer through the baroreflex,
monitored by the CorrCondEn of heart period given the systolic pressure, is modi-
fied at varying the prediction time u. In protocols of head-up tilt and pharmacologi-
cal blockade of receptors, the Authors showed that the expected monotonic decrease
of the CE (i.e, increase of the CorrCondEn) observed while increasing the prediction
time can be further typified looking at the rate at which this decrease of information
transfer occurs. It was shown that such a rate contains useful information about the
baroreflex control of heart rate in different experimental conditions.
deeply in [17, 19]. The studies were aimed at the data-driven investigation of the
modes of cardiovascular, cardiopulmonary and vasculo-pulmonary interactions both
in resting physiological conditions and during experimental maneuvers like head-
up tilt and paced breathing. TE analysis was able to describe well known mecha-
nisms of cardiovascular and cardiorespiratory regulation, as well as to support the
interpretation of other more debated mechanisms. Examples were the shift from
balanced bidirectional exchange of information between heart period and arterial
pressure in the supine position to the prevalence of information transfer through the
baroreflex in the upright position, and the mechanical effects of respiration on both
heart period and arterial pressure variability with their enhancement during paced
breathing and dampening during head-up tilt. Moreover, the utilization of a fully
multivariate approach allowed to disambiguate the role of respiration on the closed
loop interactions between heart period and arterial pressure variability. In particu-
lar, the estimated information flows suggested that short-term heart rate variability
is mainly explained by central mechanisms of respiratory sinus arrhythmia in the
resting supine position during spontaneous and paced breathing, and by baroreflex-
mediated phenomena in the upright position.
In a recent study we have dealt with a common problem in the practical esti-
mation of the multivariate TE from real physiological data, that is, the presence of
instantaneous effects which likely impair or confound the assessment of the infor-
mation transfer between coupled systems [20]. Instantaneous effects are effects oc-
curring between two time series within the same time lag, and may reflect either fast,
within sample physiologically meaningful interactions or be void of physiological
meaning (e.g., may be due to unobserved confounders). While the traditional for-
mulation of the TE does not account for instantaneous effects, we faced this issue
allowing the possible presence of instantaneous effects through proper inclusion of
the zero-lag term in the computation of CorrCondEn based on nonuniform embed-
ding. The approach was devised according to two different strategies for the com-
pensation of instantaneous effects, respectively accounting for causally meaningful
and non-meaningful zero-lag effects. The resulting measure, denoted as compen-
sated TE, was validated on simulations and then evaluated on physiological time
series. In cardiovascular and cardiorespiratory variability, where the construction of
the time series suggests the existence of physiological causal effects occurring at lag
zero, the compensated TE evidenced better than the traditional TE the presence of
expected interaction mechanisms (e.g., the baroreflex). In magnetoencephalography
analysis performed at the sensor level, where instantaneous effects are likely the
result of the simultaneous mapping of single sources of brain activity onto several
recording sensors, utilization of the proposed compensation suggested the activa-
tion of multisensory integration mechanisms in response to a specific stimulation
paradigm.
Finally, we have recently started considering an integrated perspective in which
the TE is an element of the information domain characterization of coupled phys-
iological systems. In [47] we studied the TE and the SE as factors in the decom-
position of the predictive information in bivariate physiological systems, according
to the interpretation suggested here in Sect. 3.2 (Eq. (15a)). The study was aimed
Conditional Entropy-Based Evaluation of Information Dynamics 83
References
1. Bullmore, E., Sporns, O.: Complex brain networks: graph theoretical analysis of struc-
tural and functional systems. Nat. Rev. Neurosci. 10, 186 (2009)
2. Bashan, A., Bartsch, R.P., Kantelhardt, J.W., Havlin, S., Ivanov, P.C.: Network physiol-
ogy reveals relations between network topology and physiological function. Nat. Com-
municat. 3 (2012)
3. Lizier, J.T.: The local information dynamics of distributed computation in complex sys-
tems. Springer, Heidelberg (2013)
4. Faes, L., Nollo, G.: Multivariate frequency domain analysis of causal interactions in
physiological time series. In: Laskovski, A.N. (ed.) Biomedical Engineering, Trends in
Electronics, Communications and Software. InTech, Rijeka (2011)
5. Pincus, S.M.: Approximate Entropy As A Measure of System-Complexity. Proc. Natl.
Acad. Sci. USA 88, 2297–2301 (1991)
6. Richman, J.S., Moorman, J.R.: Physiological time-series analysis using approximate
entropy and sample entropy. Am. J. Physiol. Heart Circ. Physiol. 278, H2039–H2049
(2000)
7. Porta, A., Baselli, G., Liberati, D., Montano, N., Cogliati, C., Gnecchi-Ruscone, T.,
Malliani, A., Cerutti, S.: Measuring regularity by means of a corrected conditional en-
tropy in sympathetic outflow. Biol. Cybern. 78, 71–78 (1998)
8. Paluš, M., Komárek, V., Hrnčı́ř, Z., Štěrbová, K.: Synchronization as adjustment of in-
formation rates: detection from bivariate time series. Phys. Rev. E 63, 046211 (2001)
9. Porta, A., Baselli, G., Lombardi, F., Montano, N., Malliani, A., Cerutti, S.: Conditional
entropy approach for the evaluation of the coupling strength. Biol. Cybern. 81, 119–129
(1999)
10. Schreiber, T.: Measuring information transfer. Phys. Rev. Lett. 85, 461–464 (2000)
11. Lizier, J.T., Pritam, S., Prokopenko, M.: Information Dynamics in Small-World Boolean
Networks. Artificial Life 17, 293–314 (2011)
12. Chicharro, D., Ledberg, A.: Framework to study dynamic dependencies in networks of
interacting processes. Phys. Rev. E 86, 041901 (2012)
13. Porta, A., Guzzetti, S., Montano, N., Pagani, M., Somers, V., Malliani, A., Baselli, G.,
Cerutti, S.: Information domain analysis of cardiovascular variability signals: evaluation
of regularity, synchronisation and co-ordination. Med. Biol. Eng. Comput. 38, 180–188
(2000)
Conditional Entropy-Based Evaluation of Information Dynamics 85
14. Nollo, G., Faes, L., Porta, A., Pellegrini, B., Ravelli, F., Del Greco, M., Disertori, M.,
Antolini, R.: Evidence of unbalanced regulatory mechanism of heart rate and systolic
pressure after acute myocardial infarction. Am. J. Physiol. Heart Circ. Physiol. 283,
H1200–H1207 (2002)
15. Porta, A., Faes, L., Mase, M., D’Addio, G., Pinna, G.D., Maestri, R., Montano, N.,
Furlan, R., Guzzetti, S., Nollo, G., Malliani, A.: An integrated approach based on uni-
form quantization for the evaluation of complexity of short-term heart period variability:
Application to 24 h Holter recordings in healthy and heart failure humans. Chaos 17,
015117 (2007)
16. Porta, A., Gnecchi-Ruscone, T., Tobaldini, E., Guzzetti, S., Furlan, R., Montano, N.:
Progressive decrease of heart period variability entropy-based complexity during graded
head-up tilt. J. Appl. Physiol. 103, 1143–1149 (2007)
17. Faes, L., Nollo, G., Porta, A.: Information domain approach to the investigation of
cardio-vascular, cardio-pulmonary, and vasculo-pulmonary causal couplings. Front.
Physiol. 2, 1–13 (2011)
18. Faes, L., Nollo, G., Porta, A.: Information-based detection of nonlinear Granger causal-
ity in multivariate processes via a nonuniform embedding technique. Phys. Rev. E 83,
051112 (2011)
19. Faes, L., Nollo, G., Porta, A.: Non-uniform multivariate embedding to assess the infor-
mation transfer in cardiovascular and cardiorespiratory variability series. Comput. Biol.
Med. 42, 290–297 (2012)
20. Faes, L., Nollo, G., Porta, A.: Compensated transfer entropy as a tool for reliably esti-
mating information transfer in physiological time series. Entropy 15, 198–219 (2013)
21. Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley, New York (2006)
22. Kaiser, A., Schreiber, T.: Information transfer in continuous processes. Physica D 166,
43–62 (2002)
23. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Local measures of information storage in
complex distributed computation. Information Sciences 208, 39–54 (2012)
24. Porta, A., Catai, A.M., Takahashi, A.C.M., Magagnin, V., Bassani, T., Tobaldini, E.,
Montano, N.: Information Transfer through the Spontaneous Baroreflex in Healthy Hu-
mans. Meth. Inf. Med. 49, 506–510 (2010)
25. Porta, A., Catai, A.M., Takahashi, A.C., Magagnin, V., Bassani, T., Tobaldini, E., de
van, B.P., Montano, N.: Causal relationships between heart period and systolic arterial
pressure during graded head-up tilt. Am. J. Physiol Regul. Integr. Comp. Physiol. 300,
R378–R386 (2011)
26. Vicente, R., Wibral, M., Lindner, M., Pipa, G.: Transfer entropy-a model-free measure of
effective connectivity for the neurosciences. Journal of Computational Neuroscience 30,
45–67 (2011)
27. Barnett, L., Barrett, A.B., Seth, A.K.: Granger causality and transfer entropy are equiva-
lent for Gaussian variables. Phys. Rev. Lett. 103, 238701 (2009)
28. Amblard, P.O., Michel, O.J.: The relation between Granger causality and directed infor-
mation theory: a review. Entropy 15, 113–143 (2013)
29. Vakorin, V.A., Krakovska, O.A., McIntosh, A.R.: Confounding effects of indirect con-
nections on causality estimation. J. Neurosci. Methods 184, 152–160 (2009)
30. Williams, P.L.: Nonnegative decomposition of multivariate information. ArXiv,
1004.2515 (2010)
31. Schreiber, T.: Interdisciplinary application of nonlinear time series methods. Phys.
Rep. 308, 1–64 (1999)
32. Takens, F.: Detecting strange attractors in fluid turbulence. In: Rand, D., Young, S.L.
(eds.) Dynamical Systems and Turbulence. Springer, Berlin (1981)
86 L. Faes and A. Porta
33. Vlachos, I., Kugiumtzis, D.: Nonuniform state-space reconstruction and coupling detec-
tion. Phys. Rev. E 82, 016207 (2010)
34. Small, M.: Applied nonlinear time series analysis: applications in physics, physiology
and finance. World Scientific (2005)
35. Runge, J., Heitzig, J., Petoukhov, V., Kurths, J.: Escaping the Curse of Dimensionality in
Estimating Multivariate Transfer Entropy. Phys. Rev. Lett. 108, 258701 (2012)
36. Pincus, S.M.: Approximated entropy (ApEn) as a complexity measure. Chaos, 110–117
(1995)
37. Hlaváčková-Schindler, K., Paluš, M., Vejmelka, M., Bhattacharya, J.: Causality detection
based on information-theoretic approaches in time series analysis. Phys. Rep. 441, 1–46
(2007)
38. Kugiumtzis, D., Tsimpiris, A.: Measures of Analysis of Time Series (MATS): A MAT-
LAB Toolkit for Computation of Multiple Measures on Time Series Data Bases. J. Stat.
Software 33, 1–30 (2010)
39. Wibral, M., Rahm, B., Rieder, M., Lindner, M., Vicente, R., Kaiser, J.: Transfer entropy
in magnetoencephalographic data: Quantifying information flow in cortical and cerebel-
lar networks. Progr. Biophys. Mol. Biol. 105, 80–97 (2011)
40. Lungarella, M., Pegors, T., Bulwinkle, D., Sporns, O.: Methods for quantifying the in-
formational structure of sensory and motor data. Neuroinformatics 3, 243–262 (2005)
41. Porta, A., Castiglioni, P., Bari, V., Bassani, T., Marchi, A., Cividjian, A., Quintin, L., Di
Rienzo, M.: K-nearest-neighbor conditional entropy approach for the assessment of the
short-term complexity of cardiovascular control. Phys. Meas. 34, 17–33 (2013)
42. Faes, L., Nollo, G.: Decomposing the transfer entropy to quantify lag-specific Granger
causality in cardiovascular variability. In: Proc. of the 35th Annual Int. Conf. IEEE-
EMBS, pp. 5049–5052 (2013)
43. Kugiumtzis, D.: Direct-coupling information measure from nonuniform embedding.
Phys. Rev. E 87, 062918 (2013)
44. Porta, A., Guzzetti, S., Montano, N., Furlan, R., Pagani, M., Malliani, A., Cerutti, S.:
Entropy, entropy rate, and pattern classification as tools to typify complexity in short
heart period variability series. IEEE Trans. Biomed. Eng. 48, 1282–1291 (2001)
45. Viola, A.U., Tobaldini, E., Chellappa, S.L., Casali, K.R., Porta, A., Montano, N.: Short-
Term Complexity of Cardiac Autonomic Control during Sleep: REM as a Potential Risk
Factor for Cardiovascular System in Aging. PLoS One 6 (2011)
46. Porta, A., Castiglioni, P., Di Rienzo, M., Bari, V., Bassani, T., Marchi, A., Wu, M.A.,
Cividjian, A., Quintin, L.: Information domain analysis of the spontaneous baroreflex
during pharmacological challenges. Auton. Neurosci. 178(1-2), 67–75 (2013)
47. Faes, L., Porta, A., Rossato, G., Adami, A., Tonon, D., Corica, A., Nollo, G.: Investi-
gating the mechanisms of cardiovascular and cerebrovascular regulation in orthostatic
syncope through an information decomposition strategy. Auton. Neurosci. 178(1-2), 76–
82 (2013)
Information Transfer in the Brain: Insights from
a Unified Approach
1.1 Model
We use a simple dynamical model with a threshold in order to quantify and inves-
tigate this phenomenon. Given an undirected network of n nodes and symmetric
connectivity matrix Ai j ∈ {0, 1}, to each node we associate a real variable xi whose
evolution, at discrete times, is given by:
n
xi (t + 1) = F ∑ Ai j x j (t) + σ ξi (t), (1)
j=1
where ξ are unit variance Gaussian noise terms, whose strength is controlled by σ ;
F is a transfer function chosen as follows:
F(α ) = aα |α | < θ
F(α ) = aθ α >θ (2)
F(α ) = −aθ α < −θ
where θ is a threshold value. This transfer function is chosen to mimic the fact
that each unit is capable to handle a limited amount of information. For large θ
our model becomes a linear map. At intermediate values of θ , the nonlinearity con-
nected to the threshold will affect mainly the mostly connected nodes (hubs): the
input ∑ Ai j x j to nodes with low connectivity will remain typically sub-threshold in
Information Transfer in the Brain: Insights from a Unified Approach 89
Fig. 1 Examples of the three network architectures used in this study. Left: Preferential At-
tachment. Center: Homogeneous. Left: Scale-free.
θ = 0.001
0.5
−0.5
−
θ = 0.012
0.5
−0.5
−
θ = 0.1
0.5
−0.5
−
Fig. 2 Segments of 200 time points from typical time series simulated in the scale-free net-
work for three values of θ
PRE
5 SFN
HOM
4
R
1
0 0.02 0.04 0.06 0.08 0.1
θ
Fig. 3 The ratio between the standard deviation of cout and those of cin , R, is plotted versus θ
for the three architectures of network: preferential attachment (PRE), deterministic scale free
(SFN) and homogeneous (HOM). The parameters of the dynamical system are a = 0.1 and
σ = 0.1. Networks built by preferential attachment are made of 30 nodes and 30 undirected
links, while the deterministic scale free network of 27 nodes is considered. The homogeneous
networks have 27 nodes, each connected to two other randomly chosen nodes.
threshold is applied to the connectivity matrix, so that all the information flowing in
the network is accounted for. We then evaluate the standard deviation of the distri-
butions of cin and cout , from all the nodes, varying the realization of the preferential
attachment network and implementing eqs. (1) for 10000 time points.
In figure 3 we depict R, the ratio between the standard deviation of cout over those
of cin , as a function of the θ . As the threshold is varied, we encounter a range of val-
ues for which the distribution of cin is much narrower than that of cout . In the same
figure we also depict the corresponding curve for deterministic scale free networks
[3], which exhibits a similar peak, and for homogeneous random graphs (or Erdos-
Renyi networks [17]), with R always very close to one. The discrepancy between
the distributions of the incoming and outgoing causalities arises thus in hierarchical
networks. We remark that, in order to quantify the difference between the distribu-
tions of cin and cout , here we use the ratio of standard deviations but qualitatively
similar results would have been shown using other measures of discrepancy.
In figure 4 we report the scatter plot in the plane cin − cout for preferential at-
tachment networks and for some values of the threshold. The distributions of cin
and cout , with θ equal to 0.012 and corresponding to the peak of figure 3, are de-
picted in figure 5: cin appears to be exponentially distributed, whilst cout shows a fat
tail. In other words, the power law connectivity, of the underlying network, influ-
ences just the distribution of outgoing directed influences. In figure 6 we show the
Information Transfer in the Brain: Insights from a Unified Approach 91
−3
x 10
6 0.1
cout
cout
4
0.05
2
0 0
0 2 4 6 0 0.05 0.1
c x 10
−3 c
in in
0.3
0.6
0.2
cout
0.4
0.1 c 0.2
0 0
0 0.1 0.2 0.3 0 0.05 0.1
cin θ
Fig. 4 Scatter plot in the plane cin − cout for undirected networks of 30 nodes and 30 links
built by means of the preferential attachment mechanism. The parameters of the dynamical
system are a = 0.1 and σ = 0.1. The points correspond to all the nodes pooled from 100
realizations of preferential attachment networks, each with 10 simulations of eqs. (1) for
10000 time points. (Top-left) Scatter plot of the distribution for all nodes at θ = 0.001. (Top-
right) Contour plot of the distribution for all nodes at θ = 0.012. (Bottom-left) Scatter plot of
the distribution for all nodes at θ = 0.1. (Bottom-right) The total Granger causality (directed
influence) (obtained summing over all pairs of nodes) is plotted versus θ ; circles point to the
values of θ in the previous subfigures.
average value of cin and cout versus the connectivity k of the network node: cout
grows uniformly with k, thus confirming that its fat tail is a consequence of the
power law of the connectivity. On the contrary cin appears to be almost constant: on
average the nodes receive the same amount of information, irrespective of k, whilst
the outgoing information from each node depends on the number of neighbors. It
is worth mentioning that since a precise estimation of the information flow is com-
putationally expensive, our simulations are restricted to rather small networks; in
particular the distribution of cout appears to have a fat tail but, due to our limited
data, we can not claim that it corresponds to a simple power-law. The same model
was then implemented on an anatomical connectivity matrix obtained via diffusion
spectrum imaging (DSI) and white matter tractography [22]. Also in this case we
observe a modulation of R and some scatter plots (figure 7) qualitatively similar to
the ones depicted in figures 3 and 4. In this case a multimodal distribution emerges
for high values of θ , as we can observe also in the histograms in figure 8. In figure 9
we can clearly identify some nodes in the structural connection matrix in which the
92 D. Marinazzo et al.
150
ρin
100
50
50
100
ρout
150
Fig. 5 For the preferential attachment network, at θ = 0.012, the distributions (by smooth-
ing spline estimation) of cin and cout for all the nodes, pooled from all the realizations, are
depicted. Units on the vertical axis are arbitrary.
0.05
0
0 5 10 15 20 25
k
law of diminishing marginal returns is highly expressed. The value of the threshold
has also an influence on the ratio S between interhemispheric and intrahemispheric
information transfer (figure 10). Interestingly, the maximum of this ratio occurs at a
finite value of θ , different from those at which R is maximal.
Information Transfer in the Brain: Insights from a Unified Approach 93
2
0.05
c out
R 1.5
1 0
0 0.2 0.4 0 0.02 0.04 0.06 0.08
θ c in
0.2 0.5
c out
c out
0.1
0 0
0 0.1 0.2 0 0.2 0.4 0.6
c in c in
Fig. 7 Top right: the ratio between the standard deviation of cout and those of cin , R, is plotted
versus θ when the threshold model is implemented on the connectome structure. Plots in the
plane cin − cout for three values of θ : 0.01 (top right), 0.0345 (bottom left), 0.5 (bottom right).
20
15
40
15
in 10
10
20
5
5
0 0 0
5
5
20
10
out 10
15
40
15
20
60 25 20
0 0.02 0.04 0.06 0.08 0 0.1 0.2 0 0.2 0.4 0.6
c
Fig. 8 The distributions of cin and cout for three values of θ when the threshold model is
implemented on the connectome structure. Units on the vertical axis are arbitrary.
1.75
0.5
Fig. 9 The ratio between the standard deviation of cout and those of cin , R, is mapped on the
66 regions of the structural connectivity matrix. In the figure 998 nodes are displayed, with
those belonging to the same region in the coarser template have the same color and size.
Causality [27] using a linear kernel and a model order of 5, determined by leave-
one-out cross-validation. We then pooled all the values for information flow towards
and from any electrode and analyzed their distribution.
In figure 11 we plot the incoming versus the outgoing values of the information
transfer, as well as the distributions of the two quantities: the incoming information
seems exponentially distributed whilst the outgoing information shows a fat tail.
These results suggest that overall brain effective connectivity networks may also be
considered in the light of the law of diminishing marginal returns.
More interestingly, this pattern is reproduced locally but with a clear modulation:
a topographic analysis has also been made considering the distribution of incoming
and outgoing causalities at each electrode. In figure 12 we show the distributions
of incoming and outgoing connections corresponding to the electrodes locations on
the scalp, and the corresponding map of the parameter R; the law of diminishing
marginal returns seems to affect mostly the temporal regions. This well defined pat-
tern suggests a functional role for the distributions. It is worth to note that this pattern
has been reproduced in other EEG data at rest from 9 healthy subjects collected for
another study with a different equipment.
Information Transfer in the Brain: Insights from a Unified Approach 95
cout
tion (right).
0
0.5
2
ρ
out
4
0
0 0.5 1 1.5 0 0.5 1
cin c
Fig. 12 Left: the distributions for incoming (above, light grey) and outgoing (below, dark
grey) information at each EEG electrode displayed on the scalp map (original binning and
smoothing spline estimation). Right: the distribution on the scalp of R, the ratio between the
standard deviations of the distributions of outgoing and incoming information, for EEG data.
represents the bestestimate of x, given X, and corresponds [32] to the regression
function f ∗ (X) = dxp(x|X)x. Now, let {ηn }n=1,.,N+m be another time series of
simultaneously acquired quantities, and denote Yi= (ηi , . . . , ηi+m−1 ) . The best es-
timate of x, given X and Y , is now: g∗ (X,Y ) = dxp(x|X,Y )x. If the generalized
Markov property holds, i.e.
then f ∗ (X) = g∗ (X,Y ) and the knowledge of Y does not improve the prediction of
x. Transfer entropy [38] is a measure of the violation of 4: it follows that Granger
causality implies non-zero transfer entropy [27]. Under Gaussian assumption it can
be shown that Granger causality and transfer entropy are entirely equivalent, and just
differ for a factor two [5]. The generalization of Granger causality to a multivariate
fashion, described in the following, allows the analysis of dynamical networks [28]
and to discern between direct and indirect interactions.
Let us consider n time series {xα (t)}α =1,...,n ; the state vectors are denoted
m being the window length (the choice of m can be done using the standard cross-
validation scheme). Let ε (xα |X) be the mean squared error prediction of xα on the
basis of all the vectors X (corresponding to linear regression or non linear regression
by the kernel approach described in [27]). The multivariate Granger causality index
c(β → α ) is defined as follows: consider the prediction of xα on the basis of all the
variables but Xβ and the prediction of xα using all the variables, then the causality
measures the variation of the error in the two conditions, i.e.
ε xα |X \ Xβ
c(β → α ) = log . (5)
ε (xα |X)
ε (xα |Z)
c(β → α ) = log . (8)
ε xα |Z ∪ Xβ
Under the Gaussian assumption, the mutual information I{Xβ ; Z} can be easily eval-
uated, see [5]. Moreover, instead of searching among all the subsets of nd variables,
we adopt the following approximate strategy. Firstly the mutual information of the
driver variable, and each of the other variables, is estimated, in order to choose the
first variable of the subset. The second variable of the subsets is selected among
the remaining ones, as those that, jointly with the previously chosen variable, max-
imizes the mutual information with the driver variable. Then, one keeps adding the
rest of the variables by iterating this procedure. Calling Zk−1 the selected set of k − 1
variables, the set Zk is obtained adding , to Zk−1 , the variable, among the remaining
ones, providing greatest information gain. This is repeated until nd variables are se-
lected. This greedy algorithm, for the selection of relevant variables, is expected to
give good results under the assumption of sparseness of the connectivity.
where a’s are the couplings, s is the strength of the noise and τ ’s are unit variance
i.i.d. Gaussian noise terms. The level of noise determines the minimal amount of
samples needed to assess that the structures recovered by the proposed approach
are genuine and are not due to randomness, as it happens for the standard Granger
causality (see discussions in [27] and [28]); in particular noise should not be too
high to obscure deterministic effects.
As an example, we fix n = 34 and construct couplings in terms of the well known
Zachary data set [44], an undirected network of 34 nodes. We assign a direction
to each link, with equal probability, and set ai j equal to 0.015, for each link of the
directed graph thus obtained, and zero otherwise. The noise level is set s = 0.5. The
goal is again to estimate this directed network from the measurements of time series
on nodes.
In figure (13) we show the application of the proposed methodology to data sets
generated by eqs. (9), in terms of sensitivity and specificity, for different numbers
of samples. The bivariate analysis detects several false interactions, however condi-
tioning on a few variables is sufficient to put in evidence just the direct causalities.
Due to the sparseness of the underlying graph, we get a result which is very close
to the one by the full multivariate analysis; the multivariate analysis here recovers
the true network, indeed the number of samples is sufficiently high. In figure (14),
concerning the stage of selection of variables upon which conditioning, we plot the
mutual information gain Δ y as a function of the number of variables included nd : it
decreases as nd increases.
Information Transfer in the Brain: Insights from a Unified Approach 99
1 1
sensitivity
sensitivity
0.99 0.99
0.98 0.98
0.97 0.97
0.96 0.96
0 5 10 0 5 10
nd nd
1 1
0.95 0.95
specificity
0.85 0.85
0.8 0.8
0 5 10 0 5 10
nd nd
Fig. 13 Sensitivity and specificity for the recovery of the Zachary network structure from the
dynamics at his nodes are plotted versus nd , the number of variables selected for condition-
ing, for two values of two values of the number of samples N, 500 (left) and 1000 (right).
The order is m = 2, similar results are obtained varying m. The results are averaged over
100 realizations of the linear dynamical system described in the text. The empty square, in
correspondence to nd = 0, is the result from the bivariate analysis. The horizontal line is the
outcome from multivariate analysis, where all variables are used for conditioning.
0.05
cluded, is plotted versus nd
for two values of the of the 0
1 2 3 4 5 6 7 8 9
number of samples N, 500 nd
0.15
(top) and 1000 (bottom).
The order is m = 2. The 0.1
0.05
over all the variables.
0
1 2 3 4 5 6 7 8 9
nd
100 D. Marinazzo et al.
3 Informative Clustering
In this last section we propose a formal expansion of the transfer entropy to put in
evidence irreducible sets of variables which provide information for the future state
of each assigned target. Multiplets characterized by an high value will be associ-
ated to informational circuits present in the system, with an informational character
(synergetic or redundant) which can be associated to the sign of the contribution.
We also present results on fMRI and EEG data sets.
1 http://www.nitrc.org/projects/fcon_1000/
Information Transfer in the Brain: Insights from a Unified Approach 101
Fig. 16 Variables chosen among the 10 most informative when the target is the left posterior
cingulate gyrus (in blue). The diameter of the red spheres is proportional to the times that a
region is selected for different subjects.
be superior to the standard time delayed mutual information, which fails to distin-
guish information that is actually exchanged from shared information due to com-
mon history and input signals. On the other hand, Granger causality formalized the
notion that, if the prediction of one time series could be improved by incorporating
the knowledge of past values of a second one, then the latter is said to have a causal
influence on the former. Initially developed for econometric applications, Granger
causality has gained popularity also in neuroscience (see, e.g., [9, 39, 16, 27]). A
discussion about the practical estimation of information theoretic indexes for sig-
nals of limited length can be found in [33].
Here we present a formal expansion of the transfer entropy to put in evidence irre-
ducible sets of variables which provide information for the future state of the target.
Multiplets characterized by an high value, unjustifiable by chance, will be associ-
ated to informational circuits present in the system, with an informational character
(synergetic or redundant) which can be associated to the sign of the contribution.
Δ S(X)
= S (X|Yi ) − S(X) = −I (X;Yi ) , (11)
Δ Yi
Δ 2 S(X) Δ I (X;Yi )
=− = I (X;Yi ) − I (X;Yi |Y j ), (12)
Δ Yi Δ Y j ΔYj
and so on.
Information Transfer in the Brain: Insights from a Unified Approach 103
−0.1
10 −5 0 5 10
% of values
−0.03
3 2 1 0 1 2
% of values
Now, let us consider n + 1 time series {xα (t)}α =0,...,n . The lagged state vectors
are denoted
Yα (t) = (xα (t − m), . . . , xα (t − 1)),
m being the window length.
Firstly we may use the expansion (10) to model the statistical dependencies
among the x variables at equal times. We take x0 as the target time series, and the
first terms of the expansion are
which measures to what extent the remaining variables contribute to specifying the
future state of x0 . This quantity can be expanded according to (10):
S x0 |{Yk }nk=1 − S(x0) =
Δ S(x ) Δ 2 S(x ) Δ n S(x ) (16)
∑i Δ Yi0 + ∑i> j Δ Yi Δ Y0j + · · · + Δ Yi ···Δ0Yn .
104 D. Marinazzo et al.
−0.015
3 2 1 0 1 2 3 4
% of values
A drawback of the expansion above is that it does not remove shared information
due to common history and input signals; therefore we propose to condition on the
past of x0 , i.e. Y0 . To this aim we introduce the conditioning operator CY0 :
and observe that CY0 and the variational operators (11) commute. It follows that we
can condition the expansion (16) term by term, thus obtaining
S x0 |{Yk }nk=1 ,Y0 − S(x0|Y0 ) =
−I x0 ; {Y }nk=1 |Y0 = (17)
Δ S(x |Y ) Δ 2 S(x |Y ) Δ n S(x |Y )
∑i Δ Y0 i 0 + ∑i> j Δ Yi Δ0Y j0 + · · · + Δ Yi ···0Δ Y0n .
We note that variations at every order in (17) are symmetrical under permutations
of the Yi . Moreover statistical independence among any of the Yi results in vanishing
contribution to that order: each nonvanishing term in this expansion accounts for an
irreducible set of variables providing information for the specification of the target.
The first order terms in the expansion are given by:
Information Transfer in the Brain: Insights from a Unified Approach 105
Δ S(x0 |Y0 )
A0i = = −I (x0 ;Yi |Y0 ) , (18)
Δ Yi
and coincide with the bivariate transfer entropies i → 0 (times -1). The second order
terms are
B0i j = I (x0 ;Yi |Y0 ) − I (x0 ;Yi |Y j ,Y0 ) , (19)
whilst the third order terms are
Ci0jk = I (x0 ;Yi |Y j ,Y0 ) + I (x0 ;Yi |Yk ,Y0 )
(20)
−I (x0 ;Yi |Y0 ) − I (x0 ;Yi |Y j ,Yk ,Y0 ) .
An important property of (17) is that the sign of nonvanishing terms reveals the
informational character of the corresponding set of variables: a negative sign indi-
cates that the group of variables contribute with more information, than the sum
of its subgroups, to the state of the target (synergy), while positive contributions
correspond to redundancy.
Another important point that we address here is how get a reliable estimate of
conditional mutual information from data. In this work we adopt the assumption of
Gaussianity and we use the exact expression that holds in this case [5] and reads as
follows. Given multivariate Gaussian random variables X, W and Z, the conditioned
mutual information is
1 |Σ (X|Z)|
I (X;W |Z) = ln , (21)
2 |Σ (X|W ⊕ Z)|
where | · | denotes the determinant, and the partial covariance matrix is defined
in terms of the covariance matrix Σ (X) and the cross covariance matrix Σ (X, Z); the
definition of Σ (X|W ⊕ Z) is analogous.
−1
−1.5
6 4 2 0 2 4 6
% of values
−0.2
50 25 0 25 50
% of values
obtained after a random reshuffling of the target time series: the surrogate test at 5%
confidence shows that a relevant fraction of bivariate interactions is statistically sig-
nificant. In figure (19) we report the distributions of the second order terms, both for
information flow and for instantaneous correlations: negative and positive terms are
present, i.e. both synergetic and redundant circuits of three variables are evidenced
by the proposed approach. Some of these interactions are statistically significant,
see figure (20).
In figure (21) we report the distribution of the third order terms for the infor-
mation flow which correspond to the target Posterior cingulate gyrus, a major node
within the default mode network (DMN) with high metabolic activity and dense
structural connectivity to widespread brain regions, which suggests it has a role as
a cortical hub. The region appears to be involved in internally directed thought, for
example, memory recollection. We compare the distribution with the corresponding
one for shuffled target; it appears that there are significant circuits of four variables,
involving Posterior cingulate gyrus, and most of them are redundant.
As another example, we consider electroencephalogram (EEG) data obtained at
rest from 10 healthy subjects and described in the first section. In figure (22) we
compare the distributions of A0i and Wi0 . This figure shows that also EEG data
are characterized by nontrivial causal connections. In figure (23) the distribution
of the bivariate transfer entropies is compared with those obtained after a random
reshuffling of the target time series: it shows that a remarkable amount of bivariate
Information Transfer in the Brain: Insights from a Unified Approach 107
−0.25
−0.5
6 4 2 0 2 4 6
% of values
thus constituting a realization of the network motif (a) in figure 1 of [25]. In figure
25, left we depict, as a function of the coupling c, both the information storage at
the node corresponding to the variable x, and the information flow term {y, z} → x.
In this case the three variables are redundant and a relation between information
storage and information flow can be established. Figures 25, center and right refer
to similar dynamical systems of 3 and 4 variables, corresponding to the motifs (c)
and (d), respectively, of figure 1 of [25]. These two cases correspond to synergy: still
the presence of these informational terms is connected to information storage in the
small network. Summarizing, we have shown that the expansion of the transfer en-
tropy is deeply connected with the expansion of the information storage developed
in [25], hence the search of redundant and synergetic multiplets of variables, send-
ing information to each given target, will also put in evidence the mechanisms for
information storage at that node.
108 D. Marinazzo et al.
0.2 0.02
0.02
0.01
0.1 0
0
0 −0.02
−0.01
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
c c c
Fig. 25 Information storage (squares) and information flow term {y, z} → x (crosses) for
three motifs described in [25], figure 1. Left: motif (a), redundant variables. Center: motif
(c), synergetic variables. Right: motif (d), synergetic variables.
5 Conclusions
The transfer entropy analysis describes the information flow pattern in complex sys-
tems in terms of an N × N matrix, N being the number of subcomponents, each
element being the information flowing from each subsystem to each other. The ap-
proaches described in the present chapter represent our attempts to deal with phys-
ical constraints (e.g., the limited capacity of nodes and the limited number of data
samples) within this picture, and to go beyond the N × N description when the actual
senders of information are network motifs rather than single nodes.
Concerning the physical constraints, we have shown that information flow pat-
terns show a signature of the law of diminishing marginal returns and we addressed
the problem of partial conditioning to a limited subset of variables.
As far as the search for multiplets of correlated variables is concerned, we have
proposed a formal expansion of the transfer entropy to put in evidence irreducible
sets of variables which provide information for the future state of each assigned
target. The applications to real data-set show the effectiveness of the proposed
methodology.
References
1. Angelini, L., de Tommaso, M., Marinazzo, D., Nitti, L., Pellicoro, M., Stramaglia, S.:
Redundant variables and Granger causality. Physical Review E 81(3), 037201 (2010)
2. Barabási, A., Albert, R.: Emergence of scaling in random networks. Science 286, 509–
512 (1999)
3. Barabási, A., Ravasz, E., Vicsek, T.: Deterministic scale-free networks. Physica A: Sta-
tistical Mechanics and its Applications 299, 559–564 (2001)
4. Linked, B.A.: The new science of networks. Perseus Books, New York (2002)
5. Barnett, L., Barrett, A., Seth, A.: Granger causality and transfer entropy are equivalent
for gaussian variables. Physical Review Letters 103, 238701 (2009)
6. Barrett, A., Barnett, L., Seth, A.K.: Multivariate Granger causality and generalized vari-
ance. Physical Review E 81(4), 041907 (2010)
7. Bettencourt, L.M.A., Stephens, G.J., Ham, M.I., Gross, G.W.: Functional structure of
cortical neuronal networks grown in vitro. Phys. Rev. E 75(2), 21915–21924 (2007)
Information Transfer in the Brain: Insights from a Unified Approach 109
29. Marinazzo, D., Liao, W., Pellicoro, M., Stramaglia, S.: Grouping time series by pairwise
measures of redundancy. Physics Letters A 374(39), 4040–4044 (2010)
30. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network
Motifs: Simple Building Blocks of Complex Networks. Science 298, 824–827 (2002)
31. Nolte, G., Ziehe, A., Nikulin, V., Schlögl, A., Krämer, N., Brismar, T., Müller, K.: Ro-
bustly estimating the flow direction of information in complex physical systems. Physical
Review Letters 100, 234101 (2008)
32. Papoulis, A.: Proability, Random Variables, and Stochastic Processes. McGraw-Hill,
New York (1985)
33. Porta, A., Catai, A.M., Takahashi, A.C.M., Magagnin, V., Bassani, T., Tobaldini, E.,
Montano, N.: Information Transfer through the Spontaneous Baroreflex in Healthy Hu-
mans. Methods of Information in Medicine 49, 506–510 (2010)
34. Roebroeck, A., Formisano, E., Goebel, R.: Mapping directed influence over the brain
using Granger causality and fMRI. NeuroImage 25(1), 230–242 (2005)
35. Salvador, R., Suckling, J., Coleman, M.R., Pickard, J.D., Menon, D., Bullmore, E.:
Neurophysiological Architecture of Functional Magnetic Resonance Images of Human
Brain. Cerebral cortex 15(9), 1332–1342 (2005)
36. Samuelson, P., Nordhaus, W.: Microeconomics. McGraw-Hill, Oklahoma City (2001)
37. Schneidman, E., Bialek, W., Berry II, M.J.: Synergy, redundancy, and independence in
population codes. J. Neuroscience 23, 11539–11553 (2003)
38. Schreiber, T.: Measuring information transfer. Physical Review Letters 85(2), 461 (2000)
39. Smirnov, D.A., Bezruchko, B.P.: Estimation of interaction strength and direction from
short and noisy time series. Phys. Rev. E 68, 046209–046218 (2003)
40. Tzourio-Mazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F., Etard, O., Delcroix,
N., Mazoyer, B., Joliot, M.: Automated anatomical labeling of activations in SPM using
a macroscopic anatomical parcellation of the MNI MRI single-subject brain. NeuroIm-
age 15(1), 273–289 (2002)
41. Wiener, N.: The theory of prediction, vol. 1. McGraw-Hill, New York (1996)
42. Yeger-Lotem, E., Sattath, S., Kashtan, N., Itzkovitz, S., Milo, R., Pinter, R.J., Alon, U.,
Margalit, H.: Network motifs in integrated cellular networks of transcription regulation
and protein protein interaction. Proc. Natl. Acad. Sci. U.S.A. 101, 5934–5939 (2004)
43. Yu, D., Righero, M., Kocarev, L.: Estimating topology of networks. Physical Review
Letters 97(18), 188701 (2006)
44. Zachary, W.: An information flow model for conflict and fission in small groups. J. An-
thropol. Res. 33(2), 452–473 (1977)
45. Zhou, Z., Chen, Y., Ding, M., Wright, P., Lu, Z., Liu, Y.: Analyzing brain networks
with PCA and conditional Granger causality. Human Brain Mapping 30(7), 2197–2206
(2009)
46. http://clopinet.com/causality/data/nolte/ (accessed July 6, 2012)
Function Follows Dynamics: State-Dependency
of Directed Functional Influences
Demian Battaglia
1 Introduction
Even before unveiling how neuronal activity represents information, it is crucial to
understand how this information, independently from the used encoding, is routed
across the complex multi-scale circuits of the brain. Flexible exchange of informa-
tion lies at the core of brain function. A daunting amount of computations must be
performed in a way dependent on external context and internal brain states. But how
Demian Battaglia
Aix-Marseille University, Institute for Systems Neuroscience, INSERM UMR 1106,
27, Boulevard Jean Moulin, F-13005 Marseille and
Max Planck Institute for Dynamics and Selforganization and
Bernstein Center for Computational Neuroscience, Am Faßberg 17, D-37077 Göttingen
e-mail: demian.battaglia@univ-amu.fr
can information be rerouted “on demand”, given that anatomic inter-areal connec-
tions can be considered as fixed, on timescales relevant for behavior?
In systems neuroscience, a distinction is made between structural and directed
functional connectivities [32, 33]. Structural connectivity describes actual synaptic
connections. On the other hand, directed functional connectivity is estimated from
time-series of simultaneous neural recordings using causal analysis [20, 36, 41],
to quantify, beyond correlation, directed influences between brain areas. If the
anatomic structure of brain circuits unavoidably constrains at some extent the func-
tional interactions that these circuits can support (see e.g. [42]), it is not however
sufficient to specify them fully. Indeed, a given structural network might give rise to
multiple possible collective dynamical states, and such different states could lead to
different information flow patterns. It has been suggested, for instance, that multi-
stability of neural circuits underlies switching between different perceptions or be-
haviors [21, 40, 48]. In this view, transitions between alternative attractors of the
neural dynamics would occur under the combined influence of structured “brain
noise” [47] and of the bias exerted by sensory or cognitive driving [16, 17, 18].
where the lag τ is an arbitrary temporal scale on which causal interactions are
probed. The causal influence TEx→y (τ ) of circuit element x on circuit element y
is then operatively defined as the functional:
114 D. Battaglia
PY |XY (τ )
TEx→y (τ ) = ∑ PY |XY (τ ) log2 (1)
PY |Y (τ )
where the sum runs over all the three indices i, j and k of the transition matrices.
Higher Markov order descriptions of the time-series evolution can also be
adopted for the modeling of the source and target time-series [52]. In general, the
conditioning on the single past values X(t − τ ) and Y (t − τ ) appearing in the defi-
nition of the matrices PY |XY (τ ) and PY |Y (τ ) is replaced by conditioning on vectors
of several past values Yr = [Y (t − rτ ),Y (t − (r + 1)τ ), . . . (t − (p − 1)τ ),Y (t − pτ )]
p
and Xs = [X(t − sτ ), X(t − (s + 1)τ ), . . .(t − (q − 1)τ ), X(t − qτ )]. Here p and q cor-
q
respond to the Markov orders taken for the target and source time-series Y (t) and
X(t) respectively. The parameters r, s < p, q are standardly set to r, s = 1, but might
assume different values for specific applications (see later). A general Markov order
transfer entropy TEx→y (τ ; r, s, p, q) can then be written straightforwardly .
More importantly, to characterize the dependency of directed functional interac-
tions on dynamical states, a further state conditioning is introduced. Let S(t) be a
vector describing the history of the entire system —i.e. not only the two considered
circuit elements x and y but the whole neural circuit to which they belong— over the
time-interval [t − T,t]. We define then a “state selection filter”, i.e. a set of time in-
stants C for which the system history S(t) satisfies some arbitrary set of constraints.
The definition of C is left on purpose very general and will have to be instantiated
depending on the specific concrete application. It is then possible to introduce an
(arbitrary Markov orders) state-conditioned Transfer Entropy:
PY |XY ;C (τ ; r, s, p, q)
TECx→y (τ ; r, s, p, q) = ∑ PY |XY ;C (τ ; r, s, p, q) log2 (2)
PY |Y ;C (τ ; r, s)
where the sum runs over all the possible values of Y , Yrp and Xqs and the tran-
sition probability matrices PY |XY ;C (τ ; r, s, p, q) = P[Y (t)|Yrp (t), Xqs (t);t ∈ C ] and
PY |Y ;C (τ ; r, s) = P[Y (t)|Yr (t);t ∈ C ] are restrictedly sampled over time epochs in
p
which the ongoing collective dynamics is compliant with the imposed constraints.
Although such a general definition may appear hermetic, it becomes fairly natural
when specific constraints are taken. Simple constraints might be for instance based
on the dynamic range of the instantaneously sampled activity. A possible state se-
lection filter might therefore be: “The activity of every node of the network must be
below a given threshold value”. As a consequence, the overall sampled time-series
would be inspected, and time-epochs in which some network node has an activity
with an amplitude above the threshold level would be discarded and not sampled
for the evaluation of PY |XY ;C and PY |Y ;C . Other simple constraints might be defined
based on the spectral properties of the considered time-series. For instance, the state
selection filter could be: “The power in the theta range of frequencies of the aver-
age network activity must have been above a given threshold during the last 500
milliseconds at least”). In this way, only sufficiently long transients in which the
system displayed collectively a substantial theta oscillatory activity would be sam-
pled for the evaluation of PY |XY ;C and PY |Y ;C . Even more specifically, additional
Function Follows Dynamics: State-Dependency of Directed Functional Influences 115
100ѥm 100ѥm
experiment simulation
8
B 75
Áuorescence (a.u.)
Áuorescence (a.u.)
70 6
65
4
60
2
55
50 0
0 20 40 60 80 0 20 40 60 80
time (s) time (s)
C 1.0
avg. Áuorescence (a.u.)
avg. Áuorescence (a.u.)
53
0.8
0.6
52
0.4
51 0.2
0.0
0 20 40 60 80 0 20 40 60 80
time (s) time (s)
D 1000 1000
500 500
nr. of occurrences
nr. of occurrences
100 100
50 50
10 10
5 5
51.0 51.5 52.0 52.5 53.0 0.0 0.1 0.2 0.3 0.4 0.5
Áuorescence (a.u.) Áuorescence (a.u.)
Fig. 1 Bursting neuronal cultures in vivo and in silico. A Bright field image (left panel) of a
region of a neuronal culture at day in vitro 12, together with its corresponding fluorescence
image (right panel), integrated over 200 frames. Round objects are cell bodies of neurons.
B Examples of real (left) and simulated (right) calcium fluorescence time series for different
individual neurons. C Corresponding averages over the whole population of neurons. Syn-
chronous network bursts are clearly visible from these average traces. D Distribution of pop-
ulation averaged fluorescence amplitudes, for a real network (left) and a simulated one (right).
These distributions are strongly right skewed, with a right tail corresponding to the strong av-
erage fluorescence during bursting events. Figure adapted from [56]. (Copyright: Stetter et
al. 2012, Creative Commons licence).
Function Follows Dynamics: State-Dependency of Directed Functional Influences 117
originates from both fluctuations in the membrane potential and small noise cur-
rents in the pre-synaptic terminals [14]. To reproduce spontaneous firing, each neu-
ron is driven by statistically independent Poisson spike sources with a small rate, in
addition to recurrent synaptic inputs.
A key feature required for the reproduction of network bursting is the introduc-
tion of synaptic short-term depression, described through classic Tsodyks-Markram
equations [58], which take into account the limited availability of neurotransmit-
ter resources for synaptic release and the finite time needed to recharge a de-
pleted synaptic terminal. Dynamics comparable with experiments [23] are obtained
by setting synaptic weights of internal connections to give a network bursting of
0.10 ± 0.01 Hz. To achieve these target rates, an automated conductance adjustment
procedure is used [56] for every considered topology.
Concerning more in detail the used structural topologies, connectivity is al-
ways sparse. The probability of connection is ”frozen” to lead an average degree of
about 100 neighbor neurons, compatible with average degrees reported previously
for neuronal cultures in vitro of the mimicked age (DIV) and density [44, 54]. Net-
works with different degrees of clustering are generated by first randomly drawing
connections and then rewiring them to reach a specified target degree of clustering
(non-locally clustered ensemble). Another possibility to generate clustered networks
is to adopt a connection probability law, depending on spatial distance. Variations of
the length-scale of connectivity will translate into more or less clustered networks
(locally clustered ensemble).
Finally, surrogate calcium fluorescence signals are generated based on the spik-
ing dynamics of the simulated cultured network. A common fluorescence model
introduced in [60] gives rise to an initial fast increase of fluorescence after acti-
vation, followed by a decay with a slow time-constant τCa = 1 s. Such a model
describes the intra-cellular concentration of calcium that is bound to the fluorescent
probe. The concentration changes rapidly for each action potential locally elicited
in a time bin corresponding to the acquisition frame. The net fluorescence level Fi
associated to the activity of a neuron i is finally obtained by further feeding the
Calcium concentration into a saturating static non-linearity, and by adding a Gaus-
sian distributed noise. Example surrogate calcium fluorescence time-series, together
with actual recordings for comparison, can be seen in Figure 1B.
All the details and the parameters of the used neuronal and network models and
calcium surrogate signals —including the modeling of systematic artifacts like light
scattering for an increased realism— can be found in the original publication by
Olav Stetter et al. [56]. With the selected parameters, the simulated neuronal cul-
tures display temporally irregular network bursting as highlighted by Figures 1C,
reporting fluorescence averaged over the entire network, and Figure 1D, showing
the right-skewed distribution of average fluorescence, with its right tail associated
to the high fluorescence during network bursts.
118 D. Battaglia
Frequency of observation
I II III
I II III
B
Fig. 2 Functional multiplicity in simulated cultures. A Three ranges of amplitude are high-
lighted in the distribution of network-averaged fluorescence G(t). Directed functional inter-
actions associated to different dynamical regimes are assessed by conditioning the analysis
to these specific amplitude ranges. Range I corresponds to low-amplitude noise. Range II
to fluorescence level typical of sparse inter-burst activity. Range III to high average fluo-
rescence during network bursts. B Visual representation of the reconstructed functional net-
works topology in the three considered dynamical regimes (top 10% of TE score links only
are shown). Qualitative topological differences in the three extracted networks are evident.
C ROC analysis of the correspondence between inferred functional networks and the ground-
truth structural network. Overlap is random for noise-dominated range I, is marked for inter-
burst regime II and is only partial for bursting regime III.
120 D. Battaglia
three dynamical ranges I, II and III and their relation with structural connectivity are
shown, respectively in Figures 2B and 2C. For a fair comparison, an equal number
of samples is used to estimate TE in the three fluorescence ranges.
The lowest range I corresponds to a regime in which spiking-related signals are
buried in noise. Correspondingly, the associated functional connectivity is indistin-
guishable from random, as indicated by a ROC curve close to the diagonal. Note,
however, that a more extensive sampling (i.e. using all the available observation
samples) would show that limited information about structural topology is still con-
veyed by the activity in this regime [56].
At the other extreme, represented by range III —associated to fully developed
synchronous bursts— the functional connectivity has also a poor overlap with the
underlying structural network. The extracted functional networks are characterized
by the existence of hub nodes with an elevated out- and in-degree. The spatio-
temporal organization of bursting can be described in terms of these functional
connectivity hubs, since nodes within the neighborhood of a same functional hub
experience a strongest mutual synchronization than arbitrary pair of nodes across
the network [56]. In particular, figure 2B displays three visually-evident communi-
ties of “bursting-together” neurons.
The best agreement between functional and excitatory structural connectivity is
obtained for the middle range II, corresponding to above base-line noise activity
during inter-bursts epochs and the early building-up phases of synchronous bursts.
Thus, the retrieved TE-based functional networks confirm the intuitive expecta-
tions outlined in the previous section. The state-dependency of functional connec-
tivity is not limited to synthetic data. Very similar patterns of state-dependency are
observed also in real data from neuronal cultures. In particular, in both simulated
and real cultures, the functional connectivity associated to the buildup of bursts dis-
plays a stronger clustering level than during inter-burst periods [56].
The existence of such different topologies of functional interactions stemming
out of different dynamical ranges of a same structural network constitutes a per-
fect example of the notion of functional multiplicity, outlined in the introduction.
It is certainly possible to define ranges which are “right”, i.e. lead to good struc-
tural network reconstruction, importantly for practical applications in connectomics.
However, this statement should not be over-interpreted to claim that the directed
functional connectivity inferred in a regime like the one associated to range III is
“wrong”. On the contrary, this functional connectivity is correctly capturing the
topology of causal influences in such a collective state, in which the firing of a sin-
gle neuron can trigger the firing of a whole community of nodes.
1.0
A B
0.8
0.8
True positives fraction
Functional clustering
0.6
0.6
0.4
0.4
0.2 0.2
Conditioning only Cross-corr
Conditioning + zero-lag TE
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8
False positives fraction Structural clustering
thus defining the best TE-conditioning range for reconstruction of structural con-
nectivity of the culture. This range should exclude regimes of highly synchronized
activity (like range III) while keeping most of data points for the analysis. More de-
tails are provided in the original study by Stetter et al. [56], showing that very good
reconstruction performance is achieved on simulated data, by implementing a state-
selection filter with optimized threshold Gtop close to the upper limit of Range II
and no lower threshold Gbottom . ROCs corresponding to this choice can be seen in
Figure 3A, for the non-locally clustered ensemble. Good reconstruction is possible
for a vast spectrum of topologies, as denoted by a good correlation between ground-
truth structural clustering coefficient and reconstructed functional clustering level.
Note that a cross-correlation analysis performed over the same state-conditioned set
of simulated observations would systematically overestimates the level of clustering
(Figure 3B, cfr. [56]). Similar results would be obtained for the locally clustered en-
semble, for which the overall reconstruction performance is poorer but an excellent
correlation still exists between the ground-truth and the reconstructed length-scales
of connectivity. Finally, we mention that the just described reconstruction approach
Function Follows Dynamics: State-Dependency of Directed Functional Influences 123
A
# neuron (1-100)
20 s 20 s 20 s
B
Freq. of observation
0 30 60 0 30 60 0 30 60
Inter-burst interval (s) Inter-burst interval (s) Inter-burst interval (s)
C
Fig. 4 Structural degeneracy in simulated cultures. A Examples of spike raster plots for
three simulated cultures with different structural clustering coefficients (non-local clustering
ensemble, structural clustering coefficient equal, respectively from left to right, to 0.1, 0.3 and
0.7). B As revealed by histograms of inter-burst intervals, the temporally-irregular network
bursting dynamics of these strongly different cultures are very similar. Vertical lines indi-
cating the mean of each distribution. C: panels below the IBI distributions illustrate through
graphical cartoons the amount of clustering in the actual structural network and in the directed
functional network reconstructed from fluorescence range III (bursting regime, cf. Figure 2).
To different degrees of structural clustering correspond equivalent elevated levels of func-
tional clustering, due to a common bursting statistics. Figure adapted from [56]. (Copyright:
Stetter et al. 2012, Creative Commons licence).
and firing rates (see Stetter et al. 2012 [56] for details on the procedure and on the
models). The simulated spiking dynamics of the three cultures in silico is shown
in the raster plots of Figure 4A. These three networks display indeed very similar
bursting dynamics, not only in terms of the mean bursting rate, but also in terms of
the entire inter-burst interval (IBIs) distribution, shown in Figure 4B.
Based on these bursting dynamics, directed functional connectivity is extracted
for the three differently clustered structural networks. TE is state-conditioned for
the three networks on a same dynamic range, matching range III in Figure 2, i.e.
the fully-developed burst regime is selected. As a result, the functional networks
extracted in this range have always an elevated clustering level (close to 0.7) at
contrast with the actual structural clusterings, varying in a broad range between 0.1
and 0.5 (see Figure 4C).
The illustrative simulations of Figure 4 thus genuinely confirms that the rela-
tion between network dynamics and network structure is not trivially “one-to-one”,
manifesting the phenomenon of structural degeneracy, outlined in the introduction.
used in silico approach allows as well to investigate how information encoded at the
level of the detailed spiking activity of thousands of neurons is routed between the
modeled areas. It becomes then possible to study how the specific routing modality
depends on the active directed functional connectivity2.
The spiking of individual neurons can be very irregular even when the collective
rate oscillations are regular (cfr. Figure 5B). Therefore, even local rhythms in which
the firing rate is modulated in a very stereotyped way, might correspond to irregu-
lar (highly entropic) sequences of codewords encoding information in a digital-like
fashion (e.g. by the firing —“1”— or missed firing —“0”— of specific spikes at
a given cycle [57]). In such a framework, oscillations would not directly represent
information, but would rather act as a carrier of “data-packets” associated to spike
patterns of synchronously active cell assemblies. By quantifying through a Mutual
Information (MI) analysis the maximum amount of information encoded potentially
in the spiking activity of a local area and by evaluating how much of this informa-
tion is actually transferred to distant interconnected areas, it is possible to demon-
strate that different directed functional connectivity configurations lead to different
modalities of information routing. Therefore, the pathways along which information
propagates can be reconfigured within the time of a few reference oscillation cycles,
by switching to a different effective connectivity motif, for instance by means of a
spatially and temporally precise optogenetic stimulation [4, 67].
model can be found in [4]. For simplicity, only fully connected structural motifs
involving a few areas (K = 2, 3) are studied. Note however that the used approach
might be extended to other structural motifs [55] or, in perspective, to large-scale
thalamocortical networks [35, 42].
In the more realistic case in which coherent oscillations and phase-locking arise only
transiently [59] —unlike in the model of [4] in which oscillations are stationary and
stable— additional constraints might be added, guaranteeing that the instantaneous
power of LFP time-series integrated over specified frequency band (e.g. the gamma
band) exceeds a given minimum threshold.
Since the sampling rates of the electrophysiological recordings simulated by the
computational model is elevated, there is no need to incorporate zero-lag causal
interactions. Therefore, the standard settings (r = s = 0, p = q = 1) are used.
Function Follows Dynamics: State-Dependency of Directed Functional Influences 127
A B C G
x6
D E F
x6
Fig. 6 Functional multiplicity in motifs of oscillating areas. Dynamical states and resulting
directed functional connectivities, generated by structural motifs of K = 2, 3 mutually and
symmetrically connected brain areas. A–C simulated “LFPs” and spike trains of the two pop-
ulations of a K = 2 motif for three different strengths of the symmetric inter-areal coupling,
leading to phase-locked states with different degrees of periodicity. D–E Transfer entropies
for the two possible directions of functional interaction, associated to the dynamic states in
panels A–C. A grey band indicates the threshold for statistical significancy. Below the TE
plots: graphic depiction of the functional interactions between the two areas, captured by
state.conditioned Transfer Entropy. Only arrows corresponding to significant causal inter-
actions are shown. Arrow thickness reflects TE strength. G Analogous directed functional
connectivity motifs generated by a K = 3 symmetric structural motif. The multiplier factors
denote multistability between motifs with same topology but different directions (functional
motif families). Figure adapted from [4]. (Copyright: Battaglia et al. 2012, Creative Com-
mons licence).
128 D. Battaglia
relations, only average TEs can be evaluated, yielding to equally large TE values for
all pairwise directed interactions (Figure 6F, mutual driving).
Analogous unidirectional, leaky or mutual driving motifs of functional interac-
tion can be found in larger motifs with K = 3 areas, as shown by Figure 6G [4].
A B C
0 0
10 10
ï ï
0.25 10 10
Phase
MI / H
MI / H
100%
50% ï ï
10 10
0.5 0
ï ï
Switching 10 10
frequency
0.75
due to the overall structural symmetry, configurations in which the areas exchange
their leader or laggard roles must also be stable, i.e. the complete set of dynamical
attractors continues to be symmetric, even if individual attractors are asymmetric.
Exploiting multi-stability, fast reconfiguration of directed functional influences
can be obtained just by inducing switching between alternative multi-stable attrac-
tors, associated to functional motifs in a same family but with different directional-
ity. As elaborated in [4], an efficient way to trigger “jumps” between phase-locked
configurations is to perturb locally the dynamics of ongoing oscillations with pre-
cisely phased stimulation pulses. Such an external perturbation can be provided for
instance by optogenetic stimulation, if a sufficient fraction of cells in the target area
has been transduced with light-activated conductance. Simulation studies [67] sug-
gest that even transduction rates as low as 5-10% might be sufficient to optoge-
netically induce functional motif switching, if the pulse perturbation are properly
phased with respect to the ongoing rhythm (Figure 7A), as predicted also by a mean-
field theory [4]. But what is the impact of functional motif switching on the actual
flow of information encoded at the microscopic level of detailed spiking patterns?
In the studied model, rate fluctuations can encode only a limited amount of infor-
mation, because firing rate oscillations are stereotyped and amplitude fluctuations
are small with respect to the average excursion between peaks and throughs of the
oscillation. Higher amounts of information can be carried by spiking patterns, since
the spiking activity of single neurons during sparsely synchronized oscillations re-
mains very irregular and thus associated to a large entropy. To quantify information
exchanged by interacting areas, a reference code is considered, in which a “1” or a
“0” symbol denote respectively firing or missed firing of a spike by a specific neu-
ron at each given oscillation cycle. Based on such an encoding, the neural activity
of a group of neurons is mapped to digital-like streams, “clocked” by the network
rhythm, in which a different “word” is broadcast at each oscillation cycle3.
Focusing on a fully symmetric structural motif of K = 2 areas, the network
is modified by embedding into it transmission lines (TLs), i.e. mono-directional
fiber tracts dedicated to inter-areal communication. In more detail, selected sub-
populations of source excitatory neurons within each area establish synaptic con-
tacts with matching target excitatory or inhibitory cells in the other area, in a one-
to-one cell arrangement. Synapses in a TL are strengthened with respect to usual
synapses, in the attempt to enhance communication capacity, but not too much, in
order not to alter phase-relations between the collective oscillations of the two areas
(for more details, see [4]). The information transmission efficiency of each TL is
assessed —separately for different effective motifs— by quantifying Mutual Infor-
mation (MI) [57] between the “digitized” spike trains of pairs of source and target
cells. Since a source cell fires on average every five or six oscillation cycles, the
firing of a single neuron conveys H 0.7 bits of information per oscillation cycle.
MI normalized by the source entropy H indicates the fraction of this information
reaching the target cell. Due to the possibility of generating very long simulated
3 Such a code is here introduced uniquely as a theoretical construct grounding a rigorous
analysis of information transmission, without claim that it is actually being used in the
brain.
Function Follows Dynamics: State-Dependency of Directed Functional Influences 131
functions, could be nothing else than the design of structural networks acting as
emergent “functional collectivities” [27] with suitable dynamical regimes.
An advantageous feature allowing a dynamical network to transit fluently be-
tween qualitatively different dynamical regimes would be criticality [13]. Switching
would be indeed highly facilitated for a system tuned to be close to the edge between
multiple dynamic attractors. This is eventually the case for neuronal cultures, which
undergo spontaneous switching to bursting due to their proximity to a rate instabil-
ity (compensated for by synaptic resource depletion). Beyond that, networks at the
edge of synchrony might undergo noise-induced switching between a baseline es-
sentially asynchronous activity and phase-locked transients with elevated local and
inter-areal oscillatory coherence. In networks critically tuned to be at the edge of
synchrony, specific patterns of directed functional interactions associated to a latent
phase-locked attractor —becoming manifest only for fully developed synchrony—
might be “switched on” just through the application of weak biasing inputs which
stabilizing its metastable strong-noise “ghost” [19].
Acknowledgements. The framework here reviewed would not have been developed without
the help of colleagues and students. Credit for these and other related results must be shared
with (in alphabetic order): Ahmed El Hady, Theo Geisel, Christoph Kirst, Erik Martens, An-
dreas Neef, Agostina Palmigiano, Javier Orlandi, Jordi Soriano, Olav Stetter, Marc Timme,
Annette Witt, Fred Wolf. I am also grateful to Dante Chialvo, Gustavo Deco and Viktor Jirsa
for inspiring discussions.
References
1. de Arcangelis, L., Perrone-Capano, C., Herrmann, H.J.: Self-organized criticality model
for brain plasticity. Phys. Rev. Lett. 96, 028107 (2006)
2. Battaglia, D., Brunel, N., Hansel, D.: Temporal decorrelation of collective oscillations
in neural networks with local inhibition and long-range excitation. Phys. Rev. Lett. 99,
238106 (2007)
3. Battaglia, D., Hansel, D.: Synchronous chaos and broad band gamma rhythm in a mini-
mal multi-layer model of primary visual cortex. PLoS Comp. Biol. 7, e1002176 (2011)
4. Battaglia, D., Witt, A., Wolf, F., Geisel, T.: Dynamic effective connectivity of inter-areal
brain circuits. PLoS Comp. Biol. 8, e1002438 (2012)
5. Beggs, J., Plenz, D.: Neuronal avalanches in neocortical circuits. Journal of Neuro-
science 23, 11167–11177 (2003)
6. Bosman, C.A., Schoffelen, J.-M., Brunet, N., Oostenveld, R., Bastos, A.M., et al.: At-
tentional stimulus selection through selective synchronization between monkey visual
areas. Neuron 75, 875–888 (2012)
7. Bressler, S.L., Seth, A.K.: Wiener-Granger causality: a well established methodology.
NeuroImage 58, 323–329 (2011)
8. Brovelli, A., Ding, M., Ledberg, A., Chen, Y., Nakamura, R., Bressler, S.L.: Beta oscil-
lations in a large-scale sensorimotor cortical network: directional influences revealed by
Granger causality. Proc. Natl. Acad. Sci. USA 101, 9849–9854 (2004)
9. Brunel, N., Wang, X.J.: What determines the frequency of fast network oscillations with
irregular neural discharges? J. Neurophysiol. 90, 415–430 (2003)
Function Follows Dynamics: State-Dependency of Directed Functional Influences 133
10. Brunel, N., Hansel, D.: How noise affects the synchronization properties of recurrent
networks of inhibitory neurons. Neural Comput. 18, 1066–1110 (2006)
11. Brunel, N., Hakim, V.: Sparsely synchronized neuronal oscillations. Chaos 18, 015113
(2008)
12. Buehlmann, A., Deco, G.: Optimal information transfer in the cortex through synchro-
nization. PLoS Comput. Biol. 6(9), 1000934 (2010)
13. Chialvo, D.R.: Emergent complex neural dynamics. Nat. Phys. 6, 744–750 (2010)
14. Cohen, E., Ivenshitz, M., Amor-Baroukh, V., Greenberger, V., Segal, M.: Determinants
of spontaneous activity in networks of cultured hippocampus. Brain Res. 1235, 21–30
(2008)
15. Dayan, P., Abbott, L.: Theoretical Neuroscience: Computational and Mathematical Mod-
eling of Neural Systems. MIT Press, Cambridge (2001)
16. Deco, G., Romo, R.: The role of fluctuations in perception. Trends Neurosci. 31, 591–
598 (2008)
17. Deco, G., Rolls, E.T., Romo, R.: Stochastic dynamics as a principle of brain function.
Prog. Neurobiol. 88, 1–16 (2009)
18. Deco, G., Jirsa, V.K., McIntosh, R.: Emerging concepts for the dynamical organization
of resting-state activity in the brain. Nat. Rev. Neurosci. 12, 43–56 (2011)
19. Deco, G., Jirsa, V.K.: Ongoing cortical activity at rest: criticality, multistability, and ghost
attractors. Journal of Neuroscience 32, 3366–3375 (2012)
20. Ding, M., Chen, Y., Bressler, S.L.: Granger causality: basic theory and application to
neuroscience. In: Schelter, B., Winterhalder, M., Timmer, J. (eds.) Handbook of Time
Series Analysis. Wiley, New York (2006)
21. Ditzinger, T., Haken, H.: Oscillations in the perception of ambiguous patterns: a model
based on synergetics. Biol. Cybern. 61, 279–287 (1989)
22. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., Reitboeck, H.J.:
Coherent oscillations: a mechanism of feature linking in the visual cortex? Multiple elec-
trode and correlation analyses in the cat. Biol. Cybern. 60, 121–130 (1988)
23. Eckmann, J.P., Feinerman, O., Gruendlinger, L., Moses, E., Soriano, J., et al.: The
physics of living neural networks. Physics Reports 449, 54–76 (2007)
24. Engel, A., Fries, P., Singer, W.: Dynamic predictions: oscillations and synchrony in top-
down processing. Nat. Rev. Neurosci. 2, 704–716 (2001)
25. Eytan, D., Marom, S.: Dynamics and effective topology underlying synchronization in
networks of cortical neurons. J. Neurosci. 26, 8465–8476 (2006)
26. Fox, M.D., Snyder, A.Z., Vincent, J.L., Corbetta, M., Van Essen, D.C., et al.: The human
brain is intrinsically organized into dynamic, anticorrelated functional networks. Proc.
Natl. Acad. Sci. USA 102, 9673–9678 (2005)
27. Fraiman, D., Balenzuela, P., Foss, J., Chialvo, D.R.: Ising-like dynamics in large-scale
functional brain networks. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 79, 061922
(2009)
28. Freyer, F., Roberts, J.A., Becker, R., Robinson, P.A., Ritter, P., et al.: Biophysical mech-
anisms of multistability in resting-state cortical rhythms. J. Neurosci. 31, 6353–6361
(2011)
29. Fries, P.: A mechanism for cognitive dynamics: neuronal communication through neu-
ronal coherence. Trends Cogn. Sci. 9, 474–480 (2005)
30. Fries, P., Nikolić, D., Singer, W.: The gamma cycle. Trends Neurosci. 30, 309–316
(2007)
31. Fries, P., Womelsdorf, T., Oostenveld, R., Desimone, R.: The effects of visual stimulation
and selective visual attention on rhythmic neuronal synchronization in macaque area V4.
J. Neurosci. 28, 4823–4835 (2008)
134 D. Battaglia
32. Friston, K.J.: Functional and Effective Connectivity in Neuroimaging: A Synthesis. Hu-
man Brain Mapping 2, 56–78 (1994)
33. Friston, K.J.: Functional and Effective Connectivity: A Review. Brain Connectivity 1,
13–36 (2011)
34. Garofalo, M., Nieus, T., Massobrio, P., Martinoia, S.: Evaluation of the performance
of information theory-based methods and cross-correlation to estimate the functional
connectivity in cortical networks. PLoS One 4, e6482 (2009)
35. Ghosh, A., Rho, Y., McIntosh, A.R., Ktter, R., Jirsa, V.K.: Noise during rest enables the
exploration of the brain’s dynamic repertoire. PLoS Comp. Biol. 4, 1000196 (2008)
36. Gourévitch, B., Bouquin-Jeannès, R.L., Faucon, G.: Linear and nonlinear causality be-
tween signals: methods, examples and neurophysiological applications. Biol. Cybern. 95,
349–369 (2006)
37. Granger, C.W.J.: Investigating causal relations by econometric models and cross-spectral
methods. Econometrica 37, 424–438 (1969)
38. Gregoriou, G.G., Gotts, S.J., Zhou, H., Desimone, R.: High-frequency, long-range cou-
pling between prefrontal and visual cortex during attention. Science 324, 1207–1210
(2009)
39. Grienberger, C., Konnerth, A.: Imaging Calcium in Neurons. Neuron 73, 862–885 (2012)
40. Haken, H., Kelso, J.A., Bunz, H.: A theoretical model of phase transitions in human hand
movements. Biol. Cybern. 51, 347–356 (1985)
41. Hlaváčková-Schindler, K., Paluš, M., Vejmelka, M., Bhattacharya, J.: Causality detection
based on information-theoretic approaches in time series analysis. Phys. Rep. 441, 1–46
(2007)
42. Honey, C.J., Kötter, R., Breakspear, M., Sporns, O.: Network structure of cerebral cortex
shapes functional connectivity on multiple time scales. Proc. Natl. Acad. Sci. USA 104,
10240–10245 (2007)
43. Ito, S., Hansen, M.E., Heiland, R., Lumsdaine, A., Litke, A.M., Beggs, J.M.: Extending
transfer entropy improves identification of effective connectivity in a spiking cortical
network model. PLoS One 6, e27431 (2011)
44. Jacobi, S., Soriano, J., Segal, M., Moses, E.: BDNF and NT-3 increase excitatory input
connec- tivity in rat hippocampal cultures. Eur. J. Neurosci. 30, 998–1010 (2009)
45. Levina, A., Herrmann, J.M., Geisel, T.: Dynamical synapses causing self-organized crit-
icality in neural networks. Nat. Phys. 3, 857–860 (2007)
46. Levina, A., Herrmann, J.M., Geisel, T.: Phase Transitions towards Criticality in a Neural
System with Adaptive Interactions. Phys. Rev. Lett. 102, 118110 (2009)
47. Misic, B., Mills, T., Taylor, M.J., McIntosh, A.R.: Brain noise is task-dependent and
region specific. J. Neurophysiol. 104, 2667–2676 (2010)
48. Moreno-Bote, R., Rinzel, J., Rubin, N.: Noise-induced alternations in an attractor net-
work model of perceptual bistability. J. Neurophysiol. 98, 1125–1139 (2007)
49. Orlandi, J., Stetter, O., Soriano, J., Geisel, T., Battaglia, D.: Transfer Entropy reconstruc-
tion and labeling of neuronal connections from simulated calcium imaging. PLoS One
(in press, 2014)
50. Politis, D.N., Romano, J.P.: Limit theorems for weakly dependent Hilbert space valued
random variables with applications to the stationary bootstrap. Statistica Sinica 4, 461–
476 (1994)
51. Salazar, R.F., Dotson, N.M., Bressler, S.L., Gray, C.M.: Content-specific fronto-parietal
synchronization during visual working memory. Science 338, 1097–1100 (2012)
52. Schreiber, T.: Measuring information transfer. Phys. Rev. Lett. 85, 461–464 (2000)
53. Seamans, J.K., Yang, C.R.: The principal features and mechanisms of dopamine modu-
lation in the prefrontal cortex. Prog. Neurobiol. 74, 1–58 (2004)
Function Follows Dynamics: State-Dependency of Directed Functional Influences 135
54. Soriano, J., Martinez, M.R., Tlusty, T., Moses, E.: Development of input connections in
neural cultures. Proc. Natl. Acad. Sci. USA 105, 13758–13763 (2008)
55. Sporns, O., Kötter, R.: Motifs in brain networks. PLoS Biol. 2, e369 (2004)
56. Stetter, O., Battaglia, D., Soriano, J., Geisel, T.: Model-free reconstruction of excitatory
neuronal connectivity from calcium imaging signals. PLoS Comp. Biol. 8, e1002653
(2012)
57. Strong, S.P., Koberle, R., de Ruyter van Steveninck, R.R., Bialek, W.: Entropy and infor-
mation in neural spike trains. Phys. Rev. Lett. 80, 197–200 (1998)
58. Tsodyks, M., Uziel, A., Markram, H.: Synchrony generation in recurrent networks with
frequency-dependent synapses. J. Neurosci. 20, 1–5 (2000)
59. Varela, F., Lachaux, J.P., Rodriguez, E., Martinerie, J.: The brainweb: Phase synchro-
nization and large-scale integration. Nat. Rev. Neurosci. 2, 229–239 (2001)
60. Vogelstein, J.T., Watson, B.O., Packer, A.M., Yuste, R., Jedynak, B., et al.: Spike in-
ference from calcium imaging using sequential Monte Carlo methods. Biophys. J. 97,
636–655 (2009)
61. Volgushev, M., Chistiakova, M., Singer, W.: Modification of discharge patterns of neo-
cortical neurons by induced oscillations of the membrane potential. Neuroscience 83,
15–25 (1998)
62. Wagenaar, D.A., Pine, J., Potter, S.M.: An extremely rich repertoire of bursting patterns
during the development of cortical cultures. BMC Neuroscience 7, 1–18 (2006)
63. Wang, X.J., Buzsáki, G.: Gamma oscillation by synaptic inhibition in a hippocampal
interneuronal network model. J. Neurosci. 16, 6402–6413 (1996)
64. Wang, X.J.: Neurophysiological and computational principles of cortical rhythms in cog-
nition. Physiol. Rev. 90, 1195–1268 (2010)
65. Whittington, M.A., Traub, R.D., Kopell, N., Ermentrout, B., Buhl, E.H.: Inhibition-based
rhythms: experimental and mathematical observations on network dynamics. Int. J. Psy-
chophysiol. 38, 315–336 (2000)
66. Wiener, N.: The theory of prediction. In: Beckenbach, E. (ed.) Modern Mathematics for
Engineers. McGraw-Hill, New York (1956)
67. Witt, A., Palmigiano, A., Neef, A., El Hady, A., Wolf, F., Battaglia, D.: Controlling
oscillation phase through precisely timed closed-loop optogenetic stimulation: a compu-
tational study. Front Neural Circuits 7, 49 (2013)
68. Womelsdorf, T., Lima, B., Vinck, M., Oostenveld, R., Singer, W., et al.: Orientation se-
lectivity and noise correlation in awake monkey area V1 are modulated by the gamma
cycle. Proc. Natl. Acad. Sci. USA 109, 4302–4307 (2012)
69. Yizhar, O., Fenno, L.E., Davidson, T.J., Mogri, M., Deisseroth, K.: Optogenetics in neu-
ral systems. Neuron 71, 9–34 (2011)
On Complexity and Phase Effects
in Reconstructing the Directionality
of Coupling in Non-linear Systems
Abstract. From the theoretical point of view, brain signals measured with elec-
troencephalogram (EEG), or magnetoencephalogram (MEG) can be described as
the manifestation of coupled nonlinear systems with time delays in coupling. From
the empirical point of view, to understand how the information is processed in the
brain, there is a need to characterize the information flow in a network of spatially
distinct brain areas. Tools for reconstructing the directionality of coupling, which
can be formalized as Granger causality, provide a framework for gaining the insight
into the functional organization of the brain networks. In turn, it is not completely
understood what kind of effects are captured by causal statistics. Under the context
of coupled non-linear oscillating systems with time delay in coupling, we consider
two effects that can contribute to the estimation of causality. First, we explore the
problem of ambiguity of phase delays observed between the dynamics of the driver
and the response, and its effect on the linear, spectral and information-theoretic
statistics. Second, we show that the directionality of coupling can be understood
as the differences in signal complexity between the driver and response.
1 Introduction
Rhythmic activity between neuronal ensembles is a widely observed phenomenon
in the brain [2]. The macroscopic oscillations can be detected with measurements of
Vasily A. Vakorin
Neurosciences & Mental Health, The Hospital for Sick Children, Toronto, Canada
e-mail: vasenka@gmail.com
Olga Krakovska
Department of Chemistry, York University, Toronto, Canada
Anthony R. McIntosh
Rotman Research Institute, Baycrest Centre and Department of Psychology,
University of Toronto, Toronto, Canada
show that, given the same directionality of coupling (as specified by the underlying
model), we can observe either phase delay or phase lead of the driver with respect
to the response. In turn, this phase difference affects the causal statistics, potentially
leading to spurious results. In the second part, we explore another mechanism that
can contribute to the causality estimation. Specifically, in spite of the confound-
ing effects of phase delays, the inference of the directionality of coupling may rely
on the differences in the complexity (information content) between the driver and
response. Intuitively, if the information is transferred from one system to another,
then the dynamics of the receiving system would reflect both its own complexity and
that of the sending system. Thus, the observed causality would depend on which of
the two effects, phase-related or complexity-related, would be stronger in a specific
situation.
p p
x1 (t) = ∑ a11( j)x1 (t − j) + ∑ a12( j)x2 (t − j) + ε1(t)
j=1 j=1
p p
x2 (t) = ∑ a21( j)x1 (t − j) + ∑ a22( j)x2 (t − j) + ε2(t) (2)
j=1 j=1
where an optimal order of the model, the parameter p, can be estimated, for exam-
ple, according to Bayesian information criterion [27], and ε1 (t) and ε2 (t) are the
prediction errors for each time series. According to [11], if the variance of ε2 (t) is
reduced by including the terms a21 ( j) in the second equation of (2), compared to
keeping a21 ( j) = 0 for all j, then x1 (t) is thought to be causing x2 (t).
Formally, Granger causality F1→2 from x1 (t) to x2 (t) is quantified as an enhance-
ment of predictive power and defined as
(21)
var(ε2 )
F1→2 = ln . (3)
var(ε2 )
(21)
where var(ε2 ) is the variance of ε2 (t) derived from a model with a21 ( j) = 0 for
all j, and var(ε2 ) is the variance of ε2 (t) derived from the full model (2).
Two extensions of bivariate Granger causality are proposed in the literature: spec-
tral and non-linear. The spectral version of Granger causality [25] is based on the
Fourier transform of autoregressive models:
A11 ( f ) A12 ( f ) X (f) E1 ( f )
× 1 = , (4)
A21 ( f ) A22 ( f ) X2 ( f ) E2 ( f )
Similar to (3), the spectral Granger causality G1→2 ( f ) from x1 (t) to x2 (t) is defined
as a function of the frequency f , and can be expressed in terms of the frequency-
specific covariance matrix of the residuals and the transfer function H( f ). More
details on the spectral causality can be found in [15].
A non-linear version of Granger causality in the time domain can be constructed
using the tools derived in the information theory. Under the information-theoretic
approach, we do not need to explicitly specify a model of signals and their interac-
tions. Instead, the transfer of information from the past of one process to the future
of another process can be quantified in terms of individual and joint entropies, which
essentially measure the variability of the observed signals or the amount of infor-
mation contained in them.
Non-linear Granger causality I1→2 is thus expressed as a transfer of informa-
tion from one signal to another, and can be quantified as the conditional mutual
On Complexity and Phase Effects in Reconstructing the Directionality of Coupling 141
information I(xδ2 , x1 |x2 ) between xδ2 , the future of x2 , and the past of x1 given the
past of x2 . It can be estimated in terms of individual H(·) and joint entropies H(·, ·)
and H(·, ·, ·) of the processes x1 , x1 , and xδ2 as follows:
where the time lag δ between the future and the past of a signal is typically measured
in multiples of the sampling interval. It can be shown that under certain conditions,
I(xδ2 , x1 |x2 ) is equivalent to the measure called transfer entropy [26, 19].
There are many ways to estimate the entropy of a signal. One approach is
based on an assumption that the observed time series are realizations of non-
linear dynamic systems. For example, the model (1) is a combination of two three-
dimensional systems, but we assume that only one dimension is observed (signals
x1 (t) and x2 (t)).
In this case, the dynamics in the multi-dimensional state space of the underlying
model should be reconstructed from a time series of observations. This can be done
with time delay embedding wherein the time series x1 (t) and x2 (t) are converted to
a sequence of vectors in a multidimensional space:
where d1 and d2 are embedding dimensions, and τ1 and τ2 are embedding delays
measured in multiples of the sampling interval. Note that the ultimate goal is not
to reconstruct an orbit in the state space that is closest to the true one. However,
some invariants of a dynamical system, such as dimensions and entropy, can be
determined if the embedding dimension m is sufficiently high [31]. We estimate
the individual and joints entropies in (6) by computing the corresponding correla-
tion integrals, as proposed by [22], and tested using linear and non-linear models
[3, 10, 33].
Similar to F1→2 , G1→2 ( f ), and I1→2 , causal effects in the other direction, namely,
F2→1 , G2→1 ( f ), and I2→1 , can also be estimated. The difference in two measures
may indicate the directionality of dominant coupling between x1 (t) and x2 (t). Thus,
causality can be inferred from the standard Granger causality
as a function of time lag δ . If these measures are positive, the directionality of dom-
inant coupling is reconstructed as x2 (t) → x1 (t), and x1 (t) → x2 (t) if negative. Note
that these net measures will report no causality in a case of symmetric bidirectional
systems.
Suppose that there are n realizations of the processes x1 (t) and x2 (t), and for each
(k)
realization k = 1, ..., n, the phase shift φ12 ( f ) is computed. Relative stability of the
(k)
phase difference φ12 ( f ) across realizations quantifies the degree of phase-locking
between two signals at a given frequency:
1 n
iφ12 ( f )
(k)
R12 ( f ) = ∑ e . (12)
n k=1
By construction, the statistic R12 ( f ) is limited between 0 and 1. When the relative
phase distribution is concentrated around the mean, R12 ( f ) is close to one, whereas
phase scattering will result in a random distribution of phases and R12 ( f ) close to
zero.
The mean phase delay φ 12 ( f ) between two signals can also be computed by av-
eraging across the realizations. However, there is an ambiguity in cumulative phase
shift between harmonic signals as, in general, it is not known how many cycles the
phase completed. In this book chapter, the phase difference φ 12 ( f ) between −90◦
and 0◦ implies that the signal x1 (t) (response) is phase delayed with respect to x2 (t)
(driver) at frequency f , and vice versa.
−1
−2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
1
Amplitude
−1
−2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
−1
−2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
Fig. 1 Typical time series generated by the system (1) in three scenarios: (A) phase difference
is close to zero at 10 Hz (ε = 0.07 and T = 0.1083); (B) negative phase shift (ε = 0.07 and
T = 0.1208); (C) positive phase shift (ε = 0.07 and T = 0.0958)
In the first scenario (Fig. 2), the parameters ε and T were chosen such that the
phase difference at 10Hz was close to zero. In such a case, the measure of Δ G( f )
is positive for frequencies 1 − 15Hz, reaching a peak around 11Hz (Fig. 2a). The
net information transfer Δ I(δ ) was also positive for all time lags δ . Positive values
for the causal statistics imply that the directionality of coupling is correctly recon-
structed as 2 → 1.
In the second scenario (Fig. 3), wherein the responding signal x1 (t) is phase de-
layed with respect to the driving x2 (t) (Fig. 1b), both Δ I(δ ) and Δ G( f ) are positive,
reaching a peak around 10Hz, also implying (correctly) the directionality of cou-
pling as 2 → 1. Note that the peak in Δ G( f ) at 10 Hz is higher in Fig. 3, compared
to that at 11Hz in Fig. 2, although the strength of coupling was the same. The time
precedence, as specified by the directionality of coupling from the model, concurs
with the phase precedence, as detected from the phase-locking analysis.
On Complexity and Phase Effects in Reconstructing the Directionality of Coupling 145
causality
Spectral 0.4
0.2
0
5 10 15 20 25
Frequency (Hz)
B
0.1
Transfer
Entropy
0.05
−0.05
10 20 30 40 50
Time lag (samples)
C
Phase−locking
1
index
0.5
0
5 10 15 20
Frequency (Hz)
D Phase shift = 0.4 degrees at 10 Hz
Phase shift,
100
degrees
−100
5 10 15 20
Frequency (Hz)
Fig. 2 Reconstructed causality and phase effects in the case where there is no phase shift
(φ 12 ( f ) = 0.4o for ε = 0.07 and T = 0.1083) at f = 10 Hz (see Fig. 1a): (A) spectral Granger
causality as a function of frequency; (B) net information transfer as a function of the time lag
δ ; (C) phase-locking index and (D) phase shift as functions of frequency. For the results
shown in Fig. 2-6, the embedding parameters, τ = 1 and d = 5, were kept the same, whereas
p was estimated according to Bayesian information criterion, separately for each pair of the
time series.
146 V.A. Vakorin, O. Krakovska, and A.R. McIntosh
causality
Spectral
0.5
0
5 10 15 20 25
Frequency (Hz)
B
0.1
Transfer
Entropy
0.05
10 20 30 40 50
Time lag (samples)
C
Phase−locking
1
index
0.5
0
5 10 15 20
Frequency (Hz)
D Phase shift = −44.2 degrees at 10 Hz
Phase shift,
100
degrees
−100
5 10 15 20
Frequency (Hz)
Fig. 3 Reconstructed causality and phase effects in the case where the phase shift between
the driver and response is φ 12 ( f ) = −44.2o at f = 10 Hz for ε = 0.07 and T = 0.1208 (see
Fig. 1b): (A) spectral Granger causality as a function of frequency; (B) net information trans-
fer as a function of the time lag δ ; (C) phase-locking index and (D) phase shift as functions
of frequency
On Complexity and Phase Effects in Reconstructing the Directionality of Coupling 147
causality
Spectral
0
−0.1
−0.2
−0.3
5 10 15 20 25
Frequency (Hz)
B
0.15
Transfer
Entropy
0.1
0.05
0
−0.05
10 20 30 40 50
Time lag (samples)
C
Phase−locking
1
index
0.5
0
5 10 15 20
Frequency (Hz)
D Phase shift = 45.1 degrees at 10 Hz
Phase shift,
100
degrees
−100
5 10 15 20
Frequency (Hz)
Fig. 4 Performance of causal statistics and phase effects in the case of a positive phase dif-
ference (φ 12 ( f ) = 45.1◦ ) between them at f = 10 Hz for ε = 0.07 and T = 0.0958 (see
Fig. 1c): (A) spectral Granger causality as a function of frequency; (B) net information trans-
fer as a function of the time lag δ ; (C) phase-locking index and (D) phase shift as functions
of frequency
148 V.A. Vakorin, O. Krakovska, and A.R. McIntosh
Fig. 4 represents the third scenario wherein the effects associated with phase
precedence counteract the effects related to the causal relations as implemented in
system (1). Specifically, in this case, the driver x2 (t) is phase delayed with respect
to the response x2 (t) with φ 12 ( f ) = 45o at f = 10 Hz. The causal effects related
to the phase shift are relatively strong compared to the inherent causality between
x1 (t) and x2 (t). The spectral Granger causality switches to negative values, implying
that the causal relations are spuriously reconstructed as 1 → 2. The net information
transfer is also sensitive to the phase shift, being either positive or negative, depend-
ing on the value of the time lag δ . It should be noted that Δ I(δ ) is more resistant
to the phase-locking effects, as the mean value Δ I(δ ) averaged across δ is positive
(2 → 1).
Notably, the performance of the standard Granger causality was similar to that of
the spectral Granger causality. When there was no phase shift at 10Hz, the mean Δ F
averaged across the realizations was 0.0168, whereas the confidence interval of Δ F
based on the corresponding surrogate data and defined by the 5%- and 95%-tails,
was [−0.0066 0.0059]. In the case of φ 12 ( f ) close to 45o at 10Hz, Δ F = 0.0574
with the confidence interval [−0.0060 0.0055] based on the surrogate data. How-
ever, when φ 12 ( f ) is about −44o , the analysis produced Δ F = −0.0125, whereas
the confidence interval for surrogate data was [−0.0050 0.0048]. Thus, the stan-
dard Granger causal statistic was significantly affected by the differences in phase
between the two signals.
Standard Granger
0.06
0.04
0.02
0.4
0.2
−0.2
−150 −100 −50 0 50 100 150
C
Transfer Entropy
0.06
0.04
0.02
0
−0.02
−0.04
−150 −100 −50 0 50 100 150
D
0.14
in coupling
Time delay
0.12
0.1
0.08
0.06
−150 −100 −50 0 50 100 150
Phase shift (degrees)
Fig. 5 Influence of time delay in coupling on: (A) standard Granger causality; (B) spectral
Granger causality and (C) net information transfer as functions of the observed time differ-
ence at 10 Hz; (D) phase difference at 10 Hz as a function of the time delay in coupling,
provided that the strength of coupling was unchanged (ε = 0.07)
The phase shift, at the frequency when the signals become phase-locked to each
other, depends not only on the time delay in coupling T , but also on the coupling
strength ε . Fig. 6 is based on the simulations wherein T was kept constant, whereas
ε varied from 0 to 0.1. The statistics Δ F, Δ G and Δ I as well as the phase difference
φ 12 estimated at 10Hz are shown as the functions of ε . As can be seen, φ 12 at 10Hz
can be either positive (phase delay) or negative (phase lead of the driver x2 (t) with
respect to the response x1 (t)).
Notably, the information-theoretic statistic Δ I is a monotonic function of ε as
can be seen in Fig. 6c (note, however, that for very high ε , when the driver and
150 V.A. Vakorin, O. Krakovska, and A.R. McIntosh
x 10
−3 Influence of coupling strenth
A
Standard Granger
20
10
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
B
Spectral causality
0.15
0.1
0.05
0
−0.05
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
C
Transfer Entropy
0.08
0.06
0.04
0.02
0
−0.02
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Phase shift, degrees
D
6
4
2
0
−2
−4
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Coupling strength
Fig. 6 Influence of the strength of coupling on: (A) standard Granger causality; (B) spectral
causality and (C) net information transfer as functions of phase difference at 10 Hz; (D) phase
difference at 10 Hz as a function of the coupling strength, with the time delay in coupling kept
constant (T = 0.1083)
0.0075, which also corresponds to the phase delay of x2 (t) with respect to x1 (t)),
the standard and spectral Granger statistics correctly identify the directionality of
coupling.
i=ζ θ
1
y(θ ) (ζ ) = ∑ xi , 1 ≤ ζ ≤ n/θ
θ i=(ζ −1)
(13)
θ +1
wherein the fluctuations at scales smaller than θ are eliminated. The window length,
measured in data points, represents the scale factor, θ = 1, 2, 3, .... Note that θ = 1
represents the original time series, whereas relatively large θ produces a smooth
signal, containing basically low frequency components of the original signal. To
obtain the MSE curve, sample entropy is computed for each coarse-grained time
series.
0.22
0.2
0.18
0.16
0.14
5 10 15 20
(b) (c)
sample entropy (fine)
0.035 0.035
Difference in
0.03 0.03
0.025 0.025
0.02 0.02
5 10 15 20 0.14 0.16 0.18 0.2 0.22
(d) (e)
sample entropy (coarse)
−0.095 −0.095
Difference in
−0.1 −0.1
−0.105 −0.105
−0.11 −0.11
−0.115 −0.115
5 10 15 20 0.14 0.16 0.18 0.2 0.22
Time delay T Net transfer Entropy
Fig. 7 Influence of time delay in coupling T on the differences in complexity (sample en-
tropy) between the driver and response, and the information transfer: (A) net information
transfer, (B) difference in complexity measured at the fine time scales (scales 1-5), and (D)
difference in complexity at the coarse time scales (scales 16-20), and (E) difference in com-
plexity at the coarse time scales (r = −0.08, not significant) as a function of the net informa-
tion transfer. Positive correlation r = 0.73 (p-value< 0.0001) in panel C implies that at the
fine scales, the signal complexity of the driver is higher than that of the response.
there exists a relatively strong and robust linear correlation between the information
transfer and differences in complexity at fine time scales (r = 0.73, p-value< 0.001).
Positive r implies that a system with higher variability at fine time scales can better
predict the behavior of a system with lower variability, than the other way around.
At the same time, the correlation between the information transfer and differences
at coarse time scales (Fig. 7e) is close to zero.
154 V.A. Vakorin, O. Krakovska, and A.R. McIntosh
(a)
Net transfer Entropy
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4
(b) (c)
sample entropy (fine)
0.04 0.04
Difference in
0.02 0.02
0 0
−0.02 −0.02
(d) (e)
sample entropy (coarse)
−0.1 −0.1
Difference in
−0.2 −0.2
−0.3 −0.3
−0.4 −0.4
−0.5 −0.5
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
Coupling parameter ε Net transfer Entropy
Fig. 8 Influence of the strength of coupling ε on the differences in complexity (sample en-
tropy) between the driver x2 (t) and the response x1 (t) in (1) and the information transfer:
(A) net information transfer, (B) difference in complexity measured at the fine time scales,
and (D) difference in complexity at the fine time scales (r = 0.73, p-value< 0.0001), and (E)
difference in complexity at the coarse time scales as a function of the net information transfer.
Note the negative correlation between the two statistics in panel E: the dominant amount of
information transfered from the system with lower complexity (driver x2 (t)) to the system
with higher complexity (driver x2 (t)) is a monotonic function of the difference in their signal
complexity at the time scales that are sensitive to non-linear deterministic effects.
On Complexity and Phase Effects in Reconstructing the Directionality of Coupling 155
9 Conclusion
We considered two effects that can contribute to reconstruction of the driver-
response relations in coupled non-linear systems. The first effect reflects the idea
that the difference in complexity between the driver and the response is associated
with the dominant transfer of information. Specifically, the causality can be viewed
as a transfer of information from one system to another, which increases the signal
complexity of the individual subsystems as the information propagates along the
network. The time scales at which the complexity is computed is a critical factor. In
our example, at the coarse time scales as used in the multi-scale entropy estimation,
the difference in the complexity between the two coupled subsystems was propor-
tional to the strength of coupling. This suggests that it is the coarse scales that reflect
non-linear deterministic effects for the system (1). In addition, the net information
transfer was a monotonic function of the coupling strength. Thus, the propagation
of information, which is the basis for causality reconstruction, in general, induces
an increase of signal complexity at the time scales that reflect the deterministic ef-
fects underlying the observed time series. Expanding a model of two sources to a
larger network, this accumulated complexity may clarify the topological roles of
individual nodes in this network [16].
The second effect is based on the existence of possible phase differences be-
tween the driver and response at specific frequencies. This effect can either in-
tensify or counteract the causality effects considered as the propagation of com-
plexity. Depending on the strength of the effects associated with phase differences,
the complexity-related causal effects can be partly neutralized or even totally sup-
pressed. This can be explicitly observed in the scenarios wherein signals become
phase-locked to each other at some frequencies. In turn, this could have a dominant
influence on estimated causal statistics.
In our example, we considered the role of phase shifts in the context of non-linear
coupled systems, in contrast to the case of linear time-invariant systems. In the latter
156 V.A. Vakorin, O. Krakovska, and A.R. McIntosh
scenario, the spectrum of the signal is not limited to a single harmonic component
but spans several frequencies. In the frequency domain, the slope of the phase (group
delay) produces an estimate for the time delay between the signals, which may be
used to solve the ambiguity of phase differences at a specific frequency [9]. There
exists a causal measure that explicitly exploits the cumulative phase delay as the
basis for causality [17]. However, as our examples show, in the case of non-linear
interactions, we should expect that such an approach may lead to spurious results.
In general, we found that all the statistics tested in this study were sensitive to
phase differences. However, in the situation wherein the driver was phase delayed
with respect to the response with φ 12 ( f ) approximately between 0◦ and 90◦ , both
the standard and spectral measures produced statistically significant, but spurious re-
sults. On the contrary, the information-theoretic measure performed reasonably well
in the same situations, correctly reconstructing the underlying relations as specified
by the model.
The spectral Granger statistic explicitly depends on the phase differences be-
tween harmonic components of tested signals, and the contribution from specific
frequencies can be intensified by the mechanism of phase-locking. In some sense,
inferring the directionality of coupling at a specific frequency can be viewed as an
extreme case of filtering the signals with a narrow band-pass filter. On the contrary,
we should expect that causality is ultimately based on interactions between differ-
ent frequency components. [6] explored the effects of different filtering techniques
on the performance of several causality measures. They found that, without strong
assumptions about the artifacts to be removed, filtering disturbs the information
content and leads to missed or spurious results.
Finally, the information transfer outperformed the standard Granger statistic, al-
though both measures work in the time domain. We believe that a critical difference
between the standard and non-linear versions of the causality lies in averaging the
causality measures across the length of forecast horizon, that is, across the param-
eter δ . As can be seen from the model (2), only one specific δ , namely, δ = 1, is
used for estimating the standard Granger measure. At the same time, the common
practice for computing transfer entropy is to average it across some range of the
lags δ . Originally this was proposed in [20] with the idea to decrease the variability
of estimated statistics and to increase the robustness of the results. The time lag δ
may affect the phase difference between the future and the past of the same signal.
In other words, δ = 1 may not be optimal. If the range of δ is relatively large to
cover the entire period of the characteristic scales of the signal dynamics, averaging
across δ would smooth out the phase effects.
Acknowledgments. This research was supported by research grants from the J.S. McDonnell
Foundation to Dr. Anthony R. McIntosh. We thank Maria Tassopoulos-Karachalios for her
assistance in preparing this manuscript.
On Complexity and Phase Effects in Reconstructing the Directionality of Coupling 157
References
1. Arnhold, J., Grassberger, P., Lehnertz, K., Elger, C.E.: A robust method for detecting in-
terdependences: application to intracranially recorded EEG. Physica D: Nonlinear Phe-
nomena 134(4), 419–430 (1999)
2. Buzsaki, G.: Rhythms of the brain. Oxford University Press, New York (2006)
3. Chavez, M., Martinerie, J., Le Van Quyen, M.: Statistical assessment of nonlinear causal-
ity: application to epileptic eeg signals. J. Neurosci. Methods 124(2), 113–128 (2003)
4. Costa, M., Goldberger, A.L., Peng, C.K.: Multiscale entropy analysis of physiologic time
series. Phys. Rev. Lett. 89, 062102 (2002)
5. Deco, G., Jirsa, V., McIntosh, A.R., Sporns, O., Ktter, R.: Key role of coupling, delay,
and noise in resting brain fluctuations. Proceedings of the National Academy of Sci-
ences 106(25), 10302–10307 (2009)
6. Florin, E., Gross, J., Pfeifer, J., Fink, G.R., Timmermann, L.: The effect of filtering on
Granger causality based multivariate causality measures. Neuroimage 50(2), 577–578
(2010)
7. Geweke, J.: Measurement of linear dependence and feedback between multiple time se-
ries. Journal of the American Statistical Association 7, 304–313 (1982)
8. Ghosh, A., Rho, Y., McIntosh, A.R., Ktter, R., Jirsa, V.: Cortical network dynamics with
time delays reveals functional connectivity in the resting brain. Cognitive Neurodynam-
ics 2(2), 115–120 (2008)
9. Gotman, J.: Measurement of small time differences between EEG channels: method and
application to epileptic seizure propagation. Electroenceph. Clin. Neurophysiol. 56, 501–
514 (1983)
10. Gourévitch, B., Le Bouquin-Jeannès, R., Faucon, G.: Linear and nonlinear causality be-
tween signals: methods, examples and neurophysiological applications. Biological Cy-
bernetics 95(4), 349–369 (2007)
11. Granger, C.W.J.: Investigating causal relations by econometric models and cross spectral
methods. Econometrica 37, 428–438 (1969)
12. Grassberger, P., Procaccia, I.: Estimation of the Kolmogorov entropy from a chaotic sig-
nal. Phys. Rev. A 28, 2591–2593 (1983)
13. Hadjipapas, A., Casagrande, E., Nevado, A., Barnes, G.R., Green, G., Holliday, I.E.: Can
we observe collective neuronal activity from macroscopic aggregate signals? NeuroIm-
age 44(4), 1290–1303 (2009)
14. Haken, H.: Principles of brain functioning. Springer (1996)
15. Kamiński, M., Ding, M., Truccolo, W.A., Bressler, S.L.: Evaluating causal relations in
neural systems: Granger causality, directed transfer function and statistical assessment
of significance. Biological Cybernetics 85, 145–157 (2001)
16. Mišić, B., Vakorin, V., Paus, T., McIntosh, A.R.: Functional embedding predicts the vari-
ability of neural activity. Frontiers in Systems Neuroscience 5, 90 (2011)
17. Nolte, G., Ziehe, A., Nikulin, V.V., Brismar, T., Müller, K.R., Schlögl, A., Krämer, N.:
Robustly estimating the flow direction of information in complex physical systems. Phys.
Rev. Lett. 100(23), 234101 (2008)
18. Nunez, P.L.: Neocortical dynamics and human brain rhythms. Oxford University Press
(1995)
19. Paluš, M., Vejmelka, M.: Directionality of coupling from bivariate time series: How to
avoid false causalities and missed connections. Phys. Rev. E 75, 056211 (2007)
20. Paluš, M., Komárek, V., Hrnčı́ř, Z., Štěrbová, K.: Synchronization as adjustment of info-
mation rates: Detection from bivariate time series. Phys. Rev. E 63, 046211 (2001)
158 V.A. Vakorin, O. Krakovska, and A.R. McIntosh
21. Pincus, S.M.: Approximate entropy as a measure of system complexity. Proc. Natl. Acad.
Sci. USA 88, 2297–2301 (1991)
22. Prichard, D., Theiler, J.: Generralized redundancies for time series analysis. Physica
D 84, 476–493 (1995)
23. Prokhorov, M.D., Ponomarenko, V.I.: Estimation of coupling between time-delay sys-
tems from time series. Physical Review E 72(1), 016210 (2005)
24. Quiroga, R.Q., Arnhold, J., Grassberger, P.: Learning driver-response relationships from
synchronization patterns. Phys. Rev. E 61, 5142–5148 (2000)
25. Richman, J.S., Moorman, J.R.: Physiological time-series analysis using approximate en-
tropy and sample entropy. Am. J. Physiol. Heart. Circ. Physiol. 278(6), H2039–H2049
(2000)
26. Schreiber, T.: Measuring information transfer. Phys. Rev. Letters 85(2), 461–464 (2000)
27. Schwarz, G.: Estimating the dimension of a model. The Annals of Statistics 6(2), 461–
464 (1978)
28. Silchenko, A.N., Adamchic, I., Pawelczyk, N., Hauptmann, C., Maarouf, M., Sturm, V.,
Tass, P.A.: Data-driven approach to the estimation of connectivity and time delays in the
coupling of interacting neuronal subsystems. Journal of Neuroscience Methods 191(1),
32–44 (2010)
29. Singer, W.: Neuronal synchrony: A versatile code for the definition of relations? Neu-
ron 24, 49–65 (1999)
30. Small, M., Tse, C.K.: Applying the method of surrogate data to cyclic time series. Phys-
ica D 164, 187–201 (2002)
31. Takens, F.: Detecting strange attractors in turbulence. In: Dynamical Systems and Tur-
bulence. Lecture Notes in Mathematics, vol. 898. Springer (1981)
32. Vakorin, V.A., McIntosh, A.R.: Mapping the multi-scale information content of complex
brain signals. In: Brinciples of Brain Dynamics: Global State Interactions, pp. 183–208.
The MIT Press (2012)
33. Vakorin, V.A., Krakovska, O.A., McIntosh, A.R.: Confounding effects of indirect con-
nections on causality estimation. Journal of Neuroscience Methods 184(1), 152–160
(2009)
34. Vakorin, V.A., Mišić, B., Krakovska, O., McIntosh, A.R.: Empirical and theoretical as-
pects of generation and transfer of information in a neuromagnetic source network. Fron-
tiers in Systems Neuroscience 5(96), 00096 (2012)
35. Vakorin, V.A., Mišić, B., Krakovska, O., Bezgin, G., McIntosh, A.R.: Confounding ef-
fects of phase delays on causality estimation. PLoS One 8(1), e5358 (2013)
36. Varela, F., Lachaux, J.P., Rodriguez, E., Martinerie, J.: The brainweb: phase synchro-
nization and large-scale integration. Nature Reviews Neuroscience 2(4), 229–239 (2001)
37. Vicente, R., Wibral, R., Lindner, M., Pipa, G.: Transfer entropy a model-free measure of
effective connectivity for the neurosciences 30(1), 45–67 (2011)
38. Zhang, Y.-C.: Complexity and 1/f noise. A phase space approach. J. Phys. I France 1
(1991)
Part III
Recent Advances in the Analysis of
Information Processing
160 Part III: Recent Advances in the Analysis of Information Processing
Joseph T. Lizier
1 Introduction
Analysis of directed information transfer between variables in time-series brain
imaging data and models is currently gaining much attention in neuroscience. Mea-
sures of information transfer have been computed, for example, in fMRI measure-
ments in the human visual cortex between average signals at the regional level [38]
and between individual voxels [8], as well as between brain areas of macaques from
local field potential (LFP) time-series [48]. A particularly popular topic in this do-
main is the use of information transfer measures to infer effective network connec-
tivity between variables in brain-imaging data [39, 91, 49, 88, 69, 54, 63], as well
as studying modulation of connection strength with respect to an underlying task
Joseph T. Lizier
CSIRO Computational Informatics, Marsfield, Australia
e-mail: joseph.lizier@csiro.au
different specific source states may be more predictive of a target than other states, or
how coupling strength may relate to changing underlying experimental conditions.
Indeed, the ability to investigate time-series dynamics of distributed computation
in complex systems provides an important connection from information theory to
dynamical systems theory or non-linear time-series analysis (e.g. see [81, 41]). We
use the term information dynamics to describe the study of distributed computation
in complex systems in terms of how information is stored, transferred and modified
[59, 60, 62]. The word dynamics is a key component of this term, referring to both:
1. That we study the dynamic state updates of variables in the system, decompos-
ing information in the measurement of a variable in terms of information from
that variable’s own past (information storage), information from other variables
(information transfer) and how those information sources are combined (infor-
mation modification);
2. That we study local information-theoretic measures for each of these variables,
quantifying the dynamics of these operations in time and space.
In this chapter, we review how such local information-theoretic measurements
can be made, and describe how they are used to define local measures of informa-
tion storage and transfer in distributed computation in complex systems. We begin
by describing the relevant information-theoretic concepts in Sect. 2, before provid-
ing a detailed presentation of how local information-theoretic measures are defined
in Sect. 3. We then provide an overview of our framework for information dynam-
ics in Sect. 4, describing the measures used for information storage and transfer,
and how they can be localised within a system in space and time using the tech-
niques of Sect. 3. Next, we review in Sect. 5 the application of these local measures
of computation to cellular automata, a simple discrete dynamical model which is
known to exhibit complex behaviour and emergent coherent structures (known as
particles or gliders) resembling coherent waves in neural dynamics [27]. This appli-
cation demonstrates the utility of these local measures of information storage and
transfer, by providing key insights into the dynamics of cellular automata, includ-
ing demonstrating evidence for long-held conjectures regarding the computational
role of the emergent structures (e.g. gliders as information transfer entities). Most
importantly, the local measures are shown to provide insights into the dynamics
of information in the system that are simply not possible to obtain with traditional
averaged information-theoretic methods.
We finish the chapter by describing in Sect. 6 further such insights into the dy-
namics of information that have since been obtained with these local measures for
other systems. For example, the measures have revealed coherent information cas-
cades spreading across flocks (or swarms) [92] and in modular robots [57], in anal-
ogy to the aforementioned gliders in cellular automata. They have also demonstrated
the key role of information transfer in network synchronization processes, in partic-
ular in indicating when a synchronized state has been “computed” but not yet ob-
viously reached [9]. Just like the cellular automata examples, these demonstrate the
ability of local information dynamics to reveal how the computation in a system un-
folds in time, and the dynamics of how separate agents or entities interact to achieve
164 J.T. Lizier
a collective task. Crucially, they allow one to answer meaningful questions about
the information processing in a system, in particular: “when and where is informa-
tion transferred in the brain during cognitive tasks?”, and we describe a preliminary
study where this precise question is explored using fMRI recordings during a but-
ton pressing task. As such, we demonstrate that local information dynamics enables
whole new lines of inquiry which were not previously possible in computational
neuroscience or other fields.
2 Information-Theoretic Preliminaries
To quantify the information dynamics of distributed computation, we first look to
information theory (e.g. see [85, 13, 65]) which has proven to be a useful framework
for the design and analysis of complex self-organized systems, e.g. [14, 77, 78, 66].
In this section, we give a brief overview of the fundamental quantities which will be
built on in exploring local information dynamics in the following sections.
The fundamental quantity of information theory is the Shannon entropy, which
represents the average uncertainty associated with any measurement x of a random
variable X (logarithms are taken by convention in base 2, giving units in bits):
The uncertainty H(X) associated with such a measurement is equal to the informa-
tion required to predict it (see self-information below).
The Shannon entropy was originally derived following an axiomatic approach.
This is important because it gives primacy to desired properties over candidate mea-
sures, rather than retrospectively highlighting properties of an appealing candidate
measure. It shifts the focus of any arguments over the form of measures onto the
more formal ground of selecting which axioms should be satisfied. This is partic-
ularly useful where a set of accepted axioms can uniquely specify a measure (as
in the cases discussed here). We highlight the axiomatic approach here because it
has persisted in later developments in information theory, in particular for the local
measures we discuss in Sect. 3 (as well as more recently in debate over measures of
information redundancy [95, 35, 53]).
So, the Shannon entropy was derived as the unique formulation (up to the base
of the logarithm) satisfying a certain set of properties or axioms [85] (with property
labels following [76]):
• continuity with respect to the underlying probability distribution function p(x)
(PDF). This sensibly ensures that small changes in p(x) only lead to small
changes in H(X).
• monotony: being a monotonically increasing function of the number of choices
n for x when each choice xi is equally likely (with probability p(xi ) = 1/n). In
Shannon’s words, this desirable because: “With equally likely events there is
more choice, or uncertainty, when there are more possible events” [85].
Measuring the Dynamics of Information Processing on a Local Scale 165
• grouping: “If a choice (can) be broken down into two successive choices, the
original H should be the weighted sum of the individual values of H” [85]. That
is to say, “H is independent of how the process is divided into parts” [76]. This
is crucial because the intrinsic uncertainy we measure for the process should not
depend on any subjectivity in how we divide up the stages of the process to be
examined.
Further, note that the Shannon entropy for a measurement can be interpreted as
the minimal average number of bits required to encode or describe its value without
losing information [65, 13].
The joint entropy of two random variables X and Y is a generalization to quan-
tify the uncertainty of their joint distribution:
The mutual information (MI) between X and Y measures the average reduction
in uncertainty about x that results from learning the value of y, or vice versa:
p(x | y)
I(X;Y ) = − ∑ p(x, y) log2 (5)
x,y p(x)
= H(X) − H(X | Y ). (6)
The MI is symmetric in the variables X and Y . The mutual information for measure-
ments of X and Y can be interpreted as the average number of bits saved in encoding
or describing X given that the receiver of the encoding already knows the value of Y ,
in comparison to the encoding of X without the knowledge of Y . These descriptions
of X with and without the value of Y are both minimal without losing information.
Note that one can compute the self-information I(X; X), which is the average in-
formation required to predict the value of X, and is equal to the uncertainty H(X)
associated with such a measurement.
The conditional mutual information between X and Y given Z is the mutual
information between X and Y when Z is known:
166 J.T. Lizier
p(x | y, z)
I(X;Y | Z) = − ∑ p(x, y, z) log2 (7)
x,y,z p(x | z)
= H(X | Z) − H(X | Y, Z). (8)
One can consider the MI from two variables Y1 ,Y2 jointly to another variable X,
I(X;Y1 ,Y2 ), and using (4), (6) and (8) decompose this into the information carried
by the first variable plus that carried by the second conditioned on the first:
(k)
(where the limit exists) where we have used Xn = {Xn−k+1 , . . . , Xn−1 , Xn } to de-
note the k consecutive variables of X up to and including time step n. This quantity
describes the limiting rate at which the entropy of n consecutive measurements of X
grow with n. A related definition is given by:2
2 Note that we have reversed the use of the primes in the notation from [13], in line with
[14].
Measuring the Dynamics of Information Processing on a Local Scale 167
Cover and Thomas [13] point out that these two quantities correspond to two subtly
different notions: the first is something of an average per symbol entropy, while the
second is a conditional entropy of the last random variable given the past. These
authors go on to demonstrate that for stationary processes X, the limits for the two
quantities Hμ
(X) and Hμ (X) exist (i.e. the average entropy rate converges) and are
equal.
For our purposes in considering information dynamics, we are interested in the
latter formulation Hμ (X), since it explicitly describes how one random variable Xn
(n−1)
is related to the previous instances Xn−1 . For practical usage, we are particularly
interested in estimation of Hμ (X) with finite-lengths k, and in estimating it regard-
ing the information at different time indices n. That is to say, we use the notation
(k)
Hμ (Xn+1 , k) to describe the conditional entropy in Xn+1 given Xn :
(k)
Hμ (Xn+1 , k) = H Xn+1 | Xn . (14)
Of course, letting k = n and joining (13) and (14) we have limn→∞ Hμ (Xn+1 , n) =
Hμ (X).
c(x = m)
H(X) = − ∑ log2 p(x = m), (18)
m N
c(x=m)
and then further expand using the identity c(x = m) = ∑g=1 1:
c(x=m)
1
H(X) = − ∑ ∑ log2 p(x = m). (19)
m g=1 N
This leaves a double sum running over i. each actual observation g, ii. for each
possible observation x = m. This is equivalent to a single sum over all N observations
xi , i = 1 . . . N, giving:
1 N
H(X) = − ∑ log2 p(xi ),
N i=1
(20)
and that these quantities satisfy the chain rule in alignment with their averages:
In this way, we see that the information content of a joint quantity (x, y) is the code
length of y plus the code length of x given y. Finally, we note that this quantity is also
referred to as conditional self-information and can also be derived (see [22, Chapter
2]) by starting with the local conditional mutual information (see Sect. 3.2).
170 J.T. Lizier
In this way, we see that the local mutual information is the difference in code lengths
between coding the value x in isolation (under the optimal encoding scheme for X),
or coding the value x given y (under the optimal encoding scheme for X given Y ). In
other words, this quantity captures the coding “cost” for x in not being aware of the
value y. Similarly, the local conditional mutual information can be constructed as:
Here, we see that the local conditional mutual information is the difference in code
lengths (or coding cost) between coding the value x given z (under the optimal en-
coding scheme for X given Z), or coding the value x given both y and z (under the
optimal encoding scheme for X given Y and Z).
More formally however, Fano [22, ch. 2] set out to quantify “the amount of infor-
mation provided by the occurrence of the event represented by yi about the occur-
rence of the event represented by xi .” He derived the local mutual information i(x; y)
(25) to capture this concept, as well as the local conditional mutual information
i(x; y | z) (27), directly from the following four postulates:
• once-differentiability with respect to the underlying probability distribution
functions p(x) and p(x | y);
• identical mathematical form for the conditional MI and local conditional
MI, only with p(x) replaced by p(x | z) and p(x | y) replaced by p(x | y, z);
• additivity for the information provided by y and z about x, i.e.: i({y, z} ; x) =
i(y; x) + i(z; x | y);
• separation for independent ensembles XY and UV , i.e. where we have
p(x, y, u, v) = p(x, y)p(u, v) then we must have i({x, u} ; {y, v}) = i(x; y) + i(u; v).
Crucially, Fano’s derivation means that i(x; y) and i(x; y | z) are uniquely specified,
up to the base of the logarithm.
Of course, we have I(X;Y ) = i(x; y) and I(X;Y | Z) = i(x; y | z) as per the
averaged entropy quantities in the previous section. It is particularly interesting that
Fano made the derivation for local mutual information directly, and only computed
Measuring the Dynamics of Information Processing on a Local Scale 171
the averaged quantity as a result of that. This contrasts with contemporary perspec-
tives which generally give primary consideration to the averaged quantity. (This is
not the case however in natural language processing for example, where the local
MI is commonly used and known as the point-wise mutual information, e.g. [68]).
We also note that i(x; y) is symmetric in x and y (like I(X;Y )), though this was
not explicitly built into the above postulates.
Next, consider that the local MI and conditional MI values may be either posi-
tive or negative, in contrast to the local entropy which cannot take negative values.
Positive values are fairly intuitive to understand: the local mutual information in
(25) is positive where p(x | y) > p(x), i.e. knowing the value of y increased our
expectation of (or positively informed us about) the value of the measurement x.
The existence of negative values is often a concern for readers unfamiliar with
the concept, however they too are simple to understand. Negative values simply
occur in (25) where p(x | y) < p(x), i.e. knowing about the value of y actually
changed our belief p(x) about the probability of occurrence of the outcome x to
a smaller value p(x | y), and hence we considered it less likely that x would occur
when knowing y than when not knowing y, in a case were x nevertheless occurred.
As an example, consider the probability that it will rain today, p(rain = 1), and
the probability that it will rain given that the weather forecast said it would not,
p(rain = 1 | rain forecast = 0). Being generous to weather forecasters for a
moment, let’s say that p(rain = 1 | rain forecast = 0) < p(rain = 1), so
we would have i(rain = 1; rain forecast = 0) < 0, because we considered it
less likely that rain would occur today when hearing the forecast than without the
forecast, in a case where rain nevertheless occurred. These negative values of MI are
actually quite meaningful, and can be interpreted as there being negative informa-
tion in the value of y about x. We could also interpret the value y as being misleading
or misinformative about the value of x, because it had lowered our expectation of
observing x prior to that observation being made in this instance. In the above ex-
ample, the weather forecast was misinformative about the rain today. One can also
view the negative values using (24), seeing that i(x; y) is negative where knowing y
increased the uncertainty about x.
Importantly, these local measures always average to give a non-negative value.
Elaborating on an example from Cover and Thomas [13, p.28], “in a court case,
specific new evidence” y “might increase uncertainty” about the outcome x, “but
on the average evidence decreases uncertainty”. Similarly, in our above example,
while the weather forecast might misinform us about the rain on a particular day, on
average the weather forecast will provide positive (or at least zero!) information.
Finally, we note that the local mutual information i(x; y) measures we consider
here are distinct from partial localization expressions, i.e. the partial mutual infor-
mation or specific information I(x;Y ) [18], which consider information contained in
specific values x of one variable X about the other (unknown) variable Y . Crucially,
there are two valid approaches to measuring partial mutual information, one which
preserves the additivity property and one which retains non-negativity [18]. As de-
scribed above however, there is only one valid approach for the fully local mutual
information i(x; y) (and see further discussion in [56]).
172 J.T. Lizier
Using kernel-estimators (e.g. see [82, 41]), the relevant probabilities (e.g. p̂(x | y)
and p̂(x) for mutual information) are estimated with kernel functions, and then these
values are used directly in the equation for the given local quantity (e.g. (25)) as a
plug-in estimate (see e.g. [61]).
With the improvements to kernel-estimation for mutual information suggested by
Kraskov et al. [45, 44] (and extended to conditional mutual information and trans-
fer entropy by [24, 26]), the PDF evaluations are effectively bypassed, and for the
average measure one goes directly to estimates based on nearest neighbour counts
nx and ny in the marginal spaces for each observation. For example, for Kraskov’s
algorithm 1 we have:
I(X;Y ) = ψ (k) − ψ (nx + 1) + ψ (ny + 1) + ψ (N), (30)
where ψ denotes the digamma function, and the values are returned in nats rather
than bits. Local values can be extracted here simply by unrolling the expectation
values and computing the nearest neighbour counts only at the given observation
(x, y), e.g. for algorithm 1:
This has been observed as a “time-varying estimator” in [26] and used to estimate
the local transfer entropy in [50] and [89].
Using permutation entropy approaches [3] (e.g. symbolic transfer entropy [87]),
the relevant probabilities are estimated based on the relative ordinal structure of
the joint vectors, and these values are directly used in the equations for the given
quantities as plug-in estimates (e.g. see local symbolic transfer entropy in [72]).
Finally, using a multivariate Gaussian model for X (which is of d dimensions),
the average entropy has the form [13]:
1
H(X) = ln ((2π e)d | Ω |), (32)
2
(where μ is the expectation value of x), then using these values directly in the equa-
tion for the given local quantity as a plug-in estimate.5
Now, the local active information storage aX (n + 1) is the local mutual informa-
(k) (k)
tion between realizations xn of the past state Xn (as k → ∞) and the corresponding
realizations xn+1 of the next value Xn+1 . This is computed as described for local mu-
tual information values in Sect. 3.2. The average active information storage AX is
the expectation of these local values:
The local values of active information storage measure the dynamics of information
storage at different time points within a system, revealing to us how the use of mem-
ory fluctuates during a process. Where the observations used for the relevant PDFs
are from the whole time series of a process (under an assumption of stationarity, as
outlined in Sect. 3.3), then the average AX (k) is the time-average of the local values
aX (n + 1, k).
We also note that since [62]:
then the limit in (34) exists for stationary processes (i.e. A(X) converges with k →
∞). A proof for convergence of a(xn+1 ) with k → ∞ remains a topic for future work.
As described for the local mutual information in Sect. 3.2, aX (n + 1) may be pos-
itive or negative, meaning the past history of the process can either positively inform
us or actually misinform us about its next value [62]. An observer of the process is
misinformed where, conditioned on the past history the observed outcome was rel-
atively unlikely as compared to the unconditioned probability of that outcome (i.e.
(k)
p(xn+1 | xn ) < p(xn+1 )). In deterministic systems (e.g. CAs), negative local active
information storage means that there must be strong information transfer from other
causal sources.
The transfer entropy (TE) [82] captures the average mutual information from re-
(l) (l)
alizations yn of the state Yn of a source time-series process Y to the corresponding
realizations xn+1 of the next value Xn+1 of the target time-series process X, condi-
(k) (k)
tioned on realizations xn of the previous state Xn :
These local information transfer values measure the dynamics of transfer in time be-
tween any given pair of processes within a system, revealing to us how information
is transferred across the system in time and space. Fig. 1 indicates a local transfer
entropy measurement for a pair of processes Y → X.
As above, where the observations used for the relevant PDFs are from the whole
time series of the processes (under an assumption of stationarity, as outlined in
Sect. 3.3) then the average TY →X (k, l) is the time-average of the local transfer values
tY →X (n + 1, k, l).
As described for the local conditional mutual information in Sect. 3.2, tY →X (n +
1) may be positive or negative, meaning the source process can either positively
inform us or actually misinform us about the next value of the target (in the context
of the target’s past state) [59]. An observer of the process is misinformed where,
conditioned on the source and the past of the target the observed outcome was rel-
atively unlikely, as compared to the probability of that outcome conditioning on the
(k) (l) (k)
past history only (i.e. p(xn+1 | xn , yn ) < p(xn+1 | xn )).
Noting the equivalence of the transfer entropy and the concept of Granger causal-
ity [28] when the transfer entropy is estimated using a Gaussian model [4], we ob-
serve that the local transfer entropy – when estimated with a Gaussian model as
described in Sect. 3.4 – directly gives a local Granger causality measurement .
Now, the transfer entropy may also be conditioned on other possible sources Z
to account for their effects on the target. The conditional transfer entropy was
introduced for this purpose [59, 60]:
Note that Z may represent an embedded state of another variable and/or be explicitly
multivariate. Transfer entropies conditioned on other variables have been used in
several biophysical and neuroscience applications, e.g. [20, 21, 88].
We also have the corresponding local conditional transfer entropy:
TY →X|Z (k, l) = tY →X|Z (n + 1, k, l) , (51)
(k) (l)
p(xn+1 | xn , yn , zn )
tY →X|Z (n + 1, k, l) = log2 (k)
, (52)
p(xn+1 | xn , zn )
(l) (k)
= i(yn ; xn+1 | xn , zn ). (53)
Of course, this extra conditioning can prevent the (redundant) influence of a com-
mon drive Z from being attributed to Y , and can also include the synergistic contri-
bution when the source Y acts in conjunction with another source Z (e.g. where X is
the outcome of an XOR operation on Y and Z).
We specifically refer to the conditional transfer entropy as the complete transfer
entropy (with notation TYc→X (k, l) and tYc →X (n + 1, k, l) for example) when it con-
ditions on all other causal sources Z to the target X [59]. To differentiate the condi-
tional and complete transfer entropies from the original measure, we often refer to
TY →X simply as the apparent transfer entropy [59] - this nomenclature conveys that
the result is the information transfer that is apparent without accounting for other
sources.
Finally, note that one can decompose the mutual information from a set of sources
to a target as a sum of incrementally conditioned mutual information terms [60, 56,
53]. For example, for a two source system we have:
180 J.T. Lizier
This equation could be reversed in the order of Y1 and Y2 , and its correctness is
independent of k (so long as k is large enough to capture the causal sources in the
past of the target). Crucially, this equation reveals the nature in which information
storage (AX ) and transfer (TY1 →X , etc.) are complementary operations in distributed
computation.
Table 1 Rule table for ECA rule 54. The Wolfram rule number for this rule table is composed
by taking the next cell value for each configuration, concatenating them into a binary code
starting from the bottom of the rule table as the most significant bit (e.g. b00110110 here),
and then forming the decimal rule number from that binary encoding.
processes, including fluid flow, earthquakes and biological pattern formation [70].
Indeed, CAs have even been used in neural network models to study criticality in
avalanches of activity [75, 67]. While they may not be the most realistic microscopic
neural model available, it is certainly true that CAs can exhibit certain phenomena
that are of particular interest in neuroscience, including avalanche behaviour (e.g.
[75, 80, 47, 67]) and coherent propagating wave-like structures (e.g. [27, 17]).
Indeed, the presence of such coherent emergent structures: particles, gliders,
blinkers and domains; is what has made CAs so interesting in complex systems
science in general. A domain is a set of background configurations in a CA, any of
which will update to another configuration in the set in the absence of any distur-
bance. Domains are formally defined by computational mechanics as spatial pro-
cess languages in the CA [33]. Particles are considered to be dynamic elements of
coherent spatiotemporal structure, which are disturbances or lie in contrast to the
background domain. Gliders are regular particles, blinkers are stationary gliders.
Formally, particles are defined by computational mechanics as a boundary between
two domains [33]; as such, they can be referred to as domain walls, though this term
is usually reserved for irregular particles. Several techniques exist to filter particles
from background domains (e.g. [29, 30, 33, 34, 98, 36, 37, 84, 59, 60, 62]).
These emergent structures have been quite important to studies of distributed
computation in CAs, for example in the design or identification of universal compu-
tation (see [70]), and analyses of the dynamics of intrinsic or other specific computa-
tion ([46, 33, 71]). This is because these studies typically discuss the computation in
terms of the three primitive functions of computation and their apparent analogues
in CA dynamics [70, 46]:
• blinkers as the basis of information storage, since they periodically repeat at a
fixed location;
• particles as the basis of information transfer, since they communicate information
about the dynamics of one spatial part of the CA to another part; and
• collisions between these structures as information modification, since collision
events combine and modify the local dynamical structures.
Previous to the work reviewed here however, these analogies remained conjecture
only, based on qualitative observation of CA dynamics. In the following subsections,
we review the applications [59, 60, 62, 58, 56] of the local information storage and
transfer measures described in Sect. 4 to cellular automata.
These experiments involved constructing 10 000 cell 1-dimensional CAs, and
executing the relevant update rules to generate 600 time steps of dynamics. All
resulting 6 × 106 observations of cell-updates are then used to compose the relevant
PDFs, and the local measures of information storage and transfer were computed
for each observation using these PDFs. Specifically, local active information storage
aX (n, k = 16) is computed for each cell X for each time step n, while local transfer
entropy tY →X (n, k = 16, l = 1) is computed for each time step n for each target cell
X and for the two causal sources Y on either side of X (referred to as channels j = 1
and −1 for transfer across 1 cell to the right or left). The use of all observations
across all cells and time steps implies an assumption of stationarity here. This is
182 J.T. Lizier
justified in that the large CA length and relativity short number of time steps (and
ignoring of initial steps) is designed to ensure that an attractor is not reached while
the typical transient dynamics of the CA are well-sampled. Note also that l = 1 is
used since we directly observe the interacting values and only one previous time
step is a causal source here. As such, in line with (54) we have
(k)
I(Xn+1 ; {Xn ,Yl,n ,Yr,n }) = AX (k) + TYl →X (k) + TYr →X|Yl (k), (55)
where Yl represents the causal source to the left (channel j = 1) and Yr the causal
source to the right (channel j = −1) – although their placement is interchangeable
in this equation.
Sample results of this application are displayed for rules 54 and 18 in Fig. 2
and Fig. 3. The figures displayed here were produced using the open source Java
Information Dynamics Toolkit (JIDT) [51], which can be used in Matlab, Octave
and Python as well as Java. All results can be reproduced using the Matlab/Octave
script DirectedMeasuresChapterDemo2013.m in the demos/octave/-
CellularAutomata example distributed with this toolkit.
These applications provided the first quantitative evidence for the above conjec-
tures, and are discussed in the following subsections. But the most important result
for our purposes is that the local measures reveal richly-structured spatiotem-
poral profiles of the information storage and transfer dynamics here, with in-
teresting local features revealed at various points in space-time. It is simply not
possible for these dynamics to be revealed by the average measures, be they aver-
ages across all cells and times or averages just across all cells in time. These features
are uniquely provided by considering the local dynamics of information processing
in CAs, and are discussed in the following subsections.
γ γ γ α
α γ
γ
γ
β γ
α α
γ γ
γ
α
γ γ
γ γ γ γ γ γ
γ γ
γ γ
γ
γ
γ
γ
γ γ
γ γ
γ γ
γ γ γ γ
(c) tY →X (n, k = 16) right – j = 1 channel (d) tY →X (n, k = 16) left – j = −1 channel
Fig. 2 Local information dynamics in ECA rule 54 for the raw values in (a) (black for “1”,
white for “0”). 35 time steps are displayed for 35 cells, and time increases down the page
for all CA plots. All units are in bits. (b) Local active information storage; Local apparent
transfer entropy: (c) one cell to the right, and (d) one cell to the left per time step.
(c) tY →X (n, k = 16) left – j = −1 channel (d) tYc →X (n, k = 16) left – j = −1 channel
Fig. 3 Local information dynamics in ECA rule 18 for the raw values in (a) (black for “1”,
white for “0”). 50 time steps are displayed for 50 cells, and all units are in bits. (b) Local
active information storage; (c) Local apparent transfer entropy one cell to the left per time
step; (d) Local complete transfer entropy one cell to the left per time step.
moving gliders for rule 54 in Fig. 2(c) and Fig. 2(d) by transfer entropy to the left
and right respectively, and similarly for the left moving sections of domain walls for
rule 18 in Fig. 3(c) and Fig. 3(d) by transfer entropy to the left (TE to right omit-
(k)
ted). In these examples, the past state of the target cell xn is part of the background
domain and so is misinformative about the next value xn+1 where the particle is
encountered. In contrast, the source cell yn which is in the particle at the previous
time step n (be that the left or right neighbour, as relevant for that particular parti-
cle) is highly predictive about the next value of the target (in the context of its past).
(k) (k)
As such, we have p(xn+1 | xn , yn ) > p(xn+1 | xn ), giving large positive values of
tY →X (n + 1, k) via (48).
These results for local transfer entropy are particularly important because they
provided the first quantitative evidence for the long-held conjecture that par-
ticles are the dominant information transfer agents in CAs. As stated above,
it is simply not possible for these space-time specific dynamics to be revealed
by the average transfer entropy, it specifically requires the local transfer entropy.
Measuring the Dynamics of Information Processing on a Local Scale 185
Furthermore, the average values do not give so much as a hint towards the complex-
ities of these local dynamics: ECA rule 22 has much larger average transfer entropy
values than rule 54 (0.19 versus 0.08 bits for each, respectively, in both left and right
directions), yet has no emergent self-organized particle structures [61].
As per the information storage results, we note that these results required a large
enough k to properly capture the past state of the cell, and could not be observed
with a value say of k = 1 (as discussed in [59]). When linked to the result of mis-
informative storage at the particles from Sect. 5.1, we see again the complementary
nature of information storage and transfer.
It is important to note that particles are not the only points with positive local
transfer entropy. Small positive non-zero values are also often measured in the
domain and in the orthogonal direction to glider motion in space-time (e.g. see
Fig. 2(d)) [59]. These correctly indicate non-trivial information transfer in these
regions (e.g. indicating the absence of a glider), though they are dominated by the
positive transfer in the direction of glider motion.
to complex dynamics in the domain here, with two interleaving phases. The first
phase occurs at every second cell (both in space and time), and is simply a ‘0’ –
at these cells there is strong information storage alone (see Fig. 3(b)) because the
cell value is predictable from its past (which predicts the phase accurately). The
other phase occurs at the alternate cells, and is a ‘0’ or a ‘1’ as determined via an
exclusive OR (or XOR) operation between the neighbouring left and right cells. As
such, apparent transfer entropy from either left or right cell alone provides almost no
information about the next value (hence absence of apparent transfer in the domain
– see Fig. 3(c)), whilst conditional transfer entropy provides full information about
the next value because the other contributing cell is taken into account (hence the
strong conditional transfer at every second cell in Fig. 3(d)).
The other noticeable difference between these profiles is that the conditional
transfer entropy does not have any negative local values, unlike the apparent trans-
fer entropy. This is because examining the source in the context of all other causal
sources in this deterministic system necessarily provides more information than not
examining the source. That is to say, there are no unaccounted sources here which
could mislead the observer, unlike that possibility for the apparent transfer entropy.
There are two key messages from the comparison of these measures:
1. The apparent and conditional transfer entropy reveal different aspects of
the dynamics of a system – neither is more correct than the other; they are both
useful and complementary. This is a particularly important message, since often
the importance of conditioning “out” all other sources using a conditional mea-
sure is emphasised, without acknowledging the complementary utility retained
by the pairwise transfer entropy. Both are required to have a full picture of the
dynamics of a system;
2. The differences in local dynamics that they reveal simply cannot be observed
here by using the average of each measure alone.
the background domain of rule 54. This means that the same causal effect occurs in
both types of dynamics.7
This is quite different to our interpretation of information transfer in the previ-
ous sections however. This interpretation can be restated as: predictive information
transfer refers to the amount of information that a source variable adds to the state
change of a target variable; i.e. “if I know the state of the source, how much does that
help to predict the state change of the target?” [56]. In dealing with state updates
of the target, and in particular in separating information storage from transfer, the
transfer entropy has a very different perspective to causal effect. As we have seen,
local transfer entropy attributes large positive local values at the gliders here, be-
cause the source cells help prediction in the context of a target’s past, but attributes
vanishing amounts in the domain, where stored information from a target’s past is
generally sufficient for prediction.
Again, neither perspective is more correct than the other – they both provide
useful insights and are complementary. This argument is explored in more depth in
[56]. Crucially, these insights are only fully revealed with our local perspective of
information dynamics here.
entities (e.g. particles and gliders in cellular automata [59], above, motion in flocks
and swarms [92], and in modular robotics [57]), one expects that this measure will
be used to provide similar insights into these structures in neural systems.
Yet local transfer entropy will find much more broad application than simply
identifying local coherent structure. It offers the opportunity to answer the question:
“Precisely when and where is information transferred between brain regions?”
The where is answerable with average transfer entropy, but the when is only pre-
cisely answerable with a local approach. This is a fundamentally important question
for us to have the opportunity to answer, because it will provide insight into the pre-
cise dynamics of how information is stored, transferred and modified in the brain
during neural computation.
For example, we have conducted a preliminary study applying this method to a
set of fMRI measurements where we could expect to see differences in local infor-
mation transfer between two conditions at specific time steps [50]. The fMRI data
set analyzed (from [86]) is a ‘Libet’-style experiment, which contains brain activity
recorded while subjects were asked to freely decide whether to push one of two but-
tons (with left or right index finger). Significant differences (at the group level) were
found in the local transfer entropy between left and right button presses from a sin-
gle source region (e.g. pre-SMA) into the left and right motor cortices respectively.
Furthermore, simple thresholding of these local transfer entropy values provides a
statistically significant prediction of which button was pressed.
These results are a strong demonstration that local transfer entropy can usefully
provide task-relevant insights into when and where information is transferred be-
tween brain regions. Once validation studies have been completed in this domain,
we expect that further utility will be found for these local information-theoretic mea-
sures in computational neuroscience. There are many studies in this domain which
will benefit from the ability to view local information storage, transfer and modifi-
cation operations on a local scale in space and time in the brain.
Acknowledgements. The author wishes to thank Michael Wibral for very helpful comments
on a draft paper and discussions on the topic, as well as Mikhail Prokopenko, Daniel Polani,
Ben Flecker and Paul Williams for useful discussions on these topics.
References
1. Ash, R.B.: Information Theory. Dover Publishers, Inc., New York (1965)
2. Ay, N., Polani, D.: Information Flows in Causal Networks. Advances in Complex Sys-
tems 11(1), 17–41 (2008)
3. Bandt, C., Pompe, B.: Permutation entropy: A natural complexity measure for time se-
ries. Physical Review Letters 88(17), 174102 (2002)
4. Barnett, L., Barrett, A.B., Seth, A.K.: Granger Causality and Transfer Entropy Are
Equivalent for Gaussian Variables. Physical Review Letters 103(23), 238701 (2009)
5. Barnett, L., Bossomaier, T.: Transfer Entropy as a Log-Likelihood Ratio. Physical Re-
view Letters 109, 138105 (2012)
Measuring the Dynamics of Information Processing on a Local Scale 189
6. Barnett, L., Buckley, C.L., Bullock, S.: Neural complexity and structural connectivity.
Physical Review E 79(5), 051914 (2009)
7. Boedecker, J., Obst, O., Lizier, J.T., Mayer, N.M., Asada, M.: Information processing in
echo state networks at the edge of chaos. Theory in Biosciences 131(3), 205–213 (2012)
8. Bressler, S.L., Tang, W., Sylvester, C.M., Shulman, G.L., Corbetta, M.: Top-Down Con-
trol of Human Visual Cortex by Frontal and Parietal Cortex in Anticipatory Visual Spatial
Attention. Journal of Neuroscience 28(40), 10056–10061 (2008)
9. Ceguerra, R.V., Lizier, J.T., Zomaya, A.Y.: Information storage and transfer in the syn-
chronization process in locally-connected networks. In: Proceedings of the 2011 IEEE
Symposium on Artificial Life (ALIFE), pp. 54–61. IEEE (2011)
10. Chávez, M., Martinerie, J., Le Van Quyen, M.: Statistical assessment of nonlinear causal-
ity: application to epileptic EEG signals. Journal of Neuroscience Methods 124(2), 113–
128 (2003)
11. Chicharro, D., Ledberg, A.: When Two Become One: The Limits of Causality Analysis
of Brain Dynamics. PLoS One 7(3), e32466 (2012)
12. Couzin, I.D., James, R., Croft, D.P., Krause, J.: Social Organization and Information
Transfer in Schooling Fishes. In: Brown, C., Laland, K.N., Krause, J. (eds.) Fish Cog-
nition and Behavior, Fish and Aquatic Resources, pp. 166–185. Blackwell Publishing
(2006)
13. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, New
York (1991)
14. Crutchfield, J.P., Feldman, D.P.: Regularities Unseen, Randomness Observed: Levels of
Entropy Convergence. Chaos 13(1), 25–54 (2003)
15. Crutchfield, J.P., Young, K.: Inferring statistical complexity. Physical Review Let-
ters 63(2), 105–108 (1989)
16. Dasan, J., Ramamohan, T.R., Singh, A., Nott, P.R.: Stress fluctuations in sheared Stoke-
sian suspensions. Physical Review E 66(2), 021409 (2002)
17. Derdikman, D., Hildesheim, R., Ahissar, E., Arieli, A., Grinvald, A.: Imaging spatiotem-
poral dynamics of surround inhibition in the barrels somatosensory cortex. The Journal
of Neuroscience 23(8), 3100–3105 (2003)
18. DeWeese, M.R., Meister, M.: How to measure the information gained from one symbol.
Network: Computation in Neural Systems 10, 325–340 (1999)
19. Effenberger, F.: A primer on information theory, with applications to neuroscience,
arXiv:1304.2333 (2013), http://arxiv.org/abs/1304.2333
20. Faes, L., Nollo, G., Porta, A.: Information-based detection of nonlinear Granger causality
in multivariate processes via a nonuniform embedding technique. Physical Review E 83,
051112 (2011)
21. Faes, L., Nollo, G., Porta, A.: Non-uniform multivariate embedding to assess the infor-
mation transfer in cardiovascular and cardiorespiratory variability series. Computers in
Biology and Medicine 42(3), 290–297 (2012)
22. Fano, R.M.: Transmission of information: a statistical theory of communications. MIT
Press, Cambridge (1961)
23. Flecker, B., Alford, W., Beggs, J.M., Williams, P.L., Beer, R.D.: Partial information de-
composition as a spatiotemporal filter. Chaos: An Interdisciplinary Journal of Nonlinear
Science 21(3), 037104 (2011)
24. Frenzel, S., Pompe, B.: Partial Mutual Information for Coupling Analysis of Multivariate
Time Series. Physical Review Letters 99(20), 204101 (2007)
25. Friston, K.J., Harrison, L., Penny, W.: Dynamic causal modelling. NeuroImage 19(4),
1273–1302 (2003)
190 J.T. Lizier
26. Gomez-Herrero, G., Wu, W., Rutanen, K., Soriano, M.C., Pipa, G., Vicente, R.: Assess-
ing coupling dynamics from an ensemble of time series. arXiv:1008.0539 (2010),
http://arxiv.org/abs/1008.0539
27. Gong, P., van Leeuwen, C.: Distributed Dynamical Computation in Neural Circuits with
Propagating Coherent Activity Patterns. PLoS Computational Biology 5(12) (2009)
28. Granger, C.W.J.: Investigating causal relations by econometric models and cross-spectral
methods. Econometrica 37, 424–438 (1969)
29. Grassberger, P.: New mechanism for deterministic diffusion. Physical Review A 28(6),
3666 (1983)
30. Grassberger, P.: Long-range effects in an elementary cellular automaton. Journal of Sta-
tistical Physics 45(1-2), 27–39 (1986)
31. Grassberger, P.: Toward a quantitative theory of self-generated complexity. International
Journal of Theoretical Physics 25(9), 907–938 (1986)
32. Griffith, V., Koch, C.: Quantifying synergistic mutual information. In: Prokopenko, M.
(ed.) Guided Self-Organization: Inception, pp. 159–190. Springer, Heidelberg (2014)
33. Hanson, J.E., Crutchfield, J.P.: The Attractor-Basin Portait of a Cellular Automaton.
Journal of Statistical Physics 66, 1415–1462 (1992)
34. Hanson, J.E., Crutchfield, J.P.: Computational mechanics of cellular automata: An ex-
ample. Physica D 103(1-4), 169–189 (1997)
35. Harder, M., Salge, C., Polani, D.: Bivariate Measure of Redundant Information. Physical
Review E 87, 012130 (2013)
36. Helvik, T., Lindgren, K., Nordahl, M.G.: Local information in one-dimensional cellu-
lar automata. In: Sloot, P.M.A., Chopard, B., Hoekstra, A.G. (eds.) ACRI 2004. LNCS,
vol. 3305, pp. 121–130. Springer, Heidelberg (2004)
37. Helvik, T., Lindgren, K., Nordahl, M.G.: Continuity of Information Transport in Surjec-
tive Cellular Automata. Communications in Mathematical Physics 272(1), 53–74 (2007)
38. Hinrichs, H., Heinze, H.J., Schoenfeld, M.A.: Causal visual interactions as revealed by
an information theoretic measure and fMRI. NeuroImage 31(3), 1051–1060 (2006)
39. Honey, C.J., Kotter, R., Breakspear, M., Sporns, O.: Network structure of cerebral cor-
tex shapes functional connectivity on multiple time scales. Proceedings of the National
Academy of Science 104(24), 10,240–10,245 (2007)
40. Ito, S., Hansen, M.E., Heiland, R., Lumsdaine, A., Litke, A.M., Beggs, J.M.: Extending
Transfer Entropy Improves Identification of Effective Connectivity in a Spiking Cortical
Network Model. PLoS One 6(11), e27431 (2011)
41. Kantz, H., Schreiber, T.: Nonlinear Time Series Analysis. Cambridge University Press,
Cambridge (1997)
42. Katare, S., West, D.H.: Optimal complex networks spontaneously emerge when infor-
mation transfer is maximized at least expense: A design perspective. Complexity 11(4),
26–35 (2006)
43. Kerr, C.C., Van Albada, S.J., Neymotin, S.A., Chadderdon, G.L., Robinson, P.A., Lyt-
ton, W.W.: Cortical information flow in parkinson’s disease: a composite network/field
model. Frontiers in Computational Neuroscience 7(39) (2013)
44. Kraskov, A.: Synchronization and Interdependence Measures and their Applications to
the Electroencephalogram of Epilepsy Patients and Clustering of Data. Publication Se-
ries of the John von Neumann Institute for Computing, vol. 24. John von Neumann In-
stitute for Computing, Jülich (2004)
45. Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Physical
Review E 69(6), 066138 (2004)
46. Langton, C.G.: Computation at the edge of chaos: phase transitions and emergent com-
putation. Physica D 42(1-3), 12–37 (1990)
Measuring the Dynamics of Information Processing on a Local Scale 191
47. Levina, A., Herrmann, J.M., Geisel, T.: Dynamical synapses causing self-organized crit-
icality in neural networks. Nature Physics 3(12), 857–860 (2007)
48. Liang, H., Ding, M., Bressler, S.L.: Temporal dynamics of information flow in the cere-
bral cortex. Neurocomputing 38-40, 1429–1435 (2001)
49. Lindner, M., Vicente, R., Priesemann, V., Wibral, M.: TRENTOOL: A Matlab open
source toolbox to analyse information flow in time series data with transfer entropy.
BMC Neuroscience 12(1), 119 (2011)
50. Lizier, J., Heinzle, J., Soon, C., Haynes, J.D., Prokopenko, M.: Spatiotemporal infor-
mation transfer pattern differences in motor selection. BMC Neuroscience 12(Suppl. 1),
P261 (2011)
51. Lizier, J.T.: JIDT: An information-theoretic toolkit for studying the dynamics of complex
systems (2012),
https://code.google.com/p/information-dynamics-toolkit/
52. Lizier, J.T.: The Local Information Dynamics of Distributed Computation in Complex
Systems. Springer Theses. Springer, Heidelberg (2013)
53. Lizier, J.T., Flecker, B., Williams, P.L.: Towards a synergy-based approach to measuring
information modification. In: Proceedings of the 2013 IEEE Symposium on Artificial
Life (ALIFE), pp. 43–51. IEEE (2013)
54. Lizier, J.T., Heinzle, J., Horstmann, A., Haynes, J.D., Prokopenko, M.: Multivariate
information-theoretic measures reveal directed information structure and task relevant
changes in fMRI connectivity. Journal of Computational Neuroscience 30(1), 85–107
(2011)
55. Lizier, J.T., Pritam, S., Prokopenko, M.: Information dynamics in small-world Boolean
networks. Artificial Life 17(4), 293–314 (2011)
56. Lizier, J.T., Prokopenko, M.: Differentiating information transfer and causal effect. Eu-
ropean Physical Journal B 73(4), 605–615 (2010)
57. Lizier, J.T., Prokopenko, M., Tanev, I., Zomaya, A.Y.: Emergence of Glider-like Struc-
tures in a Modular Robotic System. In: Bullock, S., Noble, J., Watson, R., Bedau, M.A.
(eds.) Proceedings of the Eleventh International Conference on the Simulation and Syn-
thesis of Living Systems (ALife XI), Winchester, UK, pp. 366–373. MIT Press, Cam-
bridge (2008)
58. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Detecting Non-trivial Computation in Com-
plex Dynamics. In: Almeida e Costa, F., Rocha, L.M., Costa, E., Harvey, I., Coutinho, A.
(eds.) ECAL 2007. LNCS (LNAI), vol. 4648, pp. 895–904. Springer, Heidelberg (2007)
59. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Local information transfer as a spatiotem-
poral filter for complex systems. Physical Review E 77(2), 026110 (2008)
60. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Information modification and particle colli-
sions in distributed computation. Chaos 20(3), 037109 (2010)
61. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Coherent information structure in complex
computation. Theory in Biosciences 131(3), 193–203 (2012)
62. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Local measures of information storage in
complex distributed computation. Information Sciences 208, 39–54 (2012)
63. Lizier, J.T., Rubinov, M.: Multivariate construction of effective computational networks
from observational data. Tech. Rep. Preprint 25/2012, Max Planck Institute for Mathe-
matics in the Sciences (2012)
64. Lungarella, M., Sporns, O.: Mapping Information Flow in Sensorimotor Networks. PLoS
Computational Biology 2(10), e144 (2006)
65. MacKay, D.J.C.: Information Theory, Inference, and Learning Algorithms. Cambridge
University Press, Cambridge (2003)
192 J.T. Lizier
66. Mahoney, J.R., Ellison, C.J., James, R.G., Crutchfield, J.P.: How hidden are hidden pro-
cesses? A primer on crypticity and entropy convergence. Chaos 21(3), 037112 (2011)
67. Manchanda, K., Yadav, A.C., Ramaswamy, R.: Scaling behavior in probabilistic neuronal
cellular automata. Physical Review E 87, 012704 (2013)
68. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing.
The MIT Press, Cambridge (1999)
69. Marinazzo, D., Wu, G., Pellicoro, M., Angelini, L., Stramaglia, S.: Information flow
in networks and the law of diminishing marginal returns: evidence from modeling and
human electroencephalographic recordings. PLoS One 7(9), e45026 (2012)
70. Mitchell, M.: Computation in Cellular Automata: A Selected Review. In: Gramss, T.,
Bornholdt, S., Gross, M., Mitchell, M., Pellizzari, T. (eds.) Non-Standard Computation,
pp. 95–140. VCH Verlagsgesellschaft, Weinheim (1998)
71. Mitchell, M., Crutchfield, J.P., Hraber, P.T.: Evolving Cellular Automata to Perform
Computations: Mechanisms and Impediments. Physica D 75, 361–391 (1994)
72. Nakajima, K., Li, T., Kang, R., Guglielmino, E., Caldwell, D.G., Pfeifer, R.: Local infor-
mation transfer in soft robotic arm. In: 2012 IEEE International Conference on Robotics
and Biomimetics (ROBIO), pp. 1273–1280. IEEE (2012)
73. Obst, O., Boedecker, J., Asada, M.: Improving Recurrent Neural Network Perfor-
mance Using Transfer Entropy. In: Wong, K.W., Mendis, B.S.U., Bouzerdoum, A. (eds.)
ICONIP 2010, Part II. LNCS, vol. 6444, pp. 193–200. Springer, Heidelberg (2010)
74. Pearl, J.: Causality: Models, Reasoning, and Inference. Cambridge University Press,
Cambridge (2000)
75. Priesemann, V., Munk, M., Wibral, M.: Subsampling effects in neuronal avalanche dis-
tributions recorded in vivo. BMC Neuroscience 10(1), 40 (2009)
76. Prokopenko, M., Boschietti, F., Ryan, A.J.: An Information-Theoretic Primer on Com-
plexity, Self-Organization, and Emergence. Complexity 15(1), 11–28 (2009)
77. Prokopenko, M., Gerasimov, V., Tanev, I.: Evolving Spatiotemporal Coordination in a
Modular Robotic System. In: Nolfi, S., Baldassarre, G., Calabretta, R., Hallam, J.C.T.,
Marocco, D., Meyer, J.-A., Miglino, O., Parisi, D. (eds.) SAB 2006. LNCS (LNAI),
vol. 4095, pp. 558–569. Springer, Heidelberg (2006)
78. Prokopenko, M., Lizier, J.T., Obst, O., Wang, X.R.: Relating Fisher information to order
parameters. Physical Review E 84, 41116 (2011)
79. Prokopenko, M., Lizier, J.T., Price, D.C.: On thermodynamic interpretation of transfer
entropy. Entropy 15(2), 524–543 (2013)
80. Rubinov, M., Lizier, J., Prokopenko, M., Breakspear, M.: Maximized directed informa-
tion transfer in critical neuronal networks. BMC Neuroscience 12(supp.l 1), P18 (2011)
81. Schreiber, T.: Interdisciplinary application of nonlinear time series methods - the gener-
alized dimensions. Physics Reports 308, 1–64 (1999)
82. Schreiber, T.: Measuring Information Transfer. Physical Review Letters 85(2), 461–464
(2000)
83. Shalizi, C.R.: Causal Architecture, Complexity and Self-Organization in Time Series and
Cellular Automata. Ph.D. thesis, University of Wisconsin-Madison (2001)
84. Shalizi, C.R., Haslinger, R., Rouquier, J.B., Klinkner, K.L., Moore, C.: Automatic fil-
ters for the detection of coherent structure in spatiotemporal systems. Physical Review
E 73(3), 036104 (2006)
85. Shannon, C.E.: A mathematical theory of communication. Bell System Technical Jour-
nal 27, 379–423, 623–656 (1948)
86. Soon, C.S., Brass, M., Heinze, H.J., Haynes, J.D.: Unconscious determinants of free
decisions in the human brain. Nature Neuroscience 11(5), 543–545 (2008)
Measuring the Dynamics of Information Processing on a Local Scale 193
87. Staniek, M., Lehnertz, K.: Symbolic transfer entropy. Physical Review Letters 100(15),
158101 (2008)
88. Stramaglia, S., Wu, G.R., Pellicoro, M., Marinazzo, D.: Expanding the transfer entropy to
identify information subgraphs in complex systems. In: Proceedings of the 2012 Annual
International Conference of the IEEE Engineering in Medicine and Biology Society, pp.
3668–3671. IEEE (2012)
89. Ver Steeg, G., Galstyan, A.: Information-theoretic measures of influence based on con-
tent dynamics. In: Proceedings of the Sixth ACM International Conference on Web
Search and Data Mining, pp. 3–12 (2013)
90. Verdes, P.F.: Assessing causality from multivariate time series. Physical Review E 72(2),
026222 (2005)
91. Vicente, R., Wibral, M., Lindner, M., Pipa, G.: Transfer entropy–a model-free mea-
sure of effective connectivity for the neurosciences. Journal of Computational Neuro-
science 30(1), 45–67 (2011)
92. Wang, X.R., Miller, J.M., Lizier, J.T., Prokopenko, M., Rossi, L.F.: Quantifying and
Tracing Information Cascades in Swarms. PLoS One 7(7), e40084 (2012)
93. Wibral, M., Pampu, N., Priesemann, V., Siebenhühner, F., Seiwert, H., Lindner, M.,
Lizier, J.T., Vicente, R.: Measuring Information-Transfer delays. PLoS One 8(2), e55809
(2013)
94. Wibral, M., Rahm, B., Rieder, M., Lindner, M., Vicente, R., Kaiser, J.: Transfer entropy
in magnetoencephalographic data: quantifying information flow in cortical and cerebellar
networks. Progress in Biophysics and Molecular Biology 105(1-2), 80–97 (2011)
95. Williams, P.L., Beer, R.D.: Nonnegative Decomposition of Multivariate Information.
arXiv:1004.2515 (2010), http://arxiv.org/abs/1004.2515
96. Williams, P.L., Beer, R.D.: Generalized Measures of Information Transfer.
arXiv:1102.1507 (2011), http://arxiv.org/abs/1102.1507
97. Wolfram, S.: A New Kind of Science. Wolfram Media, Champaign (2002)
98. Wuensche, A.: Classifying cellular automata automatically: Finding gliders, filtering,
and relating space-time patterns, attractor basins, and the Z parameter. Complexity 4(3),
47–66 (1999)
Parametric and Non-parametric Criteria
for Causal Inference from Time-Series
Daniel Chicharro
Abstract. Granger causality constitutes a criterion for causal inference from time
series that has been largely applied to study causal interactions in the brain from
electrophysiological recordings. This criterion underlies the classical parametric im-
plementation in terms of linear autoregressive processes as well as Transfer entropy,
i.e. a non-parametric implementation in the framework of information theory. In the
spectral domain, partial directed coherence and the Geweke formulation are related
to Granger causality but rely on alternative criteria for causal inference which are in-
herently based on the parametric formulation in terms of autoregressive processes.
Here we clearly differentiate between criteria for causal inference and measures
used to test them. We compare the different criteria for causal inference from time-
series and we further introduce new criteria that complete a unified picture of how
the different approaches are related. Furthermore, we compare the different mea-
sures that implement these criteria in the information theory framework.
1 Introduction
The inference of causality in a system of interacting processes from recorded time-
series is a subject of interest in many fields. Particularly successful has been the
concept of Granger causality [29, 31], originally applied to economic time-series. In
the last years, measures of causal inference have been also widely applied to elec-
trophysiological signals, in particular to characterize causal interactions between
different brain areas (see [46, 28, 10] for a review of Granger causality measures
applied to neural data).
In the original formulation of Granger causality, causality from a process Y to a
process X was examined based on the reduction of the prediction error of X when
Daniel Chicharro
Center for Neuroscience and Cognitive Systems@UniTn, Istituto Italiano di Tecnologia,
Via Bettini 31, 38068 Rovereto (TN)
e-mail: chicharro31@yahoo.es
including the past of Y [60, 29]. However, this prediction error criterion generalizes
to a criterion of conditional independence on probability distributions [31] that is
generally applicable to stationary and non-stationary stochastic processes.
Here we consider the criterion of Granger causality together with related criteria
of causal inference, like Sims causality [55]. We also consider the criteria underly-
ing other measures that have been introduced to infer causality but for which the
underlying criterion has not been made explicit. This includes the Geweke spectral
measures of causality (GSC) [25, 26], and partial directed coherence (PDC) [5]. We
make a clear distinction between criteria for causal inference and measures imple-
menting them. Accordingly, we refer by Granger causality to the general criterion
of causal inference and not as it is often the case to the measure implementing it
for linear processes. This means that we consider transfer entropy [54] as a partic-
ular measure to test for Granger causality in the information-theoretic framework
(e.g. [56, 1]).
This distinction between criteria and measures is important because in practice
one is usually not only interested in assessing the existence of a causal connection
but in evaluating its strength (e.g. [11, 9, 8, 52, 59]). Causal inference can be associ-
ated with the construction of a causal graph representing which connections exist in
the system [19]. However, quantifying the causal effects resulting from these con-
nections is a more difficult task. Recently [16] examined how the general notion of
causality developed by Pearl [45] can be applied to study the natural dynamics of
complex systems. This notion is based on the idea of externally manipulating the
system to evaluate causal effects. For example, if one is studying causal connec-
tivity in the brain, this manipulation could be the deactivation of some connections
between brain areas, or stimulating electrically a given area. It is clear that these
manipulations alter the normal dynamics of the brain, those which one wants to
analyze in order to understand neural computations. Accordingly, [16] pointed out
that if the main interest is not the effect of external perturbations, but how the causal
connections participate in the generation of the unperturbed dynamics of the system,
then only in some cases it is meaningful to characterize interactions between differ-
ent subsystems in terms of the effect of one subsystem over another. To identify
these cases the notion of natural causal effects between dynamics was introduced
and conditions for their existence were provided. Consequently, Granger causality
measures, and in particular transfer entropy, cannot be used in general as measures
of the strength of causal effects [4, 39]. Alternatively, a different approach was de-
veloped in [15]. Instead of examining the causal effects resulting from the causal
connections, a unifying multivariate framework to study the dynamic dependencies
between the subsystems that arise from the causal interactions was proposed.
Considering this, we here focus on the criteria for causal inference and the mea-
sures are only used as statistics to test these criteria. We closely follow [14] relating
the different formulations of Granger causality and the corresponding criteria of
causal inference, and integrating parametric and non-parametric formulations, as
well as time-domain and spectral formulations, for both bivariate and multivariate
systems. Furthermore, we do not discuss the fundamental assumptions that deter-
mine the valid applicability of the criterion of Granger causality. In particular we
Parametric and Non-parametric Criteria for Causal Inference from Time-Series 197
assume that all the relevant processes are observed and well-defined. This is of
course a big idealization for real applications, but our purpose is examining the
relation between the different criteria and measures that appear in the different for-
mulations of Granger causality. (For a detailed discussion of the limitations of these
criteria see [58, 16]). More generally, [45] offers a complete explanation of the lim-
itations of causal inference without intervening the system.
This Chapter is organized as follows: In Section 2 we review the non-parametric
formulation of the criteria of Granger and Sims causality and the information-
theoretic measures, including transfer entropy, used to test them. In section 3 we
review the parametric autoregressive representation of the processes and the time
domain and spectral measures of Granger causality, in particular GSC and PDC. We
make explicit the parametric criteria of causal inference underlying these measures
and discuss their relation to the non-parametric criteria. Furthermore we introduce
related new criteria for causal inference that allow us to complete a consistent uni-
fying picture that integrates all the criteria and measures. This picture is presented
all together in Section 4.
holds. Here X t = {Xt , Xt−1 , ...X1 } is the past of the process at time t. From now on
we will assume stationarity so that the results do not depend on the particular time.
Therefore we consider N → ∞ and select t such that X t accounts for the infinite past
of the process. See [56, 15] for a non-stationary formulation. According to Eq. 1
Granger causality indicates that there is no causality from Y to X when the future
Xt+1 is conditionally independent of the past Y t given the partialization on his own
past X t . That is, the past of Y has no dependence with the future of X that cannot be
accounted by the past of X.
198 D. Chicharro
It states that there is no causality from Y to X if the whole future X t+1:N is condi-
tionally independent of Yt+1 given the past of the two processes. In fact, assuming
stationarity it is not necessary to condition on Y t so that like Granger causality the
criterion indicates that the future of X is completely determined by his own past (see
[37] for a detailed review of the relation between the two criteria).
While Granger causality and Sims causality are equivalent criteria for the bivari-
ate case [12], this is not true for multivariate processes. When other processes also
interact with X and Y it is necessary to distinguish a causal connection from Y to
X from other connections that also result in statistical dependencies incompatible
with the equality in Eq. 1. These other connections are indirect causal connections
Y → Z → X as well as the effect of common drivers, i.e. a common parent Z such
that Z → Y and Z → X. The formulation of Granger causality turns out to be easily
generalizable to account for these influences resulting in the equality
where Zt refers to the past of any other process that interacts with X and Y . In fact, on
which processes it is needed to condition depends on the particular causal structure
of the system, which is exactly what one wants to infer. This renders the criterion
of Granger causality context dependent [31]. This means that if Z does not include
all the relevant processes a false positive can be obtained when testing for causality
from Eq. 3. The problem of hidden variables for causal inference is an issue not
specific for time-series that in general can only be addressed by an interventional
treatment of causality [45]. In practice, from observational data, some procedures
can help to optimize the selection of the variables on which to condition [22, 41].
In this Chapter we do not further deal with this problem and we assume that all the
relevant processes are observed.
In contrast to Granger causality, Sims causality cannot be generalized to the mul-
tivariate case as a criterion for causal inference. The reason is that, since in Eq. 2 the
whole future X t+1:N is considered jointly, there is no way to disentangle direct from
indirect causal connections from Y to X. This means that for multivariate processes
the criterion of Granger causality in Eq. 3 remains as the unique non-parametric
criterion for causal inference between the time series.
predictability improvement [60], implements for linear processes a test on the equal-
ity of the mean of the distributions appearing in Eq. 1. More generally, if one wants
to test for the equality between two probability distributions without examining spe-
cific moments of a given order, the Kullback-Leibler divergence (KL-divergence)
[38]
p∗ (x)
KL(p∗ (x), p(x)) = ∑ p∗ (x) log (4)
x p(x)
is a non-negative measure that is zero if and only if the two distributions are iden-
tical. For a multivariate variable X, since it quantifies the divergence of the distri-
bution p(x) from p∗ (x), one can construct p(x) to reflect a specific null-hypothesis
about the dependence between the components of X. As particular applications of
the KL-divergence to quantify the interdependence between random variables one
has the conditional mutual information
p(x|y, z)
I(X;Y |Z) = ∑ p(x, y, z) log p(x|z)
. (5)
x,y,z
We can see that the form of the probability distributions in the argument of the
logarithm is the same as the ones in Eqs. 1-3. Accordingly, testing the equality of
Eq. 1 is equivalent to having a zero transfer entropy [54, 44]
TY →X = I(Xt+1 ;Y t |X t ) = 0. (6)
[54] introduced the transfer entropy to test the equality of Eq. 1 further assum-
ing that the processes were Markovian with a finite order. A similar information-
theoretic quantity, the directed information, has been introduced in the context of
communication theory [42, 43, 36]. The directed information was originally formu-
lated for the non-stationary case and naturally appears in a causal decomposition of
the mutual information (e.g. [1]). Such a decomposition can also be expressed in
terms of transfer entropies, and is valid for both a non-stationary formulation of the
measures which is local in time and another that is cumulative on the whole time
series [15]. These two formulations converge for the stationary case resulting in
This equality, restricted to the stationary linear case, is indicated already in Theorem
1(ii) of [25], where no instantaneous causality is enforced by a normalization of the
covariance matrix.
Notice that here we consider the measures as particular instantiations of the KL-
divergence used as a statistic for hypothesis testing [38]. This is important to keep in
mind because the KL-divergence can be interpreted as well in terms of code length
[17], and in particular the transfer entropy (directed information) determines the
error-free transmission rate when applied to specific communication channels with
feedback [36], (and see also [47] for a discussion of different application of trans-
fer entropy). Furthermore, any conditional mutual information can be evaluated as
a difference of two conditional entropies, and interpreted as a reduction of uncer-
tainty. To test for causality only the significance of nonzero values is of interest,
but it is common to use the values of TY →X to characterize the causal dependencies.
Alternatively, the value of SY →X could be used, giving a not necessarily equivalent
characterization if the conditions of Eq. 10 are not fulfilled or depending on the
particular estimation procedure.
More generally, the KL-divergence is not the only option to test the criteria of
causality above in a non-parametric way. Other measures have been proposed based
on the same criterion (e.g. [33, 2]) that are sensitive to higher-order moments of
the distributions. A natural alternative that also considers all the moments of the
distributions is to use the Fisher information
∂ ln p(Y |x) 2
F(Y ; x) = dY p(Y |x)( ) (11)
∂x
which, by means of the Cramer-Rao bound [17], it is related to the accuracy of an
unbiased estimator of X from Y . For the particular equality of Eq. 1 this leads to test
In the Appendix we examine in detail this expression for linear Gaussian autore-
gressive processes.
Granger [29, 30] proposed to test for causality from Y to X examining if there is
an improvement of predictability of Xt+1 when using the past of Y apart from the
past on X for an optimal linear predictor. For a linear predictor h(X t ), using only
information from the past of X, the squared error is determined by
Parametric and Non-parametric Criteria for Causal Inference from Time-Series 203
(x)
E = dXt+1 dX t (Xt+1 − h(X t ))2 p(Xt+1 , X t ), (21)
and analogously for E (xy) using information from the past of X and Y . Since the
optimal linear predictor is the conditional mean [40], we have that
E (x) = dX t p(X t ) dXt+1 (Xt+1 − EXt+1 [Xt+1 |X t ])2 p(Xt+1 |X t )
(22)
=EX t [σ 2 (Xt+1 |X t )].
An analogous equality is obtained for E (xy) , so that the Geweke measure of Granger
causality is defined as:
(x)
Σx
GY →X = ln( (xy) ), (24)
Σxx
using the autoregressive representation of Eqs. 13-15. This measure, as indicated in
[31], tests if there is causality from Y to X in mean, that is, the equality:
GY →X = 0 ⇒ TY →X = 0, (26)
since the first only test for difference in the moment of order 1 and the other in
the whole probability distribution. In principle, the opposite implication is not al-
ways true. However, since Eq. 25, as well as Eqs. 1-3 impose a stack of constraints
(one for each value of the conditioning variables) we expect that, at least in gen-
eral, the inequality for higher order moments is accompanied by one in the con-
ditional means. Furthermore, when the autoregressive representations are assumed
to be valid, testing for the equality in the mean or the variance of the distributions
is equivalent, given Eq. 23 and that the conditional variance is independent on the
value conditioning. Notice that Gaussianity has not to be assumed for this equal-
ity and in general in [25] it is only further assumed to find the distribution of the
measures under the null-hypothesis of no causality.
The explanation above further relates the distinction in [31] between causation
in mean (Eq. 25) and causation prima facie (Eq. 1) to the equivalence between the
Geweke linear measure of Granger causality GY →X and the transfer entropy for
Gaussian processes. Since a Gaussian probability distribution is completely deter-
mined by its first two moments, and the conditional variance is independent on the
value conditioning, it is clear from the explanation above that for Gaussian vari-
ables causation in mean and prima facie have to be equivalent. This in practice can
204 D. Chicharro
be seen [7] taking into account that the entropy of a N-variate Gaussian distribution
is completely determined by its covariance matrix Σ :
1
N
H(XGaussian )= ln ((2π e)N |Σ |). (27)
2
Accordingly, the two measures are such that:
GY →X = 2 TY →X . (28)
For multivariate processes the conditional GSC [26] is defined in the time domain
analogously to GY →X in Eq. 24, but now using the autoregressive representations of
Eqs 17-20:
(xz)
Σxx
GY →X|Z = ln( (xyz) ). (29)
Σxx
It is straightforward to see that, given the form of the entropy for Gaussian variables
(Eq. 27) and the definition of the conditional transfer entropy TY →X|Z (Eq. 8), the re-
lation between Granger causality and Transfer entropy also holds for the conditional
measures for Gaussian variables:
Geweke [25] also proposed a spectral decomposition of the time domain Granger
causality measure (Eq. 24). Geweke derived the spectral measure of causality from
Y to X, gY →X (ω ), requiring the fulfillment of some properties:
1. The spectral measure should have an intuitive interpretation so that the spectral
decomposition is useful for empirical applications.
2. The measure has to be nonnegative.
3. The temporal and spectral measures have to be related so that
π
1
gY →X (ω )d ω = GY →X . (31)
2π −π
GY →X = 0 ⇔ gY →X (ω ) = 0 ∀ω . (32)
The GSC is obtained from the spectral representation of the bivariate autoregres-
sive process as follows. Fourier transforming Eq. 14 leads to:
(xy) (xy) (xy)
Axx (ω ) Axy (ω ) X(ω ) εx (ω )
(xy) (xy) = (xy) , (33)
Ayx (ω ) Ayy (ω ) Y (ω ) εy (ω )
Parametric and Non-parametric Criteria for Causal Inference from Time-Series 205
where ∗ denotes complex conjugate and matrix transpose. Given the lack of instan-
taneous correlations
(xy) (xy) (xy) (xy)
Sxx (ω ) = Σxx |Hxx (ω )|2 + Σyy |Hxy (ω )|2 . (36)
Sxx (ω )
gY →X (ω ) = ln (xy) (xy)
. (37)
Σxx |Hxx (ω )|2
This definition fulfills the requirement of being nonnegative since, given Eq. 36,
(xy) (xy)
Sxx (ω ) is always higher than Σxx |Hxx (ω )|2 . It also fulfills the requirement of be-
ing intuitive since gY →X (ω ) quantifies the portion of the power spectrum which is
associated with the intrinsic innovation process of X. Furthermore, the third condi-
tion is also accomplished (see [25, 57, 14] for details). This can be seen considering
that
(xy) 2
gY →X (ω ) = − ln (1 − |C(X, εy )| ) (38)
(xy) (xy)
where |C(X, εy )|2 is the squared coherence of X with the innovations εy of
Eq. 14. Given the general relation of the mutual information rate with the squared
coherence [24] we have that for Gaussian variables
π
(xy)N −1 (xy) 2
TY →X = I(X N ; εy )= ln (1 − |C(X, εy )| )d ω . (39)
4π −π
For the multivariate case, to derive the spectral representation of GY →X|Z for sim-
plicity we assume again that there is no instantaneous causality and Σ (xyz) and Σ (xz)
are diagonal (see [18] for a detailed derivation when instantaneous correlations ex-
ist). We rewrite Eq. 19 after Fourier transforming as:
(xz) (xz) (xz)
εx (ω ) Axx (ω ) Axz (ω ) X(ω )
= . (40)
(xz)
εz (ω )
(xz) (xz)
Azx (ω ) Azz (ω ) Z(ω )
206 D. Chicharro
(xz) (xz)
Accordingly, Eqs. 40 and 41 are combined to express Y , εz and εx in terms of
the innovations of the fully multivariate process:
⎛ ⎞ ⎛ (xyz) ⎞
(xz)
εx (ω ) εx (ω )
⎜ ⎟ ⎜ (xyz) ⎟
⎝ Y (ω ) ⎠ = DH(xyz) ⎝ εy (ω ) ⎠ , (43)
(xz) (xyz)
εz (ω ) εz (ω )
where ⎛ ⎞
(xz) (xz)
Axx (ω ) 0 Axz (ω )
⎜ ⎟
D=⎝ 0 1 0 ⎠. (44)
(xz) (xz)
Azx (ω ) 0 Azz (ω )
(xz) (xz)
Considering Q = DH(xyz) , the spectrum matrix of Y , εz and εx is:
S(ω ) = Q(ω )Σ (xyz) Q∗ (ω ), (45)
and in particular
(xyz) (xyz) (xyz)
Sε (xz) ε (xz) (ω ) = |Qxx (ω )|2 Σxx + |Qxy (ω )|2 Σyy + |Qxz (ω )|2 Σzz . (46)
x x
The conditional GSC from Y to X given Z is defined [26] as the portion of the power
(xyz)
spectrum associated with εx , in analogy to Eq. 37:
S (xz) (xz) (ω )
εx εx
gY →X|Z (ω ) = g (xz) (xz) (ω ) = ln (xyz)
. (47)
Y εz →εx
|Qxx (ω )|2 Σxx
This measure also fulfills the requirements that [25] imposed to the spectral mea-
sures. Furthermore, in analogy to Eq. 38, gY →X|Z (ω ) is related to a multiple
coherence:
(xz) (xyz) (xyz) 2
gY →X|Z (ω ) = − ln(1 − |C(εx , εy εz )| ), (48)
Parametric and Non-parametric Criteria for Causal Inference from Time-Series 207
In the multivariate case the information partial directed coherence (iPDC) from
Y to X [57] is:
(xyz)
(xyz) (xyz) (xyz) Axy (ω ) Syy|W \y
iπxy (ω ) = C(εx , ηy ) = (52)
(xyz)
Σxx
analogous to the one of the bivariate case (Eq. 51), provides only an expression
(xyz) (xyz)
which involves the innovation processes εx and ηy .
For the multivariate case the Geweke measure is related (Eq. 49) to
(xz)N (xz)N (xyz)N (xyz)N (xyz)N (xyz)N
p(εx ) = p(εx |εy , εz ) ∀εy , εz (56)
Comparing the non-parametric criteria of Section 2.1 with these parametric cri-
teria we can see another main difference, apart from that the parametric ones all
involve some innovation process. This difference is that in Eqs. 54-57 temporal
separation between future and past is not required to state the criteria, while the
non-parametric criteria all rely explicitly on temporal precedence. The lack of tem-
poral separation is exactly what allows to construct the spectral measures based on
the criteria of Eqs. 54-57. In [14] it was shown, based on this difference with re-
spect to temporal separation, that transfer entropy does not have a non-parametric
spectral representation. This lack of a non-parametric spectral representation of
the transfer entropy can be further understood considering why a criterion with-
out temporal separation that involves only the processes X, Y and not innovation
processes, cannot be used for causal inference: Consider p(X N ) = p(X N |Y N ) as a
criterion to infer causality from Y to X in contrast to the ones of Eqs. 1 and 54.
Using the chain rule for the probability distributions this equality implies checking
p(Xt+1 |X t ) = p(Xt+1 |X t ,Y N ). But this equality does not hold if there is a causal
connection in the opposite direction, from X to Y , because of the conditioning on
the whole process Y N instead of only on its past. Oppositely
Parametric and Non-parametric Criteria for Causal Inference from Time-Series 209
N−1 N−1
) = ∏ p(Xt+1 |X t , εy ∏ p(Xt+1 |X t , εy
(xy)N (xy)N (xy)t
p(X N |εy )= )
t=0 t=0
(58)
N−1
= ∏ p(Xt+1 |X ,Y ),
t t
t=0
since by construction there are no causal connections from the processes to the in-
novation processes. The last equality can be understood considering that the autore-
gressive projections described in Section 3.1 introduce a functional relation of the
variables, such that, for example, given Eq. 14, Xt+1 is completely determined by
(xy)t+1 (xy)t
εx , εy , and analogously for Yt+1 . Accordingly, it is equivalent to condition
(xy)t
on X t , εy or X t ,Y t .
The probability distributions in Eq. 1 and Eq. 54 are still not the same, as it
is clear from Eq. 58. However, under the assumption of stationarity, they are the
functional relations that completely determine the processes from the innovation
processes (and inversely) what leads to the equality in Eq. 39 of the transfer entropy
with the mutual information corresponding to the comparison of the probability dis-
tributions in Eq. 54, and analogously for Eqs. 49 and 51. Remarkably, the mutual
information associated with Eq. 57, as noticed above Eq. 53, is not equal to a mu-
tual information associated with a non-parametric criterion. As indicated in Eq. 51
(see [14] for details) for bivariate processes the PDC is related to Sims causality.
However, for the multivariate case, while there is no extension of Sims causality, it
is clear from the comparison of the definitions in Eqs. 50 and 52, as well as from
the comparison of the criteria of Eqs. 55 and 57, that the multivariate formulation
appears as a natural extension of the bivariate one. This stresses the role of the func-
tional relations that are assumed to implicitly define the innovation processes. It is
not only the causal structure between the variables but the specific functional form
in which they are related what guarantees the validity of the criteria in Eqs. 54-57.
In general this functional form is not required to be linear, as long as it establishes
that the processes and innovation processes are mutually determined.
Another interesting aspect is revealed from the comparison of the bivariate and
multivariate criteria respectively associated with GSC and PDC measures. While
for the PDC the multivariate criterion is a straightforward extension of the bivariate
one, this is not the case for the criteria associated with GSC. This can be noticed as
well comparing the autoregressive projections used for each measure. In particular,
for the bivariate case, gY →X (ω ) is obtained directly from the bivariate autoregres-
sive representation (Eq. 14), not by combining it with the univariate autoregressive
representation of X (Eq. 13). Oppositely, gY →X|Z (ω ) requires the combination of
the full multivariate projection (Eq. 17) and the projection on the past of X, Z (Eq.
19). Below we show that in fact there is a natural counterpart for both the criteria of
Eqs. 54 and 56 respectively.
210 D. Chicharro
where
(x)
P= axx (ω ) 0 . (61)
0 1
= PH(xy) , the spectrum of εx is (x)
Considering Q
(xy)
xx |2 Σxx + |Q (xy)
xy |2 Σyy = |axx (ω )|2 Sxx (ω ) (x)
Sε (x) ε (x) (ω ) = |Q (62)
x x
(xy)
and comparing the total power to the portion related to εx one can define
and
(x)N (xy)N (xy)N
TY →X = I(εx ; εy ) = I(X N ; εy ). (65)
This equality indicates that although the procedure used for the multivariate case is
apparently not reducible to the bivariate case for Z = 0,
/ the spectral decomposition
gY →X (ω ) is the same. The criterion for causal inference that results from reducing
to the bivariate case straightforwardly the one of Eq. 56 is
(x)N (x)N (xy)N (xy)N
p(εx ) = p(εx |εy ) ∀εy . (66)
Again the particular functional relation between the processes and the innovation
(x)N (xy)N
processes determines that X N and εx share the same information with εy ,
given that they are mutually determined in Eq. 13.
Analogously, we want to find the criterion that results from a straightforward ex-
tension of the one in Eq. 54. An alternative way to construct gY →X|Z (ω ) is suggested
Parametric and Non-parametric Criteria for Causal Inference from Time-Series 211
by the relation between the bivariate and the conditional measures stated by Geweke
[26]:
GY →X|Z = GY Z→X − GZ→X , (67)
which is just an application of the chain rule for the mutual information [17]. In
analogy to Eq. 37
Sxx
gY Z→X (ω ) = ln (xyz) (xyz) (68)
|Hxx |2 Σxx
and
Sxx
gZ→X (ω ) = ln (xz) 2 (xz)
, (69)
|Hxx | Σxx
where H(xz) is the inverse of the coefficients matrix of Eq. 40. This leads to
(xz) (xz)
|Hxx |2 Σxx
gY →X|Z (ω ) = ln (xyz) 2 (xyz)
. (70)
|Hxx | Σxx
Notice that while gY →X (ω ) = gY →X (ω ), the two measures are different in the con-
ditional case. This means that two alternative spectral decompositions are possible,
although their integration is equivalent. This can be seen considering that the in-
(xz) (xyz)
tegration of the logarithm terms including |Hxx |2 and |Hxx |2 is zero, based on
theorem 4.2 of Rozanov [51], (see [14] for details). Accordingly
(xyz)N (xyz)N (xz)N
TY →X|Z = I(X N ; εy εz ) − I(X N ; εz ). (71)
The fact that the variable conditioning on the left hand side is not preserved
among the variables conditioning on the right hand side is what determines that
the information-theoretic statistic to test this equality is not a single KL-divergence
(in particular a mutual information) but a difference of two.
We examine if the alternative spectral measures fulfill the three conditions im-
posed by Geweke described in Section 3.2.1. In the bivariate case the measure is
equal so it is clear that it does. In the multivariate case the measure has an intuitive
interpretation and fulfills the relation with the time domain measure under integra-
tion. However, nonnegativity is not guaranteed for every frequency since it is related
to a difference of mutual informations.
the parametric criteria rely not only on the causal structure but also in the functional
relations assumed between the processes and the innovation processes. This is par-
ticularly clear in the multivariate criteria (Eqs. 56, 57 and 72) because the criteria
combine innovations from different projections. This prevents from considering the
autoregressive models as actual generative models which structure can be mapped
to a causal graph. Here we introduce an alternative type of parametric criteria which
relies on a single projection, which can be considered as the model from which the
processes are generated.
In the bivariate case the criterion is
(xy)N (xy)N (xy)N (xy)N
p(εx |X N ) = p(εx |X N , εy ) ∀X N , εy , (73)
practice, the actual value estimated would also reflect how valid is the autoregreesive
model chosen.
Non-parametric
1 p(Xt+1 |X t ) = p(Xt+1 |X t ,Y t )
2 p(X t+1:N |X t ,Y t ) = p(X t+1:N |X t ,Y t ,Yt+1 )
Parametric
(x)N (x)N (xy)N
3 p(εx ) = p(εx |εy )
(xy)N
4 p(X ) = p(X N |εy
N )
(xy)N (xy)N (xy)N
5 p(εx ) = p(εx |ηy )
(xy)N N (xy)N N (xy)N
6 p(εx |X ) = p(εx |X , εy )
214 D. Chicharro
Non-parametric
1 p(Xt+1 |X t , Z t ) = p(Xt+1 |X t ,Y t , Z t )
2 −
Parametric
(xz)N (xz)N (xyz)N (xyz)N
3 p(εx ) = p(εx | εy , εz )
(xz)N (xyz)N (xyz)N
4 p(X |εz
N ) = p(X |εy
N , εz )
(xyz)N (xyz)N (xyz)N
5 p(εx ) = p(εx |ηy )
(xyz)N N N (xyz)N N N (xyz)N
6 p(εx |X , Z ) = p(εx |X , Z , εy )
Bivariate
(xy)N (x)N (xy)N
1 I(Xt+1 ;Y t |X t ) = I(X N ; εy ) = I(εx ; εy )
(xy)N (xy)N
2 I(Yt+1 ; X t+1:N |Y t , X t ) = I(εx ; ηy )
(xy)N (xy)N
3 I(εx ; εy |X N )
Multivariate
(xz)N (xyz)N (xyz)N
4 I(Xt+1 ;Y t |X t , Z t ) = I(εx ; εy , εz )
(xyz)N (xyz)N (xz)N
= I(X N ; εy , εz ) − I(X N ; εz )
(xyz)N (xyz)N
5 I(εx ; ηy )
(xyz)N (xyz)N N N
6 I(εx ; εy |X , Z )
Non-parametric
1 p(Xt+1 |{W \Y }t ) = p(Xt+1 |{W }t )
Parametric
5 Conclusion
We have reviewed criteria for causal inferences related to Granger causality and
proposed some new ones in order to complete a unified framework of criteria and
measures to test for causality in a parametric and non-parametric way, in the time or
spectral domain, and for bivariate or multivariate processes. These criteria and mea-
sures are summarized in Tables 1-4. This offers an integrating picture comprising
the measures proposed by Geweke [25, 26] and partial directed coherence [5]. The
contributions of this Chapter are complementary to the work in [57] and [14]. The
distinction between parametric and non-parametric criteria further emphasizes the
necessity to check the validity of the autoregressive representation when applying a
measure which inherently relies on the definition of the innovation processes. The
distinction between criteria and measures stresses that causal inference and the char-
acterization of the dynamic dependencies resulting from them should be addressed
by different approaches [16, 15].
Finally, we notice again that we have here focused on the formal relation be-
tween the different criteria and measures. For practical applications, problems like
the influences of hidden variables [21], or time and temporal aggregation [23] con-
stitute serious challenges that prevent from successfully applying these criteria. For
example, in the case of brain causal analysis it is now clear that a successful char-
acterization can only be obtained if the application of these criteria is combined
with a biologically plausible reconstruction of how the recorded data are generated
by the neural activity [50, 58, 23]. Even at a more practical level, estimating from
small data sets the information-theoretic measures to test for causality is compli-
cated [34]. Most often stationarity is assumed for simplification, but event-related
estimation is also possible [3, 27]. We believe that a clear understanding of the un-
derlying criteria for causal inference and their relation to measures can also help to
better interpret and address these practical problems.
Eyt [F(Xt+1 ; yt |X t )]
∂ log p(Xi+1 |X t , yt ) 2 (77)
= dyt p(yt ) dX t p(X t |yt ) p(Xi+1 |X t , yt )( ) dXt+1 .
∂y t
We start considering the term F(Xt+1 ; yt |xt ) corresponding to the first integral.
For a gaussian process p(Xi+1 |xt , yt ) = N(μ (Xt+1 |xt , yt ), σ (Xt+1 |xt , yt )) is Gaussian.
Therefore
216 D. Chicharro
F(Xt+1 ; yt |xt ) = − N(μ (Xt+1 |xt , yt ), σ (Xt+1 |xt , yt ))
(xy) (xy)
√ x − ∞ axxs Xt−s +axys Yt−s 2 (78)
∂ log 2πσ (Xt+1 |xt , yt ) 2 ∂ 12 ( t+1 ∑s=0
σ (Xt+1 |xt ,yt ) )
(( ) + ( )2 )dXt+1 .
∂ yt ∂ yt
The first summand inside the integral is zero because the term on which the deriva-
tive is done is independent of yt . For the second summand, since it is linear, we
consider for simplification just the partial derivation on a single variable yt . We get
F(Xt+1 ; yt |xt ) = N(μ (Xt+1 |xt , yt ), σ (Xt+1 |xt , yt ))
(xy) (xy)
xt+1 − ∑∞
s=0 axxs Xt−s + axyt yt axyt
( )2 dXt+1 (79)
σ (Xt+1 |x , yt )
t σ (Xt+1 |xt , yt )
a2xyt
= .
σ 2 (Xt+1 |xt , yt )
This term is independent both of xt and yt , so that the other two integration in Eq.
77 can be done straightforwardly. We have
a2xyt
Eyt [F(Xt+1 ; yt |X t )] = , (80)
σ 2 (Xt+1 |xt , yt )
so that each coefficient in the autoregressive representation can be given a mean-
ing in terms of the Fisher information. This relation further illuminates the relation
between coefficients and GY →X [55, 40]:
(xy)
GY →X = 0 ⇔ axys = 0 ∀s. (81)
References
1. Amblard, P.O., Michel, O.: On directed information theory and Granger causality graphs.
J. Comput. Neurosci. 30, 7–16 (2011)
2. Ancona, N., Marinazzo, D., Stramaglia, S.: Radial basis function approach to nonlinear
Granger causality of time series. Phys. Rev. E 70(5), 056221 (2004)
3. Andrzejak, R.G., Ledberg, A., Deco, G.: Detection of event-related time-dependent di-
rectional couplings. New. J. Phys. 8, 6 (2006)
4. Ay, N., Polani, D.: Information flows in causal networks. Advances in Complex Sys-
tems 11, 17–41 (2008)
5. Baccala, L., Sameshima, K.: Partial directed coherence: a new concept in neural structure
determination. Biol. Cybern. 84(1), 463–474 (2001)
6. Baccala, L., Sameshima, K., Ballester, G., Do Valle, A., Timo-Iaria, C.: Studying the
interaction between brain structures via directed coherence and Granger causality. Appl.
Sig. Process. 5, 40–48 (1999)
7. Barnett, L., Barrett, A.B., Seth, A.K.: Granger causality and transfer entropy are equiva-
lent for Gaussian variables. Phys. Rev. Lett. 103(23), 238701 (2009)
Parametric and Non-parametric Criteria for Causal Inference from Time-Series 217
8. Besserve, M., Schoelkopf, B., Logothetis, N.K., Panzeri, S.: Causal relationships be-
tween frequency bands of extracellular signals in visual cortex revealed by an informa-
tion theoretic analysis. J. Comput. Neurosci. 29(3), 547–566 (2010)
9. Bressler, S.L., Richter, C.G., Chen, Y., Ding, M.: Cortical functional network organiza-
tion from autoregressive modeling of local field potential oscillations. Stat. Med. 26(21),
3875–3885 (2007)
10. Bressler, S.L., Seth, A.K.: Wiener-Granger causality: A well established methodology.
Neuroimage 58(2), 323–329 (2011)
11. Brovelli, A., Ding, M., Ledberg, A., Chen, Y., Nakamura, R., Bressler, S.L.: Beta oscil-
lations in a large-scale sensorimotor cortical network: Directional influences revealed by
Granger causality. P Natl. Acad. Sci. USA 101, 9849–9854 (2004)
12. Chamberlain, G.: The general equivalence of Granger and Sims causality. Economet-
rica 50(3), 569–581 (1982)
13. Chen, Y., Bressler, S., Ding, M.: Frequency decomposition of conditional Granger
causality and application to multivariate neural field potential data. J. Neurosci.
Meth. 150(2), 228–237 (2006)
14. Chicharro, D.: On the spectral formulation of Granger causality. Biol. Cybern. 105(5-6),
331–347 (2011)
15. Chicharro, D., Ledberg, A.: Framework to study dynamic dependencies in networks of
interacting processes. Phys. Rev. E 86, 41901 (2012)
16. Chicharro, D., Ledberg, A.: When two become one: The limits of causality analysis of
brain dynamics. PLoS One 7(3), e32466 (2012)
17. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. John Wiley and
Sons (2006)
18. Ding, M., Chen, Y., Bressler, S.L.: Granger causality: Basic theory and application to
neuroscience. In: Handbook of Time Series Analysis: Recent Theoretical Developments
and Applications, pp. 437–460. Wiley-VCH Verlag (2006)
19. Eichler, M.: A graphical approach for evaluating effective connectivity in neural systems.
Phil. Trans. R Soc. B 360, 953–967 (2005)
20. Eichler, M.: On the evaluation of information flow in multivariate systems by the directed
transfer function. Biol. Cybern. 94(6), 469–482 (2006)
21. Eichler, M.: Granger causality and path diagrams for multivariate time series. J. Econo-
metrics 137, 334–353 (2007)
22. Faes, L., Nollo, G., Porta, A.: Information-based detection of nonlinear Granger causality
in multivariate processes via a nonuniform embedding technique. Phys. Rev. E 83(5),
051112 (2011)
23. Friston, K.J.: Functional and effective connectivity: A review. Brain Connectivity 1(1),
13–36 (2012)
24. Gelfand, I., Yaglom, A.: Calculation of the amount of information about a random func-
tion contained in another such function. Am. Math. Soc. Transl. Ser. 2(12), 199–246
(1959)
25. Geweke, J.F.: Measurement of linear dependence and feedback between multiple time
series. J. Am. Stat. Assoc. 77(378), 304–313 (1982)
26. Geweke, J.F.: Measures of conditional linear dependence and feedback between time
series. J. Am. Stat. Assoc. 79(388), 907–915 (1984)
27. Gómez-Herrero, G., Wu, W., Rutanen, K., Soriano, M.C., Pipa, G., Vicente, R.: Assess-
ing coupling dynamics from an ensemble of time series. arXiv:1008.0539v1 (2010)
28. Gourevitch, B., Le Bouquin-Jeannes, R., Faucon, G.: Linear and nonlinear causality
between signals: methods, examples and neurophysiological applications. Biol. Cy-
bern. 95(4), 349–369 (2006)
218 D. Chicharro
29. Granger, C.W.J.: Economic processes involving feedback. Information and Control 6,
28–48 (1963)
30. Granger, C.W.J.: Investigating causal relations by econometric models and cross-spectral
methods. Econometrica 37(3), 424–438 (1969)
31. Granger, C.W.J.: Testing for causality: A personal viewpoint. J. Econ. Dynamics and
Control 2(1), 329–352 (1980)
32. Guo, S., Seth, A.K., Kendrick, K.M., Zhou, C., Feng, J.: Partial Granger causality - elim-
inating exogenous inputs and latent variables. J. Neurosci. Meth. 172(1), 79–93 (2008)
33. Hiemstra, C., Jones, J.D.: Testing for linear and nonlinear Granger causality in the stock
price-volume relation. J. Financ. 49(5), 1639–1664 (1994)
34. Hlaváčkova-Schindler, K., Paluš, M., Vejmelka, M., Bhattacharya, J.: Causality detection
based on information-theoretic approaches in time-series analysis. Phys. Rep. 441, 1–46
(2007)
35. Kaminski, M., Blinowska, K.: A new method of the description of the information flow
in the brain structures. Biol. Cybern. 65(3), 203–210 (1991)
36. Kramers, G.: Directed information for channels with feedback. PhD dissertation, Swiss
Federal Institute of Technology, Zurich (1998)
37. Kuersteiner, G.: Granger-Sims causality, 2nd edn. The New Palgrave Dictionary of Eco-
nomics (2008)
38. Kullback, S.: Information Theory and Statistics. Dover, Mineola (1959)
39. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Local information transfer as a spatiotem-
poral filter for complex systems. Phys. Rev. E 77, 26110 (2008)
40. Lütkepohl, H.: New introduction to multiple time series analysis. Springer, Berlin (2006)
41. Marinazzo, D., Pellicoro, M., Stramaglia, S.: Causal information approach to partial con-
ditioning in multivariate data sets. Comput. Math. Meth. Med., 303601 (2012)
42. Marko, H.: Bidirectional communication theory - generalization of information-theory.
IEEE T. Commun. 12, 1345–1351 (1973)
43. Massey, J.: Causality, feedback and directed information. In: Proc. Intl. Symp. Info. Th.
Appli., Waikiki, Hawai, USA (1990)
44. Paluš, M., Komárek, V., Hrnčı́ř, Z., Štěrbová, K.: Synchronization as adjustment of in-
formation rates: Detection from bivariate time series. Phys. Rev. E 63, 046211 (2001)
45. Pearl, J.: Causality: Models, Reasoning, Inference, 2nd edn. Cambridge University Press,
New York (2009)
46. Pereda, E., Quian Quiroga, R., Bhattacharya, J.: Nonlinear multivariate analysis of neu-
rophysiological signals. Prog. Neurobiol. 77, 1–37 (2005)
47. Permuter, H., Kim, Y., Weissman, T.: Interpretations of directed information in portfolio
theory, data compression, and hypothesis testing. IEEE Trans. Inf. Theory 57(3), 3248–
3259 (2009)
48. Priestley, M.: Spectral analysis and time series. Academic Press Inc., San Diego (1981)
49. Quinn, C.J., Coleman, T.P., Kiyavash, N., Hatsopoulos, N.G.: Estimating the directed
information to infer causal relationships in ensemble neural spike train recordings. J.
Comput. Neurosci. 30, 17–44 (2011)
50. Roebroeck, A., Formisano, E., Goebel, R.: The identification of interacting networks in
the brain using fmri: Model selection, causality and deconvolution. NeuroImage 58(2),
296–302 (2011)
51. Rozanov, Y.: Stationary random processes. Holden-Day, San Francisco (1967)
52. Schelter, B., Timmer, J., Eichler, M.: Assessing the strength of directed influences among
neural signals using renormalized partial directed coherence. J. Neurosci. Meth. 179(1),
121–130 (2009)
Parametric and Non-parametric Criteria for Causal Inference from Time-Series 219
53. Schelter, B., Winterhalder, M., Eichler, M., Peifer, M., Hellwig, B., Guschlbauer, B.,
Lucking, C., Dahlhaus, R., Timmer, J.: Testing for directed influences among neural
signals using partial directed coherence. J. Neurosci. Meth. 152(1-2), 210–219 (2006)
54. Schreiber, T.: Measuring information transfer. Phys. Rev. Lett. 85, 461–464 (2000)
55. Sims, C.: Money, income, and causality. American Economic Rev. 62(4), 540–552
(1972)
56. Solo, V.: On causality and mutual information. In: Proceedings of the 47th IEEE Con-
ference on Decision and Control, pp. 4639–4944 (2008)
57. Takahashi, D.Y., Baccala, L.A., Sameshima, K.: Information theoretic interpretation of
frequency domain connectivity measures. Biol. Cybern. 103(6), 463–469 (2010)
58. Valdes-Sosa, P., Roebroeck, A., Daunizeau, J., Friston, K.: Effective connectivity: Influ-
ence, causality and biophysical modeling. Neuroimage 58(2), 339–361 (2011)
59. Vicente, R., Wibral, M., Lindner, M., Pipa, G.: Transfer entropy: A model-free measure
of effective connectivity for the neurosciences. J. Comput. Neurosci. 30, 45–67 (2010)
60. Wiener, N.: The theory of prediction. In: Modern Mathematics for Engineers, pp. 165–
190. McGraw-Hill, New York (1956)
Author Index
N
K
natural causal effects 196
kernel estimation 14, 45, 174 network motifs 101, 107, 113, 124,
Kolmogorov entropy 151 126
Kozachenko-Leonenko networks 88
estimator 46 neuronal cultures 115
Kraskov-Stögbauer-Grassberger non-linear time-series analysis 163
estimator 15, 48, 53, 174 non-uniform embedding 73
Subject Index 225