Directed Information Measures in Neuroscience 2014

Understanding Complex Systems
Michael Wibral
Raul Vicente
Joseph T. Lizier Editors
Directed Information
Measures in
Neuroscience
Founding Editor
Prof. Dr. J.A. Scott Kelso
Center for Complex Systems & Brain Sciences
Florida Atlantic University
Boca Raton FL, USA
E-mail: kelso@walt.ccs.fau.edu
Editorial and Programme Advisory Board

Dan Braha
New England Complex Systems, Institute and University of Massachusetts, Dartmouth
Péter Érdi
Center for Complex Systems Studies, Kalamazoo College, USA and Hungarian Academy of
Sciences, Budapest, Hungary
Karl Friston
Institute of Cognitive Neuroscience, University College London, London, UK
Hermann Haken
Center of Synergetics, University of Stuttgart, Stuttgart, Germany
Viktor Jirsa
Centre National de la Recherche Scientifique (CNRS), Université de la Méditerranée, Marseille,
France
Janusz Kacprzyk
System Research, Polish Academy of Sciences, Warsaw, Poland
Kunihiko Kaneko
Research Center for Complex Systems Biology, The University of Tokyo, Tokyo, Japan
Scott Kelso
Center for Complex Systems and Brain Sciences, Florida Atlantic University, Boca Raton, USA
Markus Kirkilionis
Mathematics Institute and Centre for Complex Systems, University of Warwick, Coventry, UK
Jürgen Kurths
Potsdam Institute for Climate Impact Research (PIK), Potsdam, Germany
Andrzej Nowak
Department of Psychology, Warsaw University, Poland
Linda Reichl
Center for Complex Quantum Systems, University of Texas, Austin, USA
Peter Schuster
Theoretical Chemistry and Structural Biology, University of Vienna, Vienna, Austria
Frank Schweitzer
System Design, ETH Zürich, Zürich, Switzerland
Didier Sornette
Entrepreneurial Risk, ETH Zürich, Zürich, Switzerland
For further volumes:

http://www.springer.com/series/5394
Future scientific and technological developments in many fields will necessarily depend upon coming
to grips with complex systems. Such systems are complex in both their composition - typically many
different kinds of components interacting simultaneously and nonlinearly with each other and their envi-
ronments on multiple levels - and in the rich diversity of behavior of which they are capable.
The Springer Series in Understanding Complex Systems series (UCS) promotes new strategies and
paradigms for understanding and realizing applications of complex systems research in a wide variety of
fields and endeavors. UCS is explicitly transdisciplinary. It has three main goals: First, to elaborate the
concepts, methods and tools of complex systems at all levels of description and in all scientific fields,
especially newly emerging areas within the life, social, behavioral, economic, neuro and cognitive sci-
ences (and derivatives thereof); second, to encourage novel applications of these ideas in various fields
of engineering and computation such as robotics, nano-technology and informatics; third, to provide a
single forum within which commonalities and differences in the workings of complex systems may be
discerned, hence leading to deeper insight and understanding.
UCS will publish monographs, lecture notes and selected edited contributions aimed at communicat-
ing new findings to a large multidisciplinary audience.
Springer Complexity
Springer Complexity is an interdisciplinary program publishing the best research and academic-level
teaching on both fundamental and applied aspects of complex systems - cutting across all traditional dis-
ciplines of the natural and life sciences, engineering, economics, medicine, neuroscience, social and com-
puter science.
Complex Systems are systems that comprise many interacting parts with the ability to generate a new
quality of macroscopic collective behavior the manifestations of which are the spontaneous formation of
distinctive temporal, spatial or functional structures. Models of such systems can be successfully mapped
onto quite diverse “real-life” situations like the climate, the coherent emission of light from lasers, chem-
ical reaction-diffusion systems, biological cellular networks, the dynamics of stock markets and of the
internet, earthquake statistics and prediction, freeway traffic, the human brain, or the formation of opin-
ions in social systems, to name just some of the popular applications.
Although their scope and methodologies overlap somewhat, one can distinguish the following main
concepts and tools: self-organization, nonlinear dynamics, synergetics, turbulence, dynamical systems,
catastrophes, instabilities, stochastic processes, chaos, graphs and networks, cellular automata, adaptive
systems, genetic algorithms and computational intelligence.
The two major book publication platforms of the Springer Complexity program are the monograph
series “Understanding Complex Systems” focusing on the various applications of complexity, and the
“Springer Series in Synergetics”, which is devoted to the quantitative theoretical and methodological foun-
dations. In addition to the books in these two core series, the program also incorporates individual titles
ranging from textbooks to major reference works.
Michael Wibral · Raul Vicente
Joseph T. Lizier
Editors
Directed Information
Measures in Neuroscience
ABC
Editors
Michael Wibral Joseph T. Lizier
Brain Imaging Center CSIRO Computational Informatics
Frankfurt am Main Marsfield
Germany Sydney
Australia
Raul Vicente
Max-Planck Institute for Brain Research
Frankfurt am Main
Germany
ISSN 1860-0832 ISSN 1860-0840 (electronic)

ISBN 978-3-642-54473-6 ISBN 978-3-642-54474-3 (eBook)
DOI 10.1007/978-3-642-54474-3
Springer Heidelberg New York Dordrecht London
Library of Congress Control Number: 2014932427
c Springer-Verlag Berlin Heidelberg 2014

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publisher’s location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of pub-
lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect
to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

Preface
In scientific discourse and the media it is commonplace to state that brains exist to
‘process information’. Curiously enough, however, we only have a certain under-
standing of what is meant by this when we refer to some specific tasks solved by
information processing like perceiving or remembering objects, or making decisions
– just to name a few. Information processing itself is rather general however, and it
seems much more difficult to exactly quantify it without recurring to specific tasks.
These difficulties arise mostly because, only with specific tasks it is easy to restrict
the parts of the neural system to include in the analysis and to define the roles they
assume, e.g. as inputs or outputs (for the task under consideration). In contrast to
these difficulties that arise when trying to treat information processing in brains, we
have no difficulties to quantify information processing in a digital computer, e.g.
in terms of the information stored on its hard disk, or the amount of information
transfered per second from its hard disk to its random access memory, and then on
to the CPU. In the case of the digital computer it seems completely unnecessary to
recur to specific tasks to understand the general principles of information processing
implemented in this multi-purpose machine, and components of its information pro-
cessing are easily quantified and are also understood to some degree by almost every
one. Why then is it so difficult to perform a similar quantification for biological, and
especially neural information processing?
One answer to this question is the conceptual difference between a digital com-
puter and a neural system: In a digital computer all components are laid out such
that they only perform specific operations on information: a hard disk should store
information, the CPU should quickly modify it, and system buses exist only to trans-
fer information. In contrast, in neural systems it is safe to assume that each agent
of the system (each neuron) simultaneously stores, transfers and modifies informa-
tion in variable amounts, and that these component processes are hard to separate
and quantify. This is because of the recurrent nature of neural circuits that defy the
traditional separation of inputs and outputs, and because the general ’computations’
that are performed may be of a nature that renders the explicit definition or analysis
of a ’code’ exceedingly difficult. Thus, while in digital computers the distinction
between information storage, transfer and modification comes practically for free,
VI Preface
in neural systems separating the components of distributed information processing

requires explicit mathematical definitions of information storage, transfer and mod-
ification.
These necessary mathematical definitions were recently derived building on Alan
Turing’s old idea that every act of information processing can be decomposed into
the component processes of information storage, transfer and modification – in line
with our everyday view of the subject. A key concept here is that the total infor-
mation found in the state of an agent, or part of, a computational system should be
decomposable into the contributions from Turing’s three component processes – see
the chapter by Lizier in this book for details. This decomposition of the total (Shan-
non) information then provides a link to information theory as originally introduced
by Claude Shannon for communication processes only. In recent years, Langton and
others have expanded Turing’s concepts to describe the emergence of the capacity
to perform arbitrary information processing algorithms, or universal computation,
in complex systems, such as cellular automata, swarms, or neural systems. In all
of these systems there now is considerable agreement on how to properly measure
information transfer and storage, and important progress has been made towards a
proper mathematical definition of information modification, as well. There is also
agreement that it is useful to decompose information processing in this way to guess
the algorithms that are implemented in complex systems.
In particular, the quantification of information transfer via directed information
measures has met rapidly increasing interest both in complex systems theory and
in neuroscience, and is the scope of this book. The book itself is based on the dis-
cussions on information transfer that took place at the LOEWE-NeFF symposium
on “Nonlinear and Model-free Interdependence Measures in Neuroscience” at the
Ernst-Strüngmann Institute (ESI) and the Goethe University in Frankfurt in April
2012.
While the interest of the neuroscience community in analysing neural data by
information based methods has been initially ignited by the relationship between
directed interactions and interdependencies and information transfer, as still evi-
denced in the title of the symposium, recent work by the editors and authors of
this book (most notably Lizier and Chicharro) and other groups in the field has
made clear that directed information measures reach beyond serving as a model-
free proxy of causal interactions. In fact, directed information measures introduce a
critical differentiation of causal interactions with respect to their use in computation
and provide valuable constraints on algorithms carried out in neural systems. These
recent developments are reflected in the book.
This book is divided into three parts. In the introductory part, the two opening
chapters by Lindner, Vicente and Wibral lay the foundations for a proper under-
standing of transfer entropy – the most popular measure of information transfer.
These chapters deal with the concept of information transfer, the interpretation of
the transfer entropy measure, as well as its estimation from real world data. In the
second part, the chapters by Faes and Porta, Marinazzo, and Battaglia, and Vakorin
present a more in-depth treatment of recent advances that are necessary to under-
stand information transfer in networks with many agents, such as we encounter them
Preface VII
in neuroscience. In the last part, the chapters by Lizier and Chicharro suggest two
new interesting contexts for the study of information transfer in neuroscience. The
chapter by Lizier shows how to quantify the dynamics of information transfer on a
local scale in space and time, thereby opening the possibility to follow information
processing step by step in time and node by node in space. Chicharro then points
out the relation between different measures of information transfer and criteria to
infer causal interactions in complex systems.
The editors and authors gratefully acknowledge the generous funding of the Land
Hessen via the LOEWE grant “Neuronale Koordination Forschungsschwerpunkt
Frankfurt (NeFF)” that sponsored a workshop that gave rise to this book. The editors
also acknowledge the help of Daniel Chicharro in reviewing.
Frankfurt, Tartu, Sydney Michael Wibral

January 2014 Raul Vicente
Joseph T. Lizier
Contents
Part I: Introduction to Directed Information Measures
Transfer Entropy in Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Michael Wibral, Raul Vicente, Michael Lindner
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Physical Systems, Time Series, Random Processes
and Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Basic Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 The Transfer Entropy Functional . . . . . . . . . . . . . . . . . . . . 7
2.4 Interpretation of TE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Practical Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Signal Representation and State Space
Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Transfer Entropy Estimators . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 A Graphical Summary of the TE Principle . . . . . . . . . . . . 16
3.4 Information Transfer Delay Estimation . . . . . . . . . . . . . . 18
3.5 Practical TE Estimation and Open Source Tools . . . . . . . 20
4 Common Problems and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Statistical Testing to Overcome Bias and Variance
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Multivariate TE and Approximation Techniques . . . . . . . 24
4.3 Observation Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Stationarity and Ensemble Methods . . . . . . . . . . . . . . . . . 26
5 Relation to Other Directed Information Measures . . . . . . . . . . . . . 27
5.1 Time-Lagged Mutual Information . . . . . . . . . . . . . . . . . . . 27
5.2 Transfer Entropy and Massey’s Directed Information . . . 28
5.3 Momentary Information Transfer . . . . . . . . . . . . . . . . . . . 30
6 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
X Contents
Efficient Estimation of Information Transfer . . . . . . . . . . . . . . . . . . . . . . . . . 37

Raul Vicente, Michael Wibral
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2 Why Information Theory? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.1 Transfer Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 A Zoo of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1 Parametric Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Non-parametric Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Estimating Transfer Entropy from Time Series via Nearest
Neighbor Statistics: Step by Step . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 Step 1: Reconstructing the State Space . . . . . . . . . . . . . . . 49
4.2 Step 2: Computing the Transfer Entropy Numerical
Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Step 3: Using Transfer Entropy as a Statistic . . . . . . . . . . 51
4.4 Toolboxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Coping with Non-stationarity: An Ensemble Estimator . . . . . . . . . 52
6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Part II: Information Transfer in Neural and Other Physiological Systems
Conditional Entropy-Based Evaluation of Information Dynamics in

Physiological Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Luca Faes, Alberto Porta
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2 Information Dynamics in Coupled Systems . . . . . . . . . . . . . . . . . . . 63
2.1 Self Entropy, Cross Entropy and Transfer Entropy in
Bivariate Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Multivariate Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.3 Self Entropy, Cross Entropy and Transfer Entropy as
Components of System Predictive Information . . . . . . . . 66
3 Strategies for the Estimation of Information Dynamics
Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.1 Corrected Conditional Entropy . . . . . . . . . . . . . . . . . . . . . 69
3.2 Corrected Conditional Entropy from Non-uniform
Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3 Parameter Setting and Open Issues . . . . . . . . . . . . . . . . . . 75
4 Applications to Physiological Systems . . . . . . . . . . . . . . . . . . . . . . . 78
4.1 Applications of Self Entropy Analysis . . . . . . . . . . . . . . . 78
4.2 Applications of Cross Entropy Analysis . . . . . . . . . . . . . . 80
4.3 Applications of Transfer Entropy Analysis . . . . . . . . . . . 81
5 Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 83
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Contents XI
Information Transfer in the Brain: Insights from a Unified Approach . . . 87

Daniele Marinazzo, Guorong Wu, Mario Pellicoro, Sebastiano Stramaglia
1 Economics of Information Transfer in Networks . . . . . . . . . . . . . . 88
1.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
1.2 Electroencephalographic Recordings . . . . . . . . . . . . . . . . 93
2 Partial Conditioning of Granger Causality . . . . . . . . . . . . . . . . . . . . 95
2.1 Finding the Most Informative Variables . . . . . . . . . . . . . . 96
2.2 Partial Conditioning in a Dynamical Model . . . . . . . . . . . 98
2.3 Partial Conditioning in Resting State fMRI . . . . . . . . . . . 100
3 Informative Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.1 Identification of Irreducible Subgraphs . . . . . . . . . . . . . . . 101
4 Expansion of the Transfer Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.1 Applications: Magnetic Resonance and EEG Data . . . . . 105
4.2 Relationship with Information Storage . . . . . . . . . . . . . . . 107
5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Function Follows Dynamics: State-Dependency of Directed Functional
Influences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Demian Battaglia
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
2 State-Conditioned Transfer Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 113
3 Directed Functional Interactions in Bursting Cultures . . . . . . . . . . 115
3.1 Neuronal Cultures “in silico” . . . . . . . . . . . . . . . . . . . . . . . 115
3.2 Extraction of Directed Functional Networks . . . . . . . . . . 118
3.3 Zero-Lag Causal Interactions for Slow-Rate Calcium
Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.4 State-Selection Constraints for Neuronal Cultures . . . . . 119
3.5 Functional Multiplicity in Simulated Cultures . . . . . . . . . 120
3.6 Structural Connectivity from Directed Functional
Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.7 Structural Degeneracy in Simulated Cultures . . . . . . . . . . 123
4 Directed Functional Interactions in Motifs of Oscillating
Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.1 Oscillating Local Areas “in silico” . . . . . . . . . . . . . . . . . . 125
4.2 State-Selection Constraints for Motifs of Oscillating
Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.3 Functional Multiplicity in Motifs of Oscillating
Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.4 Control of Information Flow Directionality . . . . . . . . . . . 129
5 Function from Structure, via Dynamics . . . . . . . . . . . . . . . . . . . . . . 131
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
XII Contents
On Complexity and Phase Effects in Reconstructing the Directionality

of Coupling in Non-linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Vasily A. Vakorin, Olga Krakovska, Anthony R. McIntosh
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
2 Coupled Non-linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3 Granger Causality: Standard, Spectral and Non-linear . . . . . . . . . . 139
4 Phase Synchronization and Phase Delays . . . . . . . . . . . . . . . . . . . . 142
5 Causality and Phase Differences: Three Scenarios . . . . . . . . . . . . . 142
6 Influence of the Parameters of Coupling on Causality and
Phase Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7 Information Content of the Observed Time Series . . . . . . . . . . . . . 151
8 Directionality of Coupling and Differences in Complexity . . . . . . 152
9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Part III: Recent Advances in the Analysis of Information Processing
Measuring the Dynamics of Information Processing on a Local Scale

in Time and Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Joseph T. Lizier
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
2 Information-Theoretic Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 164
3 Local Information Theoretic Measures . . . . . . . . . . . . . . . . . . . . . . 167
3.1 Shannon Information Content and Its Meaning . . . . . . . . 168
3.2 Local Mutual Information and Conditional Mutual
Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
3.3 Local Information Measures for Time Series . . . . . . . . . . 172
3.4 Estimating the Local Quantities . . . . . . . . . . . . . . . . . . . . . 173
4 Local Measures of Information Processing . . . . . . . . . . . . . . . . . . . 175
4.1 Local Information Storage . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.2 Local Information Transfer . . . . . . . . . . . . . . . . . . . . . . . . 176
5 Local Information Processing in Cellular Automata . . . . . . . . . . . . 180
5.1 Blinkers and Background Domains as Information
Storage Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.2 Particles, Gliders and Domain Walls as Dominant
Information Transfer Entities . . . . . . . . . . . . . . . . . . . . . . . 183
5.3 Sources Can Be Locally Misinformative . . . . . . . . . . . . . 185
5.4 Conditional Transfer Entropy Is Complementary . . . . . . 185
5.5 Contrasting Information Transfer and Causal Effect . . . . 186
6 Discussion: Relevance of Local Measures to Computational
Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Contents XIII
Parametric and Non-parametric Criteria for Causal Inference from

Time-Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Daniel Chicharro
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
2 Non-parametric Approach to Causal Inference from
Time-Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
2.1 Non-parametric Criteria for Causal Inference . . . . . . . . . 197
2.2 Measures to Test for Causality . . . . . . . . . . . . . . . . . . . . . . 198
3 Parametric Approach to Causal Inference from Time-Series . . . . . 200
3.1 The Autoregressive Process Representation . . . . . . . . . . . 201
3.2 Parametric Measures of Causality . . . . . . . . . . . . . . . . . . . 202
3.3 Parametric Criteria for Causal Inference . . . . . . . . . . . . . . 208
3.4 Alternative Geweke Spectral Measures . . . . . . . . . . . . . . . 210
3.5 Alternative Parametric Criteria Based on Innovations
Partial Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4 Comparison of Non-parametric and Parametric Criteria for
Causal Inference from Time-Series . . . . . . . . . . . . . . . . . . . . . . . . . 213
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6 Appendix: Fisher Information Measure of Granger Causality
for Linear Autoregressive Gaussian Processes . . . . . . . . . . . . . . . . 215
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Part I
Introduction to Directed Information
Measures
2 Part I: Introduction to Directed Information Measures
This part of the book provides an introduction to the concepts of directed infor-
mation measures, especially transfer entropy, to the relation of causal interactions
and information transfer, and to practical aspects of estimating information theoretic
quantities from real-world data.
Transfer Entropy in Neuroscience
Michael Wibral, Raul Vicente, and Michael Lindner
Abstract. Information transfer is a key component of information processing, next

to information storage and modification. Information transfer can be measured by a
variety of directed information measures of which transfer entropy is the most popu-
lar, and most principled one. This chapter presents the basic concepts behind transfer
entropy in an intuitive fashion, including graphical depictions of the key concepts.
It also includes a special section devoted to the correct interpretation of the mea-
sure, especially with respect to concepts of causality. The chapter also provides an
overview of estimation techniques for transfer entropy and pointers to popular open
source toolboxes. It also introduces recent extensions of transfer entropy that serve
to estimate delays involved in information transfer in a network. By touching upon
alternative measures of information transfer, such as Massey’s directed information
transfer and Runge’s momentary information transfer, it may serve as a frame of ref-
erence for more specialised treatments and as an overview over the field of studies
in information transfer in general.
1 Introduction
This chapter introduces transfer entropy, which to date is arguably the most widely
used directed information measure, especially in neuroscience. The presentation of
Michael Wibral
MEG Unit, Brain Imaging Center, Goethe University, Heinrich-Hoffmann Strasse 10,
60528 Frankfurt am Main, Germany
e-mail: wibral@em.uni-frankfurt.de
Raul Vicente
Max-Planck Institute for Brain Research, 60385 Frankfurt am Main, Germany
e-mail: raulvicente@gmail.com
Michael Lindner
School of Psychology and Clinical Language Science, University of Reading
e-mail: m.lindner@reading.ac.uk
M. Wibral et al. (eds.), Directed Information Measures in Neuroscience, 3

Understanding Complex Systems,
DOI: 10.1007/978-3-642-54474-3_1, c Springer-Verlag Berlin Heidelberg 2014
4 M. Wibral, R. Vicente, and M. Lindner
the basic concepts behind transfer entropy and a special section devoted to the cor-
rect interpretation of the measure are meant to prepare the reader for more in depth
treatments in chapters that follow. The chapter should also serve as a frame of refer-
ence for these more specialised treatments and present an overview over the field of
studies in information transfer. In this sense, it may be treated as both and opening
and a closing chapter to this volume.
Since its introduction by Paluš [55] and Schreiber [60] transfer entropy has
proven extremely useful in a wide variety of application scenarios ranging from
neuroscience [69, 73, 55, 66, 67, 68, 8, 1, 4, 6, 7, 17, 19, 40, 47, 52, 59, 62, 34, 22,
52, 36, 5, 63, 27, 6, 35, 6, 7], physiology [11, 13, 12], climatology [57], complex
systems theory [44, 45, 40] and other fields, such as economics [32, 29]. This wide
variety of application fields suggests that transfer entropy measures a useful and
fundamental quantity to understand complex systems, especially those that can be
conceptualized as some kind of network of interacting agents or processes. It is the
purpose of this chapter to organise the available material so as to help the reader in
understanding why transfer entropy is an indispensable tool in understanding com-
plex systems.
In the first section of this chapter we will introduce the fundamental concepts
behind transfer entropy, give a guide to its interpretation and will help to distinguish
it from measures of causal influences based on interventions. In the second section
we will then proceed to show how the concepts of transfer entropy can be cast into
an efficient estimator to obtain reliable transfer entropy values from empirical data,
and also consider aspects of computationally efficient implementations. The third
section deals with common problems and pitfalls encountered in transfer entropy
estimation. We will also briefly discuss two other directed information measures
that have been proposed for the analysis of information transfer – Marko’s directed
information and Pompe’s momentary information transfer. We will show in what
aspects they differ from transfer entropy, and what these differences mean for their
application in neuroscience. In the concluding remarks we will explain how transfer
entropy is much more than just the tool for model-free investigations of directed
interactions that it is often portrayed to be, and point out the important role it may
play even in the analysis of detailed neural models.
2 Concepts
2.1 Physical Systems, Time Series, Random Processes and

Random Variables
In this section we introduce the necessary notation and basic information theoretic
concepts that are indispensable to understand transfer entropy. This is done to obtain
a self-contained presentation of the material for readers without a background in
information theory. Readers familiar with elementary information theory may safely
skip ahead to the next section.
Transfer Entropy in Neuroscience 5
To avoid confusion when introducing the concept of transfer entropy, we first

have to state what systems we wish to apply the measure to, and how we for-
malize observations from these systems mathematically. For the remainder of this
text we therefore assume that the coupled physical systems (e.g. neurons, brain
areas) X , Y , Z , . . . , produce the observed time series {x1 , . . . , xT }, {y1 , . . . , yT },
{z1 , . . . , zT }, . . . ,via measurements at discrete times t ∈ 1...T . These time series are
understood as realizations xt , yt , zt , . . . of random variables Xt ,Yt , Zt , . . . that form
random processes X, Y, Z, . . . (indicated by typewriter font), unless stated other-
wise. Random processes are nothing but collections of random variables, sorted
by an index (t in our case). Hence, our notation should be understood as all of the
{X1 , . . . , XT }, {Y1 , . . . ,Y yT }, {Z1 , . . . , ZT }, . . . being individual random variables that
produce their own (potentially multiple) realizations at the indexed time point t. As
pointed out below, in section 4.4, these multiple realizations of each random variable
can for example be obtained from multiple copies of the physical systems.
If, however, such copies are unavailable, we will obtain only a single realization
from each random variable in a process, which is not enough to evaluate probabil-
ities or information theoretic quantities. Hence, we have to assume that all random
variables of a process are essentially the same in terms of their underlying proba-
bility distributions, i.e. that the random process is stationary. Under this stationarity
assumption we can treat all values of an observed time series as realizations from a
single underlying probability distribution (the one common to all our random vari-
ables), and estimate this distribution from multiple time samples. The stationarity
assumption for the random processes is convenient here as it allows to replace ex-
pectations taken over an ensemble of copies of the random process by simple time
averages over observations from a single process, and we will assume stationarity
from here on, unless stated otherwise. Nevertheless, the reader should keep in mind
that all of the methods presented here will also work for ensemble averaging, and
thus for non-stationary time-series [18].
In the remainder of the text, upper case letters X, Y, Z refer to these (stationary)
random processes, Xt ,Yt , Zt to the random variables the processes are composed of,
while lower case letters with subscript indices xt , yt , zt refer to scalar realizations
of these random variables. Bold case letters Xt , Yt , Zt , xt , yt , zt refer to the cor-
responding random variables, and their realizations in a state space representation
(see section 3.1 for the meaning and construction of these state spaces).
2.2 Basic Information Theory

Based on the above definitions we now define the necessary basic information theo-
retic quantities. Assume a random variable X with possible outcomes X ∈ AX and a
probability distribution pX (X = x) over these outcomes, for which we use the short-
hand notation p(x). Then the Shannon information is the reduction in uncertainty
that we obtain when that specific outcome x (with probability p(x)) is observed. To
understand this quantity we have to consider what fraction of all possible chains
of events – that together have a total probability of 1 – remains possible after the
observation of x. It is clear that after observation of x only chains of events that

start with x are still possible. All of these together initially had a probability of
p(x) . Hence, before observing x, events with a total probability of 1 were possible,
whereas afterwards only events with a total probability of p(x) remain possible. For
reasons given below it makes sense to quantify uncertainty not directly as the to-
tal remaining probability but as a monotonic function, the logarithm, thereof. The
reduction of uncertainty in our example thus is: log(1) − log(p(x)). Shannon there-
fore defined the information content gained by the observation of an event x with
probability p(x) as:
1
h(x) = log (1)
p(x)
We easily see that less probable outcomes yield more information in case they are
actually observed, and that for to two outcomes x, y of two independent random
variables X,Y the Shannon information is additive, as suggested by our intuition
about an information measure for this case:
1 1 1
h(x, y) = log = log + log iff p(x, y) = p(x)p(y) (2)
p(x, y) p(x) p(y)
The average Shannon information that we obtain by repeatedly observing out-

comes from a random variable X is is called the (Shannon) entropy H of the random
variable:
H(X) = ∑ p(x) (3)
x∈Ax
The Shannon information of an outcome x of X, given we have already observed

the outcome y of another variable Y , that is not necessarily independent of X, is:
1
h(x|y) = log (4)
p(x|y)
Averaging this for all possible outcomes of X, weighted by their probabilities

p(x|y) after the outcome y was observed, and then over all possible outcomes y, that
occur with p(y), yields a definition for the conditional entropy:
1 1
H(X|Y ) = ∑ p(y) ∑ p(x|y) log
p(x|y) x∈AX∑
= p(x, y) log
p(x|y)
(5)
y∈AY x∈AX ,y∈AY
The conditional entropy H(X|Y ) is the average amount of information that we get
from making an observation of X after having already made an observation of Y . In
terms of uncertainties H(X|Y ) is the average remaining uncertainty in X once Y was
observed. We can also say H(X|Y ) is the information that is unique to X. Conditional
entropy is useful if we want to express the amount of information shared between
the two variables X,Y . This is because the shared information is the the total average
information in the one variable H(X) minus the average information that is unique
to this variable, H(X|Y ). Hence, we define mutual information as:
I(X;Y ) = H(X) − H(X|Y) = H(Y ) − H(Y |X) (6)

Similarly to conditional entropy we can also define a conditional mutual infor-
mation between two variables X,Y , given the value of a third variable Z is known,
as:
I(X;Y |Z) = H(X|Z) − H(X|Y, Z) (7)
This conditional mutual information is the basic information theoretic functional
used for defining transfer entropy as we will explain in the next section.
2.3 The Transfer Entropy Functional

At a conceptual level, TE is a model free implementation of Wiener’s principle of
observational causality1 [76]. Wiener’s principle states:
• For two simultaneously observed processes X, Y, we call X “causal” to Y if knowl-
edge about the past of X improves our prediction of Y over and above what is
predictable from the past of Y alone.
We obtain an information theoretic implementation of this principle by recasting it
in information theoretic terms as:
• How much additional information does the past state of process X contain about
the future observation of a value of Y given that we already know the past state
of Y?
Writing down the second statement as a formula directly yields the transfer entropy
functional:
T E(X → Y) = I(X− ,Y + |Y− ) , (8)
where I(·, ·|·) is the conditional mutual information, Y + is the future random vari-
able of process Y, and X− , Y− are the past state variables of processes X, Y, respec-
tively. The state variables introduced here a collections of past random variables that
contain all the “relevant” past information on the time evolution of the random pro-
cesses – see section 3.1 for more information on state variables. Figure 2 illustrates
the relationship between Wiener’s principle and TE.
TE has been independently formulated as a conditional mutual information by
Schreiber [60] and Paluš [55]. The quantity measured by the TE functional has been
termed predictive information transfer to emphasize the predictive, rather than the
causal, interpretation of the measure [44]. It is useful to remember that a perfect self-
prediction of the target variable (Y + ) automatically results in the transfer entropy
1 Today, there is general agreement that statements about causality require interventions in
the system in question [56, 3]. Also see section 2.4.
being zero. It is also important to note that Wiener’s principle requires the best self-
prediction possible, as sub-optimal self prediction will lead to erroneously inflated
transfer entropy values [72, 69].
2.4 Interpretation of TE
After introducing the concept of transfer entropy and its relation to other
information-theoretic measures, it is important to now take a broader perspective
and describe the interpretation and the use of TE in the field of complex systems
analysis, including neuroscience, before turning to actual estimation techniques in
section 3, below.
2.4.1 Transfer Entropy and Causality

Historically, TE was introduced with interactions in physical systems in mind by
Schreiber, and has been widely adopted as a model-free measure of directed inter-
actions in neuroscience [69, 73, 55, 66, 67, 8, 16, 1, 4, 6, 7, 17, 19, 40, 47, 52, 59,
62, 68, 34, 22], physiology [11, 13, 12], and other fields [32, 29, 44, 45, 40]. Often,
TE is even given a causal interpretation.
Before we go into more detail on the distinction between information transfer and
causal interactions, a simple thought experiment may serve to illustrate our case:
Imagine you have bought a new record, say with music so far unknown to you. As
you come home and play the record the elements of a long chain of causal interac-
tions all conspire to transfer information about the music on the record to your brain.
These causal, physical interactions happen between the walls of the record’s grooves
and the needle, the magnetic transducer system behind the needle, and so on, up to
the conversion of pressure modulations to neural signals in the cochlea that finally
drive activity in your cortex. In this situation, there undeniably is information trans-
fer, as the information read out from the source, the record, at any given moment is
not yet known in the target process, i.e. the neural activity in your brain. However,
this information transfer rapidly ceases if your new record has a crack, making the
needle skip and repeat a certain part of the music. Obviously, no new information
is transferred which is under certain mild conditions is equivalent to no information
transfer at all. Interestingly, an analysis of TE between sound and neural activity
will yield the same result: The repetitive sound leads to repetitive neural activity (at
least after a while). This neural activity is thus predictable by it’s own past – at least
under the condition of vanishing neural ’noise’ – leaving no room for a prediction
improvement by the sound source signal. Hence, we obtain a TE of zero, which is
the correct result from a conceptual point of view. Remarkably, at the same time
the chain of causal interactions remains practically unchanged. Therefore, a causal
model able to fit the data from the original situation will have no problem to fit the
data of the situation with the cracked record, as well. Again, this is conceptually the
correct result, but this time from a causal point of view.
In line with this simple example, several recent studies demonstrate clearly that
this measure should be strictly interpreted as predictive information transfer [45]
for at least four reasons:
1. An investigation of the presence of causal interactions will ultimately require
interventions to come to definite conclusions, as can be seen by a well-known
toy example (see figure 1). In fact, a causal measure that is intimately related to
TE, but employs Pearl’s ’do-formalism’ [56] for interventions, has been proposed
by Ay and Polani [3].
2. TE values are not a measure of ’causal effect size’, as noted by Chicharro and
colleagues [9]. Chicharro and colleagues found that the concept of a causal effect
size even lacked proper definition, and when defined properly did not align with
the quantities determined by measures of predictive information transfer such as
TE or other Wiener-type measures.
3. TE is not a measure of coupling strength and cannot be used to recover an es-
timate of a coupling parameter. This is illustrated by the fact that TE often de-
pends in a non-monotonic way on the coupling strengths between two systems.
For example, increasing the interaction strength between two systems may lead
to their complete synchronization. In this case, the systems’ dynamics are iden-
tical copies of each other, and information can not be transferred. Hence, TE is
zero by definition in this case and thus smaller than in cases with smaller cou-
pling strength and incomplete synchronization (see figure 1 in [28], and figure 1
in [24]).
4. Not all causal interactions in a system serve the purpose of information transfer
from the perspective of distributed computation, because some interactions serve
active information storage, rather than transfer, depending on network topology
[38], and dynamic regime [42].
The last item on this list deserves special attention as it points out a particular
strength of TE: It can differentiate between interactions in service of information
storage and those in service of information transfer. This differentiation is absolutely
crucial to understanding distributed computation in systems composed of many in-
teracting, similar agents that dynamically change their roles in a computation. Im-
portantly, this differentiation is not possible using measures of causal interactions
based on interventions [43], as these ultimately reveal physical interaction struc-
ture rather than computational structure. In neuroscience, this physical interaction
structure can be equated to anatomical connectivity at all spatial scales.
Another advantage of an information theoretic approach as compared to a causal
one arises when we want to understand a specific computation in a neural system
that specifically relies on the absence of interventions (e.g. spontaneous perceptual
switches). In this case the investigation of causal interactions could only be carried
out under certain fortunate circumstances that may be rarely met in neural system
[9]. In contrast, an analysis of the information transfer underlying the computation
is still well defined in information theoretic terms [10] and fruitful as long as one is
aware of the conceptual difference between information transfer and causal interac-
tions.
Fig. 1 Causality without information transfer. Two example systems that demonstrate the
difference between causal interactions and information transfer. (A) A system of to nodes
where each node has only internal dynamics that make each nodes’ state flip alternatingly
between the two states of a bit, 1 (black), and 0 (white). There is no causal interactions
between the nodes, and no information transfer (TE=0). (B) Another system with no internal
dynamics in the two nodes, but with mutual causal interactions that always impose the bit
state of the source node onto the target node at each update. In this example there is a causal
interaction, but again no information transfer (TE=0). Note that the states of the full system
of two nodes are identical to the ones in (A). (C) The same systems as in (B), but this time
’programmed’ with a different initial state (0,0). Example simplified from the one given in
[3].
To sum up, the quantity measured by TE is the amount of predictive informa-

tion actually transferred between two processes and, in neuroscience, may best be
interpreted as information transferred in service of a distributed computation. As
such, TE (together with local active information storage – presented in the chapter
of Lizier in this book, and local information modification [37]) gives us certain con-
straints on computational algorithms carried out by the system under specific experi-
mental conditions, while measures of causal interactions, such as causal information
flow [3], reveal the set of all possible interactions, i.e. information on the anatomy
and biophysics of a neural system. Nevertheless, measures of information trans-
fer may sometimes serve as proxies for causal interactions, if other (interventional)
methods are not available or practical. In these cases however, causal interpretation
should proceed carefully, as, to put it simply, there is indeed no information transfer
without a causal interaction, but the reverse does not hold.
2.4.2 State Dependent and State-Independent Transfer Entropy

In equation 12 we introduced TE as the mutual information between the past state
variable (X− ) of the source process X and the future variable (Y + ) of the target
process Y, conditional on the past state variable (Y− ) of the destination process Y:
I(Y + ; X− |Y− ). At that point one may have been tempted to read this as the condi-
tioning effectively removing what we already knew from the past of the target about
its future. This removal of redundant information between the past of the source,
X− , and the past of the target, Y− , about the future of the target, Y + , would after all
be true to the spirit of Wiener’s principle.
However, conditioning on the past of the target in equation 12 also introduces
synergies between the past of the source and the past of the target [78], i.e. informa-
tion in one of the variables is only ’visible’ when the value of the other is known,
e.g. by conditioning on it. This synergistic contribution is not stated explicitly in
Wiener’s principle (but it is also not explicitly excluded), and has therefore often
been overlooked when interpreting the results of measures derived from the Wiener
principle, e.g. in linear Granger causality implementations.
As a toy example take two binary processes where the source process is random
and stationary, and the future of the target process is formed in an XOR-operation
on the past of the source and the past of the target. In this case, the information
that the past of the source conveys about the future of the target can only be seen
when conditioning on the past of the target. Not conditioning in this case would
actually underestimate the transfer entropy instead of overestimating it. In sum, the
information in the past of both source and target processes about the future of the
target process can be decomposed into information that is (1) unique to the past of
the target, or (2) redundantly shared between source and target - this information
is ’conditioned away’, (3) information that is unique to the past of the source, this
information has been called ’state-independent transfer entropy’, and (4) informa-
tion that arises synergistically from the past of the source and the target together
– this information is called ’state-dependent transfer entropy’. State-dependent
transfer entropy is effectively ’conditioned in’ by conditioning on the past of the
source. Two recent studies by Paul Williams and Randall Beer provide more details
on the decomposition of mutual information in general, and of transfer entropy in
particular [77, 78].
As a consequence of the potential presence of synergies between the past of
source and target, the criterion of optimal self-prediction of the target [72], that is in-
corporated in Ragwitz’ criterion for example [58], should in the future be extended
to incorporate also these joint influences.
2.4.3 Multivariate TE
So far we have mostly considered the information transfer from one source process
X to another target process Y. In neuroscience, however, we deal with networks com-
posed of many nodes, and therefore have to consider information transfer between
multiple processes. If we look at a target process Y and a set of source processes X(i) ,
we immediately see the possibility of redundant or synergistic information transfer

from this set of sources to the target. Redundancy and synergy were described for
the past of one source and the past of the target in the preceding section (2.4.2), but
the same principle naturally applies to the case of multiple source processes. It is
clear that a simple pairwise analysis of transfer entropies between one source at a
time and the target will not reveal redundant or synergistic information transfers,
and that to this end a multivariate transfer entropy analysis is necessary. An analysis
that specifically considers the information transfer from sets of multiple sources into
a target. A quick analysis of this problem reveals that uncovering the full structure
of the information transfers in a multi-process network is an NP-hard problem [46],
hence, approximative solutions have to be used. We will therefore present several
such approximative algorithms in section 4.2 below.
3 Practical Application
In this section we now turn to the problem of obtaining transfer entropy values from
experimental data, and to judge the significance of these estimates.
3.1 Signal Representation and State Space Reconstruction

One of the essential requirements of Wiener’s principle is that the self-prediction of
the future realizations y+ of the target process Y has to be done optimally. If this is
not done, an improvement of this prediction by including information from the past
of process X might be only due to the suboptimal self prediction.
An optimal prediction of a future realization of the target process typically re-
quires looking at more than realizations of just one past random variable of this
process. The simple example of a pendulum may illustrate this: If only one posi-
tion value yt of the pendulum at time t is observed, we do not know whether the
pendulum is going left or right at that moment and the future position yt+1 is dif-
ficult to predict. Observation of a second past position value yt−1 , however, allows
us to distinguish left-going and right-going motion and therefore a much better pre-
diction. In this case, knowledge of two past position values is in principle enough
for prediction, given the general properties of a pendulum are known. In contrast,
more complicated systems (processes) may require knowledge about realizations of
additional past random variables for optimal prediction. A vector collecting past re-
alizations, such that they are sufficient for prediction is called a state of the system.
More formally, if there is any dependence between the Yt that form a random
process Y, we have to form the smallest collection of variables into a state variable,
Yt = (Yt ,Yt1 ,Yt2 , . . . ,Yti , . . .) with ti < t, that jointly make Yt+1 conditionally indepen-
dent of all Ytk with tk < min(ti ), i.e.:
p(yt+1 , ytk |yt ) = p(yt |yt )p(ytk |yt )

(9)
∀tk < min(ti ) ∧ ∀yt+1 ∈ AYt+1 , ytk ∈ AYtk , yt ∈ Ayt
A realization yt of Yt is called a state of the random process Y at time t.

The procedure of obtaining this state yt from the observations yt , yt1 , . . . is called
state space reconstruction. The procedure most often used in this context is Taken’s
delay embedding [65]. Taken’s delay embedding contains two parameters (d and τ ,
see below) which are often optimized2 via Ragwitz’ criterion [58]. The use of Rag-
witz’ criterion yields delay embedding states that provide optimal self prediction for
a large class of systems, either deterministic or stochastic in nature, but alternatives
exist [13, 61], and may be more data efficient in some cases.
Delay embedding states of the systems under investigation can be written as delay
vectors of the form:
xtd = (xt , xt−τ , xt−2τ , ...

(10)
, xt−(d−1)τ ) ,
where d denotes the embedding dimension, describing how many past time samples
are used, and τ denotes Taken’s embedding delay, describing how far apart these
samples are in time (compare figure 2, A, where the relevant samples are spaced τ
time steps apart). The space containing all delay embedding vectors is the delay-
embedding space (compare figure 2, B). This delay-embedding space is the state
space of the process, if embedding successfully captured all past information in the
process that is relevant to its future.
The importance of a proper state-space reconstruction cannot be overstated as
insufficient state-space reconstruction may lead to false positive results and reversed
directions of information transfer (see [69] for toy examples).
3.2 Transfer Entropy Estimators

Transfer entropy estimation (as opposed to its analytic computation based on known
probability distributions, see for example [28]) revolves around reconstructing prob-
ability distributions, or functions of these distributions, for the processes of interest
from finite data samples. Using the states obtained by delay embedding we can
rewrite transfer entropy as:

∑ p yt , yt−1
dy dx
T ESPO (X → Y, u) = , xt−u
dy dx
yt ,yt−1 ,xt−u
(11)
dy dx
p yt |yt−1 , xt−u
log ,
dy
p yt |yt−1
2 Hence, these parameters do not feature explicitly in the TE estimation, but can be consid-
ered part of the algorithm itself.
where the parameter u is the assumed time that the information transfer needs to get
from X to Y , and the subscript SPO (for self prediction optimal) is a reminder that
d y
the past state of Y , yt−1 , has to be constructed such that self prediction is optimal.
We can rewrite equation 11 using a representation in the form of four Shannon
(differential) entropies H(·), as:

dy dx dy dx
T ESPO (X → Y, u) = H yt−1 , xt−u − H yt , yt−1 , xt−u
(12)
dy dy
+H yt , yt−1 − H yt−1 .
Thus, T ESPO estimation amounts to computing a combination of different joint

and marginal differential entropies.
In the estimation process, several partially conflicting goals have to be balanced.
First we expect a small finite sample bias of the estimate, and fast convergence to
the expected value with increasing sample size. At the same time we expect com-
putational cost low enough to compute not only TE estimates from the empirical
data at hand, but also for several surrogate data sets, that are needed to compensate
for residual bias of the estimate. The number of surrogate data sets that is required
may be large, ranging from hundreds to hundreds of thousands, depending on the re-
quired statistical threshold. This threshold can be excessively small, and the required
number of surrogates high, if for example a correction for multiple comparisons is
necessary in an application.
While TE estimators are presented in a separate chapter by Raul Vicente and
Michael Wibral in this book, we give some rough guidelines here about what to
expect in terms of the three requirements above.
Shannon (differential) entropies can be estimated by
• Binning, or coarse graining approaches. In these approaches the joint space of
all relevant variables in the TE formula, equation 12 is partitioned into a grid of
bins, or boxes and the data points in each box are counted and divided by the
box volume to estimate the probability densities. Algorithmically this cannot be
implemented in a naive way as some of the boxes may be empty. One would
therefore check first which of the boxes in the joint space is non-empty and run
all necessary evaluations in the joint and the marginal spaces only over non-
empty boxes, relying for the empty boxes on the convention that 0 log(0) = 0,
which makes empty boxes disappear from the summations. Note that a binning
approach for continuous variables (instead of naturally discrete variables) is not
recommended as it can produce spurious reversals of information transfers [23].
• Kernel estimators are similar to coarse graining approaches but replace the count
of points in a bin by a weighted sum over points, where the weighting function
is some kernel. As the kernels are only placed around the observed points in
the joint embedding space, it is guaranteed that no empty bins exist. Typically
box kernels are used for simplicity but any mono-modal kernel could be used in
principle.
• Nearest-neighbour techniques. These techniques exploit the statistics of distances

between neighbouring data points in a given embedding space in a data effi-
cient way. This efficiency is necessary to estimate entropies in high-dimensional
spaces from limited real data [30, 70]. Nearest-neighbour estimators are as local
as possible given the available data, and can be thought of as ’variable-width’
kernel-estimators. The assumption behind nearest-neighbour estimators is only a
certain smoothness of the underlying probability distribution. Nearest-neighbour
estimators can therefore be considered as essentially non-parametric techniques,
as desired for a model-free approach to transfer entropy estimation. While the
number of neighbours to consider in the estimation process is indeed a remain-
ing parameter, results are typically relatively robust with respect to reasonable
variations in this parameter.
Of these possibilities, the first two are fast to compute but have unfavourable bias
properties [31], and may even reverse the estimated direction of information flows
[23]. Hence, in the remainder of the text we will present only estimators based on
nearest neighbour techniques.
Unfortunately, it is problematic to estimate TE by simply applying a naive
nearest-neighbour estimator, such as the Kozachenko-Leonenko estimator [30], sep-
arately to each of the terms appearing in equation 12. The reason is that the dimen-
sionality of the spaces involved in equation 12 can differ largely across terms. Thus,
fixing a given number of neighbours for the search will set very different spatial
scales (range of distances) for each term [31]. Since the error bias of each term is
dependent on these scales, the errors would not cancel each other but accumulate.
The Kraskov-Grassberger-Stögbauer estimator handles this problem by only fixing
the number of neighbours k in the highest dimensional space and by projecting the
resulting distances to the lower dimensional spaces as the range to look for neigh-
bours there [31]. After adapting this technique to the TE formula [18], the suggested
estimator can be written as

T E (X → Y, u) = ψ (k) + ψ n dy + 1
yt−1

− ψ n dy + 1 (13)
yt yt−1

− ψ n dy dx + 1 ,
yt−1 xt−u t
or, following the second suggestion by Kraskov [31] as:

2
T E (X → Y, u) = ψ (k) −
k

+ ψ n dy
yt−1

1 (14)
− ψ n dy +
yt yt−1 n dy
yt yt−1

1
− ψ n dy dx + ,
yt−1 xt−u n dy dx t
yt−1 xt−u
where ψ denotes the Digamma function, while the angle brackets (·t ) indicate an
averaging over different time points. The distances to the k-th nearest neighbour
dy dx
in the highest dimensional space (spanned by yt , yt−1 , xt−u ) define the diameter of
the hypercubes (or rectangles, for eq. 14) for the counting of the number of points
n(·) that are (1) strictly in these hypercubes (equation 13), or (2) inside or on the
borders of the hyper-rectangles (equation 14) around each state vector in all the
marginal spaces (·) involved. Equation 14 yields an estimator that is thought to be
more precise when very large sample sizes are available, whereas equation 13 yields
an estimator that is more robust when only small sample sizes are available, but has
more bias. Since bias problems can be handled based on surrogate data techniques
(see section 4.1), in neuroscience equation 13 seems to be the generally preferred
option.
3.3 A Graphical Summary of the TE Principle

After having introduced states, and state space reconstruction, and TE estimation
techniques, we are now in a position to give a brief graphical summary of TE es-
timation. Figure 2 illustrates the main ideas introduced so far. Two time series are
obtained experimentally, and are interpreted as realizations of two random processes
X,Y (Figure 2,A). These processes are assumed to be stationary for the purpose of
illustration. For each time point we now obtain a data point in the full embedding
space by embedding and reconstructing the states of each time series: xt−u (samples
at the black circles in the grey box on time series X in Figure 2,A), yt−1 (samples
at the black circles in the grey box on time series Y ), and the current sample yt
(star symbol on time series Y , labelled ’prediction point’). All the amplitude val-
ues in the state-vectors are now combined into a vector that indicates a point in the
full embedding space (Figure 2,B; the dashed lines show how the amplitude values
at xt−u ,yt−1 ,yt end up as coordinates in a 3 dimensional rendering of the full em-
bedding space). From this joint
embedding space now the conditional probability
dy dx
distributions p yt |yt−1 , xt−u in the numerator of equation 11, that are necessary
for TE estimation, can be obtained (Figure 2,C; the black columns indicate approx-
imate bins used to estimate the conditional probabilities shown in the left part of
this subfigure). As the TE functional also involves conditional probability distribu-
tions that are independent of the source process X, all the full embedding space is
A state X(t-u)
1
B
X
(source)
0
delay δ prediction
point
Yt
1
state Y(t-1) Xt-u
Y
(target)
0
t-1-τ t-1 Yt-1
t-1-2τ
τ τ u
C
t-u-2τ t-u-τ t-u t
Yt p(Yt , Xt-u =0.45±Δ ,Yt-1=0.15±Δ)
s
on
uti p(Yt,Yt-1,Xt-u)
rib 1
d ist
n al
itio 1
nd
co
Yt
1
p(Yt | Xt-u =0.45 ,Yt-1=0.15)
0
Δ→0, normalization
1
p(Yt| Xt-u =0.15 ,Yt-1=0.15)
0 p(Yt , Xt -u=0.15±Δ ,Yt-1=0.15±Δ)
ignore Xt-u
E D p(Yt , Yt-1=0.15±Δ)
Yt p(Yt,Yt-1)
1
?
= Δ→0, normalization
p(Yt | Yt-1=0.15)
conditional distributions marginal distribution 0 1
Fig. 2 Central TE concepts. (A) Coupled systems X → Y . To quantify T E(X → Y ) we

predict a future y (star) once from past values (circles) of Y , once from past values of X and
Y . d is the number of past time steps used for prediction (embedding dimension, see sec-
tion 3.1), τ is the time span between two past time points used for prediction (embedding
lag). (B) Embedding. yt , yt−1 , xt−u - coordinates in the embedding space, repetition of em-
bedding for all t gives an estimate of the probability p(yt , yt−1 , xt−u ) (part C, d limited to
1).(C) p(yt |yt−1 , xt−u ) - probability to observe yt after yt−1 and xt−u were observed. This
probability enables prediction of yt from yt−1 and xt−u . Here, p(yt |yt−1 , xt−u ) is obtained by
a binning approach for illustration: We compute p(yt ± Δ , yt−1 ± Δ , xt−u ± Δ ), let Δ → 0 and
normalize by p(yt−1 , xt−u )). (D) p(yt |yt−1 ) predicts yt from yt−1 , without xt−u . (E) If xt−u
is irrelevant, the conditional distributions p(yt |yt−1 , xt−u ) should be all equal to p(yt |yt−1 ).
log-differences indicate information transfer. Their weighted sum is TE. Modified from [36],
creative commons (CC BY) attribution license. Modified figure courtesy of C. Stawowsky.
“flattened” by simply
ignoringthe x-related coordinates of all points (Figure 2,D),
dy
and the distribution p yt |yt−1 is obtained (again shown for a binning approach).
Last, the obtained x-dependent conditional distributions (Figure 2,E) are compared
to the x-independent distribution (Figure 2,D; distribution on the left of subfigure).

This process is repeated for all values of yt , and all the log-differences are summed
to obtain TE as per equation 11.
3.4 Information Transfer Delay Estimation

The function of complex systems in the world around us, such as traffic systems,
gene regulatory networks, or neural circuits can often be only understood if we
identify the pattern of information transfers in the network. As information transfer
is necessarily coupled to a physical interaction (the reverse does not always hold -
see section 2.4.1), there will always be a certain finite time delay δ involved in the
information transfer. These delays influence network function, because the correct
function of the network depends on the information being received at the right point
in time, as much as on information being transferred at all. Hence, the delays of
information transfers are critical for network function. In neuroscience, information
transfer delays arise mainly due to propagation of action potentials (‘spikes’) along
axonal processes and can amount to several tens of milliseconds. The presence of
axonal delays is of particular importance for the coordination of neural activity be-
cause they add an intrinsic component to the relative timing between spikes. For
example, two neurons projecting to a downstream neuron will be observed to spike
simultaneously by this downstream neuron only when their relative timing of spikes
compensates the difference in their axonal delays and in the dendritic delays to
the soma of the target neuron. Temporally coordinated input to neurons, in turn, is
thought to be critical for a variety of neural phenomena, e.g. synchronization [20],
Hebbian learning[26], or spike time dependent plasticity. Indeed, disruption of co-
ordinated activity by the pathological modification of axonal delays is thought to
account for some deficits in diseases such as multiple sclerosis [15], schizophre-
nia [71], and autism [64]. Thus, the estimation of information transfer delays from
multichannel brain recordings seems to be necessary to understand the distributed
computation that neural networks perform.
The estimation of an information transfer delay δ between two processes X and
Y is possible using the TE estimator from equation 13, by scanning the delay pa-
rameter u in the estimator.
A mathematical proof in [72] shows that the delay parameter u that results in the
maximal TE value in equation 13 is identical to the true information transfer delay
δ between the two processes X and Y (see figure 3 for an intuitive representation of
the main idea):
δ = arg max(T ESPO (X → Y, u)) . (15)

u
Given enough data, this estimation of the information transfer delay δ works
robustly, and can even separate out differential delays for the two directions of
transfer between two bidirectionally coupled systems (figure 4) – as long as the
Fig. 3 Illustration of the idea behind interaction delay reconstruction using the TESPO
estimator. (A) Scalar time courses of processes X,Y coupled X → Y with delay δ , as indi-
cated by the solid arrow. Light grey boxes with circles indicate data belonging to a certain
state of the respective process. The star on the Y time series indicates the scalar observation
y(t) to be predicted in Wiener’s sense. Three settings for the delay parameter u are depicted:
(1) u < δ – u is chosen such that influences of the state X(t − u1 ) on Y arrive in the future
of the prediction point. Hence, the information in this state is useless and yields no transfer
entropy. (2) u = δ – u is chosen such that influences of the state X(t − u2 ) arrive exactly at
the prediction point, and influence it. Information about this state is useful, and we obtain
non-zero transfer entropy. (3) u > δ – u is chosen such that influences of the state X(t − u3 )
arrive in the far past of prediction point. This information is already available in the past of
the states of Y that we condition upon in T ESPO Information about this state is useless again,
and we obtain zero transfer entropy. (B) Depiction of the same idea in a more detailed view,
depicting states (grey boxes) of X and the samples of the most informative state (black cir-
cles) and non-informative states (white circles). The the curve in the left column indicates
the approximate dependency of T ESPO versus u. The solid black circles on the curves on the
left indicate the TE value obtained with the respective states on the right. Modified from [72].
Creative Commons Attribution (CC-BY) license.
Fig. 4 Interaction delay reconstruction between a pair of bidirectionally coupled Lorenz

systems. Transfer entropy (T ESPO ) values and significance as a function of the assumed delay
u for two bidirectionally coupled, chaotic Lorenz systems with non-identical parameters. The
simulated delays were δXY = 45 and δY X = 75, and the coupling constants were γXY = γY X =
0.1. The delays were recovered as δ̂XY = 46 and δ̂Y X = 76. Modified from [72]. Creative
Commons Attribution (CC-BY) license.
bidirectional coupling does not lead to full synchronization. The reconstruction of

information transfer delays in LFP data has been demonstrated with a precision of
approximately 4% [72]. If all information transfer delays are known, the graph of
all transfers, weighted by the respective delays can be used as the basis for a graph-
theoretical removal of cascade and common driver effects [74] (see below).
3.5 Practical TE Estimation and Open Source Tools

Practical TE estimation is a demanding task, but established toolbox, as introduced
below, can simplify it a lot. TE analysis is demanding because several relatively
complex analysis steps, such as state space reconstruction, TE estimation, surrogate
data creation, and statistical testing have to be carried out in sequence. Failure to
correctly perform any one of these steps typically yields meaningless results. Prac-
tical transfer entropy estimation is also demanding because of high computational
cost, especially when using data-efficient kernel or next-neighbour based estimators
– as recommended for continuous processes. The main computational cost comes
from the underlying search for next neighbours in the reconstructed phase space. For
next neighbour estimators and typical multichannel neuroscience data, estimation
is virtually impossible using naive approaches to finding next neighbours. Several
fast neighbour search algorithms exist: For single-threaded CPU applications the
algorithm implemented in TSTOOL [51] seems to be the fastest algorithm for the
neighbourhood structure found in neural data. In addition, the computational prob-
lem of transfer entropy estimation is trivially parallel in terms of the pairs of source
and target nodes for bivariate TE analysis of multi-node networks and parallel in the
number of nodes for a greedy multivariate approach. This parallelism can be eas-
ily exploited on a cluster. Moreover, efficient GPU-based algorithms for neighbour
Fig. 5 Interaction delay reconstruction in the turtle brain. (A) Electroretinogram (green),
and LFP recordings (blue), light pulses are marked by yellow boxes. (B) Schematic depiction
of stimulation and recording, including the investigated interactions and the identified delays.
Modified from [72], Creative Commons Attribution (CC-BY) license.
search have been developed recently (see [75], and http://www.trentool.de). Open
source toolboxes that already include these algorithms offer an elegant way to save
on coding work here, and typically provide code that is tested thoroughly.
Toolboxes differ in what type of data they are handle (discrete or continuous
valued), how they deal with multivariate time series in the input to avoid the detec-
tion of spurious information transfer (approx. algorithms to multivariate treatment),
which estimators are implemented (binned, kernel, Kraskov), how efficient their im-
plementation is (algorithms for next neighbour search, parallel computing on GPU
or CPU), what preprocessing tools they offer for state space reconstruction, and how
flexible the creation of surrogate data and statistical tests is handled.
At the time of writing the most established toolboxes for TE analysis of
neural data seem to be TRENTOOL (www.trentool.de) [36], a MATLAB R tool-
box, the transfer-entropy-toolbox (TET) (http://code.google.com/p/transfer-entropy-
toolbox/) [27], which provides C-code callable from MATLAB R (mex-files) and
the Java Information Dynamics Toolkit (JIDT, http://code.google.com/p/

information-dynamics-toolkit/) [41].
• TRENTOOL is aimed at the analysis of analogue neural data3 , and offers
automatic state space reconstruction [69], estimation of information transfer
delays [72], non-parametric significance testing [36], and a variety of sur-
rogate data creation algorithms. Estimation is done based on the Kraskov-
Stoegbauer-Grassberger estimator and efficient next neighbour search is imple-
mented via TSTOOL (http://www.physik3.gwdg.de/tstool/), or
fast GPU based algorithms [75]. Furthermore, parallel computation is possible
via Mathworks parallel computing toolbox. TRENTOOL’s input data format is
compatible with the format used by the open source analysis toolbox FieldTrip
[54] for electrophysiological data, e.g from EEG, MEG and local field potential
(LFP) recordings. As in these data volume conduction (in EEG, LFP) and field-
spread (in MEG) is a serious problem for an analysis of information transfer
[53], TRENTOOL offers algorithms for both, the detection of volume conduc-
tion [36] and for the removal of its influence [14]. A fully multivariate analysis is
not possible, but a graph-based algorithm to remove or label cascade and simple
common driver effects that can lead to the detection of spurious links is present
in the toolbox. TRENTOOL is available in version 3.0 at the time of writing and
is a relatively mature toolbox tested on hundreds of data sets.
• TET is aimed at the analysis of binary spiking data, offers transfer entropy esti-
mation and delay reconstruction and can operate on embedding states (i.e. past
firing pattern of a certain length), but embedding parameters have to be supplied
by the user, and the embedding delay τ seems to be fixed at one sample at the
time of writing. As the toolbox is intended for binary data only, binned estima-
tion strategies can be used, and the computation is in general very fast because
of this. No statistical routines for surrogate data creation and significance testing
are provided at the moment.
• JIDT can handle discrete as well as continuous process data, and for certain types
of computations even joint variables of mixed type are allowed. JIDT offers a
wide range of estimators for information theoretic analysis, such as mutual infor-
mation, conditional mutual information, transfer entropy, and active information
storage. Since the conditional mutual information can be calculated on multi-
variate data, approximate multivariate transfer entropy calculation following the
algorithm described in [46] is possible. As a unique outstanding feature, JIDT
offers the computation of local measures instead of average ones. These local
measures compare to average ones the way Shannon Information compares to
Shannon Entropy. The analysis of local measures makes it possible to follow the
dynamics of information processing in a system [37] – for more details see the
chapter of Lizier in this book. JIDT provides basic routines for embedding and
and statistical testing are provided. While the toolkit is written in Java, the rou-
tines can be conveniently called from within MATLAB, R from the open source
3 Spiking data can be analysed after convolution with a kernel modelling post-synaptic
potentials.
software GNU Octave http://www.gnu.org/software/octave/, or

from Python, and examples for these uses are provided on the toolkit’s website.
Both, TRENTOOL and JIDT are licensed under GPL v3; TET is licensed under the
new BSD license. This makes code reuse in other open-source projects possible.
4 Common Problems and Solutions
4.1 Statistical Testing to Overcome Bias and Variance Problems

Finite data from two time series are almost never entirely uncorrelated, simply be-
cause they are finite. Accordingly, TE estimators often evaluate to non-zero TE val-
ues for finite data even in the absence of information transfer. This phenomenon is
called the bias of the estimator. Depending on the type of estimator (see section 3.2)
this bias may be small or large, and typically there is a trade off between systematic
errors (bias) of an estimator and variance of its random fluctuations observed over
different realizations of those time series. To see if a specific estimated TE value
is non-zero because of correlations due to finite data and bias in the estimator or
because of ’real’ information transfer, we have to estimate the expected values of
TE for finite data that are as close as possible to the original data but have no in-
formation transfer. These data are called ’surrogate data’ or ’surrogates’. Creating
surrogate data for TE bias correction requires them to have the same finite length
and (at least) the same autocorrelation properties, while at the same time the sur-
rogates should be guaranteed to have no predictive information transfer. This can
be achieved by destroying the temporal precedence structure between the source
and the target processes, that would be underlying a potential predictive informa-
tion transfer in the original data. For two processes X and Y, and a transfer entropy
estimator T E(X → Y) this can be achieved by resampling of the source time series
xt [8]. For continuous time series, e.g. from a recording of resting state neural activ-
ity, this can be done by randomly sampling blocks of data from the original source
time series and composing a surrogate source time series from them. However, a
better approach may be to randomly assign a cut point in the source time series xt
and to exchange the resulting two data pieces. This approach is preferable because
using more than one cut point will potentially make the surrogate data more sta-
tionary than the original data, adding another source of bias. In a TE analysis based
on epochs, surrogate data can be created by exchanging whole epochs of the source
time series [36]. A good strategy for the latter approach is to swap neighboring odd
and even epochs to preserve drifts in the data that may be present at longer time
scales.
For a statistical verification that non-zero TE values are not due to bias we have
to compare the original (distribution of) TE value(s) to a distribution of TE values
from surrogate data.
For epoched data, this is typically done by computing TE values for the set of
epochs of the original data and for the set of epochs of one surrogate data set, yield-
ing two sets of TE values. As we can not assume normality of the distribution of
the TE values in either dataset the statistical comparison has to be non-parametric,

i.e. via permutation testing. This approach is computationally efficient as TE values
have to be computed only twice for the the amount of data at hand – once for the
original epochs, once for the surrogate epochs. For more details see for example
[36].
The computation of multiple original TE values for epochs of the original data
as described above may not always be possible, e.g. because there were no epochs
in the data or because data from all epochs were put into a joint phase space to use
ensemble-average methods [18]. In this case, we can create multiple surrogate data
sets to obtain a reference distribution of TE values from the surrogates and test at
what percentile the TE value from the original data is found. This approach is well
suited to the analysis of cyclostationary data – a common type of nonstationary data
from neuroscience experiments that are based on stereotypically repeated stimuli.
However, this approach is computationally heavier than the epoch based approach,
because the TE values for the surrogate data have to be computed for many sets of
surrogates – typically several thousand– instead of just one set as above.
Statistical testing is also necessary as the outcome of TE estimation from a fi-
nite sample of data can be considered a random variable in itself, having a certain
data-size dependent variance. This variance increases with diminishing length of the
available data, rendering transfer entropy values unreliable. Again, this necessitates
statistical testing to allow a sound interpretation of the TE values obtained.
4.2 Multivariate TE and Approximation Techniques

It is well known that a pairwise analysis of information transfers in a system with
multiple interacting processes leads to several problems:
1. Spurious information flows are picked up due to cascade effects, where an infor-
mation transfer from a source A to an intermediate target B and further on to a
final target C will also be seen as information transfer from A to C, directly. This
problem is particularly severe, when the intermediate process B neither computes
new information to send on, nor adds much information on its own. In networks
composed of nodes with dynamics that have a high information rate, this prob-
lem is reduced (see the three Lorenz systems coupled into a ring in [72]) but may
still be present.
2. Spurious information flows are picked up due to common driver effects, where
a single source process A transfers information to two target processes B and
C, albeit with a different delay of the information transfer, e.g. such that the
information from A arrives first at C and then at B. In this case, information
from C is predictive for information arriving at B, and non-zero bivariate transfer
entropy is observed from C to B.
3. Pairwise analysis must miss out on synergistic information transfer, where infor-
mation from two sources A, B is combined in a nontrivial way, e.g. an XOR
function for binary data, before being transferred into the target C. For the
XOR example above and for memoryless random processes A,B, each pairwise
transfer entropy T E(A → C) and T E(B → C) is zero, while the transfer entropy
T E(A, B → C) from the joint process A, B to the target C is non-zero.
While the first two problems are widely recognized, the last problem seems to
be less well known, potentially because synergies and redundancies were defined
in various ways in the past and a satisfactory axiomatic definition of synergies and
redundancies has only emerged recently [77, 39, 21, 25].
To address the first and second problem, it was proposed to reconstruct the timing
of information transfers in a bivariate analysis [72] and to then identify cascade and
common driver effects based on their signature in the graph of delays - for cascade
effects the spurious link has a delay that is equal to the sums of delays on the true
path, for common driver effects, the difference of the summed delays on the driving
paths is equal to the delay of the spurious links [74]. If a link meets neither of these
two conditions it cannot be due to cascade or common driver effects.
To address problem the third problem, Lizier and colleagues proposed an ap-
proximate greedy approach to a fully multivariate analysis [46]. This approach tries
recursively to find for each target node all source nodes, or combinations thereof,
that have significant information transfer into that target node – conditional on the
information provided by other nodes with significant information transfer that have
already been included. The approach also solves the first and second problem. It is
an approximation to a fully multivariate approach. In a fully multivariate approach,
TE would be evaluated for each pair of source and target, conditioned on the past of
all other processes in the network. In practice, however, this ’approximation’ even
yields more accurate results than the fully multivariate approach, because it is more
data efficient and therefore more robust on small samples sizes.
More in-depth treatments of multivariate transfer entropy can be found in the
chapters by Lizier and Faes in this book.
4.3 Observation Noise

Real world data are typically not free of observation noise, and it is fair to ask to
what extent we should expect transfer entropy estimation to suffer from observation
noise. To answer this question we start by noting that the random variables of the
source process X and the target process Y are observed under added noise from
processes N(X) and N(Y ) as X̃t and Ỹt :
(X)
X̃t = Xt + Nt (16)
(Y )
Ỹt = Yt + Nt (17)
where we assume that N (X) and N (Y ) are statistically independent noise processes.
The most important practical problem arising from observation noise is that Markov
processes X and Y are transformed into hidden Markov processes X̃ and Ỹ of which
the states are not easily reconstructed. Without properly reconstructed states, how-
ever, transfer entropy estimation may fail or produce spurious results (e.g. [69]). A
proper analytical treatment of transfer entropy on noisy variables is hampered by the
(X)
fact that the Shannon entropy of the sum of two variables H(X + Nt ) cannot be
decomposed into terms containing just one or the other variable. In fact, the entropy
H(X + Y ) for two random variables can be infinite or zero, even if both entropies
H(X) and H(Y ) exist and are finite [33].
In the face of lacking analytical approaches to the problem, simulation studies
must demonstrate the applicability of TE estimation. Indeed, it was shown that both,
transfer entropy estimation and the reconstruction of information transfer delays are
quite robust under Gaussian, white noise [72, 69, 36]. Nevertheless, simulations for
other typical (neuro-)physiological noise profiles seem warranted.
4.4 Stationarity and Ensemble Methods

The basic definition of transfer entropy in equation 12 does not require the processes
X,Y – that yield the random variables X− , Y− , Y + – to be stationary, as long as we
can obtain the probability distributions of these variables to compute their condi-
tional mutual information. Samples from the random variables could for example
come from running multiple identical copies of the processes involved, that are all
observed at time t so that we get multiple realizations of X− , Y− , Y + at a single
time point t. Indeed, this is the most fundamental way of evaluating equation 12.
It is only when multiple realizations of a random variable are not obtainable that
we have to resort to other methods of evaluation. The closest alternative to mul-
tiple copies of the processes in question are identical repeats of a processes over
time – in neuroscience this would typically take the form of experimental trials and
require the assumption that a process is repeatable experimentally. Still only this
repeatability is required, but not stationarity of the process itself, as again multiple
realizations of the variables in the processes, e.g. Xt , can be obtained: xt (n1 ), xt (n2 ),
. . . , where ni is the trial number, and we assume that the random variables Xt (n1 ),
Xt (n2 ), . . . , which are evaluated the same time in the trial (“t”), but different abso-
lute physical times t(ni ), all have the same probability distribution. If this is the case,
the process X is called cyclostationary. Efficient methods for the computation of TE
from cyclostationary processes exist [18, 75]. Given the trial structure often found
neuroscience experiments these methods are ideal to handle the nonstationarities in
neural dynamics as long as these nonstationarities are repeatable. Only when there
is no trial structure in the data, or not enough trials are available for a proper estima-
tion of probabilities, we have to resort to the assumption that all random variables
X1 , X2 , . . . ;Y1 ,Y2 , . . . that form the random processes X,Y have essentially identical
probability distributions, i.e. that the processes are stationary. In this case the sum
in our estimators runs over all points in time (equation 13). In practice, ensemble
methods can be mixed with temporal averaging, if the cyclostationary process is
approximately stationary over short time intervals within the trial [18].
5 Relation to Other Directed Information Measures

TE is not the only measure that was proposed for measuring directed information
relations between random processes. However, TE seems to emerge as the measure
most suitable for the analysis of distributed computation and hence for the anal-
ysis of neural information processing. In the next sections we will review several
alternative measures and describe their similarities with and their differences to TE.
5.1 Time-Lagged Mutual Information

One fundamental principle behind transfer entropy is the sequential observation of
information first in a source process, followed by its observation in a target process.
For this information, Wiener’s principle mandates that the information in the past of
the source should not be present in the past of the target as well – otherwise there
will be no information transfer. Nevertheless, we may ask what is lost if we drop
this second condition of the Wiener principle – and simply ask for the information
shared between the past of a source process and the future of a target process, i.e. in
the form of a time-lagged mutual information:
p(xt−u , yt )
I(Xt−u ,Yt ) = ∑ p(xt−u , yt ) log
p(xt−u ), p(yt )
(18)
xt−u ,yt
What exactly is lost if we simply consider this measure of shared information, where
the mutual information is not conditional on the past of the target process as in TE
(equation 8)?
One answer to this question was already given by Schreiber in the initial paper
that introduced TE [60]. Schreiber pointed out that the additional conditioning on
the past of the target that is included in TE, but not in the time-lagged mutual in-
formation, creates a measure of the influence of the past of the source process on
the state-transitions occurring in the target process. This adds a dynamical systems
aspect to the measure [60]. This dynamical systems aspect is also closely related
to the notion of state-dependent influences from control theory as pointed out by
Williams and Beer [78].
In more detail, only conditioning on the past of the target reveals synergistic in-
formation transfer from the past of the target and the source jointly to the future of
the the target (state dependent transfer entropy, [78]) and also removes redundant
information between the past of source and target (see section for 2.4.2 for more de-
tails). As a consequence, it is easily possible to construct two processes X, Y , such
that the time-lagged mutual information between them is always zero whereas there
is non-zero TE. For example, we may choose the source process X as being com-
posed of random variables Xt that are independent identically distributed random
bits, and to construct Y such that Y0 is also a random bit, whereas all other Yt are con-
structed such that an their outcomes are determined by an exclusive OR-operation:
yt = XOR(xt−1 , yt−1 ). In this example it can be easily verified that I(Xt−u ,Yt ) = 0 for
all time-lags u, whereas T ESPO (X → Y, 1) = 1 bit.
Last, conditioning on the past of the target process is necessary to separate in-
formation transfer and information storage in the sense of component processes of
distributed computation (see section 2.4.1, and the chapter by Lizier in this book for
more details).
5.2 Transfer Entropy and Massey’s Directed Information

A measure closely related to transfer entropy is Massey’s directed information (DI).
DI is often unnecessarily confused with, or seen as an improvement on TE. There-
fore, we quickly describe their relation here, and also hint at the differences between
information theoretical and physical indexing that is at the source of much of this
confusion. A reader not interested in DI may safely skip ahead to the next section.
DI was originally defined by Massey [50] (based on earlier work by Marko [48])
as:
N
DI(X N → Y N ) = ∑ I(X n ;Yn |Y n−1 ) (19)
n=1
where X N and Y N are joint ensembles (N-tuples) of N random variables, i.e.

Xi , i = 1 . . . N, X N = [X1 , . . . , XN ] where Xk is the k-th variable of sequence X N .
Note that this definition, in particular the mutual information inside the sum, re-
quire that that we can obtain multiple realizations of from these joint ensembles,
i.e. xN (r) = [x1 (r), . . . , xN (r)] with r = 1 . . . M for empirical estimation from M re-
alizations. Only for random variables that form part of a stationary random process
with finite memory (finite Markov processes), these realizations may be replaced by
averaging over realizations at subsequent indices to obtain the mutual information
estimate. In practice, therefore stationarity is required for DI analyses.
To understand DI correctly, it is important to know why Massey introduced it.
Massey had in mind a system composed of a signal source that would produce mes-
sages (symbol sequences) uK of the joint random variable U K , independent of any
transmission process. These uK would then be encoded to yield encoded messages
xN that were sent over the channel to a receiver and yield received messages yN ,
to be decoded into V K . The system X → Y is called a channel with input X N and
output Y N in information theory. Massey was considering the particular case where
the encoded messages were fed into the channel symbol by symbol, i.e.. x1 , x2 , . . .
and where the corresponding received message symbols y1 , y2 , . . . would be able to
change the en-coding process for the next respective next symbol xn+1 of the en-
coder (’feedback’). In this system the following dependencies hold a priori:
• U K is not influenced by any other source of influence.
• U K as a whole has influence on the encoded message X N .
• An encoded symbol xn at index n is sent and influences the decoded symbol yn
with the same index. This is called transmission through the channel.
• The received symbol yn may influence the encoding for the next source symbol
xn+1 . If this is the case, we say the channel has feedback.
For this system with a known and highly specific causal structure, Massey wanted
to find a more precise bound for the information that could be transmitted through
this channel when feedback was present, because information theory had not con-
sidered feedback as part of a communication channel correctly before [50]. While
for the channel without feedback the mutual information between input and output
I(X N ;Y N ) is an upper bound for the information that can be transmitted, I(U K ;V K ),
Massey could show that in the presence of feedback a tighter limit on the transmit-
table information holds [50]:
DI(X N → Y N ) − I(U K ;V K ) ≤ I(X N ;Y N ) − I(U K ;V K ) (20)

The interpretation of this is straightforward: Not all of the dependencies between
X N and Y N arise because of information transmission from X to Y , a part of them
arises because yn influences the encoding process of the original message at xn+1 ,
thereby creating correlations that show in I(X N ;Y N ), but do not support information
transfer.
When considering to apply Massey’s DI in practice is very important to note
that it relies on a known (!) dependency structure and is inherently asymmetric as
it clearly distinguishes the known roles of the original source (U K ) that cannot be
influenced, the encoder/sender (xn ) and the receiver/decoder (yn ). To sum up, DI was
initially not intended to be used for inference on a dependency structure, in partic-
ular not a causal one (this task should be left to Ay and Polani’s causal information
flow measure based on interventions [3]), but to infer achievable information rates
on a known communication channel with feedback.
As a consequence of the above definitions, the forward influence from X → Y
is of a fundamentally different type than the feedback Y X. This is also visible
in the fact that the action of xn affects yn at the same n and that Massey clearly
attributes a directedness to this interaction [50]. While this is sometimes confused
with ’instantaneous causality’, the correct interpretation is as follows: In direction
of the channel Massey does not care about transmission time, and treats yn as the
output caused by xn (we can think of the message as carrying a label ’n’), whereas
in the reverse direction he makes sure that yn can only influence xn+1 via feedback.
Because of the differences between forward and feedback influences, we cannot
use DI analysis without modifications to Massey’s original definition if we don’t
know a priori which of X and Y to consider as sender and receiver. Unfortunately,
this situation is the norm rather than an exception in neuroscience. In the case of
such an unknown sender-receiver relationship, the asymmetry in indexing (the influ-
ence from xn → yn is causal, whereas the influence from yn → xn is not) gets highly
problematic. This is because the prescribed causal structure implied in the informa-
tion theoretic channel with feedback cannot be mapped easily to a bidirectionally
coupled dynamical system. For such a system at least two ad-hoc decompositions of
the dependency structure are possible, (1) into directed information from X → Y and
the feedback Y X, and (2) into directed information from Y → X and the feed-
back X Y . Due to the inherent asymmetries in DI, the two decompositions are not
equivalent, but represent different a priori assumptions on the system. Considering
all four contributions on the other hand is also not an option, because information is
accounted for multiple times this way.
Therefore, authors applying DI to systems with unknown dependency structure,
such as neural systems, often modify DI, or rather the interpretation thereof by in-
terpreting the index n as physical time (’t’) rather than a channel-use index, and by
subsequently stripping the ’instantaneous’ information transfer from Xn to Yn of its
directedness, based on the argument that Xn and Yn ’happen simultaneously’ [1, 2].
While this indeed yields a useful measure for neuroscience, it is a clear violation of
the original ideas by Massey. Therefore, the reinterpreted measure should perhaps
be given another name to highlight the fact that instantaneous causality is seen as
uninterpretable in terms of a direction in the new measure, a problem that Massey
did not face because of indexing by channel-use, and the prespecified causal struc-
ture in his use case. We leave the renaming of Massey’s directed information to
the community and simply refer to this new interpretation by DI’. Using this new
interpretation of directed information one can show that [2]:
I(X N ;Y N ) = DI
(X N−1 → Y N−1 ) + DI
(Y N−1 → X N ) + DI
(X N ↔ Y N−1 ) (21)
N
DI
(X N ↔ Y N ) := ∑ I(xn ; yi |yi−1 , xn−1 ) (22)
i=1
which is a useful decomposition of the mutual information into two directed infor-
mation transfers and a contribution of the undirected, instantaneously shared infor-
mation, called instantaneous information exchange, DI
(X N−1 ↔ Y N−1 ). Further-
more it can be shown that in the limit of t → ∞, the rates of the directed parts in
equation 21 are nothing but the transfer entropy rates for X → Y and Y → X [1, 2].
In sum, most of the confusion about the use of transfer entropy and directed
information arises because the use case for directed information is no longer the one
intended by Massey, and indeed the measure has been changed via reinterpretation,
while references are still made to the original use case and claims by Massey.
5.3 Momentary Information Transfer

Pompe and Runge [57] recently proposed to reconstruct interaction delays using an
information-theoretic functional, called momentary information transfer (MIT). In
their functional the information transfer delay between two systems is introduced in
the form of a parameter of a conditional mutual information term – just as it was
done for transfer entropy in equation 12 for T ESPO . As for T ESPO , this parameter is
scanned in order to maximize the value of MIT . In contrast to the transfer entropy
approach, however, conditioning of the mutual information in MIT is done with
respect to the joint history of the two variables in question:
MIT (X → Y, u) = I(Yt ; Xt−u |Yt−1 , Xt−u−1 ) , (23)
That is, while MIT retains the conditioning on the immediately previous state
of the target Yt−1 that is used in T ESPO , MIT additionally conditions on the state
variable of the source, Xt−u−1 , immediately preceding the scalar source observation
under consideration, Yt .
The essence of Pompe and Runge’s argument is that their conditioning on Xt−u−1
seeks to find the delay over which the transferred information is first available in
the source. While the measure does indeed have this property, we note that for a
measure of information transfer delays it is important to identify the point in time
when the information in the source is most relevant to predict the future of the target,
as was shown by mathematical proof in [72]. As shown by example in the same
study, MIT may therefore slightly misidentify information transfer delays, yielding
inflated delay values. The mathematical reason for this is the removal of memory
in the source via the additional conditioning before determining the information
transfer.
6 Summary and Outlook

Given that one of the major advantages of transfer entropy is its model-freeness,
we may ask what the use of transfer entropy and related information theoretic ap-
proaches will be once highly detailed, and provably correct, simulations of large
neural systems become available.While this may not happen in the near future, it
seems only a matter of time. The answer to this question will also again help us
to see the specific meaning of the quantity ’predictive information transfer’ that is
measured by transfer entropy.
To see the meaning of transfer entropy and other methods from the field of in-
formation dynamics we may turn to recent developments in the field of cellular
automata. Elementary cellular automata (ECAs) are composed of simple agents that
can only take binary values and are connected to their next neighbours [79]. These
agents update their own state in discrete time steps based on a simple rule applied
to the states of their neighbours at the past time steps. ECA are simple to simu-
late, and we might think that we fully understand how they work, since the rules
for their behaviour are all available to us (we created them and therefore know all
about their causal structure). Despite their deceptive simplicity, ECA show complex
unpredicted emergent behaviour and have been shown to be able to do universal
computation, i.e. solve any mathematical task, given a solution exists. This means
that we understand how to simulate a cellular automaton, but not how it functions
or computes. The same maybe said for realistic large scale neural simulations – just
reproducing the dynamics of a system does not entail an understanding of the al-
gorithms it runs to solve a task. For the case of ECA this problem can be elegantly
solved by evaluating local transfer entropy in space and time, and related measures
from information dynamics (see the chapter by Lizier in this book, and [37]). Look-
ing at the structures revealed by this analysis it is then easy to see, that the ECA
performs computations based on coherent activity structures called particles.
The example just given demonstrates that in a complex system there are typically
several levels of understanding. These levels of understanding have been laid out for
the case of neuroscience in an elegant treatment by David Marr in his book Vision
[49]:
• The computational level: What is computed by the neural system, and why is this
computation ecologically relevant to the organism?
• The algorithmic level: What representations of quantities of the outside world
exist (in the neural system) and in what algorithms are they used?
• The implementation level: How are these algorithms implemented in the bio-
physics of the neural system?
As noted already by Marr and later emphasized by Poggio (in the afterword added
to [49]), these level of understanding only loosely constrain each other as any re-
alization at one level may map to multiple possibilities at the other levels. Poggio
also emphasized the need for analysis approaches that bring the levels closer to-
gether again, after their initial separation brought clarity to neuroscientific study.
If we take into account that transfer entropy quantifies the amount of information
transferred in service of a computation, we see that the analysis of transfer entropy
in a neural system uses data from the implementational level but gives constraints
on the algorithms the systems runs. This way, transfer entropy effectively links the
implementation level to the algorithmic level – and does so both for empirical data
and models. As models offer the possibility of virtually unlimited access to data,
and as this is highly beneficial for reliable analyses of information theoretic meth-
ods, we think that the understanding of neural systems will strongly profit from the
application of transfer entropy analysis specifically to data from detailed, large scale
neural simulations that will become available in the near future.
References
1. Amblard, P.O., Michel, O.J.J.: On directed information theory and Granger causality
graphs. J. Comput. Neurosci. 30(1), 7–16 (2011)
2. Amblard, P.O., Michel, O.J.J.: The relation between Granger causality and directed in-
formation theory: A review. Entropy 15(1), 113–143 (2012)
3. Ay, N., Polani, D.: Information flows in causal networks. Adv. Complex Syst. 11, 17
(2008)
4. Barnett, L., Barrett, A.B., Seth, A.K.: Granger causality and transfer entropy are equiva-
lent for Gaussian variables. Phys. Rev. Lett. 103(23), 238–701 (2009)
5. Battaglia, D., Witt, A., Wolf, F., Geisel, T.: Dynamic effective connectivity of inter-areal
brain circuits. PLoS Comput. Biol. 8(3), e1002 (2012)
6. Besserve, M., Schlkopf, B., Logothetis, N.K., Panzeri, S.: Causal relationships between
frequency bands of extracellular signals in visual cortex revealed by an information the-
oretic analysis. J. Comput. Neurosci. 29(3), 547–566 (2010)
7. Bühlmann, A., Deco, G.: Optimal information transfer in the cortex through synchro-
nization. PLoS Comput. Biol. 6(9), e1000934 (2010)
8. Chávez, M., Martinerie, J., Le Van Quyen, M.: Statistical assessment of nonlinear causal-
ity: application to epileptic EEG signals. J. Neurosci. Methods 124(2), 113–128 (2003)
9. Chicharro, D., Ledberg, A.: When two become one: the limits of causality analysis of
brain dynamics. PLoS One 7(3), e32466 (2012)
10. Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley-Interscience, New
York (1991)
11. Faes, L., Nollo, G.: Bivariate nonlinear prediction to quantify the strength of complex
dynamical interactions in short-term cardiovascular variability. Med. Biol. Eng. Com-
put. 44(5), 383–392 (2006)
12. Faes, L., Nollo, G., Porta, A.: Information-based detection of nonlinear granger causal-
ity in multivariate processes via a nonuniform embedding technique. Phys. Rev. E Stat.
Nonlin. Soft. Matter Phys. 83(5 Pt. 1), 051112 (2011)
13. Faes, L., Nollo, G., Porta, A.: Non-uniform multivariate embedding to assess the infor-
mation transfer in cardiovascular and cardiorespiratory variability series. Comput. Biol.
Med. 42(3), 290–297 (2012)
14. Faes, L., Nollo, G., Porta, A.: Compensated transfer entropy as a tool for reliably esti-
mating information transfer in physiological time series. Entropy 15(1), 198–219 (2013)
15. Felts, P.A., Baker, T.A., Smith, K.J.: Conduction in segmentally demyelinated mam-
malian central axons. J. Neurosci. 17(19), 7267–7277 (1997)
16. Freiwald, W.A., Valdes, P., Bosch, J., Biscay, R., Jimenez, J.C., Rodriguez, L.M., Ro-
driguez, V., Kreiter, A.K., Singer, W.: Testing non-linearity and directedness of interac-
tions between neural groups in the macaque inferotemporal cortex. J. Neurosci. Meth-
ods 94(1), 105–119 (1999)
17. Garofalo, M., Nieus, T., Massobrio, P., Martinoia, S.: Evaluation of the performance
of information theory-based methods and cross-correlation to estimate the functional
connectivity in cortical networks. PLoS One 4(8), e6482 (2009)
18. Gomez-Herrero, G., Wu, W., Rutanen, K., Soriano, M.C., Pipa, G., Vicente, R.: Assess-
ing coupling dynamics from an ensemble of time series. arXiv preprint arXiv:1008.0539
(2010)
19. Gourevitch, B., Eggermont, J.J.: Evaluating information transfer between auditory corti-
cal neurons. J. Neurophysiol. 97(3), 2533–2543 (2007)
20. Gray, C.M., Knig, P., Engel, A.K., Singer, W.: Oscillatory responses in cat visual cortex
exhibit inter-columnar synchronization which reflects global stimulus properties. Na-
ture 338(6213), 334–337 (1989)
21. Griffith, V., Koch, C.: Quantifying synergistic mutual information. In: Prokopenko, M.
(ed.) Guided Self-Organization: Inception, pp. 159–190. Springer, Heidelberg (2014)
22. Hadjipapas, A., Hillebrand, A., Holliday, I.E., Singh, K.D., Barnes, G.R.: Assessing in-
teractions of linear and nonlinear neuronal sources using MEG beamformers: a proof of
concept. Clin. Neurophysiol. 116(6), 1300–1313 (2005)
23. Hahs, D.W., Pethel, S.D.: Distinguishing anticipation from causality: anticipatory bias in
the estimation of information flow. Phys. Rev. Lett. 107(12), 128701 (2011)
24. Hahs, D.W., Pethel, S.D.: Transfer entropy for coupled autoregressive processes. En-
tropy 15(3), 767–788 (2013)
25. Harder, M., Salge, C., Polani, D.: Bivariate measure of redundant information. Phys. Rev.
E Stat. Nonlin. Soft Matter Phys. 87(1), 012130 (2013)
26. Hebb, D.O.: The organization of behavior: A neuropsychological theory. Wiley, New
York (1949)
27. Ito, S., Hansen, M.E., Heiland, R., Lumsdaine, A., Litke, A.M., Beggs, J.M.: Extending
transfer entropy improves identification of effective connectivity in a spiking cortical
network model. PLoS One 6(11), e27431 (2011)
28. Kaiser, A., Schreiber, T.: Information transfer in continuous processes. Physica D 166,
43 (2002)
29. Kim, J., Kim, G., An, S., Kwon, Y.K., Yoon, S.: Entropy-based analysis and
bioinformatics-inspired integration of global economic information transfer. PLoS
One 8(1), e51986 (2013)
30. Kozachenko, L., Leonenko, N.: Sample estimate of entropy of a random vector. Probl.
Inform. Transm. 23, 95–100 (1987)
31. Kraskov, A., Stoegbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev.
E Stat. Nonlin. Soft Matter Phys. 69(6 Pt. 2), 066138 (2004)
32. Kwon, O., Yang, J.S.: Information flow between stock indices. EPL (Europhysics Let-
ters) 82(6), 68003 (2008)
33. Lapidoth, A., Pete, G.: On the entropy of the sum and of the difference of independent
random variables. In: IEEE 25th Convention of Electrical and Electronics Engineers in
Israel, IEEEI 2008, pp. 623–625. IEEE (2008)
34. Leistritz, L., Hesse, W., Arnold, M., Witte, H.: Development of interaction measures
based on adaptive non-linear time series analysis of biomedical signals. Biomed. Tech.
(Berl.) 51(2), 64–69 (2006)
35. Li, X., Ouyang, G.: Estimating coupling direction between neuronal populations with
permutation conditional mutual information. NeuroImage 52(2), 497–507 (2010)
36. Lindner, M., Vicente, R., Priesemann, V., Wibral, M.: Trentool: A Matlab open source
toolbox to analyse information flow in time series data with transfer entropy. BMC Neu-
rosci. 12(119), 1–22 (2011)
37. Lizier, J.: The Local Information Dynamics of Distributed Computation in Complex Sys-
tems. Springer theses. Springer (2013)
38. Lizier, J.T., Atay, F.M., Jost, J.: Information storage, loop motifs, and clustered structure
in complex networks. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 86(2 Pt. 2), 026110
(2012)
39. Lizier, J.T., Flecker, B., Williams, P.L.: Towards a synergy-based approach to measuring
information modification. In: Proceedings of the 2013 IEEE Symposium on Artificial
Life (ALIFE), pp. 43–51. IEEE (2013)
40. Lizier, J.T., Heinzle, J., Horstmann, A., Haynes, J.D., Prokopenko, M.: Multivariate
information-theoretic measures reveal directed information structure and task relevant
changes in fMRI connectivity. J. Comput. Neurosci. 30(1), 85–107 (2011)
41. Lizier, J.T., Mahoney, J.R.: Moving frames of reference, relativity and invariance in
transfer entropy and information dynamics. Entropy 15(1), 177–197 (2013)
42. Lizier, J.T., Pritam, S., Prokopenko, M.: Information dynamics in small-world Boolean
networks. Artif. Life 17(4), 293–314 (2011)
43. Lizier, J.T., Prokopenko, M.: Differentiating information transfer and causal effect. Eur.
Phys. J. B 73, 605–615 (2010)
44. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Local information transfer as a spatiotem-
poral filter for complex systems. Phys. Rev. E 77(2 Pt. 2), 026110 (2008)
45. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Information modification and particle colli-
sions in distributed computation. Chaos 20(3), 037109 (2010)
46. Lizier, J.T., Rubinov, M.: Multivariate construction of effective computational networks
from observational data. Max Planck Preprint 25/2012. Max Planck Institute for Mathe-
matics in the Sciences (2012)
47. Lüdtke, N., Logothetis, N.K., Panzeri, S.: Testing methodologies for the nonlinear anal-
ysis of causal relationships in neurovascular coupling. Magn. Reson. Imaging 28(8),
1113–1119 (2010)
48. Marko, H.: The bidirectional communication theory–a generalization of information the-
ory. IEEE Transactions on Communications 21(12), 1345–1351 (1973)
49. Marr, D.: Vision: A Computational Investigation into the Human Representation and
Processing of Visual Information. Henry Holt and Co. Inc., New York (1982)
50. Massey, J.: Causality, feedback and directed information. In: Proc. Int. Symp. Informa-
tion Theory Application (ISITA 1990), pp. 303–305 (1990)
51. Merkwirth, C., Parlitz, U., Lauterborn, W.: Fast nearest-neighbor searching for nonlinear
signal processing. Phys. Rev. E Stat. Phys. Plasmas. Fluids Relat. Interdiscip. Topics 62(2
Pt. A), 2089–2097 (2000)
52. Neymotin, S.A., Jacobs, K.M., Fenton, A.A., Lytton, W.W.: Synaptic information trans-
fer in computer models of neocortical columns. J. Comput. Neurosci. 30(1), 69–84
(2011)
53. Nolte, G., Ziehe, A., Nikulin, V.V., Schlogl, A., Kramer, N., Brismar, T., Muller, K.R.:
Robustly estimating the flow direction of information in complex physical systems. Phys.
Rev. Lett. 100(23), 234101 (2008)
54. Oostenveld, R., Fries, P., Maris, E., Schoffelen, J.M.: Fieldtrip: Open source software for
advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput. Intell.
Neurosci. 2011, 156869 (2011)
55. Paluš, M.: Synchronization as adjustment of information rates: detection from bivariate
time series. Phys. Rev. E 63, 046211 (2001)
56. Pearl, J.: Causality: models, reasoning, and inference. Cambridge University Press
(2000)
57. Pompe, B., Runge, J.: Momentary information transfer as a coupling measure of time
series. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 83(5 Pt. 1), 051122 (2011)
58. Ragwitz, M., Kantz, H.: Markov models from data by simple nonlinear time series pre-
dictors in delay embedding spaces. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 65(5 Pt.
2), 056201 (2002)
59. Sabesan, S., Good, L.B., Tsakalis, K.S., Spanias, A., Treiman, D.M., Iasemidis, L.D.:
Information flow and application to epileptogenic focus localization from intracranial
EEG. IEEE Trans. Neural. Syst. Rehabil. Eng. 17(3), 244–253 (2009)
60. Schreiber, T.: Measuring information transfer. Phys. Rev. Lett. 85(2), 461–464 (2000)
61. Small, M., Tse, C.: Optimal embedding parameters: a modelling paradigm. Physica D:
Nonlinear Phenomena 194, 283–296 (2004)
62. Staniek, M., Lehnertz, K.: Symbolic transfer entropy: inferring directionality in biosig-
nals. Biomed. Tech (Berl.) 54(6), 323–328 (2009)
63. Stetter, O., Battaglia, D., Soriano, J., Geisel, T.: Model-free reconstruction of excita-
tory neuronal connectivity from calcium imaging signals. PLoS Comput. Biol. 8(8),
e1002653 (2012)
64. Sun, L., Grützner, C., Bölte, S., Wibral, M., Tozman, T., Schlitt, S., Poustka, F., Singer,
W., Freitag, C.M., Uhlhaas, P.J.: Impaired gamma-band activity during perceptual orga-
nization in adults with autism spectrum disorders: evidence for dysfunctional network
activity in frontal-posterior cortices. J. Neurosci. 32(28), 9563–9573 (2012)
65. Takens, F.: Detecting Strange Attractors in Turbulence. In: Dynamical Systems and
Turbulence, Warwick. Lecture Notes in Mathematics, vol. 898, pp. 366–381. Springer
(1980)
66. Vakorin, V.A., Kovacevic, N., McIntosh, A.R.: Exploring transient transfer entropy based
on a group-wise ica decomposition of EEG data. Neuroimage 49(2), 1593–1600 (2010),
67. Vakorin, V.A., Krakovska, O.A., McIntosh, A.R.: Confounding effects of indirect con-
nections on causality estimation. J. Neurosci. Methods 184(1), 152–160 (2009)
68. Vakorin, V.A., Mii, B., Krakovska, O., McIntosh, A.R.: Empirical and theoretical aspects
of generation and transfer of information in a neuromagnetic source network. Front Syst.
Neurosci. 5, 96 (2011)
69. Vicente, R., Wibral, M., Lindner, M., Pipa, G.: Transfer entropy – a model-free measure
of effective connectivity for the neurosciences. J. Comput. Neurosci. 30(1), 45–67 (2011)
70. Victor, J.: Binless strategies for estimation of information from neural data. Phys. Rev.
E 72, 051903 (2005)
71. Whitford, T.J., Ford, J.M., Mathalon, D.H., Kubicki, M., Shenton, M.E.: Schizophrenia,
myelination, and delayed corollary discharges: a hypothesis. Schizophr Bull. 38(3), 486–
494 (2012)
72. Wibral, M., Pampu, N., Priesemann, V., Siebenhhner, F., Seiwert, H., Lindner, M., Lizier,
J.T., Vicente, R.: Measuring information-transfer delays. PLoS One 8(2), 055809 (2013)
73. Wibral, M., Rahm, B., Rieder, M., Lindner, M., Vicente, R., Kaiser, J.: Transfer entropy
in magnetoencephalographic data: Quantifying information flow in cortical and cerebel-
lar networks. Prog. Biophys. Mol. Biol. 105(1-2), 80–97 (2011)
74. Wibral, M., Wollstadt, P., Meyer, U., Pampu, N., Priesemann, V., Vicente, R.: Revisiting
wiener’s principle of causality – interaction-delay reconstruction using transfer entropy
and multivariate analysis on delay-weighted graphs. Conf. Proc. IEEE Eng. Med. Biol.
Soc. 2012, 3676–3679 (2012)
75. Wollstadt, P., Martinéz-Zarzuela, M., Vicente, R., Dı́az-Pernas, F., Wibral, M.: Ef-
ficient transfer entropy analysis of non-stationary neural time series. arXiv preprint
arXiv:1401.4068 (2014)
76. Wiener, N.: The theory of prediction. In: Beckmann, E.F. (ed.) In Modern Mathematics
for the Engineer, McGraw-Hill, New York (1956)
77. Williams, P.L., Beer, R.D.: Nonnegative decomposition of multivariate information.
arXiv preprint arXiv:1004.2515 (2010)
78. Williams, P.L., Beer, R.D.: Generalized measures of information transfer. arXiv preprint
arXiv:1102.1507 (2011)
79. Wolfram, S.: A new kind of science. Wolfram Media, Champaign (2002)
Efficient Estimation of Information Transfer
Raul Vicente and Michael Wibral
Abstract. Any measure of interdependence can lose much of its appeal due to a
poor choice of its numerical estimator. Information theoretic functionals are partic-
ularly sensitive to this problem, especially when applied to noisy signals of only
a few thousand data points or less. Unfortunately, this is a common scenario in
applications to electrophysiology data sets. In this chapter, we will review the state-
of-the-art estimators based on nearest-neighbor statistics for information transfer
measures. Nearest neighbors techniques are more data-efficient than naive partition
or histogram estimators and rely on milder assumptions than parametric approaches.
However, they also come with limitations and several parameter choices that influ-
ence the numerical estimation of information theoretic functionals. We will describe
step by step the efficient estimation of transfer entropy for a typical electrophysi-
ology data set, and how the multi-trial structure of such data sets can be used to
partially alleviate the problem of non-stationarity.
1 Introduction
Inferring interdependencies between subsystems from empirical data is a com-
mon task across different fields of science. In neuroscience the subsystems from
which we would like to infer an interdependency can consist of a set of stimulus
and a region of the brain [1], two regions of the brain [2], or even two frequency
bands registered at the same brain region [3]. An important characterization of di-
rected dependency is the information transfer between subsystems, especially when
Raul Vicente
Max-Planck Institute for Brain Research, 60385 Frankfurt am Main, Germany
e-mail: raulvicente@gmail.com
Michael Wibral
MEG Unit, Brain Imaging Center, Goethe University, Heinrich-Hoffmann Strasse 10,
60528 Frankfurt am Main, Germany
e-mail: wibral@em.uni-frankfurt.de

38 R. Vicente and M. Wibral
describing the information processing capabilities of a system [4, 5]. The success of
this task crucially depends not only on the quality of the data but on the numerical
estimator of the interdependency measure [6]. In this chapter we will review the dif-
ferent stages in obtaining a numerical estimate of information transfer, as measured
by transfer entropy, from a typical electrophysiology data set. Specifically, in Sec-
tion 2 we answer why transfer entropy is used as a quantifier of information transfer.
Next, we describe different strategies to estimate transfer entropy along with their
advantages and drawbacks. Section 4 explains step by step the procedure to numer-
ically estimate transfer entropy from nearest neighbor statistics. The section covers
from the choice of parameters for the embedding of raw time series to the testing of
statistical significance. In Section 5, we illustrate how to integrate multi-trial infor-
mation to improve the temporal resolution of transfer entropy. Finally, in Section 6
we briefly discuss the current status of the field and some future developments that
will be needed for moving forward the application of information transfer measures
in neuroscience.
2 Why Information Theory?

Any top ranking of popular measures of interdependence will certainly include
cross-correlation, coherence, and Granger causality. These measures quantify the
strength of different linear relations and thus, belong to the class of parametric mea-
sures which assume a specific form for the interdependence between two or more
processes. By highly constraining the type of interdependence evaluated, the nu-
merical evaluation of parametric measures typically amounts to estimating a few
coefficients, which in the case of linear measures can be usually obtained by matrix
manipulations. Thus, parametric measures are often data-efficient, generalizable to
multivariate settings, and easy to interpret. It is probably no exaggeration to say that
one should always start the inspection of a new data set with linear techniques [7].
However, statistical relationships between processes are more naturally and gen-
erally formulated within the probabilistic framework, which relaxes the need to as-
sume explicit models on how variables relate to each other [8]. Thus, exploratory
rather than confirmatory analysis of a particular model should ideally be carried out
by techniques formulated in probabilistic terms. After all, if two random variables
X and Y are independent
P(X,Y ) = P(X)P(Y ) , (1)
they must be uncorrelated
E[XY ] = E[X]E[Y ] , (2)
but the reverse is not true.

Information theory is precisely formulated in probabilistic terms and quantifies
properties such as the information shared by two random variables X and Y by
simply measuring how much the probabilities at the two sides of Eq. 1 differ. The
Efficient Estimation of Information Transfer 39
information content of a random variable (Shannon entropy), or the information

shared by two random variables (mutual information) does not appeal to any explicit
model for the random variables or their interrelations [9]. Instead these information
theoretic measures are simply scalars obtained directly from the probability distri-
bution of a single random variable (Shannon entropy) or the joint probability of a
pair of random variables (mutual information). Implicit assumptions include that
probability mass or density functions can describe our knowledge about the ran-
dom variables or the existence of a given communication channel. The generality of
the information theory approach endows it with the ability to deal with a diversity
of systems and still employ a common currency, the bit. Conceptually, information
theoretic measures are very appealing because they stem from a simple and elegant
set of axioms that include intuitive properties such as that the information gath-
ered from independent sources should be additive. Also concepts like synergy and
redundancy can be naturally formulated in information theoretic terms [10, 11]. Re-
cent developments have also given rise to new branches such information dynamics
which aims to describe the local dynamics in space and time of the transfer, storage,
and transformation of information [5]. See Chapter 7 by Lizier in this volume for a
short introduction to local information dynamics.
In neurobiology mutual information has been widely used to assess how neu-
rons represent information, i.e., to investigate the neural code [12]. Indeed, just over
a decade after Shannon’s seminal article [13] and book [14], Horace Barlow pro-
posed in 1961 the efficient coding hypothesis by which sensory systems use neural
codes that optimize the number of spikes needed to transmit a given signal [15].
Measures of mutual information between stimuli from the outside world and neu-
ronal response in early sensory areas have provided lower bounds for the channel
capacity of sensory systems [9, 12, 16]. In fact, only when tested with natural stim-
uli different sensory modalities have been found to transmit stimulus information at
rates close to their theoretical optimal capacities, as predicted by the efficient coding
hypothesis [17, 18].
However, the simplest model of communication in information theory (a single
unidirectional chain composed by source, transmitter, channel, receiver, and destina-
tion) might not be the most appropriate to study the information processing in com-
plex networks of neurons [9]. Feedback, delays, and plasticity are defining features
of neuronal circuitry that can alter the original interpretation of information theoret-
ical functionals. Another basic difficulty in applying information theory to natural
systems such cell networks is that, unlike in human designed engineering systems,
the encoding and decoding processes as well as the channels used for transmission
need to be discovered and characterized in first place [19]. Finally, one must also
keep in mind that the main function of the nervous system is the processing of infor-
mation to attain fast and specific responses to environmental and internal demands.
Processing as opposed to a mere transmission of information involves in many oc-
casions a loss of information (e.g., generating some type of invariance) and mutual
information alone cannot decide how much information was lost due to determinis-
tic processing and how much simply due to a noisy transmission.
Taking into account these conceptual difficulties and severe measure limita-
tions (experimental measures only capture a hugely subsampled and coarse-grained
version of underlying neural processes), it is probably safe to affirm that many over-
interpretations are found in results dealing with information theory applied to sys-
tems neuroscience and other fields [20]. This justifies in part the skepticism that
information theory applied to neural data generated among rigorous information
theoreticians [21, 22]. However, information theory, even in its most classical and
simple framework can still provide very useful insights and lower bounds on funda-
mental quantities characterizing the transmission of information. The latter aspect
is the base of many analyses that try to determine the flexible routing of information
across brain areas on top of its anatomical architecture. To this end a generalization
of mutual information named transfer entropy (TE) has become the tool of choice.
2.1 Transfer Entropy

Transfer entropy from a process X to another process Y is the amount of uncer-
tainty reduced in the future values of Y by knowing the past values of X once the
past values of Y are given [23]. Importantly, this definition directly embodies an
operational version of statistical causal dependencies introduced by the great math-
ematician Norbert Wiener in 1956 [24]. In particular, a signal X is said to cause
a signal Y when the future of signal Y is better predicted by adding knowledge
from the past and present of signal X than by using the present and past of Y alone.
Once prediction enhancement and reduction of uncertainty are identified with each
other it is clear that transfer entropy implements Wiener’s principle of causality in
information theoretic terms [25].
Mathematically, for processes X and Y that can be approximated by Markov
chains transfer entropy can be simply expressed as the mutual information (MI)
between the future of Y and the present of X once conditioned to the past of Y,
T E (X → Y ) = MI(Y + ; X − |Y − ) , (3)
where the superscripts + and − denote adequate future and past state reconstruc-
tions of the respective random variables.
The conditioning in the former equation equips transfer entropy with several ad-
vantages over the unconditioned mutual information to describe information trans-
fer. First, it enables transfer entropy to consider transition between states and thus
incorporates the dynamics of the processes. Second, transfer entropy is inherently
asymmetric with respect to the exchange of X and Y and thus can distinguish the two
possible directions of interaction. These two properties allow one to assess the di-
rected information being dynamically transferred between two process as opposed
to the information being merely statically shared [23]. This can also be observed
from rewriting Eq. 3 as

T E (X → Y ) = MI Y + ; (Y − , X − ) − MI(Y + ;Y − ) , (4)
which makes explicit that transfer entropy is the reduction of uncertainty in one vari-
able due to another that is not explained by its own past alone. Another arrangement
of transfer entropy, in this case in terms of Shannon entropies, reads

T E (X → Y ) = H Y − , X − − H Y + ,Y − , X − + H Y + ,Y − − H Y − . (5)
For a detailed review on the concept of transfer entropy and its application to
neuroscience see Chapter 1 by Wibral in this volume.
Note also that we refer to transfer entropy as capturing causal dependencies only
in the sense that there is some value in the past of an observed signal in explaining
the future evolution of another signal beyond its own past. Observational causal-
ity as defined by Wiener differs in general from interventional causality in which
perturbations of one process while conditioning or controlling the state of others,
is necessary to infer the graph of causal interactions. Indeed transfer entropy actu-
ally captures the notion of information transfer as opposed to quantify the strength
of causal interactions [26]. The two concepts are different as reviewed in Chap-
ter 8 by Chicharro in this volume. An easy observation illustrating this difference
is that transfer entropy will be zero for both independent and fully synchronized
processes, possibly due to a null and very strong causal interaction, respectively
[27, 28]. However, information transfer across brain regions is arguably the quantity
of interest to study the flexible information routing in the brain rather than interven-
tional causal connectivity which is directly related to its relatively fixed anatomical
circuitry [29, 25, 30, 31].
Regarding the estimation of transfer entropy, the innocent formulation in Eq. 3
does not make explicit its dependence on several probability densities. For Markov
processes indexed by a discrete valued time-index t this reads

d

p(yt+1 |yt y , xtdx )
∑ p(yt+1 |yt , xt ) log
dy dx
T E (X → Y ) = dy
, (6)
dy dx
y ,y ,x
p(yt+1 |y t )
t+1 t t
d
where xtdx = (xt , ..., xt−dx +1 ), yt y = (yt , ..., yt−dy +1 ), while dx and dy are the orders
(memory) of the Markov processes X and Y , respectively.
The formula in Eq. 6 does not appear so innocent anymore and the appearance of
several probability densities in possibly high dimensions already hints that the esti-
mation procedure might be difficult. In the next sections we will describe different
types of estimators of transfer entropy from a collection of time series.
3 A Zoo of Estimators
Given a data set and application in mind, which is a good estimator for transfer
entropy? Before addressing this question we shall recall some basic notions about
estimators and their classification.
An estimator is a function or rule that takes observed data as input and out-
puts an estimate of an unknown parameter or variable [32]. Any estimator can be
characterized in terms of the bias and variance of its estimates, this is its systematic
deviance from the true value and its variability across different realizations of the
sampling. Often one is interested in knowing how the bias of the estimate and its
convergence to the expected value behave as the number of samples grows, i.e., the
asymptotic bias and consistency of the estimator, respectively. From all unbiased
estimators, one estimator is more efficient than another if it needs less samples to
achieve a given level of variance. More generally, one can be interested in control-
ling the balance between bias and variance. For example, if one decides to contrast
the estimate for one data set with that of surrogate data (see Section 5), the analyst
might chose to reduce the variance or statistical error of its estimate at the expense
of increasing its bias. The reason is that if the surrogate data is suspected to have a
similar bias to the observed data, the bias can be canceled out in the comparison. In
another case, the analyst might be interested in a direct interpretation of the value of
an estimate. To attain such a goal a low bias estimator is mandatory.
Thus, selecting the appropriate numerical estimator for a given application is cru-
cial since too large biases or statistical errors can severely hamper the interpretation
of the estimated functionals or their practical use. An optimal selection is indeed a
complex question that depends on several factors including the number of samples
available, the dimensionality of problem, the levels of quantization of the samples,
the desired bias versus variance balance, and the computational resources. Differ-
ent estimators can be classified according to several criteria and each class will
exhibit different advantages and costs depending on the above-mentioned factors.
Taxonomies of transfer entropy estimators closely follow that of other information
theoretical functionals such as Shannon or mutual information [33]. A usual first di-
vision consists of the separation between parametric and non-parametric estimators.
Here we will focus on those estimators that can readily be applied to high-
dimensional spaces for two reasons. First, the time series from electrophysiology
experiments typically render time series that can only be embedded in some high
dimensional space. This step is necessary to properly represent their true state (see
Section 4). Second, the evaluation of transfer entropy involves joint probability
densities, compounding the problem. Furthermore, only continuous signals will be
considered. The existence of a reliable estimator of transfer entropy for point pro-
cesses such as spike trains is still to be proved. For some heuristic approaches see
[34, 35, 36].
3.1 Parametric Estimators

No matter how large, a finite number of samples does not completely determine
an arbitrary continuous probability density. However, assumptions about the shape
of a probability density constrain the searching space and thus help to reduce the
number of samples needed to characterize a density. Thus, when there is confidence
that data are sampled from a family of densities described by some parameters it is
possible to use that information to obtain more data-efficient estimators. Parametric
estimators assume that the probability densities involved belong to a certain family
and start by first inferring the parameters of the family that best fits the sampled dis-
tribution. Note that due to the need to embed the time series into a high dimensional
space the distributions considered here, both the parametric family and the sampled,
are necessarily multi-dimensional. After the inference of the density parameters, a
direct estimation of transfer entropy or other information theoretic functionals then
proceeds by applying the proper functional to the estimated density function.
An advantage of the parametric approach is that in some cases it allows for an-
alytical insight on how an information theoretic functional depends on relevant pa-
rameters. For example, for the Gaussian family and other distributions some func-
tionals of the densities can be computed analytically [27, 37]. For example see the
work by Hlaváčková for the derivation of differential entropy and transfer entropy
for Gaussian, log-normal, Weinman, and generalized normal distributions [38]. Also
if time series are well fitted by some generative models it is possible to estimate
transfer entropy directly from the parameters of the generative equations. For exam-
ple, for coupled first-order auto-regressive models or second-order linear differential
equations with Gaussian input transfer entropy is analytically solvable [27, 39, 40].
Furthermore, for linear Gaussian systems the distribution of estimations given the
data length, as well as for that of surrogates on the given data, is analytically known
[41] which simplifies the evaluation of statistical significance for these systems (see
Section 4.3 for a discussion on assessing statistical significance for transfer entropy
in the general case).
The success of the parametric approach depends on the correctness of the as-
sumptions. For example, under certain conditions it might be reasonable to assume
that during resting state samples from certain electrodes are distributed according
to a Gaussian law (in which case transfer entropy is proportional to Granger causal-
ity). Even if the data are not distributed according to any member of a well known
family of distributions it is possible to apply some transform to bring them into one.
This procedure can also be useful to estimate bounds for certain functionals. For
example, since the data processing inequality implies that I(X,Y ) ≥ I( f (X), g(Y )),
where f and g are deterministic functions, a lower bound can be obtained if the
distributions of f (X) and g(Y ) are easier to estimate.
3.2 Non-parametric Estimators

In most of the cases there is no a simple family of distributions that fits the data, or
the distribution is adaptive across time and experimental conditions. Thus, time se-
ries during strongly evoked or induced activity, or in general non-stationary regimes,
are not adequately described by a single family of distributions. In such situations
the application of a parametric approach to estimate entropy or other functionals is
not recommended.
Non-parametric approaches make only mild assumptions about the continuity or
smoothness of the distributions, which in any case are not assumed to belong to any
particular family. Here, we will follow Hlaváčková et al. [33] in their division of
three main classes of non-parametric estimators.
3.2.1 Partition Based Estimators

The most intuitive method to estimate probability densities is arguably that of his-
tograms. The idea is simply to estimate probabilities by counting how many samples
fall into each division of a certain partition of the state space. Indeed, such procedure
corresponds to the maximum likelihood estimator of probability densities. Hence,
in principle it is possible to estimate transfer entropy by using the frequency of vis-
itation of different states as approximations for each probability involved in Eq. 6.
However, due to the concavity of the log function even an asymptotically unbiased
estimator for the density can result in a significant bias for Shannon or transfer en-
tropy. For this reason several bias correction formulas have been derived such as the
Miller-Madow or the ”‘jack-knifed”’ method by Efron and Stein [42, 43].
When the process under study can be naturally divided into a small number h
of different levels or states, histograms are the most straightforward approach to
compute probability distributions. Thus, cellular automata or other processes such
Markov chains with a small number of distinct states are ideal candidates to com-
pute transfer entropy by histogram techniques [5]. Also it is possible that several
states or ranges of a continuous variable can be merged into representative states
or symbols. For example, for a scalar time series one possibility is to only assign
different symbols to each of the possible orderings (permutations) of a sequence of
amplitude values. That is, the sequence of amplitudes {1, 2, 3} within a time series
will be assigned to the same symbol as the sequence {6, 7, 8} (same relative order)
but to a different symbol than the sequence {5, 4, 2} (different relative order). Thus
one could compute symbolic functionals such as symbolic transfer entropy from the
frequency of visitation of each symbol [44]. However, unless there is a natural cri-
teria to justify symbolic sequences this procedure might hamper the interpretability
of transfer entropy. For example, as shown in [45] symbolic transfer entropy may
lose relevant information as it assumes that all relevant information is given by the
relative orderings. In general, the use of arbitrary partitions is difficult to justify
or becomes impractical for most applications dealing with continuous time series
consisting of a few hundreds or thousands samples. Let us see why.
For continuous signals there is an important difficulty. Since two samples falling
into the same cell will be consider identical for all purposes, a partition of the state
space for continuous variables will introduce some level of quantization. Ideally,
a large number of equidistant levels is necessary to resolve fine differences in the
signal and not lose most of the structure of the process. However, this is impracti-
cal. The reason is that the number of bins in a regular partition grows exponentially
(hdim )with the number of dimensions of the data and easily exceed the number of
samples observed N. This results in most of the bins being unoccupied and, in turn,
in large biases for information functionals applied to such sparse probability den-
sities. For example, if bins are so sparsely occupied that each bin only contains 0
or 1 samples, values of Shannon entropy will only reflect the number of distinct
states rather than any internal structure of the process. Since transfer entropy can be
decomposed in four entropy terms in some marginal and joint spaces with different
dimensionality, an straightforward application of this approach to practical data sets
will most probably saturate the term with highest dimensionality and underestimate
transfer entropy.
Until now we have considered partitions of fixed size and independent of the
data but it is possible to generate partitions with cells or divisions of different sizes
adaptively to the observed samples. One possibility to overcome some of the above-
mentioned problems is to partition the observation space such that it is guarantee
that the occupations of bins satisfy some desired property. For example, for mu-
tual information Paluš proposed that some problems of over-quantization can be-
come less critical by using partitions that ensure an equal occupation of bins in the
marginal spaces [46]. Such equiquantization ensures a maximization of the entropy
for the marginal probabilities, which makes the mutual information depend only √
on the joint entropy term. In general, Paluš suggests that the condition h < dim+1 N
should be met for the practical estimation of mutual information by equiquantized
marginal partitions [46]. A different adaptive technique is based on the local recur-
sive refinement of a partition to uniformize the distribution of samples in the joint
space [47, 48, 49]. Yet a third type of approach considers fuzzy bins by allowing
a continuous weighting of a sample at multiple bins [50]. While in principle these
strategies can be generalized to estimate other information theoretical functionals,
no systematic study has tested their convergence properties for transfer entropy.
More generally, the curse of dimensionality is the major impediment for applying
these techniques to data sets living in moderate or high dimensional spaces, which
is the usual case in electrophysiology.
Note that the histogram technique is not readily applicable to spike trains in prac-
tical settings. Although they are usually considered binary processes (two states),
they have indeed a mixed topology due to the continuous nature of the time stamp
at which each spike occurs [51]. Only in case of very long recordings and after
the application of some bias corrections have histograms strategies produced some
reliable results for spike trains or signals with some continuous variable [12].
3.2.2 Plug-in Estimators

The idea behind this type of estimators is to find a consistent estimate for the proba-
bility densities and plug them into the corresponding functionals. However, in con-
trast to the parametric approach, no prior assumptions are made about the overall
shape of the densities. Thus densities are not forced to be members of any known
family of distributions. Instead, the densities are typically estimated using more
flexible techniques such as a kernel density (or Parzen) estimator [52, 53, 23].
In kernel estimators a density is written as a sum of decaying kernel functions,
such as Gaussian or box kernels, centered on the observed samples x1 , x2 , ...xN .
Such expression is theoretically justified since it can be shown that is equivalent
to estimate a density function via the inverse Fourier transform of its characteristic
function N1 ∑t=1
N
exp(iλ xt ). The bandwidth or smoothness of the local kernel win-
dows controls the bias versus statistical error balance. Kernel density estimators
are a popular solution to overcome some problems that binning approaches have
such as sensitivity to noise near bin borders and arbitrary location of bin origins.
For continuous random variables, the sum of smooth kernels converge faster to the
underlying density than binning based techniques [54].
However, the evaluation of functionals of a density that is expressed as a sum of
kernels centered at the irregularly distributed sample points is difficult. For exam-
ple, the estimation of transfer entropy for continuous variables could in principle be
carried out by combining four Shannon entropy terms, or more strictly speaking, dif-
ferential entropies terms. While the application of kernel estimation of densities is
straightforward, evaluating each of the entropy terms requires to numerically com-
pute an integral over joint spaces, which for electrophysiology signals can easily
reach a dimensionality of fifteen or higher.
3.2.3 Metric Based Estimators

This class of estimators appeals to the notion that the larger the distance between
one point to its nearest neighbor the lower the local density around that point. Par-
ticularly relevant is the seminal work of Kozachenko and Leonenko (KL) who ex-
ploited the statistics of nearest neighbors and the assumption of continuity of the
probability density to provide an asymptotically unbiased and consistent estimator
for differential entropy [55].
Importantly, the KL estimator illustrates that to estimate an information theoretic
functional it is not always necessary to explicitly estimate the full probability dis-
tribution from the observed samples to later plug it into the respective functional.
A familiar illustration of this occurs for example when we estimate the mean of N
samples (xt ) without first estimating its probability distribution and then taking its
first moment. Instead we simply apply the formula
1 N
x̂ = ∑ xt .
N t=1
(7)
which provides a direct estimation of the mean from the samples without computing
the full distribution in first place.
The
derivation of the KL estimator starts by noticing that a differential entropy
term p(x) log (p(x)) can be approximated by the sample average of log(p(x)) eval-
uated at the sampled points x = xt [55, 51]. The next ingredient is the assumption that
the probability density near each point xt is locally uniform and equal to p(x = xt ),
which is an approximation of p as local as possible given the data available. Given
the former assumption and using the trinomial formula, log (p(xt )) can be calculated
from the probability that after N − 1 other samples have been drawn according to
p, the nearest neighbor to xt is at least at distance ε . Finally, the sample average of
log (p(xt )) can be shown to be, up to a constant, equal to the sample average of the
log of the distance of each sample point to its nearest neighbor. The general form of
the KL estimator for differential entropy finally reads
d
N∑
H (X) = −ψ (k) + ψ (N) + log(|Bd |) + log (2εt ) , (8)
t
where ψ denotes the digamma function, |Bd | is the volume of the unit ball in the d-
dimensional Euclidean space and εt is the distance of xt to its k-th nearest neighbor.
To note, other norms different from the Euclidean, such as the maximum norm, can
also be used in the former formula to estimate distance to nearest neighbors.
The KL estimator for differential entropy is endowed with several properties that
make it particularly attractive for practical applications. First, Kozachenko and Leo-
nenko demonstrated that under mild assumptions for the continuity of p, the above
estimator is asymptotically unbiased and consistent. Even for finite samples the pa-
rameter k (number of nearest neighbors) still allows for a certain control of the bias
versus variance balance (larger k reduce statistical errors at the expense of a higher
bias). Second, Victor has reported that the data-efficiency in estimating differential
entropy with the KL formula can reach a thousand times that of histogram strategies
for typical electrophysiology data sets [51]. Additionally, compared to histogram
techniques KL and other nearest neighbor approaches are centered on each data
sample and thus avoid certain arbitrariness typical of binning procedures. Third, KL
estimator effectively implements an adaptive resolution where the distance scale
used changes according to the underlying density. And fourth, the search of nearest
neighbors from a set of points is a classic problem that has received a lot of attention
and for which several algorithms exists [56, 57, 58, 59].
However, there remain important drawbacks. For example, the application of KL
estimator might return unreliable results if the assumption of continuity of p is not
appropriate. The validity of such assumption seems natural for most of continuous
electrophysiology signals but is a condition to check for each application. Also if
the number of samples is short and the dimensionality very high, the KL estima-
tor will suffer from the curse of dimensionality. In addition, for some applications
with a large data size living in very high-dimensional spaces the computation of ex-
act nearest neighbor distances can be computationally expensive. Unfortunately, less
expensive alternatives such approximative nearest neighbors, where some margin of
error is allowed in finding the k-th nearest neighbor members, leads to an amplifi-
cation of errors for the entropy estimate that renders the advantage gained from the
approximated search less useful. However, hierarchical clustering techniques and
parallel computing possibly assisted by GPU technology have paved the way to-
wards high-performance computing of massive exact nearest neighbor calculations
[60, 61].
Since most relevant information theoretic functionals can be decomposed in
terms of differential entropies, a naive estimator for such functionals would consist
of summing the individual differential entropy estimators. For example, for transfer
entropy the naive approach would consist of estimating each term of Eq. 5 from
a KL estimator. This is however not adequate for many applications. To see why,
it is important to first note that the probability densities involved in computing TE
or MI from individual terms can be of very different dimensionality (from dy up
to dy + dx + 1 for the case of bivariate TE). For a fixed k, this means that different
distance scales are effectively used for spaces of different dimension. The second
important aspect is to note that the KL estimator is based on the assumption that the
density of the distribution of samples is constant within an ε -ball. The bias of the
final entropy estimate depends on the validity of this assumption, and thus, on the
values of εt . Since the size of the ε -balls depends directly on the dimensionality of
the random samples, the biases of estimates for the dierential entropies in Eq. 5 will,
in general, not cancel, leading to a poor estimator of the transfer entropy.
The solution to this problem came from Kraskov, Stögbauer, and Grassberger
(KSG) who provided a methodology to adapt the KL estimator to estimate mutual
information [62, 63]. This set the path to estimators for other information theoretic
functionals such transfer entropy. Their solution came from the insight that Eq. 8
holds for any k and thus, it is not necessary to use a fixed k. Therefore, we can
vary the value of k in each data point so that the radii of the corresponding ε -balls
would be approximately the same for the joint and the marginal spaces. The key
idea is then to use a fixed mass (k) only in the higher dimensional space and project
the distance scale set by this mass into the lower dimensional spaces. Thus, the
procedure designed for mutual information suggests to first determine the distances
to k-th nearest neighbors in the joint space. Then, an estimator of MI can be obtained
by counting the number of neighbors n that fall within such distances for each point
in the marginal space. The estimator of MI based on this method displays many good
statistical properties, it inherits the data-efficiency of the KL estimator, it greatly
reduces the bias obtained with individual KL estimates, and it seems to become an
exact estimator in the case of independent variables.
The idea can be generalized to estimate other functionals such as conditional mu-
tual information, including its specific formulation for transfer entropy [64]. Finally,
the KSG estimator of transfer entropy for Markov processes indexed by the discrete
time variable t (Eq. 6) is written as

1
T E (X → Y ; u) = ψ (k) + ∑ ψ n dy + 1
N t yt−1

− ψ n dy + 1 − ψ n dy dx + 1 , (9)
yt yt−1 yt−1 xt−u
where the distances to the k-th nearest neighbor in the highest dimensional space
dy dx
(spanned by yt yt−1 xt−u ) define the radius of the balls for the counting of the number
of points n(·) in these balls around each state vector in all the marginal spaces (·)
involved. In the above formulation we have also included a temporal parameter u
which accounts for the time delay for the information transfer to occur between two
processes as explained in [45].
In summary, since the KSG estimator is more data efficient and accurate than
other techniques (especially those based on binning), it allows one to analyze shorter
data sets possibly contaminated by small levels of noise. At the same time, the
method is especially geared to handle the biases of high dimensional spaces nat-
urally occurring after the embedding of raw signals. Thus, the use of KSG enhances
the applicability of information theoretic functionals in practical scenarios with lim-
ited data of unknown distribution such as in neuroscience applications [25]. As
such, in the next section, we focus on the use of this estimator in a typical elec-
trophysiolgical data set. However, even using this improved estimator the curse of
dimensionality and inaccuracies in estimation are unavoidable, especially for the re-
strictive conditions of electrophysiology data sets. For these reasons it is suggested
that the raw value of transfer entropy may be less reliable than its use as a statis-
tic (in some statistical significance test against the null hypothesis that time series
measured are independent) to infer a directed relationship between time series.
4 Estimating Transfer Entropy from Time Series via Nearest

Neighbor Statistics: Step by Step
Next we proceed to describe how to obtain a numerical estimation of transfer en-
tropy from raw time series in a typical electrophysiology data set [25]. We assume
time series {x1 , x2 , . . . , xN } and {y1 , y2 , . . . , yN } from two different channels have
been simultaneously registered and preprocessed to exclude artifacts, filter compo-
nents of interest, or perform any type of source reconstruction. N typically amounts
to a few hundred or thousand of points which sampled at 1 kHz amount to a few
hundred or thousand of milliseconds, a relevant scale for the dynamics of cognitive
tasks. We also consider a typical laboratory setup where a few dozens or hundreds
of trials (R) under similar experimental conditions have been registered.
4.1 Step 1: Reconstructing the State Space

Experimental recordings can only access a limited number of the relevant variables
that determine the full state of the underlying system. However, we are usually inter-
ested (and formulate hypotheses) about the underlying systems that give rise to the
signals actually being measured. To partially overcome this problem it is possible to
approximately reconstruct the full state space of a dynamical system from a single
series of observations. Takens delay embedding is the technique of choice for such
reconstructions [65]. It allows one to map a scalar time series into a trajectory in a
state space of possibly high dimension that resembles the trajectory of the underly-
ing system. The mapping uses delay-coordinates to create a set of vectors or points
in a higher dimensional space according to

xtdx = xt , xt−τ , xt−2τ , . . . , xt−(dx −1)τ . (10)
This procedure depends on two parameters, the dimension d and the delay τ of the
embedding. The parameters d and τ considerably affect the outcome of the TE es-
timates. For instance, a low value of d can be insufficient to unfold the state space
of a system and consequently degrade the meaning of transfer entropy. On the other
hand, a too large dimensionality reduces the accuracy of the estimators given a sam-
ple size and can significantly increase the computing time.
While there is an extensive literature on how to choose such parameters, the dif-
ferent methods proposed are far away from reaching a consensus. A popular option
is to take the delay embedding as the auto-correlation decay time of the signal
or the first minimum, if any, of the auto mutual information [66]. Once the delay
time of the embedding has been fixed, the Cao criterion offers an algorithm to de-
termine the embedding dimension. This is based on detecting false neighbors due
to points being projected into a too low dimensional state space [67]. However, for
the purpose of interpreting transfer entropy as an information theoretical incarna-
tion of Wiener’s principle of causality, it is important not only that the embedding
parameters allow one to reconstruct the state space but also that they provide an
optimal self-prediction [45, 24]. Otherwise, if the reconstruction is not optimal in
the self-predicting sense, there might be a trivial reason for which the past states of
one system help to predict the future of another system better than from its own past
alone (see Chapter 1 in this volume for more details). Fortunately, the Ragwitz cri-
terion yields delay embedding states that provide optimal self-prediction for a large
class of systems, either deterministic or stochastic in nature. The Ragwitz criterion
is based on scanning the (d-τ ) plane to identify the point in that plane that mini-
mizes the locally constant predictor error [68]. This is how we finally recommend
making the choice of the embedding parameters for each time series. However, it is
always a good idea to check how transfer entropy measurements depend on values
for d and τ around those found by any criterion.
4.2 Step 2: Computing the Transfer Entropy Numerical Estimator

After the states space for each time series have been reconstructed, the next step is
to evaluate numerically Eq. 9, which depends explicitly on the propagation time u
and implicitly on the number of nearest neighbors k. We propose to estimate transfer
entropy for each value of u within an interval that includes some crude a priori esti-
mation of the propagation time between the subsystems generating the time series.
The value of k partially controls the bias versus statistical error. Kraskov suggested
that a value of k = 4 gave a good compromise for ECoG recordings from human
epileptic patients [63]. For each application it is also good idea to scan k and check
how TE depends on it.
Given u and k, the numerical estimation relies on two types of nearest neighbor
searches. The first type (”fixed mass”) searches the distance from each state-vector
dy dx
in the highest dimensional space (yt yt−1 xt−u ) to its k-th nearest neighbor typically
using the maximum norm. Such set of distances (one distance per time index t)
defines the radii of the balls for the second search type (”fixed radius”) –counting of
the number of points n(·) in these balls around each state vector in all the marginal
spaces. To exclude biases due to the slow auto-correlation of signals, it is important
to discard in all previous searches state vectors too close in time. This correction
named Theiler correction [69] or dynamic correlation exclusion [66] introduces an
extra parameter T h which is typically set to the largest auto-correlation decay time
of the two time series. Finally, after plugging the set n(·) into Eq. 9 we can obtain a
numerical value for transfer entropy.
Both the ”fixed mass” and ”fixed radius” nearest neighbor searches involved in
the estimation can be computationally expensive in large data sets of high dimen-
sionality [59]. In addition, for typical multichannel recordings there are hundreds to
thousands of possible pairwise combinations of channels for which transfer entropy

is to be estimated. In those conditions it is absolutely mandatory the use efficient al-
gorithms for the nearest neighbors searches. Vejmelka compared box-assisted, k-d
trie, and projection algorithms and concluded a big advantage of the k-d trie method
for dimensions larger or equal than 4 [70]. Although the state-of-the-art for very
high dimensions is still unsatisfactory, algorithms exploiting parallelization includ-
ing those using GPU technology have been reported to increase the speed of nearest
neighbor searches by factors up to 100 [61].
4.3 Step 3: Using Transfer Entropy as a Statistic

A final estimate of transfer entropy is typically produced by averaging the individual
TE estimates over as many trials as possible to reduce its variance. Under station-
ary conditions numerical simulations suggest that if the product N · R (data length
by number of trials) is large enough reliable estimators can be obtained even from
short time series of a few hundred samples. But even averaging over many trials and
using the KSG estimator one cannot remove certain bias that will unavoidably af-
fect the estimate. Thus, the obtained TE values have to be compared against suitable
surrogate data using non-parametric statistical testing to infer the presence or ab-
sence of directed information transfer [25, 30]. In short, the surrogate data must be
produced under the null hypothesis of no source-target directed information trans-
fer, while retaining as many other statistical properties as possible (in particular the
d
state transition probabilities p(yt+1 |yt y )).
There are at least two natural options to build surrogate data that minimally de-
stroys features of the signals other than their possible dependency. If the data is
organized in multiple trials a one way to construct surrogate data is by pairing the
time series of one of the two signals with the time series of other signal during the
next trial, trying to preserve as many data features as possible. If the data is not or-
ganized in trials it is possible to construct surrogates for transfer entropy by cutting
one of the time series at a random point and swapping the two resulting blocks (see
detailed descriptions of the statistical routines in Lindner [71]).
TE values can be quantified as excess TE values with respect to surrogate data:

Δ T E (X → Y ; u) = T E (X → Y ; u) − T E X
→ Y ; u . (11)
where X
denotes the surrogate data. Statistical significance can then be obtained for
the excess transfer entropy by non-parametric methods using permutation testing as
detailed in Vicente et al. [25] or Lindner et al. [71] to minimize the potential effects
of bias introduced by small sample size.
4.4 Toolboxes
Several toolboxes have been developed to tackle some or all of the three former
steps. Here we mention three of the toolboxes that handle the complexity of the
KSG estimation for transfer entropy but we make no claim that this short list is
exhaustive and encourage the reader to find the toolbox or software that fits best to
its intended application domain.
TRENTOOL is a MATLAB R
open source toolbox co-developed by the authors
and M. Lindner that is especially geared to neurophysiological data sets [71, 72]. It
is integrated with the popular Fieldtrip toolbox and handles the reconstruction, es-
timation, and non-parametric statistical significance of transfer entropy and mutual
information for multichannel recordings. It also features parallel search of nearest
neighbors and analysis of non-stationary time series by the ensemble method (see
next Section).
TIM is an open source C++ toolbox by K. Rutanen that estimates a large range
of information functionals for continuous-valued signals including transfer entropy,
mutual information, Kullback-Leibler divergence, Shannon entropy, Renyi entropy,
and Tsallis entropy [73].
Java information dynamics toolkit is a software written by J. Lizier that imple-
ments all of the relevant information dynamics measures, including basic measures
such as entropy, joint entropy, mutual information, conditional mutual information,
as well as advanced measures such as transfer entropy, active information stor-
age, excess entropy, separable information. It features discrete-valued estimators,
kernel estimators, nearest neighbors estimators, and Gaussian approximation based
estimates [74].
5 Coping with Non-stationarity: An Ensemble Estimator

Instead of averaging different estimations of TE over trials, it is possible to use the
multi-trial structure much earlier to our advantage. When independent repetitions
or trials of an experimental condition are available, it is possible to use ensemble
averages rather than temporal averages [66] to approximate the quantities involved
in the KSG estimator. By reducing the need for time averaging, ensemble methods
can improve the temporal resolution of functionals of probability densities such as
transfer entropy. The ensemble approach follows [64] and relies on using nearest
neighbor statistics across event-locked trials to estimate a time-resolved transfer
entropy. Crucially, if the data is non-stationary this approach makes a better use of
the multi-trial structure of data than averaging over the individual trial estimates of
transfer entropy as described in Section 4. However, the temporal resolution comes
at a high computational cost as the number of nearest neighbor searches can increase
considerably.
Here we describe the formulation of the ensemble method applied to estimate
transfer entropy. In particular, we consider R trials for each of which two time series
X = xt (r) and Y = yt (r) are collected (r = 1, 2, . . . , R). As in Section 4, we assume
that each time series can be approximated by a Markov process and thus that the
state space of each process can be reconstructed by an appropriate delay embedding
(for example using the Ragwitz criterion). To simplify computations for a given pro-
cess (X or Y ) we set its embedding dimension to the largest embedding dimension
estimated over all trials. Thus, for each signal and trial we are led to consider a set
of embedded points as such

xtdx (r) = xt (r), xt−τ (r), xt−2τ (r), . . . , xt−(dx −1)τ (r) . (12)
A time-resolved transfer entropy can be formulated by using only the data points
from all trials belonging to a particular time window (t − σ ,t + σ ). This ensem-
ble TE can be decomposed into a sum of four time-resolved individual Shannon
entropies as in Eq. 5

dy dx dy dx
T E (X → Y,t; u) = H yt−1 (r), xt−u (r) − H yt (r), yt−1 (r), xt−u (r)

dy dy
+ H yt (r), yt−1 (r) − H yt−1 (r) , (13)
where r denotes the trial index in the full set of trials. We have also taken into ac-
count that propagation delays u between two processes X and Y affect the timing of
information transfer. Now it is again possible to adapt the KSG estimator to partially
cancel the errors of the different terms. The main difference consists in that in the
ensemble variant of transfer entropy we proceed by enabling the search of nearest
neighbors through points across all trials and not only from the same trial as the
point of reference of the search. If all the trials are aligned according to meaningful
events (such as stimulus or response onset) then it is possible to restrict the search
of neighbors around a time stamp t within a temporal window of width σ to control
the temporal resolution of the estimator. Thus, the ensemble estimator of transfer
entropy reads [64]

T E (X → Y,t; u, σ ) = ψ (k) + ψ n dy + 1
yt−1

− ψ n dy + 1 − ψ n dy dx + 1 , (14)
yt ,yt−1 yt−1 ,xt−u
r
where ψ denotes the digamma function and the angle brackets (<>) denote an av-
eraging over points at different trials at the time index t. Thus, in contrast to time
averaging used in Eq. 9, in the former expression averages are taken over points
across different trials and the nearest neighbor searches are defined within the tem-
poral window (t − σ ,t + σ ). The distances to the k-th nearest neighbor in the space
dy dx
spanned by yt (r), yt−1 (r), xt−u (r) define the radii of the balls for the counting of the
number of points (n(·)) in these balls around each state vector in all the marginal
spaces (·) involved. Such counting is restricted to only points within the interval
(t − σ ,t + σ ) across all trials.
To facilitate its computation, the ensemble estimator, including the state space
reconstruction and the statistical significance testing, has been recently added to the
TRENTOOL open source toolbox [72].
The ensemble estimator has recently been applied detect time-dependent cou-
plings between processes [64, 61] and they are closely related to local measures of
information discussed in Lizier in Chapter 7 of this volume.
6 Discussion
The characterization of a system in terms of the information transfer between its
subsystems is a common goal in many fields of science. However, this approach
seems particularly necessary when dealing with systems such as the nervous sys-
tem for which implementing a flexible routing of information is a key function.
Along this chapter we have described different methods to estimate transfer entropy
from continuous-valued time series typical from electrophysiology recordings. We
have also discussed that methods based on nearest neighbors statistics provide effi-
cient estimators and detailed the KSG estimator as an attractive option for practical
applications.
In this description we have restricted ourselves to a bivariate formulation of trans-
fer entropy. However, to distinguish cascade effects and common drive interactions
in networks of possibly interacting systems (as measured by multichannel record-
ings), it is fundamental to surpass the bivariate limitation. While the mathematical
extension to the multi-variate case is straightforward, its numerical estimation is
rather challenging. The curse of dimensionality and the combinatorial explosion
of possibilities makes unpractical an exhaustive computation of transfer entropies
beyond order 3 for applications with tens or hundreds of channels as it occurs in
typical EEG/MEG recordings. Fortunately recent developments on the optimal sub-
selection of channels as well as efficient multi-variate embedding reconstructions
have paved the way to practical approximations to higher order transfer entropies in
multichannel recordings [75, 76].
Future developments are expected to fully exploit the low dimensionality of the
manifolds on which the dynamics of many systems live. Since the manifold dimen-
sionality is typically far way lower than the Euclidean embedding space, it is pos-
sible that non-linear manifold learning techniques might provide a substantial leap
over current standard techniques. Also a mathematical fully rigorous formulation
of transfer entropy, including an adequate state space reconstruction, for point pro-
cesses such as spike trains would be very welcome. On the applications side, the nu-
merical decomposition of transfer entropy in state-dependent and state-independent
contributions seems a very useful tool to better discern the role of a receiving system
in processing information.
Finally, it is to note that since the 1948’s seminal works of Wiener (on cybernet-
ics [77]) and Shannon (on the quantification of information [13]) the idea that uni-
fying information aspects run deep below the diverse physical descriptions of many
phenomena, has been slowly gaining importance [78]. We believe that the charac-
terization of complex systems using transfer entropy as well as other functionals
describing the dynamics of information [5, 79], is a promising approach towards
understanding what diverse complex systems of computational units really have in

common.
Acknowledgements. The authors would like to thank Wei Wu and Joe Lizier for fruitful
discussions and suggestions.
References
1. Hubel, D.H., Wiesel, T.N.: Receptive fields and functional architecture of monkey striate
cortex. The Journal of Physiology 195(1), 215–243 (1968)
2. Gray, C.M., Knig, P., Engel, A.K., Singer, W.: Oscillatory responses in cat visual cortex
exhibit inter-columnar synchronization which reflects global stimulus properties. Na-
ture 338(6213), 334–337 (1989)
3. Canolty, R.T., Knight, R.T.: The functional role of cross-frequency coupling. Trends in
Cognitive Sciences 14(11), 506–515 (2010)
4. Victor, J.D.: Approaches to information-theoretic analysis of neural activity. Biological
Theory 1(3), 302 (2006)
5. Lizier, J.T.: The Local Information Dynamics of Distributed Computation in Complex
Systems. Springer theses. Springer (2013)
6. Lehmann, E.L., Casella, G.: Theory of point estimation, vol. 31. Springer (1998)
7. Niso, G., Brua, R., Pereda, E., Gutirrez, R., Bajo, R., Maest, F., Del-Pozo, F.: Hermes:
Towards an integrated toolbox to characterize functional and effective brain connectivity.
Neuroinformatics 11, 405–434 (2013)
8. Pereda, E., Quiroga, R.Q., Bhattacharya, J.: Nonlinear multivariate analysis of neuro-
physiological signals. Progress in Neurobiology 77(1), 1–37 (2005)
9. Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley-Interscience, New
York (1991)
10. Latham, P.E., Nirenberg, S.: Synergy, redundancy, and independence in population
codes, revisited. J. Neurosci. 25(21), 5195–5206 (2005)
11. Williams, P.L., Beer, R.D.: Nonnegative decomposition of multivariate information.
arXiv preprint arXiv:1004.2515 (2010)
12. Rieke, F., Warland, D., Deruytervansteveninck, R., Bialek, W.: Spikes: exploring the
neural code (computational neuroscience). MIT Press (1999)
13. Shannon, C.E.: The bell technical journal. A Mathematical Theory of Communica-
tion 27(4), 379–423 (1948)
14. Shannon, C.E., Weaver, W.: The mathematical theory of communication, urbana, il,
vol. 19(7), p. 1. University of Illinois Press (1949)
15. Barlow, H.B.: Possible principles underlying the transformation of sensory messages.
Sensory Communication, 217–234 (1961)
16. de Ruyter van Steveninck, R.R., Laughlin, S.B.: The rate of information transfer at
graded-potential synapses. Nature 379(6566), 642–645 (1996)
17. Lewicki, M.S.: Efficient coding of natural sounds. Nature Neuroscience 5(4), 356–363
(2002)
18. Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current Opinion in Neu-
robiology 14(4), 481–487 (2004)
19. Johnson, D.H.: Information theory and neuroscience: Why is the intersection so small?
In: IEEE Information Theory Workshop, ITW 2008, pp. 104–108 (2008)
20. Shannon, C.E.: The bandwagon. IRE Transactions on Information Theory 2(1), 3 (1956)
21. Nirenberg, S.H., Victor, J.D.: Analyzing the activity of large populations of neurons: how
tractable is the problem? Current Opinion in Neurobiology 17(4), 397–400 (2007)
22. Johnson, D.H.: Information theory and neural information processing. IEEE Transac-
tions on Information Theory 56(2), 653–666 (2010)
23. Schreiber, T.: Measuring information transfer. Phys. Rev. Lett. 85(2), 461–464 (2000)
24. Wiener, N.: The theory of prediction. In: Beckmann, E.F. (ed.) Modern Mathematics for
the Engineer. McGraw-Hill, New York (1956)
25. Vicente, R., Wibral, M., Lindner, M., Pipa, G.: Transfer entropy – a model-free measure
of effective connectivity for the neurosciences. J. Comput. Neurosci. 30(1), 45–67 (2011)
26. Ay, N., Polani, D.: Information flows in causal networks. Adv. Complex Syst. 11, 17
(2008)
43 (2002)
28. Chicharro, D., Ledberg, A.: When two become one: the limits of causality analysis of
ity: application to epileptic EEG signals. J. Neurosci. Methods 124(2), 113–128 (2003)
lar networks. Prog. Biophys. Mol. Biol. 105(1-2), 80–97 (2011)
31. Vicente, R., Gollo, L.L., Mirasso, C.R., Fischer, I., Pipa, G.: Dynamical relaying can
yield zero time lag neuronal synchrony despite long conduction delays. Proceedings of
the National Academy of Sciences 105(44), 17157–17162 (2008)
32. Kay, S.M.: Fundamentals of statistical signal processing. In: Estimation Theory, vol. 1
(1993)
33. Hlaváčková-Schindler, K., Paluš, M., Vejmelka, M., Bhattacharya, J.: Causality detec-
tion based on information-theoretic approaches in time series analysis. Physics Re-
ports 441(1), 1–46 (2007)
34. Gourevitch, B., Eggermont, J.J.: Evaluating information transfer between auditory corti-
cal neurons. J. Neurophysiol. 97(3), 2533–2543 (2007)
network model. PLoS One 6(11), e27431 (2011)
36. Li, Z., Li, X.: Estimating temporal causal interaction between spike trains with permuta-
tion and transfer entropy. PLoS One 8(8), e70894 (2013)
lent for Gaussian variables. Phys. Rev. Lett. 103(23), 238701 (2009)
38. Hlavácková-Schindler, K.: Equivalence of Granger causality and transfer entropy: A gen-
eralization. Applied Mathematical Sciences 5(73), 3637–3648 (2011)
39. Nichols, J.M., Seaver, M., Trickey, S.T., Todd, M.D., Olson, C., Overbey, L.: Detecting
nonlinearity in structural systems using the transfer entropy. Phys. Rev. E Stat. Nonlin.
Soft Matter Phys. 72(4 Pt. 2), 046217 (2005)
40. Hahs, D.W., Pethel, S.D.: Transfer entropy for coupled autoregressive processes. En-
tropy 15(3), 767–788 (2013)
41. Barnett, L., Bossomaier, T.: Transfer entropy as a log-likelihood ratio. Physical Review
Letters 109(13), 138105 (2012)
42. Miller, G.A.: Note on the bias of information estimates. Information Theory in Psychol-
ogy: Problems and Methods 2, 95–100 (1955)
43. Efron, B., Stein, C.: The jackknife estimate of variance. The Annals of Statistics, 586–
596 (1981)
44. Pompe, B., Runge, J.: Momentary information transfer as a coupling measure of time
series. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 83(5 Pt. 1), 051122 (2011)
45. Wibral, M., Pampu, N., Priesemann, V., Siebenhhner, F., Seiwert, H., Lindner, M., Lizier,
J.T., Vicente, R.: Measuring information-transfer delays. PLoS One 8(2), e55809 (2013)
46. Paluš, M.: Testing for nonlinearity using redundancies: Quantitative and qualitative as-
pects. Physica D: Nonlinear Phenomena 80(1), 186–205 (1995)
47. Fraser, A.M., Swinney, H.L.: Independent coordinates for strange attractors from mutual
information. Phys. Rev. A. 33, 1134 (1986)
48. Darbellay, G.A., Vajda, I.: Estimation of the information by an adaptive partitioning
of the observation space. IEEE Transactions on Information Theory 45(4), 1315–1321
(1999)
49. Cellucci, C.J., Albano, A.M., Rapp, P.E.: Statistical validation of mutual information
calculations: Comparison of alternative numerical algorithms. Physical Review E 71(6),
066208 (2005)
50. Daub, C.O., Steuer, R., Selbig, J., Kloska, S.: Estimating mutual information using b-
spline functions–an improved similarity measure for analysing gene expression data.
BMC Bioinformatics 5(1), 118 (2004)
51. Victor, J.: Binless strategies for estimation of information from neural data. Phys. Rev.
E 72, 051903 (2005)
52. Silverman, B.W.: Density estimation for statistics and data analysis, vol. 26. CRC Press
(1986)
53. Young-Il, M., Rajagopalan, B., Lall, U.: Estimation of mutual information using kernel
density estimators. Physical Review E 52(3), 2318 (1995)
54. Steuer, R., Kurths, J., Daub, C.O., Weise, J., Selbig, J.: The mutual information: detecting
and evaluating dependencies between variables. Bioinformatics 18(suppl. 2), S231–S240
(2002)
55. Kozachenko, L.F., Leonenko, N.N.: Sample estimate of entropy of a random vector.
Probl. Inform. Transm. 23, 95–100 (1987)
56. Knuth, D.E.: The art of computer programming. In: Sorting and Searching, vol. 3 (1973)
57. Vaidya, P.M.: An O(n logn) algorithm for the all-nearest-neighbors problem. Discrete &
Computational Geometry 4(1), 101–115 (1989)
58. Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity search: The metric space ap-
proach. Advances in Database Systems, vol. 32. Springer, Secaucus (2005)
59. Heineman, G.T., Pollice, G., Selkow, S.: Algorithms in a Nutshell. O’Reilly Media, Inc.
(2009)
60. Merkwirth, P., Lauterborn, W.: Fast nearest-neighbor searching for nonlinear signal pro-
cessing. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics 62(2 Pt. A),
2089–2097 (2000)
61. Wollstadt, P., Martinez-Zarzuela, M., Vicente, R., Wibral, M.: Efficient transfer entropy
analysis of nonstationary neural time series. arXiv preprint arXiv:1401.4068 (2014)
62. Kraskov, A., Stoegbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev.
E Stat. Nonlin. Soft Matter Phys. 69(6 Pt. 2), 066138 (2004)
63. Kraskov, A.: Synchronization and Interdependence measures and their application to the
electroencephalogram of epilepsy patients and clustering of data. PhD thesis, University
of Wuppertal (February 2004)
ing coupling dynamics from an ensemble of time series. arXiv preprint arXiv:1008.0539
(2010)
65. Takens, F.: Detecting Strange Attractors in Turbulence. In: Dynamical Systems and Tur-
bulence, Warwick, 1980. Lecture Notes in Mathematics, vol. 898, pp. 366–381. Springer
(1981)
66. Kantz, H., Schreiber, T.: Nonlinear Time Series Analysis, 2nd edn. Cambridge University
Press (November 2003)
67. Cao, L.Y.: Practical method for determining the minimum embedding dimension of a
scalar time series. Physica A 110, 43–50 (1997)
68. Ragwitz, M., Kantz, H.: Markov models from data by simple nonlinear time series pre-
dictors in delay embedding spaces. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 65(5 Pt.
2), 056201 (2002)
69. Theiler, J.: Spurious dimension from correlation algorithms applied to limited time-series
data. Physical Review A 34(3), 2427 (1986)
70. Vejmelka, M., Hlaváčková-Schindler, K.: Mutual information estimation in higher di-
mensions: A speed-up of a k-nearest neighbor based estimator. In: Beliczynski, B.,
Dzielinski, A., Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007, Part I. LNCS,
vol. 4431, pp. 790–797. Springer, Heidelberg (2007)
71. Lindner, M., Vicente, R., Priesemann, V., Wibral, M.: Trentool: A Matlab open source
toolbox to analyse information flow in time series data with transfer entropy. BMC Neu-
rosci. 12(119), 1–22 (2011)
72. Lindner, M., Vicente, R., Wibral, M., Pampu, N., Wollstadt, P., Martinez-Zarzuela, M.:
TRENTOOL, http://www.trentool.de
73. Rutanen, K.: TIM 1.2.0,
http://www.cs.tut.fi/˜timhome/tim-1.2.0/tim.htm
74. Lizier, J.: Java Information Dynamics Toolkit,
http://code.google.com/p/information-dynamics-toolkit/
Med. 42(3), 290–297 (2012)
76. Lizier, J.T., Rubinov, M.: Inferring effective computational connectivity using incremen-
tally conditioned multivariate transfer entropy. BMC Neuroscience 14(suppl. 1), P337
(2013)
77. Wiener, N.: Cybernetics. Hermann, Paris (1948)
78. Davies, P.C.W., Gregersen, N.H.: Information and the Nature of Reality, vol. 3. Cam-
bridge University Press, Cambridge (2010)
79. Barnett, L., Lizier, J.T., Harré, M., Seth, A.K., Bossomaier, T.: Information flow in a
kinetic Ising model peaks in the disordered phase. Physical Review Letters 111(17),
177203 (2013)
Part II
Information Transfer in Neural and Other
Physiological Systems
60 Part II: Information Transfer in Neural and Other Physiological Systems
This part of the book provides example applications of measures of information

transfer to physiological systems, and models thereof. Specific methodological top-
ics covered here are the analysis of information transfer in multivariate (multi-node)
systems in the chapters by Faes and Porta, and by Marinazzo and colleagues, and
the dependence of information transfer patterns on the dynamic state of a system de-
spite an unchanged causal architecture in the chapter of Battaglia. The final chapter
of the second part of this book, by Vakorin and colleagues, analyses the effect that
the information content of source and target have on the transfer of information be-
tween them, as well as the influence of the relative phase between source and target
time series.
Conditional Entropy-Based Evaluation of
Information Dynamics in Physiological Systems
Luca Faes and Alberto Porta
Abstract. We present a framework for quantifying the dynamics of information in

coupled physiological systems based on the notion of conditional entropy (CondEn).
First, we revisit some basic concepts of information dynamics, providing definitions
of self entropy (SE), cross entropy (CE) and transfer entropy (TE) as measures of
information storage and transfer in bivariate systems. We discuss also the gener-
alization to multivariate systems, showing the importance of SE, CE and TE as
relevant factors in the decomposition of the system predictive information. Then,
we show how all these measures can be expressed in terms of CondEn, and devise
accordingly a framework for their data-efficient estimation. The framework builds
on a CondEn estimator that follows a sequential conditioning procedure whereby
the conditioning vectors are formed progressively according to a criterion for Con-
dEn minimization, and performs a compensation for the bias occurring for condi-
tioning vectors of increasing dimension. The framework is illustrated on numerical
examples showing its capability to deal with the curse of dimensionality in the mul-
tivariate computation of CondEn, and to reliably estimate SE, CE and TE in the
challenging conditions of biomedical time series analysis featuring noise and small
sample size. Finally, we illustrate the practical application of the presented frame-
work to cardiovascular and neural time series, reporting some applicative examples
in which SE, CE and TE are estimated to quantify the information dynamics of the
underlying physiological systems.
Luca Faes
Department Physics and BIOtech Center, University of Trento, Trento, Italy
e-mail: luca.faes@unitn.it
Alberto Porta
Department of Biomedical Sciences for Health, Galeazzi Orthopaedic Institute,
University of Milan, Milan, Italy
e-mail: alberto.porta@unimi.it

62 L. Faes and A. Porta
1 Introduction
The study of many physical phenomena is often performed according to a reduction-
ist approach whereby the dynamics of the observed complex system are described
as resulting from the activity of less complex subsystems and from the interaction
among these subsystems. For instance, the human brain can be seen as a complex
network characterized by distinct neural ensembles, each represented by a single
oscillator, which are highly interconnected with each other according to specific
patterns of connectivity [1]. With a broader perspective, the whole human organism
can be seen as an integrated network where multiple physiological systems under
the neural regulation, such as the cardiac, circulatory, respiratory and muscular sys-
tems, each with its own internal dynamics, continuously interact to preserve the
overall physiological function [2]. The aim of this approach is to describe how com-
plex properties of the observed network arise from the dynamics and the dynamical
interaction of simpler and likely more accessible parts. When the physiology of the
composite system and the way in which the subsystems interact is well known, the
analysis may be performed constructing suitable generative models and comparing
the dynamics of these models with the available experimental data. If, as often hap-
pens, the available knowledge is insufficient to support the definition of a generative
model, data-driven approaches are needed whereby the properties of the subsystems
and their interactions are estimated from the measured data.
When the data-driven approach is considered, the central need is to identify a
suitable framework for describing the properties of the observed complex network
in terms of meaningful measures of system activity and connectivity. In this chapter,
we focus on the well-posed analysis framework provided by dynamical information
theory [3]. Compared with other frameworks commonly used for the analysis of
physiological networks, for instance the linear parametric representation of multi-
ple time series performed either in time or frequency domains [4], the information-
theoretic approach offers the intriguing possibility of exploring system dynamics
from a nonlinear and model-free perspective. Attracted by this opportunity, several
researchers have defined and developed different measures of information dynam-
ics based on the computation of entropy rates. In particular, single-process condi-
tional entropy measures computed via their various formulations (e.g., approximate
entropy [5], sample entropy [6], corrected conditional entropy [7]) are typical mea-
sures of system complexity, while mutual information and cross-entropies quan-
tify the information shared between coupled systems [8, 9], and measures based
on transfer entropy [10] quantify the directional information flow from one sys-
tem to another. Recent advances in the field of information dynamics have shown
that, if properly defined and contextualized, these measures form the basis of dis-
tributed computation in complex networks [11] and are not independent with each
other when the aim is to characterize the overall behavior of a network of interact-
ing systems [12]. In this contribution, new definitions of the so-called self entropy
(SE) and cross entropy (CE) are integrated, together with the well known trans-
fer entropy (TE), into an unified framework for the study of information dynamics.
These elements are combined together to compute the reduction of the information
Conditional Entropy-Based Evaluation of Information Dynamics 63
associated to the target system due to the knowledge of the dynamics of a network
of interacting dynamical components.
Another aspect of paramount importance in practical analysis is the design of
data-efficient procedures for the estimation of information-theoretic measures in the
challenging conditions of physiological signal analysis. Surveying our past and re-
cent research in the field [7],[9],[13],[14],[15],[16],[17],[18],[19],[20], we present
in the second part of the chapter a general strategy for the estimation of SE, CE
and TE from short realizations of multivariate processes. The strategy is based on
the utilization of a corrected conditional entropy estimator and of appropriate em-
bedding schemes, and aims at dealing with the curse of dimensionality –an issue
unavoidably affecting the estimation of information-theoretic quantities defined in
high-dimensional state spaces from time series of limited length. As suggested by
the reported practical applications this approach successfully describes, in terms of
SE, CE and TE estimated from multivariate time series, both individual and collec-
tive properties of the systems composing brain and physiological networks.
2 Information Dynamics in Coupled Systems

Bivariate Systems
Let us consider two dynamical processes X and Y, taken as descriptive of two pos-
sibly coupled dynamical systems. In order to study the information relevant to an
assigned target process, say Y, standard information-theoretic measures can be ex-
ploited [21]. First, we can consider the central quantity in information theory, that
is the Shannon entropy, which expresses the amount of information carried by the
target process in terms of the average uncertainty about Y:
H (Y ) = − ∑ p (y) log p (y), (1)
where p(y) is the probability for the variable Y to take the value y. The conditional
entropy (CondEn) quantifies the average uncertainty that remains about Y when X is
known as:
H (Y |X) = − ∑ p (x, y) log p (y|x), (2)
while the mutual information (MI) quantifies the amount of information shared
between X and Y as:
p (y|x)
I (X;Y ) = ∑ p (x, y) log (3)
p (y)
where p(x,y) is the joint probability of observing simultaneously the values x and
y for the variables X and Y, and p(y|x) is the conditional probability observing y
given that x has been observed. Note that the sums in (1-3) extend to the sets of
all values with nonzero probability. Combining (1), (2) and (3) one can easily see
that the three measures are linked to each other by the relation I(X;Y)=H(Y)-H(Y|X).
Moreover, considering a third process Z possibly affecting the relation between X

and Y, the conditional mutual information between X and Y given Z:
p (y|x, z)
I (X;Y |Z) = ∑ p (x, y, z) log , (4)
p (y|z)
quantifies the mutual information between X and Y when Z is known. Similarly

to the MI, the conditional MI in (4) can be also stated in terms of CondEn as
I(X;Y|Z)=H(Y|Z)-H(Y|X,Z). The formulations (1-4) hold equivalently when one or
more processes are composed of multiple sub-processes; in such a case, the relevant
variables X, Y, Z are treated as vector variables. The base of the logarithms in (1-4)
can be any convenient real number; a common choice in dynamical system analysis
is to chose e as base, so that entropy is measured in natural units.
The information measured in (1-4) is static in the sense that the computation
of entropy, CondEn and MI does not take the temporal evolution of the observed
systems into account. The dynamic properties of a system can be studied in the in-
formation domain introducing the concept of transition probability, which is the
probability associated with the transition of a system from its past states to its
present state. To introduce the notation for information dynamics, let Xn denote the
present of the process X (i.e., the scalar variable obtained sampling the process at the
present time n), and X− n =[Xn−1 Xn−2 ···] denote the state variable of the process X
(i.e., the infinite-dimensional vector variable collecting the whole past of X up to
time n-1). Moreover, single realizations of the variables Xn and X− n are denoted with
the lowercase letters xn and x− n =[x x
n−1 n−2 ···] (the same notation holds for the tar-
get process Y). Then, for instance, the transition probabilities p(yn|y− n ) and p(yn |
x−n ) measure the probability for the target process to take the value y n respectively
,
knowing that its state variable has taken the value y− n , or that the state variable of
the source system has taken the value x− n .
The dynamic information contained in the transition probabilities can be ex-
ploited to assess how the state transitions contribute to the information carried by a
system. First, the influence of the past states of the target process Y onto its present
state can be assessed by means of the self entropy (SE), which we define as:
p(yn |y−
n)
SY = ∑ p(yn , y−
n ) log (5)
p(yn )
that quantifies the average reduction in uncertainty about Yn resulting from the
knowledge of Y − n . The SE ranges from SY =0, measured when the past states Y n
−
do not provide any reduction in the uncertainty about the present state Yn (i.e., when
p(yn |y−
n )=p(yn )), to its maximum value SY =H(Yn ), measured when the whole un-
certainty about Yn is reduced by learning Y − −
n (i.e., when p(yn |yn )=1). Moreover, the
influence of the past states of the source process X onto the present state of the target
process Y can be assessed by means of the cross entropy (CE), defined here as:
p(yn |x−
n)
CX→Y = ∑ p(yn , x−
n ) log (6)
p(yn )
that quantifies the average reduction in uncertainty about Yn resulting from

the knowledge of X− n . Similarly to the SE, the CE ranges from CX→Y =0 to
CX→Y =H(Yn ), measured respectively when X− n does not bring any uncertainty re-
duction about Yn , and when X− n reduces the whole uncertainty about Yn . As an
alternative, the influence of X−
n on Y n can be assessed by the well known transfer
entropy (TE) [10]:
p(yn |x− −
n , yn )
TX→Y = ∑ p(yn , y− −
n , xn ) log (7)
p(yn |y−
n)
At difference of CE the average reduction of uncertainty about Yn resulting from the

knowledge of X− −
n is assessed by taking into account the past values Y n of the target
−
process. In fact, the TE ranges from TX→Y =0, measured when Xn does not bring
any uncertainty reduction about Yn beyond that brought by Y − −
n , to TX→Y =H(Yn |Y n ),
measured when the whole uncertainty about Yn that was not already reduced by Y − n
is reduced by X− n.
Note that the presented formulation presupposes to work with stationary and
discrete processes, because all measures are computed as expectation values of each
logarithm quantity over all possible configurations and assuming a finite alphabet for
the possible values taken by the variables Xn and Yn . Nevertheless, non-stationary
and continuous valued processed can be treated within this framework exploiting
suitable local formulations and entropy estimators [12, 22].
The three measures of information dynamics introduced above may be expressed
in compact form in terms of MI and conditional MI, or equivalently in terms of
entropy and CondEn, obtaining respectively:
SY = I(Yn ; Y − −
n ) = H(Yn ) − H(Yn |Y n ), (8)
CX→Y = I(Yn ; X − −
n ) = H(Yn ) − H(Yn |X n ), (9)
TX→Y = I(Yn ; X − − − − −
n |Y n ) = H(Yn |Y n ) − H(Yn |X n , Y n ). (10)
From these compact formulations, it is intuitive to see that SE, CE and TE mea-
sure the reduction in the information carried by Y respectively due to the introduc-
tion of its own past, due to the introduction of the past of X when the contribution
of the past of Y is not taken into account, and due to the introduction of the past of
X when the contribution of the past of Y is taken into account.

Multivariate Systems
In this section we extend the formulation presented in Sect. 2.1. to the general sit-
uation of M interacting (sub)systems composing an overall observed dynamical
system. The extension is based on performing a proper conditioning to the M-2
processes other than the two considered source and destination processes, in order
to rule out the side information related to these other processes that may possi-
bly confound the analysis of information dynamics. This is achieved by defining
multivariate (conditional) variants of self, cross and transfer entropy measures as
follows. Suppose that we are interested in evaluating the information of the target
process Y in relation to the source process X, collecting the remaining processes in
the set Z={Z(k) }k=1,...,M−2 . Then, the multivariate SE of Y given Z quantifies the
additional reduction of information about Yn due to the introduction of Y − −
n in Zn ,
− −
thus accounting for the contribution of Yn that is not already provided by Zn :
SY |Z = I(Yn ; Y − − − − −
n |Zn ) = H(Yn |Zn ) − H(Yn |Y n , Zn ); (11)
the multivariate CE from X to Y given Z quantifies the additional reduction of in-

formation about Yn due to the introduction of X− −
n in Zn , thus accounting for the
− −
contribution of Xn that is not already provided by Zn :
CX→Y |Z = I(Yn ; X − − − − −
n |Zn ) = H(Yn |Zn ) − H(Yn |X n , Zn ); (12)
the multivariate TE from X to Y given Z quantifies the additional reduction of infor-

mation about Yn due to introduction of X− − −
n in Zn incremented by Y n , thus account-
ing for the contribution of Xn that is not already provided by Y n and by Z−
− −
n:
TX→Y |Z = I(Yn ; X− − − − − − − −
n |Y n , Zn ) = H(Yn |Y n , Zn ) − H(Yn |X n , Y n , Zn ). (13)
The multivariate measures defined in (11-13) result as natural extensions of the

bivariate measures (8-10) defined to study in the information domain the dynamical
dependencies within a single system or between two systems. As we will see in the
next Section, these extensions are useful to understand how dynamical dependencies
change when the two systems are considered as constituents of a larger network of
interacting systems rather than as an isolated bivariate system.
2.3 Self Entropy, Cross Entropy and Transfer Entropy as

Components of System Predictive Information
The three measures of information dynamics presented in Sect. 3.1 represent key
elements for the description of the dynamical information generated and shared be-
tween coupled systems. In the study of dynamical systems, the SE has been asso-
ciated with the concept of active information storage [23], and to that of regularity
intended as the inverse of complexity [5, 7]. The CE has been used as a measure of
directed information [14, 24, 25], and also as a measure of coupling when evaluated
as the maximum information exchanged over the two opposite directions of interac-
tion between two coupled systems [9, 13]. The TE is a very well known measure of
predictive information transfer between systems [10, 26], closely reflecting the ubiq-
uitous concept of Granger causality [27, 28]. Moreover, in situations where more
than two dynamical systems are known to interact with each other, the utilization
of conditional MI measures has proven useful to identify statistical dependencies
between pairs of systems in the context of their multivariate representation where
the remaining interacting systems are considered. For instance, the multivariate ver-
sion of the TE has been proven useful to address the confounding effects of indirect
connections in the estimation of direct information transfer between the nodes of a
network [29]. In addition, the conditional MI I(X;Y|Z) has been given a novel inter-
pretation in terms of partial information decomposition [30] which has led to show,
e.g., in cellular automata [3], that the multivariate TE TX→Y |Z assesses the predic-
tive information transfer by suppressing the redundant information provided by X
and Z about Y, but also incorporating the synergistic information found in X and Z
about Y.
Interestingly, SE, CE and TE do not describe isolated aspects of the dynamics of
information in a composite dynamical system. In accordance with a recently pro-
posed information-theoretic framework for the study of dependencies in networks
of dynamical systems [12], we show here that SE, CE and TE appear naturally as
terms in the decomposition of the system predictive information about the observed
target system. The predictive information is defined for an assigned target process
as the amount of information about the present state of the process that is explained
by its past states and the past states of all other available processes. This can be
measured, for a bivariate process {X,Y} where Y is taken as target process, through
the conditional MI:
PY = I(Yn ; X − − − −
n , Y n ) = H(Yn ) − H(Yn |X n , Y n ), (14)
which can be further decomposed in two terms related to the bivariate SE and TE
of (8) and (9), or alternatively in two terms related to the bivariate CE (9) and the
multivariate SE (11):
PY = H(Yn ) − H(Yn|Y − − − −
n ) + H(Yn |Y n ) − H(Yn |X n , Y n ) = SY + TX→Y , (15a)
PY = H(Yn ) − H(Yn|X − − − −
n ) + H(Yn |X n ) − H(Yn |X n , Y n ) = CX→Y + SY |X (15b)
Consistently, for a multivariate process {X,Y,Z} the system predictive informa-

tion about Y is:
PY = I(Yn ; X − − − − − −
n , Y n , Zn ) = H(Yn ) − H(Yn |X n , Y n , Zn ), (16)
which can be expanded, according to the chain rule for the decomposition of the
conditional MI, in six different ways:
PY = I(Yn ; Y − − − − − −
n ) + I(Yn ; X n |Y n ) + I(Yn ; Zn |X n , Y n ) = SY + TX →Y + TZ→Y |X , (17a)
PY = I(Yn ; Y − − − − − −
n ) + I(Yn ; Zn |Y n ) + I(Yn ; X n |Y n , Zn ) = SY + TZ→Y + TX →Y |Z , (17b)
PY = I(Yn ; X− − − − − −
n ) + I(Yn ; Y n |X n ) + I(Yn ; Zn |X n , Y n ) = CX →Y + SY |X + TZ→Y |X , (17c)
PY = I(Yn ; Z− − − − − −
n ) + I(Yn ; Y n |Zn ) + I(Yn ; X n |Y n , Zn ) = CZ→Y + SY |Z + TX →Y |Z , (17d)
PY = I(Yn ; X− − − − − −
n ) + I(Yn ; Zn |Xn ) + I(Yn ; Y n |X n , Zn ) = CX →Y +CZ→Y |X + SY |X ,Z ,(17e)
PY = I(Yn ; Z− − − − − −
n ) + I(Yn ; X n |Zn ) + I(Yn ; Y n |Xn , Zn ) = CZ→Y +CX →Y |Z + SY |X ,Z . (17f)
The decompositions in (15) and (17) are useful to explain how the uncertainty
about the states visited by the target system is reduced as a result of the state tran-
sitions relevant to the overall bivariate or multivariate system. In particular, they
show that SE, CE and TE are the elements through which this uncertainty reduc-
tion is achieved. It is worth noting that the different decompositions in (15) or in
(17) are equally valid, as they reflect simply different orders through which con-
ditioning to the past of the constituent processes is performed [12]. Therefore, as
none of the decompositions can be considered as preeminent, SE, CE and TE can
be seen as equally important terms of the description of a target system in terms
of predictive information. Which of these decompositions should be chosen to dis-
sect the system predictive information about the target system may depend only on
side information, e.g., based on physiological knowledge. For instance, when the
target process Y is known to be a passive process a decomposition evidencing the
CE might be preferred to another evidencing the TE, to limit the underestimation
of information transfer which may result by conditioning to Y − n ; on the contrary,
when Y exhibits self-sustained oscillatory activity the SE should be highlighted to
rule out the possibility that such an activity is misinterpreted as information transfer.
Moreover, formulations evidencing CX→Y |Z and TX→Y |Z should be preferred when
some of the processes composing Z may potentially affect both X and Y, while the
computation of CX→Y and TX→Y may suffice when X and Z can be considered in-
dependent. In any case, one particular decomposition can be supported a posteriori,
i.e. showing how useful it is to understand how the overall bivariate or multivariate
system behaves when examined under different conditions.
3 Strategies for the Estimation of Information Dynamics

Measures
An interesting property of all the measures of information dynamics presented in
the previous section is that they can be expressed as the difference between two
conditional entropies (see Eqs. (8-10) for the bivariate formulation and Eqs. (11-13)
for the multivariate formulation). Therefore, a natural way to estimate SE, CE and
TE from time series data is to exploit a common strategy based on CondEn estima-
tion. In this Section we present an efficient framework for the practical estimation
of CondEn, and show how this framework may be used to estimate SE, CE and TE
from short-length realizations of the observed multivariate processes.
The CondEn terms involved in Eqs. (8-10) and (11-13) have to be computed by
conditioning on the past history of one or more observed systems. In practical analy-
sis, this is achieved through the so-called state-space reconstruction of the observed
dynamical systems [31]. State space reconstruction refers to identifying the finite
dimensional state variables that better approximate the past states of the observed
processes X− − −
n , Y n , and Zn . The most commonly followed approach is to perform
uniform time delay embedding, whereby each scalar process is mapped into trajec-
tories described by delayed coordinates uniformly spaced in time [32]. In this way
the state variable of the target process Y, Y −
n , is approximated with the delay vector
[Yn−u Yn−u−τ ··· Yn−u−(d−1)τ ], with d, τ and u representing respectively the so-
called embedding dimension, embedding time and prediction time. This procedure
suffers from many disadvantages. First, univariate embedding whereby coordinate
selection is performed separately for each process does not guarantee optimality of
the reconstruction for the multivariate state space [33]. Moreover, selection of the
embedding parameters d, τ and u is not straightforward, as many competing criteria
exist which are all heuristic and somewhat mutually exclusive [34]. Most impor-
tant, uniform embedding exposes the state space reconstruction procedure to the so
called “curse of dimensionality”, a problem related to the sparsity of the available
data within state spaces of increasing volume [35]; this problem is exacerbated in
the presence of multivariate time series and when the series are of limited length, as
commonly happens in physiological system analysis due to lack of data or stationar-
ity requirements. In these conditions the estimation of CondEn suffers from serious
limitations, as it is found that –whatever the underlying dynamics– short time se-
ries generate estimates of entropy rates that progressively decrease towards zero
at increasing the embedding dimension [7], thus rendering completely useless the
computed measures. This issue forced many authors to fix the embedding dimension
at very small arbitrary values to obtain reliable CondEn estimates (see, e.g., [36]).
To show how these problems can be counteracted in the practical estimation of mea-
sures of information dynamics, we describe in the following an estimation strategy
based on the utilization of a corrected CondEn estimator [7], and an improvement
of this strategy based on a non-uniform embedding technique [18].
3.1 Corrected Conditional Entropy

Let us consider a single process, say Y, and focus on the estimation of the SE de-
fined in (8), SY =I(Yn ,Y − −
n )=H(Yn )–H(Yn |Y n ), from a realization of length N of the
process, {yn , n=1,...,N}. This SE can be intuitively estimated in terms of Shannon
entropy, computing estimates of the entropy H(Yn ) and of the CondEn H(Yn |Y − n ),
which in turn can be seen as the difference between two entropies, H(Yn |Y − n)
=H(Yn ,Y −
n )-H(Y − ). Moreover, according to state space reconstruction based on
n
uniform embedding, the past states Y − n can be adequately represented by the d-
(d,u,τ )
dimensional embedding vector Y n = [Yn−u Yn−u−τ ··· Yn−u−(d−1)τ ]. Therefore,
the problem of SE estimation amounts to estimating the entropy of a scalar variable,

(d,u,τ )
H(Yn )=–∑p(yn )logp(yn ), that of a d-dimensional variable, H(Y − n )= H(Y n )=–
(d,u,τ ) (d,u,τ ) −
∑p(yn )log p(yn ), and that of a d+1-dimensional variable, H(Yn , Y n )=H(Yn ,
(d,u,τ ) (d,u,τ ) (d,u,τ )
Yn )=-∑p(yn , yn )log p(yn , yn ).
There are a number of entropy estimators based on histograms, kernels, nearest
neighbors and splines [37], with advantages and disadvantages for each estimator.
Here we describe estimation based on fixed state space partitioning, because it is
widely used in the literature (e.g., see [38]), and because it favors the development
of the correction for CondEn estimates that is presented below (Eq. (19)). This ap-
proach is based on performing uniform quantization of the time series and then
estimating the entropy approximating probabilities with the frequency of visitation
of the quantized states. Specifically, the series y is coarse grained spreading its dy-
namics over ξ quantization levels of amplitude r=(ymax – ymin )/ξ , where ymax and
ymin represent minimum and maximum values of the normalized series. Quantiza-
tion assigns to each sample the number of the level to which it belongs, so that the
quantized time series yξ takes values within the alphabet A=(0,1,...,ξ –1). Uniform
quantization of embedding vectors of dimension d builds an uniform partition of
the d-dimensional state space into ξ d disjoint hypercubes of size r, such that all
vectors V falling within the same hypercube are associated with the same quantized
vector Vξ , and are thus indistinguishable within the tolerance r. The entropy is then
estimated as:
H(V ξ ) = − ∑ p(V ξ )logp(V ξ ) (18)
V ξ ∈Ad
where the summation is extended over all states (i.e., hypercubes) in the embedding
space, and the probabilities p(Vξ ) are estimated for each hypercube simply as the
fraction of quantized vectors Vξ falling into the hypercube (i.e., the frequency of
occurrence of Vξ within Ad ). An illustrative example is reported in Fig. 1, showing
estimation of H(yn ), H(yn−1, yn−3 ) and H(yn , yn−1 , yn−3 ) representing respectively
the entropies H(Yn ), H(Y − −
n ) and H(Yn ,Y n ) computed with an embedding vector
(d,u,τ )
yn =[yn−1 , yn−3 ].
A major problem in estimating the CondEn from time series of limited length
is that it always decreases towards zero at increasing the embedding dimension d.
This results from the fact that, letting d increase, the embedding vectors become
more and more isolated in the state space of increasing dimension, and this isola-
tion results in an increasing numbers of vectors Vξ found alone inside an hypercube
of the quantized space. This effect is seen already at low dimensions in Fig. 1c,
noting that using yn−1 as embedding vector would have resulted in only one single
point, while the use of [yn−1 , yn−3 ] as in the figure results in four single points.
(d,u,τ )
The problem with single points is that, when a vector yn is alone inside an
(d,u,τ )
hypercube of the d-dimensional space, the vector [yn , yn ] is also alone in the
(d+1)-dimensional space. Therefore, single points in the d-dimensional space give
to H(Y − −
n ) the same contribution given to H(Yn ,Y n ) by the corresponding points in
the (d+1)-dimensional space, bringing a null contribution to H(Yn |Y − n ). Thus, the
increase of the number of single points with d leads to a progressive decrease of
Fig. 1 Example of state space partitioning of a time series for the computation of en-
tropy and conditional entropy. (a) The values yn of the series descriptive of the process
Y, ranging from ymin to ymax , are uniformly quantized using ξ =6 quantization levels; (b)
the values of yn are binned according to quantization, and the entropy H(Yn ) is estimated
as H(Yn )=-∑p(yn )logp(yn ), where the probabilities p(yn ) are estimated as the relative fre-
quency of visitation of each bin; (c) assuming a prediction time u=1, an embedding time
τ =2 and an embedding dimension d=2, all embedding vectors of the form V=[yn−1 ,yn−3 ]
built from the time series are represented in a bidimensional state space, and are assigned
to square bins resulting from the uniform quantization of the two coordinates (gray grid);
then the entropy H(Y − n ) is estimated as H(V)=-∑p(V)logp(V); (d) the analysis is repeated
for all values assumed by the vector [yn ,V] to estimate the entropy H(Yn ,Y − n ) as H(yn ,V)=-
∑p(yn ,V)logp(yn ,V), where cubic bins result now from the uniform quantization of three co-
ordinates. Then, the CondEn is estimated as H(Yn |Y − n )=H(yn ,V)-H(V), and the CorrCondEn
as Hc (Yn |Y −
n )= =H(Y n |Y − )+n(V)H(Y ), where n(V) is the fraction of vectors V found alone
n n
into an hypercube in panel (c) (gray squares). Note that single points in (c) remain always sin-
gle also in the higher dimensional space in (d), while other single points may appear (black
squares).
the estimated CondEn. This occurs even for completely unpredictable processes for
which the conditional entropy should stay at high values regardless of the embed-
ding dimension (an example is in Fig. 2a,c). To counteract this bias, a corrected
conditional entropy (CorrCondEn) can be defined as [7, 18]:
(d,u,τ ) (d,u,τ ) (d,u,τ )
H c (Yn |Y −
n ) = H (Yn |Y n
c
) = H(Yn |Y n ) + n(Yn ) · H(Yn ) (19)
(d,u,τ )
where in the context of uniform quantization n(Y n ) is the fraction of single
(d,u,τ )
points in the quantized space, i.e. the fraction of vectors Y n , represented in
(d,u,τ )
their quantized form, found only once within Ad (0 ≤ n(Yn ) ≤ 1). The scale
factor H(Yn ) is chosen because it represents the CondEn of a white noise with the
same probability distribution of the considered process; with this choice, the null
contribution of single points is substituted with the maximal information carried
by a white noise, so that the CondEn of the relevant white noise is estimated after
finding 100% of single points.
The CorrCondEn is the sum of two terms, the first decreasing and the second
increasing with the dimension of the explored state space. Hence, Hc (Yn |Y − n )=
(d,u,τ )
H (Yn |Y n
c ) exhibits a minimum over d, and this minimum value may be taken
as an estimate of the CondEn. Following this idea, CondEn analysis may be per-
formed over short time series without constraining the embedding dimension to
low predetermined values. An example is shown in Fig. 2, illustrating the computa-
(d,u,τ )
tion of the CorrCondEn Hc (Yn |Y n ) as a function of d, with parameters u=τ =1,
ξ =6, for a second-order autoregressive process y defined by two complex and con-
jugated poles with modulus ρ and phase ϕ = π /4: Yn =2ρ cosϕ Yn−1 − ρ 2 Yn−2 +wn
(w is a white noise innovation process). The regularity of the process is determined
by the parameter ρ : with ρ =0 the process reduces to a fully unpredictable white
noise (Fig. 2a), while with ρ =0.98 a partially predictable stochastic oscillation is
(d,u,τ )
set (Fig. 2b). Accordingly, the entropy of Yn conditioned to Y n =[Yn−1 ,...,Yn−d ]
is expected to be high and constant at varying d when ρ =0, and to show a mini-
mum reflecting the predictability of the process when ρ =0.98. These two situations
are well reproduced by the CorrCondEn. For the white noise process, the slow de-
(d,u,τ )
crease of H(Yn |Y n ) with increasing d (dashed line) is fully compensated by
(d,u,τ )
the corrective term n(Y n )·H(Yn ) (dotted line), resulting in a roughly flat pro-
(d,u,τ )
file of Hc (Yn |Y n ) (solid line) that determined a minimum estimate close to the
expected CondEn (Fig. 2c). For the partially predictable process, the decrease of
(d,u,τ )
H(Yn |Y n ) is substantial already at low values of d due to the usefulness of past
samples to describe the present of Y, while the corrective term intervened at higher
values of d, thus producing a well defined minimum at d=5 (Fig. 2d).
In accordance with the above described procedure, an estimate of the SE in (8)
results simply by subtracting the estimated CorrCondEn from the Shannon entropy
of the series. The same procedure may be easily followed to estimate the CE in (9),
(d,u,τ ) (d,u,τ )
simply conditioning on X− −
n instead of on Y n (i.e, using Xn in place of Y n ) in
the computation of the CorrCondEn [9]. However, a possible limitation of this pro-
cedure is in the fact that the terms used for conditioning are included progressively
into the embedding vector without checking their effective relevance for describ-
ing the dynamics of the target process. While the progressive inclusion based on
the time lag of the past terms (i.e. the terms Yn−1 ,Yn−2 ,... are sequentially added to
the embedding vector when conditioning on Y − n ) is intuitive and works well under
most circumstances, it is exposed to an inclusion of irrelevant terms that is likely to
impair the detection of dependencies. This problem is exacerbated in the presence
Fig. 2 Example of computation of the corrected conditional entropy for short time series
(N=300 points) with different level of predictability [7]. (a) realization of a fully unpredictable
white noise; (b) realization of a partially predictable autoregressive process; (c,d) correspond-
ing estimated profiles of the CondEn (dashed line), the corrective term (dotted line) and the
CorrCondEn (solid line) obtained at varying the dimension d of uniform embedding from 1
to 15.
of short time series, for which the corrective term prevents the exploration of high-
dimensional state spaces. Moreover, the issue may become critical when one aims
at estimating information measures that account for conditioning schemes involv-
ing several different variables, such as the bivariate TE in (10) and the multivariate
extensions of SE, CE and TE defined in (11-13). In these situations, a reliable esti-
mation of the CondEn in the presence of short realizations of multiple conditioning
processes may be performed only through an intelligent embedding strategy that al-
lows to include into the embedding vector only the terms which are relevant to the
dynamics of the target process. This is achieved by the procedure for nonuniform
embedding presented in the next subsection.
3.2 Corrected Conditional Entropy from Non-uniform

Embedding
The strategy proposed in [18] for estimating the CorrCondEn through non-uniform
embedding is based on a sequential procedure which updates the embedding vector
progressively, taking all relevant processes into consideration at each step and se-
lecting the components that better describe the destination process. Specifically, a
set of candidate terms is first defined including the past states (and, if relevant, also
the present state) of all systems relevant to the estimation of the considered CondEn.
For instance, if we consider the two CondEn terms involved in the computation of
the multivariate TE in (13), the candidate set for estimating Hc (Yn |Y − −
n ,Zn ) will be
the set Ω 1 ={Yn−1 ,...,Yn−L ,Zn−1,...,Zn−L }, and the candidate set for the estimation
of H(Yn |X− − −
n ,Y n ,Zn ) will be the set Ω 2 ={Ω 1 ,Xn−1 ,...,Xn−L }, where L is the number
of time lagged terms to be tested for each process. Given the generic candidate set
Ω , the procedure for estimating the CorrCondEn Hc (Yn |Ω ) starts with an empty
embedding vector V0 =[·], and proceeds as follows:
• at each step k ≥ 1, form the candidate vector [s,Vk−1 ], where s is an element of Ω
not already included in Vk−1 , and compute the CorrCondEn of the target process
y given the considered candidate vector, Hc (Yn |[s,Vk−1 ]);
• repeat the previous step for all possible candidates, and then retain the candidate
for which the CorrCondEn is minimum, i.e., set Vk =[s
, Vk−1 ] where s
=arg mins
Hc (Yn |[s,Vk−1 ]);
• terminate the procedure when a minimum in the CorrCondEn is found, i.e., at the
step k
such that H c (Yn |Vk
) ≥ H c (Yn |Vk
− 1), and set Vd =Vk
− 1 as embedding
vector.
This procedure is devised to try to include into the embedding vector only the com-
ponents that effectively contribute to resolving the uncertainty of the target process
(in terms of CondEn reduction), while leaving out the irrelevant components. This
feature, together with the termination criterion which prevents the selection of new
terms when they do not bring further resolution of uncertainty for the destination
process, help escaping the curse of dimensionality for the multivariate estimation of
the CondEn. Moreover the procedure avoids the nontrivial task of setting the embed-
ding parameters d, τ and u (the only parameter here is the number L of candidates
to be tested for each process, which can be as high as allowed by the affordable
computational times).
To illustrate the procedure we report an example of computation of the multivari-
ate TE on a short realization (N=300 points) of the processes associated with the
M=3 processes described as [19]:
Xn = 1.4 − Xn−1
2 + 0.3X
n−2
Yn = 1.4 − 0.5 (Xn−1 + Yn−1)Yn−1 + 0.1Yn−2 (20)
Zn = |Xn−3 | + Yn−1
In this simulation, X is an autonomous Henon process, Y is another Henon pro-

cess unidirectionally driven by X, and Z is a passive process driven both by X and
by Y. Fig. 3 depicts the TE analysis, performed according to (13) between each pos-
sible pair of systems (analysis parameters: L=5, ξ =6). In each panel, non-uniform
embedding of the target system is described, at each step k of the sequential pro-
cedure, reporting the selected candidate term and depicting the corresponding es-
timated minimum of the CorrCondEn at that step; the procedure is repeated two
times, either excluding (black) or including (red) terms from the source process in
the set of initial candidate terms. A difference between the two repetitions occurs
when terms from the source process are selected in the second repetition, leading to
lower estimated CorrCondEn and thus to nonzero information transfer. On the con-
trary, if no source terms are selected even when considered as possible candidates,
the two CondEn profiles overlap and the estimated information transfer is zero. For
instance, if we consider the analysis from X to Y (Fig. 3a) we see that the embedding
of Yn based on the set of candidates Ω 1 ={Y − −
n ,Zn }≈{Yn−1 ,...,Yn−L ,Zn−1 ,..., Zn−L }
terminates at the step d=4 returning the embedding vector V4 =[yn−1, yn−3 , zn−3 ,
zn−1 ] and the corresponding CorrCondEn Hc (Yn |Y − −
n ,Zn )=0.396; at the second rep-
− − −
etition the set of candidates is Ω 2 ={Xn ,Y n ,Zn }≈{Ω 1 , Xn−1 ,...,Xn−L}, and we see
that the procedure selects a term from the input system, xn−1 , at the second step,
leading to a decreased CorrCondEn minimum, Hc (Yn |X− − −
n ,Y n ,Zn )=0.263, and ulti-
mately to a positive information transfer measured by the TE TX→Y |Z . Note that for
this realization the obtained embedding vector is exactly the one expected from the
generating equation of y, i.e., V3 =[yn−1 , xn−1 , yn−2 ] (see (20)). On the contrary, if
we consider the opposite direction of interaction from Y to X (Fig. 3d), we see that
the two repetitions of the embedding procedure yield the same embedding vector,
V4 =[xn−1, xn−2 , xn−5 , xn−3 ]. In this case two terms in excess are selected besides the
two terms entering the equation for X in (20), but – though confounding the inter-
pretation of the internal dynamics of X – this does not lead to detection of spurious
information transfer as the embedding vector is unchanged from the first to the sec-
ond repetition, so that Hc (Xn |X− − − − −
n , Zn )=H (Xn |Xn , Y n ,Zn )=0.308 and TY →X|Z =0.
c
Moreover, the procedure detects the non-negligible information transfer imposed

from X to Z and from Y to Z, documented by the proper selection of source terms
at the second repetition (respectively, xn−3 in Fig. 3b and yn−1 and in Fig. 3c) that
leads to decreased CorrCondEn with respect to the first repetition and thus to large
values of the estimated TEs TX→Z|Y and TY →Z|X , and also the absence of informa-
tion transfer from Z to X from Z to Y, documented by the unchanged embedding
vectors for the two repetitions of the procedure (Figs. 3e and 3f) leading to un-
changed CorrCondEn and to TZ→X|Y =TZ→Y |X =0. The thorough validation involving
several realizations with changing parameters and noisy conditions is reported in
[19], where also a simulation with stochastic processes is reported.
3.3 Parameter Setting and Open Issues

One major advantage of the estimation approach proposed in Sect. 3.2 is in the fact
that it releases the user from the non-trivial setting of most of the analysis param-
eters that otherwise would considerably affect the estimation outcome. Indeed, in
the sequential procedure of Sect. 3.1 the dimension of the embedding vectors, d, is
a free parameter that results from the progressive search of a minimum CorrCon-
dEn. Moreover, with the non-uniform embedding strategy described in Sect. 3.2 the
choice of the remaining embedding parameters, i.e., the prediction time u and delay
τ , and maximum number L of lagged terms to consider as candidates, is not criti-
cal because the strategy is devised to select only relevant components and exclude
irrelevant ones. Anyway, it is a good practice to set u and τ in order to let a proper
unfolding of the system dynamics described by the state variables resulting from the
embedding; while τ is normally set so as to reduce the number of correlated points
within a single process (e.g., taking the first zero of the autocorrelation function or
the first minimum of the auto-information function [34]), u can be set on the basis of
Fig. 3 Example of application of nonuniform embedding to the estimation of the TE be-

tween a single realization the processes X, Y and Z generated by Eq. (20). Plots depict the
CorrCondEn estimated for a given target system either after excluding (black circles) or in-
cluding (red triangles) past terms from the source system in the set of initial candidates used
for nonuniform embedding. Filled symbols indicate the minimum CorrCondEn Hc taking the
value reported in the panel, while the corresponding estimated TE is reported above the panel.
The candidate terms selected at each step k of the procedure are also reported in each panel.
prior knowledge about the propagation times in the overall dynamical system (see,
e.g., [20] for cardiovascular variability or [39] for magnetoencephalography).
The parameter related to the binning procedure for entropy estimation is the num-
ber of quantization levels ξ used to spread the dynamics of the observed time series.
Theoretically, increasing ξ would lead to a finer partition of the state space and bet-
ter estimates of the conditional probabilities. However, this observation holds for
time series of infinite length, while in practical applications with series of length
N the number of quantization levels should remain as low as ξ d ≈N [7, 18]. In the
studies of short-term cardiovascular and cardiorespiratory variability reviewed in
Sect. 4, where N≈300 and CorrCondEn estimates were usually obtained from three
lagged terms (or at most four in few cases), the common choice is to use ξ =6 levels.
A number of levels such that ξ d ≈N may seem too high according to some other
prescriptions (e.g., Lungarella et al. [40] recommend to work with a number of hy-
percubes at least three times lower than the series length). However, the suitability
of our choice may be explained in that the search for relevant components achieved
by non-uniform embedding makes it able to target only a restricted “typical set”
of hypercubes with higher probability than the other regions of the state space (see
[21], chapter 3), thus allowing some extent of over-quantization with respect to tra-
ditional embedding.
As seen in section 3.2, the non-uniform embedding approach for computing Cor-
rCondEn allows reliable estimation of information dynamics measures from short
realizations of multivariate processes. Nevertheless, it suffers from some limitations
that leave room for improving the estimation of information dynamics measures. A
main problem of the approach is the selection of some terms in excess during the
sequential embedding. This is seen in the reported simulation example where xn−5
and xn−3 are selected in the embedding of X (Fig. 3d,e) and xn−4 is selected in the
embedding of Z (Fig. 3b,c); while this mis-selection is not problematic in terms of
TE computation, it may hamper the estimation of the other terms of an information
decomposition, or other tasks like delay estimation. A first explanation for the de-
tection of excess terms may be the fact that the contribution of the corrective term
is not strong enough to produce the CorrCondEn minimum before the inclusion of
irrelevant terms. From this point of view, we tested alternative corrections: e.g., a
(d,u,τ )
more strict selection is proposed in [7, 9] using the corrective term n(Yn ,Y n ) in
(d,u,τ )
place of the term n(Y n ) used here in (19) and in [18, 20]. Nevertheless, a bal-
ance need always to be found because a more strict selection decreases the rate of
false detections but at the same time increases the number of missed detections.
More generally, factors that may affect the accuracy of component selection are:
(i) the estimator of CondEn; (ii) the empirical nature of the correction; and (iii) the
sub-optimal nature of the exploration of candidates, which, being sequential and not
exhaustive, somehow disregards joint effects that more candidates may have on the
reduction of the CondEn. The binning entropy estimator used here in Eq. (17) may
be inaccurate due to its known bias [37] and to the fact that the associated quantiza-
tion may leave a certain amount of information unexplained even after selection of
the correct causal sources, and thus leave room for excess source selection. While
in principle any alternative entropy estimator might be used, we remark that in the
context of non-uniform embedding the introduction of a corrective term serves, be-
sides for compensating the bias, for guaranteeing the existence of a CondEn mini-
mum, which we use to terminate the sequential procedure in attempting to avoid the
inclusion of irrelevant terms. Therefore, the integration within the proposed proce-
dure of any improved entropy measure has to cope with the need of finding a clear
minimum of the CondEn estimated while increasing the embedding dimensions.
From this point of view, the utilization of accurate Shannon entropy estimators such
as those based on kernels or nearest neighbors [37] would face the necessity of
counteracting the isolation of the embedding vectors in state spaces of increasing
dimension through a corrective term. An interesting alternative solution might be
that recently proposed in [41], where a k-nearest neighbor approach was pursued to
estimate the CondEn directly in one step (rather than in two steps as the difference
between entropy estimates) yielding an estimate which exhibits a minimum over
the embedding dimension without requiring the addition of a corrective term. An-
other way to avoid the use of a corrective term would be to assess, at each step of
the selection procedure, the statistical significance of the contribution brought by the
selected candidate to the description of the target process, so that only the candidates
bringing significant contribution can be selected and the procedure would terminate
when the contribution of the selected candidate is not significant. We are currently
exploring this alternative criterion, both using the binning entropy estimator [42]
and using nearest neighbor estimators [26]. As to the point (iii), the problem is that
a sequential exploration of the candidate space does not guarantee convergence to
the absolute minimum of the CondEn, and thus it does not assure a semipositive
value for the measures defined as difference between two CondEn terms like those
defined in Eqs. (10-13). Nevertheless, a sequential approach needs to be adopted be-
cause an exhaustive exploration of all possible candidate terms, which would lead to
the absolute CondEn minimum, would become computationally intractable still at
low embedding dimensions; e.g., in a common practical situation such as that with
M=4 conditioning processes and L=5 candidates explored per process, the number
of combinations to be tested would be 4845 for k=4 and 15504 for k=5. The pos-
sibility of finding negative values for SE, CE and TE computed with this approach
suggest the need of assessing the statistical significance of each estimated measure,
e.g. through the utilization of surrogate data [18, 20]. Of note, the introduction of a
significance criterion for candidate selection as mentioned above would implicitly
provide a tool to assess the statistical significance of information measures without
resorting to surrogate approaches [42, 43].
4 Applications to Physiological Systems

This section reviews the main studies from our group in which the approach for
estimating the CondEn presented here was exploited to characterize, in terms of SE,
CE and TE, the information dynamics of physiological systems.
4.1 Applications of Self Entropy Analysis

The SE has been extensively used, mostly in its formulation evidencing solely
CondEn (i.e., using H(Yn |Y − n ) instead of SY in (8)) to characterize the short-term
complexity (or its opposite, regularity) of individual physiological systems.
It was first applied to beat-to-beat sequences of the sympathetic discharge ob-
tained from decerebrate artificially ventilated cats [7]. In the experimental protocol
considered in the study, in which non-linear interactions between the periodic forc-
ing input (i.e., ventilation) and the spontaneous sympathetic discharge are found,
more regular dynamics were detected in presence of phase locked patterns, while
less regular dynamics were observed after disruption of the non linear relation be-
tween mechanical ventilation and sympathetic outflow via spinalization.
The same analysis was applied to evaluate the regularity of cardiovascular vari-
ability signals (heart period, arterial pressure and muscle sympathetic nerve activity)
during sympathetic activation induced by head-up tilt, during the perturbing action
produced by paced ventilation, and after peripheral muscarinic blockade provoked
by atropine administration [13]. The results suggested that the regularity of heart
period variability increased with tilt and paced ventilation at low breathing rates,
likely due to the entrainment of multiple physiological mechanisms into specific fre-
quency bands. In the case of administration of a high dose of atropine the reduction
of complexity is not due to the entrainment of different physiological mechanisms
but, more likely, to the reduction of the complexity of the neural inputs to the sinus
node due to the cholinergic blockade. Moreover, systolic arterial pressure and mus-
cle sympathetic variability series were respectively more regular and more complex
than heart period variability, and their regularity was not markedly affected by the
specified experimental conditions.
The results about SE analysis of the heart period variability during tilt test and
paced breathing protocols were strengthened and further interpreted in a following
study [44]. Moreover, a subsequent study demonstrated the ability of SE measures
to evidence a progressive decrease of the complexity of heart period variability as
a function of the tilt table inclination during graded head-up tilt [16]. This finding
was of great relevance as it established a straightforward link between physiolog-
ical mechanisms and the behavior of an information dynamical quantity like the
SE. Indeed, since graded head-up tilt produces a gradual shift of the sympathovagal
balance toward sympathetic activation and parasympathetic deactivation, the cor-
responding gradual decrease of CorrCondEn observed in the study indicated that
complexity of heart period variability is under the control of the autonomic nervous
system. Another interesting result of the study was that standard measures related
to SE like the approximate entropy [5] were unable to reveal the same gradual de-
crease in complexity during the protocol unless they were corrected according to a
strategy similar to that presented in Sect. 3.1. This pointed out the necessity of ex-
ploiting the CorrCondEn or similar measures devised according to the same strategy
to extract fruitful information from the short data sequences commonly available in
experimental settings.
Another interesting applicative context of SE was the characterization of the neu-
ral control on heart rate variability during sleep [15, 45], a condition which is known
to be associated with important changes of the autonomic cardiovascular regulation.
In [15], the complexity of heart period variability of healthy subjects was found
to follow a circadian pattern characterized by larger CorrCondEn during night-
time than during daytime; this day-night variation was lost in heart failure patients
due to a tendency of complexity to increase during daily activities and decrease at
night, corroborating the association between SE and sympathetic modulation. Inter-
estingly, significant circadian variations were observed only normalizing the Cor-
rCondEn to the entropy of the heart period series; this suggested the opportunity of
reducing the dependence of the estimated SE on the shape of the static distribution
of the observed process through normalization, so that to magnify the dynamical
complexity in the resulting normalized measure. In [45], the short term complexity
of heart period variability was characterized during different sleep stages in young
and elderly healthy persons, observing a significant reduction of CorrCondEn in
older subjects, especially during REM sleep. These results suggested that with ag-
ing REM sleep is associated with a simplification of the mechanisms of cardiac
control, that could lead to an impaired ability of the cardiovascular system to react
to adverse events.
4.2 Applications of Cross Entropy Analysis

The first utilizations of the CE based on CorrCondEn computation were aimed at
determining the degree of coupling in bivariate processes [9, 13]. More specifically,
given two processes X and Y the two CEs CX→Y and CY →X were computed as de-
scribed in Sect. 3.1, and then a synchronization index was taken as the maximum
between normalized versions of the two CEs obtained at varying embedding di-
mensions: χ x,y =maxd (Cc X→Y /H(Yn ),CcY →X /H(Xn )) (the apex c denotes that CX→Y
and CY →X were derived from Hc (Yn |X− −
n ) and H (Xn |Y n ) computed as in (18) for
c
varying d). This synchronization measure was first used to measure the coupling
strength between the beat-to-beat variability of the sympathetic discharge and ven-
tilation in decerebrate artificially ventilated cats [9]. The measure was able to reflect
the coupling between sympathetic discharge and ventilation, being very large in the
presence of periodic dynamics in which the sympathetic discharge is locked to the
respiratory forcing input and close to zero for quasiperiodic or aperiodic dynam-
ics resulting, for instance, after spinalization. The synchronization index was also
utilized in humans to evaluate the coupling degree of bivariate systems compris-
ing cardiac, vascular, pulmonary and muscular systems in response to experimental
maneuvers or in pathologic conditions, leading to important results which were re-
lated to physiological mechanisms in health and disease. Specifically, Porta et al. [9]
observed that the synchronization between the beat-to-beat variability of the heart
period and the ventricular repolarization interval was not changed by experimental
conditions that alter the sympathovagal balance but strongly decreased after my-
ocardial infarction. Nollo et al. [14] found also that after infarction the synchroniza-
tion index is associated with an impaired cardiovascular response to head-up tilt,
observing that the index computed between heart period and arterial pressure vari-
ability decreased significant in post-infarction patients, while it increased in healthy
subjects. Moreover, relevant results from [13] were that: the cardiovascular cou-
pling was significant but weak at rest, and increased with head-up tilt and paced
breathing; the cardiopulmonary and vasculo-pulmonary couplings were significant
and increased with paced breathing at 10 breaths/min; muscle nerve activity and
respiration were uncoupled in control condition but become coupled after atropine
administration.
The CE based on CorrCondEn was also successfully exploited as an asymmetric
measure of coupling quantifying the directed information in bivariate physiolog-
ical systems, with special emphasis on the study of the closed loop interactions
between the spontaneous variability of heart period and arterial pressure in humans.
In this applicative context, the CE has been proven useful in disentangling this in-
tricate closed loop, evidencing information flows directed either through the barore-
flex (i.e., from systolic pressure to heart period) or through circulatory mechanics
(i.e. from heart period to systolic pressure). Nollo et al. [14] pointed out that the
information flow was balanced over the two directions and higher during head-up
tilt than at rest in young healthy subjects, while it was unbalanced (with prevalence
of the information flow from heart period to systolic arterial pressure) and lower
during head-up tilt in post-myocardial infarction patients. Porta et al. [25] demon-
strated the usefulness of CE, compared with the traditional approach based on the
analysis of Fourier phases, in detecting the dominant direction of interaction in the
cardiovascular loop. They showed that: (i) CE is able to detect the lack of informa-
tion transfer through the baroreflex in heart transplant recipients, and the gradual
restoration of this transfer with time after transplantation; (ii) CE quantitatively re-
flects the progressive shift from the prevalence of information transfer through the
circulatory mechanics to the prevalence of information transfer through the barore-
flex with tilt table inclination during graded head-up tilt in healthy subjects. Recent
studies [24, 46] focused on how the information transfer through the baroreflex,
monitored by the CorrCondEn of heart period given the systolic pressure, is modi-
fied at varying the prediction time u. In protocols of head-up tilt and pharmacologi-
cal blockade of receptors, the Authors showed that the expected monotonic decrease
of the CE (i.e, increase of the CorrCondEn) observed while increasing the prediction
time can be further typified looking at the rate at which this decrease of information
transfer occurs. It was shown that such a rate contains useful information about the
baroreflex control of heart rate in different experimental conditions.
4.3 Applications of Transfer Entropy Analysis

The nonuniform embedding strategy presented in Sect. 3.2, handling the issues of
arbitrariness and redundancy associated with the embedding of multiple time se-
ries, introduced the possibility to efficiently compute –even for small sample size
datasets– the CorrCondEn in the presence of several conditioning processes. This
favored to move from the estimation of SE and CE, which involve only one condi-
tioning process, to that of TE, that requires to deal with two or more conditioning
processes in the computation of CondEn.
The first application of the strategy was the computation of a normalized ver-
sion of the multivariate TE to elicit direct transfer of information in physiological
systems composed of multiple interacting subsystems, such as the cardiovascular
and cardiorespiratory ones, and in spatially extended physiological systems, such
as the human cortical system where EEG activity is supposed to propagate across
different scalp locations [18]. The study indicated that the purposeful state space
reconstruction achieved by nonuniform embedding allows describing patterns of di-
rectional connectivity consistent with known mechanisms of cardiovascular, such
as the rise of causal connectivity along the baroreflex with the transition from the
supine to the upright position, and of neural physiology, such as the presence of
causality from the posterior towards the central and anterior EEG recorded during
eyes closed wakefulness.
The feasibility of estimating multivariate TE on the basis of CorrCondEn
and nonuniform embedding in cardiovascular neuroscience was investigated more
deeply in [17, 19]. The studies were aimed at the data-driven investigation of the
modes of cardiovascular, cardiopulmonary and vasculo-pulmonary interactions both
in resting physiological conditions and during experimental maneuvers like head-
up tilt and paced breathing. TE analysis was able to describe well known mecha-
nisms of cardiovascular and cardiorespiratory regulation, as well as to support the
interpretation of other more debated mechanisms. Examples were the shift from
balanced bidirectional exchange of information between heart period and arterial
pressure in the supine position to the prevalence of information transfer through the
baroreflex in the upright position, and the mechanical effects of respiration on both
heart period and arterial pressure variability with their enhancement during paced
breathing and dampening during head-up tilt. Moreover, the utilization of a fully
multivariate approach allowed to disambiguate the role of respiration on the closed
loop interactions between heart period and arterial pressure variability. In particu-
lar, the estimated information flows suggested that short-term heart rate variability
is mainly explained by central mechanisms of respiratory sinus arrhythmia in the
resting supine position during spontaneous and paced breathing, and by baroreflex-
mediated phenomena in the upright position.
In a recent study we have dealt with a common problem in the practical esti-
mation of the multivariate TE from real physiological data, that is, the presence of
instantaneous effects which likely impair or confound the assessment of the infor-
mation transfer between coupled systems [20]. Instantaneous effects are effects oc-
curring between two time series within the same time lag, and may reflect either fast,
within sample physiologically meaningful interactions or be void of physiological
meaning (e.g., may be due to unobserved confounders). While the traditional for-
mulation of the TE does not account for instantaneous effects, we faced this issue
allowing the possible presence of instantaneous effects through proper inclusion of
the zero-lag term in the computation of CorrCondEn based on nonuniform embed-
ding. The approach was devised according to two different strategies for the com-
pensation of instantaneous effects, respectively accounting for causally meaningful
and non-meaningful zero-lag effects. The resulting measure, denoted as compen-
sated TE, was validated on simulations and then evaluated on physiological time
series. In cardiovascular and cardiorespiratory variability, where the construction of
the time series suggests the existence of physiological causal effects occurring at lag
zero, the compensated TE evidenced better than the traditional TE the presence of
expected interaction mechanisms (e.g., the baroreflex). In magnetoencephalography
analysis performed at the sensor level, where instantaneous effects are likely the
result of the simultaneous mapping of single sources of brain activity onto several
recording sensors, utilization of the proposed compensation suggested the activa-
tion of multisensory integration mechanisms in response to a specific stimulation
paradigm.
Finally, we have recently started considering an integrated perspective in which
the TE is an element of the information domain characterization of coupled phys-
iological systems. In [47] we studied the TE and the SE as factors in the decom-
position of the predictive information in bivariate physiological systems, according
to the interpretation suggested here in Sect. 3.2 (Eq. (15a)). The study was aimed
at characterizing cardiovascular regulation, from the analysis of heart period and

systolic arterial pressure variability, and cerebrovascular regulation, from the anal-
ysis of mean arterial pressure and mean cerebral blood flow variability, in subjects
developing orthostatic syncope in response to prolonged head-up tilt testing. We ev-
idenced specific patterns of information processing, jointly described by the SE and
TE and by their modifications after tilt and in the proximity of the syncopal event,
that were associated with the impairment of physiological mechanisms of cardio-
vascular and cerebrovascular regulation. These results documented for the first time
the added value of studying in integration the different aspects of information pro-
cessing for enhancing the interpretation of multiple intertwined physiological mech-
anisms, and suggested the usefulness of further integration (e.g., directed towards
moving from bivariate to fully multivariate decomposition) for providing the most
complete picture of autonomic integration in pathologic conditions.
5 Conclusions and Future Directions

In this chapter we have presented an unified framework for the analysis of infor-
mation dynamics in physiological systems, which integrates the computation of the
known SE, CE and TE measures under the dynamic definition of conditional en-
tropy. The proposed strategy for the estimation of CondEn, based on the utilization
of a sequential procedure that avoids predetermining the dimension of the embed-
ding and of a CorrCondEn estimator that compensates for the CondEn bias, allows
the reliable estimation from short-length datasets of SE and bivariate CE as mea-
sures of system complexity and connectivity. Moreover the introduction of nonuni-
form embedding whereby the conditioning terms are selected on the basis of their
actual relevance opens the way to entropy estimation in the presence of several dif-
ferent conditioning processes, and thus to the assessment of SE, CE and TE in fully
multivariate settings. The feasibility of this approach in practice is documented by
the several applications surveyed in the chapter, showing that SE, CE and TE may
be efficiently estimated from short realizations of biomedical processes, and may
thus be exploited to support the interpretation of the mechanisms underlying the
behavior of coupled physiological systems in different experimental conditions or
pathologic situations.
Starting from the established feasibility of the computation of information dy-
namics measures in physiological systems, the main direction for future studies
should bring the research in this field towards the full integration of different
measures and their combined interpretation in the information domain. Existing
approaches for the information-theoretic analysis of physiological systems mainly
focus on single aspects of the observed dynamics, e.g., how complex is the activity
of one system or how it is coupled with that of another system. These approaches
are highly encouraging since they evidence a strong relation between information
domain measures and physiological function, but are mostly limited to the study
of dynamics within a system (e.g., variability of the heart rate, connectivity within
the brain) or at most between two systems (e.g., cardiovascular or cardiorespiratory
interactions). Recent theoretical developments have shown that different aspects of

information dynamics (e.g., information storage, transfer and modification) are con-
nected with each other and thus should be addressed through an unified approach
rather than in isolation. This view has been reinforced in the present chapter, where
we showed that CE, SE and TE are separate but complementary elements of infor-
mation dynamics, as they constitute factors of the decomposition of the predictive
information for an assigned target system in a network. Therefore, we believe that
looking at the combined activity of different physiological systems from the inte-
grated perspective offered by information dynamics will provide the methodological
strength to assess physiological control mechanisms in health and disease in a more
informed way than possible before.
References
1. Bullmore, E., Sporns, O.: Complex brain networks: graph theoretical analysis of struc-
tural and functional systems. Nat. Rev. Neurosci. 10, 186 (2009)
2. Bashan, A., Bartsch, R.P., Kantelhardt, J.W., Havlin, S., Ivanov, P.C.: Network physiol-
ogy reveals relations between network topology and physiological function. Nat. Com-
municat. 3 (2012)
3. Lizier, J.T.: The local information dynamics of distributed computation in complex sys-
tems. Springer, Heidelberg (2013)
4. Faes, L., Nollo, G.: Multivariate frequency domain analysis of causal interactions in
physiological time series. In: Laskovski, A.N. (ed.) Biomedical Engineering, Trends in
Electronics, Communications and Software. InTech, Rijeka (2011)
5. Pincus, S.M.: Approximate Entropy As A Measure of System-Complexity. Proc. Natl.
Acad. Sci. USA 88, 2297–2301 (1991)
6. Richman, J.S., Moorman, J.R.: Physiological time-series analysis using approximate
entropy and sample entropy. Am. J. Physiol. Heart Circ. Physiol. 278, H2039–H2049
(2000)
7. Porta, A., Baselli, G., Liberati, D., Montano, N., Cogliati, C., Gnecchi-Ruscone, T.,
Malliani, A., Cerutti, S.: Measuring regularity by means of a corrected conditional en-
tropy in sympathetic outflow. Biol. Cybern. 78, 71–78 (1998)
8. Paluš, M., Komárek, V., Hrnčı́ř, Z., Štěrbová, K.: Synchronization as adjustment of in-
formation rates: detection from bivariate time series. Phys. Rev. E 63, 046211 (2001)
9. Porta, A., Baselli, G., Lombardi, F., Montano, N., Malliani, A., Cerutti, S.: Conditional
entropy approach for the evaluation of the coupling strength. Biol. Cybern. 81, 119–129
(1999)
10. Schreiber, T.: Measuring information transfer. Phys. Rev. Lett. 85, 461–464 (2000)
11. Lizier, J.T., Pritam, S., Prokopenko, M.: Information Dynamics in Small-World Boolean
Networks. Artificial Life 17, 293–314 (2011)
12. Chicharro, D., Ledberg, A.: Framework to study dynamic dependencies in networks of
interacting processes. Phys. Rev. E 86, 041901 (2012)
13. Porta, A., Guzzetti, S., Montano, N., Pagani, M., Somers, V., Malliani, A., Baselli, G.,
Cerutti, S.: Information domain analysis of cardiovascular variability signals: evaluation
of regularity, synchronisation and co-ordination. Med. Biol. Eng. Comput. 38, 180–188
(2000)
14. Nollo, G., Faes, L., Porta, A., Pellegrini, B., Ravelli, F., Del Greco, M., Disertori, M.,
Antolini, R.: Evidence of unbalanced regulatory mechanism of heart rate and systolic
pressure after acute myocardial infarction. Am. J. Physiol. Heart Circ. Physiol. 283,
H1200–H1207 (2002)
15. Porta, A., Faes, L., Mase, M., D’Addio, G., Pinna, G.D., Maestri, R., Montano, N.,
Furlan, R., Guzzetti, S., Nollo, G., Malliani, A.: An integrated approach based on uni-
form quantization for the evaluation of complexity of short-term heart period variability:
Application to 24 h Holter recordings in healthy and heart failure humans. Chaos 17,
015117 (2007)
16. Porta, A., Gnecchi-Ruscone, T., Tobaldini, E., Guzzetti, S., Furlan, R., Montano, N.:
Progressive decrease of heart period variability entropy-based complexity during graded
head-up tilt. J. Appl. Physiol. 103, 1143–1149 (2007)
17. Faes, L., Nollo, G., Porta, A.: Information domain approach to the investigation of
cardio-vascular, cardio-pulmonary, and vasculo-pulmonary causal couplings. Front.
Physiol. 2, 1–13 (2011)
18. Faes, L., Nollo, G., Porta, A.: Information-based detection of nonlinear Granger causal-
ity in multivariate processes via a nonuniform embedding technique. Phys. Rev. E 83,
051112 (2011)
Med. 42, 290–297 (2012)
20. Faes, L., Nollo, G., Porta, A.: Compensated transfer entropy as a tool for reliably esti-
mating information transfer in physiological time series. Entropy 15, 198–219 (2013)
21. Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley, New York (2006)
43–62 (2002)
23. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Local measures of information storage in
complex distributed computation. Information Sciences 208, 39–54 (2012)
24. Porta, A., Catai, A.M., Takahashi, A.C.M., Magagnin, V., Bassani, T., Tobaldini, E.,
Montano, N.: Information Transfer through the Spontaneous Baroreflex in Healthy Hu-
mans. Meth. Inf. Med. 49, 506–510 (2010)
25. Porta, A., Catai, A.M., Takahashi, A.C., Magagnin, V., Bassani, T., Tobaldini, E., de
van, B.P., Montano, N.: Causal relationships between heart period and systolic arterial
pressure during graded head-up tilt. Am. J. Physiol Regul. Integr. Comp. Physiol. 300,
R378–R386 (2011)
26. Vicente, R., Wibral, M., Lindner, M., Pipa, G.: Transfer entropy-a model-free measure of
effective connectivity for the neurosciences. Journal of Computational Neuroscience 30,
45–67 (2011)
lent for Gaussian variables. Phys. Rev. Lett. 103, 238701 (2009)
28. Amblard, P.O., Michel, O.J.: The relation between Granger causality and directed infor-
mation theory: a review. Entropy 15, 113–143 (2013)
nections on causality estimation. J. Neurosci. Methods 184, 152–160 (2009)
30. Williams, P.L.: Nonnegative decomposition of multivariate information. ArXiv,
1004.2515 (2010)
31. Schreiber, T.: Interdisciplinary application of nonlinear time series methods. Phys.
Rep. 308, 1–64 (1999)
32. Takens, F.: Detecting strange attractors in fluid turbulence. In: Rand, D., Young, S.L.
(eds.) Dynamical Systems and Turbulence. Springer, Berlin (1981)
33. Vlachos, I., Kugiumtzis, D.: Nonuniform state-space reconstruction and coupling detec-
tion. Phys. Rev. E 82, 016207 (2010)
34. Small, M.: Applied nonlinear time series analysis: applications in physics, physiology
and finance. World Scientific (2005)
35. Runge, J., Heitzig, J., Petoukhov, V., Kurths, J.: Escaping the Curse of Dimensionality in
Estimating Multivariate Transfer Entropy. Phys. Rev. Lett. 108, 258701 (2012)
36. Pincus, S.M.: Approximated entropy (ApEn) as a complexity measure. Chaos, 110–117
(1995)
37. Hlaváčková-Schindler, K., Paluš, M., Vejmelka, M., Bhattacharya, J.: Causality detection
based on information-theoretic approaches in time series analysis. Phys. Rep. 441, 1–46
(2007)
38. Kugiumtzis, D., Tsimpiris, A.: Measures of Analysis of Time Series (MATS): A MAT-
LAB Toolkit for Computation of Multiple Measures on Time Series Data Bases. J. Stat.
Software 33, 1–30 (2010)
lar networks. Progr. Biophys. Mol. Biol. 105, 80–97 (2011)
40. Lungarella, M., Pegors, T., Bulwinkle, D., Sporns, O.: Methods for quantifying the in-
formational structure of sensory and motor data. Neuroinformatics 3, 243–262 (2005)
41. Porta, A., Castiglioni, P., Bari, V., Bassani, T., Marchi, A., Cividjian, A., Quintin, L., Di
Rienzo, M.: K-nearest-neighbor conditional entropy approach for the assessment of the
short-term complexity of cardiovascular control. Phys. Meas. 34, 17–33 (2013)
42. Faes, L., Nollo, G.: Decomposing the transfer entropy to quantify lag-specific Granger
causality in cardiovascular variability. In: Proc. of the 35th Annual Int. Conf. IEEE-
EMBS, pp. 5049–5052 (2013)
43. Kugiumtzis, D.: Direct-coupling information measure from nonuniform embedding.
Phys. Rev. E 87, 062918 (2013)
44. Porta, A., Guzzetti, S., Montano, N., Furlan, R., Pagani, M., Malliani, A., Cerutti, S.:
Entropy, entropy rate, and pattern classification as tools to typify complexity in short
heart period variability series. IEEE Trans. Biomed. Eng. 48, 1282–1291 (2001)
45. Viola, A.U., Tobaldini, E., Chellappa, S.L., Casali, K.R., Porta, A., Montano, N.: Short-
Term Complexity of Cardiac Autonomic Control during Sleep: REM as a Potential Risk
Factor for Cardiovascular System in Aging. PLoS One 6 (2011)
46. Porta, A., Castiglioni, P., Di Rienzo, M., Bari, V., Bassani, T., Marchi, A., Wu, M.A.,
Cividjian, A., Quintin, L.: Information domain analysis of the spontaneous baroreflex
during pharmacological challenges. Auton. Neurosci. 178(1-2), 67–75 (2013)
47. Faes, L., Porta, A., Rossato, G., Adami, A., Tonon, D., Corica, A., Nollo, G.: Investi-
gating the mechanisms of cardiovascular and cerebrovascular regulation in orthostatic
syncope through an information decomposition strategy. Auton. Neurosci. 178(1-2), 76–
82 (2013)
Information Transfer in the Brain: Insights from
a Unified Approach
Daniele Marinazzo, Guorong Wu, Mario Pellicoro, and Sebastiano Stramaglia
Abstract. Measuring directed interactions in the brain in terms of information flow

is a promising approach, mathematically treatable and amenable to encompass sev-
eral methods. In this chapter we propose some approaches rooted in this framework
for the analysis of neuroimaging data. First we will explore how the transfer of in-
formation depends on the network structure, showing how for hierarchical networks
the information flow pattern is characterized by exponential distribution of the in-
coming information and a fat-tailed distribution of the outgoing information, as a
signature of the law of diminishing marginal returns. This was reported to be true
also for effective connectivity networks from human EEG data. Then we address
the problem of partial conditioning to a limited subset of variables, chosen as the
most informative ones for the driver node. We will then propose a formal expansion
of the transfer entropy to put in evidence irreducible sets of variables which provide
information for the future state of each assigned target. Multiplets characterized by
a large contribution to the expansion are associated to informational circuits present
in the system, with an informational character (synergetic or redundant) which can
be associated to the sign of the contribution. Applications are reported for EEG and
fMRI data.
Daniele Marinazzo
University of Gent, Department of Data Analysis, 1 Henri Dunantlaan, B9000 Gent, Belgium
e-mail: daniele.marinazzo@ugent.be
Guorong Wu
University of Gent, Department of Data Analysis, 1 Henri Dunantlaan, B9000 Gent, Belgium
and Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science
and Technology, University of Electronic Science and Technology of China, Chengdu, China
e-mail: guorong.wu@ugent.be
Mario Pellicoro · Sebastiano Stramaglia
University of Bari, Physics Department,
Via Amendola 173, 70126 Bari, Italy
e-mail: {mario.pellicoro,sebastiano.stramaglia}@ba.infn.it

88 D. Marinazzo et al.
1 Economics of Information Transfer in Networks

Most social, biological, and technological systems can be modeled as complex net-
works, and display substantial non-trivial topological features [4, 10]. Moreover,
time series of simultaneously recorded variables are available in many fields of sci-
ence; the inference of the underlying network structure, from these time series, is an
important problem that received great attention in the last years.
In many situations it can be expected that each node of the network may handle a
limited amount of information. This structural constraint suggests that information
transfer networks should exhibit some topological evidences of the law of dimin-
ishing marginal returns [36], a fundamental principle of economics which states
that when the amount of a variable resource is increased, while other resources are
kept fixed, the resulting change in the output will eventually diminish [26]. Here we
introduce a simple dynamical network model where the topology of connections,
assumed to be undirected, gives rise to a peculiar pattern of the information flow
between nodes: a fat tailed distribution of the outgoing information, while the aver-
age incoming information transfer does not depend on the connectivity of the node.
In the proposed model the units, at the nodes the network, are characterized by a
transfer function that allows them to process just a limited amount of the incom-
ing information. In this case a possible way to quantify the law of the diminishing
marginal returns can be the discrepancy of the distributions, expressed as the ratio
of their standard deviations.
1.1 Model
We use a simple dynamical model with a threshold in order to quantify and inves-
tigate this phenomenon. Given an undirected network of n nodes and symmetric
connectivity matrix Ai j ∈ {0, 1}, to each node we associate a real variable xi whose
evolution, at discrete times, is given by:

n
xi (t + 1) = F ∑ Ai j x j (t) + σ ξi (t), (1)
j=1
where ξ are unit variance Gaussian noise terms, whose strength is controlled by σ ;
F is a transfer function chosen as follows:
F(α ) = aα |α | < θ
F(α ) = aθ α >θ (2)
F(α ) = −aθ α < −θ
where θ is a threshold value. This transfer function is chosen to mimic the fact
that each unit is capable to handle a limited amount of information. For large θ
our model becomes a linear map. At intermediate values of θ , the nonlinearity con-
nected to the threshold will affect mainly the mostly connected nodes (hubs): the
input ∑ Ai j x j to nodes with low connectivity will remain typically sub-threshold in
Information Transfer in the Brain: Insights from a Unified Approach 89
Fig. 1 Examples of the three network architectures used in this study. Left: Preferential At-
tachment. Center: Homogeneous. Left: Scale-free.
θ = 0.001
0.5
−0.5
−
θ = 0.012
0.5
−0.5
−
θ = 0.1
0.5
−0.5
−
Fig. 2 Segments of 200 time points from typical time series simulated in the scale-free net-
work for three values of θ
this case. We consider hierarchical networks generated by preferential attachment

mechanism [2], which in the deterministic case leads to a scale-free network. Exam-
ples of a preferential attachment network, a scale free network and an homogeneous
network are reported in figure 1. A segment of 200 time points of a typical time
series for three values of θ is plotted in figure 2.
From numerical simulations of eqs. (1), we evaluate the linear causality pattern
for this system as the threshold is varied. We verify that, in spite of the threshold,
variables are nearly Gaussian so that we may identify the causality with the informa-
tion flow between variables [5]. We compute the incoming and outgoing information
flow from and to each node, cin and cout , summing respectively all the sources for
a given target and all the targets for a given source. It is worth to underline that no
PRE
5 SFN
HOM
4
R
1
0 0.02 0.04 0.06 0.08 0.1
θ
Fig. 3 The ratio between the standard deviation of cout and those of cin , R, is plotted versus θ
for the three architectures of network: preferential attachment (PRE), deterministic scale free
(SFN) and homogeneous (HOM). The parameters of the dynamical system are a = 0.1 and
σ = 0.1. Networks built by preferential attachment are made of 30 nodes and 30 undirected
links, while the deterministic scale free network of 27 nodes is considered. The homogeneous
networks have 27 nodes, each connected to two other randomly chosen nodes.
threshold is applied to the connectivity matrix, so that all the information flowing in
the network is accounted for. We then evaluate the standard deviation of the distri-
butions of cin and cout , from all the nodes, varying the realization of the preferential
attachment network and implementing eqs. (1) for 10000 time points.
In figure 3 we depict R, the ratio between the standard deviation of cout over those
of cin , as a function of the θ . As the threshold is varied, we encounter a range of val-
ues for which the distribution of cin is much narrower than that of cout . In the same
figure we also depict the corresponding curve for deterministic scale free networks
[3], which exhibits a similar peak, and for homogeneous random graphs (or Erdos-
Renyi networks [17]), with R always very close to one. The discrepancy between
the distributions of the incoming and outgoing causalities arises thus in hierarchical
networks. We remark that, in order to quantify the difference between the distribu-
tions of cin and cout , here we use the ratio of standard deviations but qualitatively
similar results would have been shown using other measures of discrepancy.
In figure 4 we report the scatter plot in the plane cin − cout for preferential at-
tachment networks and for some values of the threshold. The distributions of cin
and cout , with θ equal to 0.012 and corresponding to the peak of figure 3, are de-
picted in figure 5: cin appears to be exponentially distributed, whilst cout shows a fat
tail. In other words, the power law connectivity, of the underlying network, influ-
ences just the distribution of outgoing directed influences. In figure 6 we show the
−3
x 10
6 0.1
cout
cout
4
0.05
2
0 0
0 2 4 6 0 0.05 0.1
c x 10
−3 c
in in
0.3
0.6
0.2
cout
0.4
0.1 c 0.2
0 0
0 0.1 0.2 0.3 0 0.05 0.1
cin θ
Fig. 4 Scatter plot in the plane cin − cout for undirected networks of 30 nodes and 30 links
built by means of the preferential attachment mechanism. The parameters of the dynamical
system are a = 0.1 and σ = 0.1. The points correspond to all the nodes pooled from 100
realizations of preferential attachment networks, each with 10 simulations of eqs. (1) for
10000 time points. (Top-left) Scatter plot of the distribution for all nodes at θ = 0.001. (Top-
right) Contour plot of the distribution for all nodes at θ = 0.012. (Bottom-left) Scatter plot of
the distribution for all nodes at θ = 0.1. (Bottom-right) The total Granger causality (directed
influence) (obtained summing over all pairs of nodes) is plotted versus θ ; circles point to the
values of θ in the previous subfigures.
average value of cin and cout versus the connectivity k of the network node: cout
grows uniformly with k, thus confirming that its fat tail is a consequence of the
power law of the connectivity. On the contrary cin appears to be almost constant: on
average the nodes receive the same amount of information, irrespective of k, whilst
the outgoing information from each node depends on the number of neighbors. It
is worth mentioning that since a precise estimation of the information flow is com-
putationally expensive, our simulations are restricted to rather small networks; in
particular the distribution of cout appears to have a fat tail but, due to our limited
data, we can not claim that it corresponds to a simple power-law. The same model
was then implemented on an anatomical connectivity matrix obtained via diffusion
spectrum imaging (DSI) and white matter tractography [22]. Also in this case we
observe a modulation of R and some scatter plots (figure 7) qualitatively similar to
the ones depicted in figures 3 and 4. In this case a multimodal distribution emerges
for high values of θ , as we can observe also in the histograms in figure 8. In figure 9
we can clearly identify some nodes in the structural connection matrix in which the
150
ρin
100
50
50
100
ρout
150
0 0.01 0.02 0.03 0.04 0.05 0.06

c
Fig. 5 For the preferential attachment network, at θ = 0.012, the distributions (by smooth-
ing spline estimation) of cin and cout for all the nodes, pooled from all the realizations, are
depicted. Units on the vertical axis are arbitrary.
Fig. 6 In the ensemble

of preferential attachment 0.2
networks of figure (2), at cin
θ = 0.012, cin and cout cout
are averaged over nodes
with the same connectiv- 0.15
ity and plotted versus the
connectivity.
0.1
c
0.05
0
0 5 10 15 20 25
k
law of diminishing marginal returns is highly expressed. The value of the threshold
has also an influence on the ratio S between interhemispheric and intrahemispheric
information transfer (figure 10). Interestingly, the maximum of this ratio occurs at a
finite value of θ , different from those at which R is maximal.
2
0.05
c out
R 1.5
1 0
0 0.2 0.4 0 0.02 0.04 0.06 0.08
θ c in
0.2 0.5
c out
c out
0.1
0 0
0 0.1 0.2 0 0.2 0.4 0.6
c in c in
Fig. 7 Top right: the ratio between the standard deviation of cout and those of cin , R, is plotted
versus θ when the threshold model is implemented on the connectome structure. Plots in the
plane cin − cout for three values of θ : 0.01 (top right), 0.0345 (bottom left), 0.5 (bottom right).
60 = 0.01 25 = 0.0345 20 = 0.5
20
15
40
15
in 10
10
20
5
5
0 0 0
5
5
20
10
out 10
15
40
15
20
60 25 20
0 0.02 0.04 0.06 0.08 0 0.1 0.2 0 0.2 0.4 0.6
c
Fig. 8 The distributions of cin and cout for three values of θ when the threshold model is
implemented on the connectome structure. Units on the vertical axis are arbitrary.
1.2 Electroencephalographic Recordings

As a real example we consider electroencephalogram (EEG) data. We used record-
ing obtained at rest from 10 healthy subjects. During the experiment, which lasted
for 15 min, the subjects were instructed to relax and keep their eyes closed. To avoid
drowsiness, every minute the subjects were asked to open their eyes for 5 s. EEG
was measured with a standard 10-20 system consisting of 19 channels [31]. Data
were analyzed using the linked mastoids reference, and are available from [46].
For each subject we considered several epochs of 4 seconds in which the subjects
kept their eyes closed. For each epoch we computed multivariate Kernel Granger
1.75
0.5
Fig. 9 The ratio between the standard deviation of cout and those of cin , R, is mapped on the
66 regions of the structural connectivity matrix. In the figure 998 nodes are displayed, with
those belonging to the same region in the coarser template have the same color and size.
Fig. 10 The ratio S between

8
intrahemispheric and in-
terhemispheric information
6
transfer in the threshold
model implemented on the S
4
connectome structure as a
function of θ . The circles
2
indicate the same values of
figures 7 and 8.
0
0 0.1 0.2 0.3 0.4 0.5
θ
Causality [27] using a linear kernel and a model order of 5, determined by leave-
one-out cross-validation. We then pooled all the values for information flow towards
and from any electrode and analyzed their distribution.
In figure 11 we plot the incoming versus the outgoing values of the information
transfer, as well as the distributions of the two quantities: the incoming information
seems exponentially distributed whilst the outgoing information shows a fat tail.
These results suggest that overall brain effective connectivity networks may also be
considered in the light of the law of diminishing marginal returns.
More interestingly, this pattern is reproduced locally but with a clear modulation:
a topographic analysis has also been made considering the distribution of incoming
and outgoing causalities at each electrode. In figure 12 we show the distributions
of incoming and outgoing connections corresponding to the electrodes locations on
the scalp, and the corresponding map of the parameter R; the law of diminishing
marginal returns seems to affect mostly the temporal regions. This well defined pat-
tern suggests a functional role for the distributions. It is worth to note that this pattern
has been reproduced in other EEG data at rest from 9 healthy subjects collected for
another study with a different equipment.
Fig. 11 For the EEG data

the distributions of cin and 1.5 ρ
in
cout are depicted in a scatter 6
plot (left) and in terms of
their distributions, obtained 4
1
by smoothing spline estima-
2
cout
tion (right).
0
0.5
2
ρ
out
4
0
0 0.5 1 1.5 0 0.5 1
cin c
Fig. 12 Left: the distributions for incoming (above, light grey) and outgoing (below, dark
grey) information at each EEG electrode displayed on the scalp map (original binning and
smoothing spline estimation). Right: the distribution on the scalp of R, the ratio between the
standard deviations of the distributions of outgoing and incoming information, for EEG data.
2 Partial Conditioning of Granger Causality

Granger causality has become the method of choice to determine whether and how
two time series exert causal influences on each other [23],[13]. This approach is
based on prediction: if the prediction error of the first time series is reduced by
including measurements from the second one in the linear regression model, then
the second time series is said to have a causal influence on the first one. This frame
has been used in many fields of science, including neural systems [24],[9],[34], and
cardiovascular variability [18].
From the beginning [21],[41], it has been known that if two signals are influ-
enced by third one that is not included in the regressions, this leads to spurious
causalities, so an extension to the multivariate case is in order. The conditional
Granger causality analysis (CGCA) [19] is based on a straightforward expansion
of the autoregressive model to a general multivariate case including all measured
variables. CGCA has been proposed to correctly estimate coupling in multivariate

data sets [6],[14],[15],[45]. Sometimes though, a fully multivariate approach can
lead to problems which can be purely computational but even conceptual: in pres-
ence of redundant variables the application of the standard analysis leads to under-
estimation of causalities [1].
Several approaches have been proposed in order to reduce dimensionality in mul-
tivariate sets, relying on generalized variance [6], principal components analysis
[45] or Granger causality itself [29].
Here we will address the problem of partial conditioning to a limited subset of
variables, in the framework of information theory. Intuitively, one may expect that
conditioning on a small number of variables should remove most of the indirect
interactions if the connectivity pattern is sparse. We will show that this subgroup
of variables might be chosen as the most informative for the driver variable, and
describe the application to simulated examples and a real data set.
2.1 Finding the Most Informative Variables

We start by describing the connection between Granger causality and information-
theoretic approaches like the transfer entropy in [38]. Let {ξn }n=1,.,N+m be a time
series that may be approximated by a stationary Markov process of order m, i.e.
p(ξn |ξn−1 , . . . , ξn−m ) = p(ξn |ξn−1 , . . . , ξn−m−1 ). We will use the shorthand notation
Xi = (ξi , . . . , ξi+m−1 ) and xi = ξi+m , for i = 1, . . . , N, and treat these quantities as N
realizations of the stochastic variables X and x. The minimizer of the risk functional

R[f] = dXdx (x − f (X))2 p(X, x) (3)
represents the bestestimate of x, given X, and corresponds [32] to the regression
function f ∗ (X) = dxp(x|X)x. Now, let {ηn }n=1,.,N+m be another time series of
simultaneously acquired quantities, and denote Yi= (ηi , . . . , ηi+m−1 ) . The best es-
timate of x, given X and Y , is now: g∗ (X,Y ) = dxp(x|X,Y )x. If the generalized
Markov property holds, i.e.
p(x|X,Y ) = p(x|X), (4)
then f ∗ (X) = g∗ (X,Y ) and the knowledge of Y does not improve the prediction of
x. Transfer entropy [38] is a measure of the violation of 4: it follows that Granger
causality implies non-zero transfer entropy [27]. Under Gaussian assumption it can
be shown that Granger causality and transfer entropy are entirely equivalent, and just
differ for a factor two [5]. The generalization of Granger causality to a multivariate
fashion, described in the following, allows the analysis of dynamical networks [28]
and to discern between direct and indirect interactions.
Let us consider n time series {xα (t)}α =1,...,n ; the state vectors are denoted
Yα (t) = (xα (t − m), . . . , xα (t − 1)),

m being the window length (the choice of m can be done using the standard cross-
validation scheme). Let ε (xα |X) be the mean squared error prediction of xα on the
basis of all the vectors X (corresponding to linear regression or non linear regression
by the kernel approach described in [27]). The multivariate Granger causality index
c(β → α ) is defined as follows: consider the prediction of xα on the basis of all the
variables but Xβ and the prediction of xα using all the variables, then the causality
measures the variation of the error in the two conditions, i.e.

ε xα |X \ Xβ
c(β → α ) = log . (5)
ε (xα |X)
Note that in [27] a different definition of causality has been used,

ε xα |X \ Xβ − ε (xα |X)
δ (β → α ) = ; (6)
ε xα |X \ Xβ
The two definitions are clearly related by a monotonic transformation:
c(β → α ) = − log [1 − δ (β → α )]. (7)
Here we first evaluate the causality δ (β → α ) using the selection of significant

eigenvalues described in [27] to address the problem of over-fitting in (6); then
we use (7) and express our results in terms of c(β → α ), because it is with this
definition that causality is twice the transfer entropy, equal to I{xα ; Xβ |X \ Xβ }, in
the Gaussian case [5].
We now address the problem of coping with a large number of variables, when
the application of multivariate Granger causality may be questionable or even unfea-
sible, whilst bivariate analysis would detect also indirect influences. Here we show
that conditioning on a small number of variables, chosen as the most informative
for the candidate driver variable, is sufficient to remove the biggest portion of in-
direct interactions for sparse connectivity patterns. Conditioning on a large number
of variables requires a high number of samples in order to get reliable results. Re-
ducing the number of variables, that one has to condition over, would thus provide
better results for small data-sets. In the general formulation of Granger causality,
one has no way to choose this reduced set of variables; on the other hand, in the
framework of information theory, it is possible to individuate the most informative
variables one by one. Once that it has been demonstrated [5] that Granger causality
is equivalent to the information flow between Gaussian variables, partial condition-
ing becomes possible for Granger causality estimation; to our knowledge this is the
first time that such approach is proposed.
Concretely, let us consider the causality β → α ; we fixthe number of variables,
to be used for conditioning, equal to nd . We denote Z = Xi1 , . . . , Xind the set of
the nd variables, in X \ Xβ , most informative for Xβ . In other words, Z maximizes
the mutual information I{Xβ ; Z} among all the subsets Z of nd variables. Then, we
evaluate the causality
ε (xα |Z)
c(β → α ) = log . (8)
ε xα |Z ∪ Xβ
Under the Gaussian assumption, the mutual information I{Xβ ; Z} can be easily eval-
uated, see [5]. Moreover, instead of searching among all the subsets of nd variables,
we adopt the following approximate strategy. Firstly the mutual information of the
driver variable, and each of the other variables, is estimated, in order to choose the
first variable of the subset. The second variable of the subsets is selected among
the remaining ones, as those that, jointly with the previously chosen variable, max-
imizes the mutual information with the driver variable. Then, one keeps adding the
rest of the variables by iterating this procedure. Calling Zk−1 the selected set of k − 1
variables, the set Zk is obtained adding , to Zk−1 , the variable, among the remaining
ones, providing greatest information gain. This is repeated until nd variables are se-
lected. This greedy algorithm, for the selection of relevant variables, is expected to
give good results under the assumption of sparseness of the connectivity.
2.2 Partial Conditioning in a Dynamical Model

Let us consider linear dynamical systems on a lattice of n nodes, with equations, for
i = 1, . . . , n:
n
xi,t = ∑ ai j x j,t−1 + sτi,t , (9)
j=1
where a’s are the couplings, s is the strength of the noise and τ ’s are unit variance
i.i.d. Gaussian noise terms. The level of noise determines the minimal amount of
samples needed to assess that the structures recovered by the proposed approach
are genuine and are not due to randomness, as it happens for the standard Granger
causality (see discussions in [27] and [28]); in particular noise should not be too
high to obscure deterministic effects.
As an example, we fix n = 34 and construct couplings in terms of the well known
Zachary data set [44], an undirected network of 34 nodes. We assign a direction
to each link, with equal probability, and set ai j equal to 0.015, for each link of the
directed graph thus obtained, and zero otherwise. The noise level is set s = 0.5. The
goal is again to estimate this directed network from the measurements of time series
on nodes.
In figure (13) we show the application of the proposed methodology to data sets
generated by eqs. (9), in terms of sensitivity and specificity, for different numbers
of samples. The bivariate analysis detects several false interactions, however condi-
tioning on a few variables is sufficient to put in evidence just the direct causalities.
Due to the sparseness of the underlying graph, we get a result which is very close
to the one by the full multivariate analysis; the multivariate analysis here recovers
the true network, indeed the number of samples is sufficiently high. In figure (14),
concerning the stage of selection of variables upon which conditioning, we plot the
mutual information gain Δ y as a function of the number of variables included nd : it
decreases as nd increases.
1 1
sensitivity
sensitivity
0.99 0.99
0.98 0.98
0.97 0.97
0.96 0.96
0 5 10 0 5 10
nd nd
1 1
0.95 0.95
specificity
0.9 specificity 0.9
0.85 0.85
0.8 0.8
0 5 10 0 5 10
nd nd
Fig. 13 Sensitivity and specificity for the recovery of the Zachary network structure from the
dynamics at his nodes are plotted versus nd , the number of variables selected for condition-
ing, for two values of two values of the number of samples N, 500 (left) and 1000 (right).
The order is m = 2, similar results are obtained varying m. The results are averaged over
100 realizations of the linear dynamical system described in the text. The empty square, in
correspondence to nd = 0, is the result from the bivariate analysis. The horizontal line is the
outcome from multivariate analysis, where all variables are used for conditioning.
Fig. 14 The mutual in-

formation gain Δ y for the
0.15
Zachary network, when the 0.1
(nd + 1)-th variable is in-

Δy
0.05
cluded, is plotted versus nd
for two values of the of the 0
1 2 3 4 5 6 7 8 9
number of samples N, 500 nd
0.15
(top) and 1000 (bottom).
The order is m = 2. The 0.1
information gain is averaged

Δy
0.05
over all the variables.
0
1 2 3 4 5 6 7 8 9
nd
2.3 Partial Conditioning in Resting State fMRI

We used a resting state datasets from a public repository1. Data were acquired by
using of single-shot gradient echo planar imaging (EPI) sequence (repetition time
[TR]: 645ms; echo time: 30ms; slices: 33; thickness: 3mm; gap: 0.6mm; field of
view: 200 × 200mm2 ; in-plane resolution: 64 × 64; flip angle: 90◦ ). Preprocess-
ing of resting-state images was performed using the Statistical Parametric Mapping
software (SPM8, http://www.fil.ion.ucl.ac.uk/spm), including slice-timing corrected
relative to middle axial slice for the temporal difference in acquisition among dif-
ferent slices, realigned with the corresponding 3-D structure image, head motion
correction(for all subjects, the translational or rotational parameters of a data set did
not exceed ±1mm or ±1◦ ), spatial normalization into a standard stereotaxic space,
parameters from normalizing 3-D structure images to the Montreal Neurological In-
stitute T1 template in SPM8 were written to fMRI images then resampled to 3-mm
isotropic voxels. The functional images were segmented into 90 regions of interest
(ROIs) using automated anatomical labeling (AAL) template [40]. For each subject,
the representative time series of each ROI was obtained by averaging the fMRI time
series across all voxels in the ROI. Several procedures were used to remove possi-
ble spurious variances from the data through linear regression. These were 1) six
head motion parameters obtained in the realigning step, 2) signal from a region in
cerebrospinal fluid, 3) signal from a region centered in the white matter. 4) global
signal averaged over the whole brain. The hemodynamic response function was de-
convolved from the BOLD time series.
In order to select the variables over which conditioning, in figure 15 we plot the
mutual information gain for a given target (left posterior cingulate gyrus) as a func-
tion of the number of variables included nd : as expected it decreases as nd increases.
The same behavior is reproduced for all the targets. We can observe that the curve
starts to become less steep after nd = 6. This phenomenon could be explained con-
sidering that multivariate analysis by hierarchical clustering and multidimensional
scaling consistently defined six major systems in the resting brain [35]. This is con-
firmed by looking at figure 16 in which for the same given target the most frequently
chosen target variables are reported. It is evident how these are generally sampled
at larger scale across the brain in order to pick up information from even distant
regions.
3 Informative Clustering
In this last section we propose a formal expansion of the transfer entropy to put in
evidence irreducible sets of variables which provide information for the future state
of each assigned target. Multiplets characterized by an high value will be associ-
ated to informational circuits present in the system, with an informational character
(synergetic or redundant) which can be associated to the sign of the contribution.
We also present results on fMRI and EEG data sets.
1 http://www.nitrc.org/projects/fcon_1000/
Fig. 15 The mutual infor-

mation gain when the target
is the left posterior cingulate
gyrus, when the (nd + 1)-th
variable is included, is plot-
ted versus nd .
Fig. 16 Variables chosen among the 10 most informative when the target is the left posterior
cingulate gyrus (in blue). The diameter of the red spheres is proportional to the times that a
region is selected for different subjects.
3.1 Identification of Irreducible Subgraphs

Information theoretic treatment of groups of correlated degrees of freedom can re-
veal their functional roles as memory structures or those capable of processing infor-
mation [12]. Information quantities reveal if a group of variables may be mutually
redundant or synergetic [37, 7]. The application of these insights to identify func-
tional connectivity structure is a promising line of research. Most approaches for the
identification of functional relations among nodes of a complex networks rely on the
statistics of motifs, subgraphs of k nodes that appear more abundantly than expected
in randomized networks with the same number of nodes and degree of connectivity
[30, 42]. An approach to identify functional subgraphs in complex networks, relying
on an exact expansion of the mutual information with a group of variables, has been
presented in [8].
On the other hand, understanding couplings between dynamical subsystems is
a topic of general interest. Transfer entropy [38], which is related to the concept
of Granger causality [21], has been proposed to distinguish effectively driving and
responding elements and to detect asymmetry in the interaction of subsystems. By
appropriate conditioning of transition probabilities this quantity has been shown to
be superior to the standard time delayed mutual information, which fails to distin-
guish information that is actually exchanged from shared information due to com-
mon history and input signals. On the other hand, Granger causality formalized the
notion that, if the prediction of one time series could be improved by incorporating
the knowledge of past values of a second one, then the latter is said to have a causal
influence on the former. Initially developed for econometric applications, Granger
causality has gained popularity also in neuroscience (see, e.g., [9, 39, 16, 27]). A
discussion about the practical estimation of information theoretic indexes for sig-
nals of limited length can be found in [33].
Here we present a formal expansion of the transfer entropy to put in evidence irre-
ducible sets of variables which provide information for the future state of the target.
Multiplets characterized by an high value, unjustifiable by chance, will be associ-
ated to informational circuits present in the system, with an informational character
(synergetic or redundant) which can be associated to the sign of the contribution.
Fig. 17 Concerning fMRI

data, the distribution of 0
the first order term in the
expansions, eqs. (18) and
(13) are depicted.
−0.1
A0i
W0i
−0.2
6 4 2 0 2 4 6
% of values
4 Expansion of the Transfer Entropy

We start describing the work in [8]. Given a stochastic variable X and a family of
stochastic variables {Yk }nk=1 , the following expansion for the mutual information
has been derived there:
S (X|{Y }) − S(X) = −I (X; {Y }) =
Δ S(X) Δ 2 S(X) (10)
+ · · · + ΔΔYi ···
nS(X)
∑i Δ Yi + ∑i> j Δ Yi Δ Y j Δ Yn ,
where the variational operators are defined as
Δ S(X)
= S (X|Yi ) − S(X) = −I (X;Yi ) , (11)
Δ Yi
Δ 2 S(X) Δ I (X;Yi )
=− = I (X;Yi ) − I (X;Yi |Y j ), (12)
Δ Yi Δ Y j ΔYj
and so on.

data, the distribution of the 0
first order term in the expan-
sion of the transfer entropy,
0
eq. (18), is compared with Ai
the results corresponding to −0.05
a reshuffling of the target A0 random
i
time series.
−0.1
10 −5 0 5 10
% of values

second order term in the Bij
expansions, eqs. (19) and 0.03 0
(14) are depicted. Zij
−0.03
3 2 1 0 1 2
% of values
Now, let us consider n + 1 time series {xα (t)}α =0,...,n . The lagged state vectors
are denoted
Yα (t) = (xα (t − m), . . . , xα (t − 1)),
m being the window length.
Firstly we may use the expansion (10) to model the statistical dependencies
among the x variables at equal times. We take x0 as the target time series, and the
first terms of the expansion are
Wi0 = −I (x0 ; xi ) (13)
for the first order;

Zi0j = I (x0 ; xi ) − I (x0 ; xi |x j ) (14)
for the second order; and so on. Here we propose to consider also
S (x0 |{Yk }nk=1 ) − S(x0) = −I (x0 ; {Yk }nk=1 ) , (15)
which measures to what extent the remaining variables contribute to specifying the
future state of x0 . This quantity can be expanded according to (10):

S x0 |{Yk }nk=1 − S(x0) =
Δ S(x ) Δ 2 S(x ) Δ n S(x ) (16)
∑i Δ Yi0 + ∑i> j Δ Yi Δ Y0j + · · · + Δ Yi ···Δ0Yn .

data, the distribution of the 0.015 0
expansion of the transfer en- 0
tropy, eq. (18), is compared Bij random
with the results correspond- 0
ing to a reshuffling of the
target time series.
−0.015
3 2 1 0 1 2 3 4
% of values
A drawback of the expansion above is that it does not remove shared information
due to common history and input signals; therefore we propose to condition on the
past of x0 , i.e. Y0 . To this aim we introduce the conditioning operator CY0 :
CY0 S(X) = S(X|Y0 ),
and observe that CY0 and the variational operators (11) commute. It follows that we
can condition the expansion (16) term by term, thus obtaining

S x0 |{Yk }nk=1 ,Y0 − S(x0|Y0 ) =
−I x0 ; {Y }nk=1 |Y0 = (17)
Δ S(x |Y ) Δ 2 S(x |Y ) Δ n S(x |Y )
∑i Δ Y0 i 0 + ∑i> j Δ Yi Δ0Y j0 + · · · + Δ Yi ···0Δ Y0n .

the third order term in the Cijk
0.05
expansion of the transfer en-
tropy, eq. (18), is compared C0ijk random
0.025
with the results correspond-
ing to a reshuffling of the 0
target time series.
−0.025
1 0 1 2
% of values
We note that variations at every order in (17) are symmetrical under permutations
of the Yi . Moreover statistical independence among any of the Yi results in vanishing
contribution to that order: each nonvanishing term in this expansion accounts for an
irreducible set of variables providing information for the specification of the target.
The first order terms in the expansion are given by:
Δ S(x0 |Y0 )
A0i = = −I (x0 ;Yi |Y0 ) , (18)
Δ Yi
and coincide with the bivariate transfer entropies i → 0 (times -1). The second order
terms are
B0i j = I (x0 ;Yi |Y0 ) − I (x0 ;Yi |Y j ,Y0 ) , (19)
whilst the third order terms are
Ci0jk = I (x0 ;Yi |Y j ,Y0 ) + I (x0 ;Yi |Yk ,Y0 )
(20)
−I (x0 ;Yi |Y0 ) − I (x0 ;Yi |Y j ,Yk ,Y0 ) .
An important property of (17) is that the sign of nonvanishing terms reveals the
informational character of the corresponding set of variables: a negative sign indi-
cates that the group of variables contribute with more information, than the sum
of its subgroups, to the state of the target (synergy), while positive contributions
correspond to redundancy.
Another important point that we address here is how get a reliable estimate of
conditional mutual information from data. In this work we adopt the assumption of
Gaussianity and we use the exact expression that holds in this case [5] and reads as
follows. Given multivariate Gaussian random variables X, W and Z, the conditioned
mutual information is
1 |Σ (X|Z)|
I (X;W |Z) = ln , (21)
2 |Σ (X|W ⊕ Z)|
where | · | denotes the determinant, and the partial covariance matrix is defined
Σ (X|Z) = Σ (X) − Σ (X, Z)Σ (Z)−1 Σ (X, Z) , (22)
in terms of the covariance matrix Σ (X) and the cross covariance matrix Σ (X, Z); the
definition of Σ (X|W ⊕ Z) is analogous.
4.1 Applications: Magnetic Resonance and EEG Data

In order to test this approach on a real neuroimaging dataset we used resting state
fMRI data described in the previous section.
For each subject, we evaluated the first terms in the expansions of the conditional
mutual information. We then pooled all the values of the terms in the expansions,
from all subjects and all targets, and we report their distributions in the following
figures. In figure (17) we compare the distributions of A0i , the first order terms in the
expansion of the information flow (equivalent to the bivariate transfer entropy), with
those of the equal time dependencies Wi0 . This figure shows that the expansion terms
of the mutual information have a quite wide distribution, and also that the maximum
of the distribution is not at zero, suggesting that the data set is characterized by
many equal time statistical dependencies and by nontrivial causal connections. In
figure (18) the distribution of the bivariate transfer entropies is compared with those
Fig. 22 Concerning EEG

0
the first order term in the Aij
expansions, eqs. (18) and 0
−.5 Wij
(13) are depicted.
−1
−1.5
6 4 2 0 2 4 6
% of values

first order term in the expan-
sion of the transfer entropy,
eq. (18), is compared with
the results corresponding to −0.1 A0ij
a reshuffling of the target 0
time series.
Aij random
−0.2
50 25 0 25 50
% of values
obtained after a random reshuffling of the target time series: the surrogate test at 5%
confidence shows that a relevant fraction of bivariate interactions is statistically sig-
nificant. In figure (19) we report the distributions of the second order terms, both for
information flow and for instantaneous correlations: negative and positive terms are
present, i.e. both synergetic and redundant circuits of three variables are evidenced
by the proposed approach. Some of these interactions are statistically significant,
see figure (20).
In figure (21) we report the distribution of the third order terms for the infor-
mation flow which correspond to the target Posterior cingulate gyrus, a major node
within the default mode network (DMN) with high metabolic activity and dense
structural connectivity to widespread brain regions, which suggests it has a role as
a cortical hub. The region appears to be involved in internally directed thought, for
example, memory recollection. We compare the distribution with the corresponding
one for shuffled target; it appears that there are significant circuits of four variables,
involving Posterior cingulate gyrus, and most of them are redundant.
As another example, we consider electroencephalogram (EEG) data obtained at
rest from 10 healthy subjects and described in the first section. In figure (22) we
compare the distributions of A0i and Wi0 . This figure shows that also EEG data
are characterized by nontrivial causal connections. In figure (23) the distribution
of the bivariate transfer entropies is compared with those obtained after a random
reshuffling of the target time series: it shows that a remarkable amount of bivariate

data, the distribution of the 0.5
0
expansions, eqs. (19) and 0.25 0
(14) are depicted. Zij
0
−0.25
−0.5
6 4 2 0 2 4 6
% of values
interactions is statistically significant. In figure (24) we report the distributions of

the second order terms, both for information flow and for instantaneous correlations.
4.2 Relationship with Information Storage

Information storage is a fundamental aspect of the dynamics of all the processes on
complex networks. The full comprehension of the relationship between networks
properties and information storage remains a challenge; however some novel in-
sights have been suggested in a recent paper [25], where a connection between in-
formation storage and networks motifs has been pointed out. In this subsection we
show that the information storage at each node of a network is also connected to the
presence of multiplets of variables sending information to that node. Let us consider
the following set of three variables, evolving according to:
(1)
xt+1 = cyt + 0.1ξt+1
(2)
yt+1 = czt + 0.1ξt+1 (23)
(2)
zt+1 = cxt + 0.1ξt+1 ,
thus constituting a realization of the network motif (a) in figure 1 of [25]. In figure
25, left we depict, as a function of the coupling c, both the information storage at
the node corresponding to the variable x, and the information flow term {y, z} → x.
In this case the three variables are redundant and a relation between information
storage and information flow can be established. Figures 25, center and right refer
to similar dynamical systems of 3 and 4 variables, corresponding to the motifs (c)
and (d), respectively, of figure 1 of [25]. These two cases correspond to synergy: still
the presence of these informational terms is connected to information storage in the
small network. Summarizing, we have shown that the expansion of the transfer en-
tropy is deeply connected with the expansion of the information storage developed
in [25], hence the search of redundant and synergetic multiplets of variables, send-
ing information to each given target, will also put in evidence the mechanisms for
information storage at that node.
0.2 0.02
0.02
0.01
0.1 0
0
0 −0.02
−0.01
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
c c c
Fig. 25 Information storage (squares) and information flow term {y, z} → x (crosses) for
three motifs described in [25], figure 1. Left: motif (a), redundant variables. Center: motif
(c), synergetic variables. Right: motif (d), synergetic variables.
5 Conclusions
The transfer entropy analysis describes the information flow pattern in complex sys-
tems in terms of an N × N matrix, N being the number of subcomponents, each
element being the information flowing from each subsystem to each other. The ap-
proaches described in the present chapter represent our attempts to deal with phys-
ical constraints (e.g., the limited capacity of nodes and the limited number of data
samples) within this picture, and to go beyond the N × N description when the actual
senders of information are network motifs rather than single nodes.
Concerning the physical constraints, we have shown that information flow pat-
terns show a signature of the law of diminishing marginal returns and we addressed
the problem of partial conditioning to a limited subset of variables.
As far as the search for multiplets of correlated variables is concerned, we have
proposed a formal expansion of the transfer entropy to put in evidence irreducible
sets of variables which provide information for the future state of each assigned
target. The applications to real data-set show the effectiveness of the proposed
methodology.
References
1. Angelini, L., de Tommaso, M., Marinazzo, D., Nitti, L., Pellicoro, M., Stramaglia, S.:
Redundant variables and Granger causality. Physical Review E 81(3), 037201 (2010)
2. Barabási, A., Albert, R.: Emergence of scaling in random networks. Science 286, 509–
512 (1999)
3. Barabási, A., Ravasz, E., Vicsek, T.: Deterministic scale-free networks. Physica A: Sta-
tistical Mechanics and its Applications 299, 559–564 (2001)
4. Linked, B.A.: The new science of networks. Perseus Books, New York (2002)
5. Barnett, L., Barrett, A., Seth, A.: Granger causality and transfer entropy are equivalent
for gaussian variables. Physical Review Letters 103, 238701 (2009)
6. Barrett, A., Barnett, L., Seth, A.K.: Multivariate Granger causality and generalized vari-
ance. Physical Review E 81(4), 041907 (2010)
7. Bettencourt, L.M.A., Stephens, G.J., Ham, M.I., Gross, G.W.: Functional structure of
cortical neuronal networks grown in vitro. Phys. Rev. E 75(2), 21915–21924 (2007)
8. Bettencourt, L.M.A., Gintautas, V., Ham, M.I.: Identification of functional information

subgraphs in complex networks. Phys. Rev. Lett. 100, 238701–238704 (2008)
9. Blinowska, K., Kusacute, R., Kaminacuteski, M.: Granger causality and information flow
in multivariate processes. Physical Review E 70(5), 050902 (2004)
10. Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.: Complex networks: Struc-
ture and dynamics. Physics Reports 424, 175–308 (2006)
11. Boccaletti, S., Hwang, D., Chavez, M., Amann, A., Kurths, J., Pecora, L.: Synchro-
nization in dynamical networks: Evolution along commutative graphs. Physical Review
E 74(1), 016102 (2006)
12. Borst, A., Theunissen, F.E.: Information theory and neural coding. Nature Neuro-
science 2, 947–957 (1999)
13. Bressler, S.L., Seth, A.K.: Wiener-Granger causality: A well established methodology.
NeuroImage 58(2), 323–329 (2011)
14. Chen, Y., Bressler, S.L., Ding, M.: Frequency decomposition of conditional Granger
causality and application to multivariate neural field potential data. Journal of Neuro-
science Methods 150(2), 228–237 (2006)
15. Deshpande, G., LaConte, S., James, G.A., Peltier, S., Hu, X.: Multivariate Granger
causality analysis of fMRI data. Human Brain Mapping 30(4), 1361–1373 (2009)
16. Dhamala, M., Rangarajan, G., Ding, M.: Estimating Granger causality from Fourier and
wavelet transforms of time series data. Phys. Rev. Lett. 100, 18701–18704 (2008)
17. Erdős, P., Rényi, A.: On the evolution of random graphs. Publications of the Mathemati-
cal Institute of the Hungarian Academy of Sciences 5, 17–61 (1960)
18. Faes, L., Nollo, G., Chon, K.H.: Assessment of Granger causality by nonlinear model
identification: Application to short-term cardiovascular variability. Annals of Biomedical
Engineering 36(3), 381–395 (2008)
19. Geweke, J.F.: Measures of conditional linear dependence and feedback between time
series. Journal of the American Statistical Association 79(388), 907–915 (1984)
20. Ghahramani, Z.: Learning dynamic bayesian networks. In: Giles, C.L., Gori, M. (eds.)
IIASS-EMFCSC-School 1997. LNCS (LNAI), vol. 1387, pp. 168–197. Springer, Hei-
delberg (1998)
21. Granger, C.W.J.: Investigating causal relations by econometric models and cross-spectral
methods. Econometrica 37(3), 424–438 (1969)
22. Hagmann, P., Cammoun, L., Gigandet, X., Meuli, R., Honey, C., Weeden, V., Sporns,
O.: Mapping the Structural Core of Human Cerebral Cortex. PLoS Biology 6(7), e159
(2008)
23. Hlaváčková-Schindler, K., Paluš, M., Vejmelka, M., Bhattacharya, J.: Causality detec-
tion based on information-theoretic approaches in time series analysis. Physics Re-
ports 441(1), 1–46 (2007)
24. Kamiński, M., Ding, M., Truccolo, W.A., Bressler, S.L.: Evaluating causal relations in
neural systems: Granger causality, directed transfer function and statistical assessment
of significance. Biological Cybernetics 85(2), 145–157 (2001)
25. Lizier, J.T., Atay, F.M., Jost, J.: Information storage, loop motifs and clustered structure
in complex networks. Physical Review E 86, 026110 (2012)
26. López, L., Sanjuán, M.: Relation between structure and size in social networks. Physical
Review E 65, 036107 (2002)
27. Marinazzo, D., Pellicoro, M., Stramaglia, S.: Kernel method for nonlinear Granger
causality. Physical Review Letters 100, 144103 (2008)
28. Marinazzo, D., Pellicoro, M., Stramaglia, S.: Kernel Granger causality and the analysis
of dynamical networks. Physical Review E 77, 052615 (2008)
29. Marinazzo, D., Liao, W., Pellicoro, M., Stramaglia, S.: Grouping time series by pairwise
measures of redundancy. Physics Letters A 374(39), 4040–4044 (2010)
30. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network
Motifs: Simple Building Blocks of Complex Networks. Science 298, 824–827 (2002)
31. Nolte, G., Ziehe, A., Nikulin, V., Schlögl, A., Krämer, N., Brismar, T., Müller, K.: Ro-
bustly estimating the flow direction of information in complex physical systems. Physical
Review Letters 100, 234101 (2008)
32. Papoulis, A.: Proability, Random Variables, and Stochastic Processes. McGraw-Hill,
New York (1985)
33. Porta, A., Catai, A.M., Takahashi, A.C.M., Magagnin, V., Bassani, T., Tobaldini, E.,
Montano, N.: Information Transfer through the Spontaneous Baroreflex in Healthy Hu-
mans. Methods of Information in Medicine 49, 506–510 (2010)
34. Roebroeck, A., Formisano, E., Goebel, R.: Mapping directed influence over the brain
using Granger causality and fMRI. NeuroImage 25(1), 230–242 (2005)
35. Salvador, R., Suckling, J., Coleman, M.R., Pickard, J.D., Menon, D., Bullmore, E.:
Neurophysiological Architecture of Functional Magnetic Resonance Images of Human
Brain. Cerebral cortex 15(9), 1332–1342 (2005)
36. Samuelson, P., Nordhaus, W.: Microeconomics. McGraw-Hill, Oklahoma City (2001)
37. Schneidman, E., Bialek, W., Berry II, M.J.: Synergy, redundancy, and independence in
population codes. J. Neuroscience 23, 11539–11553 (2003)
38. Schreiber, T.: Measuring information transfer. Physical Review Letters 85(2), 461 (2000)
39. Smirnov, D.A., Bezruchko, B.P.: Estimation of interaction strength and direction from
short and noisy time series. Phys. Rev. E 68, 046209–046218 (2003)
40. Tzourio-Mazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F., Etard, O., Delcroix,
N., Mazoyer, B., Joliot, M.: Automated anatomical labeling of activations in SPM using
a macroscopic anatomical parcellation of the MNI MRI single-subject brain. NeuroIm-
age 15(1), 273–289 (2002)
41. Wiener, N.: The theory of prediction, vol. 1. McGraw-Hill, New York (1996)
42. Yeger-Lotem, E., Sattath, S., Kashtan, N., Itzkovitz, S., Milo, R., Pinter, R.J., Alon, U.,
Margalit, H.: Network motifs in integrated cellular networks of transcription regulation
and protein protein interaction. Proc. Natl. Acad. Sci. U.S.A. 101, 5934–5939 (2004)
43. Yu, D., Righero, M., Kocarev, L.: Estimating topology of networks. Physical Review
Letters 97(18), 188701 (2006)
44. Zachary, W.: An information flow model for conflict and fission in small groups. J. An-
thropol. Res. 33(2), 452–473 (1977)
45. Zhou, Z., Chen, Y., Ding, M., Wright, P., Lu, Z., Liu, Y.: Analyzing brain networks
with PCA and conditional Granger causality. Human Brain Mapping 30(7), 2197–2206
(2009)
46. http://clopinet.com/causality/data/nolte/ (accessed July 6, 2012)
Function Follows Dynamics: State-Dependency
of Directed Functional Influences
Demian Battaglia
Abstract. Brain function requires the control of inter-circuit interactions on time-

scales faster than synaptic changes. In particular, strength and direction of causal in-
fluences between neural populations (described by the so-called directed functional
connectivity) must be reconfigurable even when the underlying structural connectiv-
ity is fixed. Such influences can be quantified through causal analysis of time-series
of neural activity with tools like Transfer Entropy. But how can manifold func-
tional networks stem from fixed structures? Considering model systems at different
scales, like neuronal cultures or cortical multi-areal motifs, we show that “function
and information follow dynamics”, rather than structure. Different dynamic states
of a same structural network, characterized by different synchronization properties,
are indeed associated to different directed functional networks, corresponding to al-
ternative information flow patterns. Here we discuss how suitable generalizations of
Transfer Entropy, taking into account switching between collective states of the ana-
lyzed circuits, can provide a picture of directed functional interactions in agreement
with a “ground-truth” description at the dynamical systems level.
1 Introduction
Even before unveiling how neuronal activity represents information, it is crucial to
understand how this information, independently from the used encoding, is routed
across the complex multi-scale circuits of the brain. Flexible exchange of informa-
tion lies at the core of brain function. A daunting amount of computations must be
performed in a way dependent on external context and internal brain states. But how
Demian Battaglia
Aix-Marseille University, Institute for Systems Neuroscience, INSERM UMR 1106,
27, Boulevard Jean Moulin, F-13005 Marseille and
Max Planck Institute for Dynamics and Selforganization and
Bernstein Center for Computational Neuroscience, Am Faßberg 17, D-37077 Göttingen
e-mail: demian.battaglia@univ-amu.fr

112 D. Battaglia
can information be rerouted “on demand”, given that anatomic inter-areal connec-
tions can be considered as fixed, on timescales relevant for behavior?
In systems neuroscience, a distinction is made between structural and directed
functional connectivities [32, 33]. Structural connectivity describes actual synaptic
connections. On the other hand, directed functional connectivity is estimated from
time-series of simultaneous neural recordings using causal analysis [20, 36, 41],
to quantify, beyond correlation, directed influences between brain areas. If the
anatomic structure of brain circuits unavoidably constrains at some extent the func-
tional interactions that these circuits can support (see e.g. [42]), it is not however
sufficient to specify them fully. Indeed, a given structural network might give rise to
multiple possible collective dynamical states, and such different states could lead to
different information flow patterns. It has been suggested, for instance, that multi-
stability of neural circuits underlies switching between different perceptions or be-
haviors [21, 40, 48]. In this view, transitions between alternative attractors of the
neural dynamics would occur under the combined influence of structured “brain
noise” [47] and of the bias exerted by sensory or cognitive driving [16, 17, 18].
Due to a possibly non trivial attractor dynamics, the interrelation between

structural and functional connectivity becomes inherently complex. There-
fore, dependencies from the analyzed dynamical regime have to be taken into
account explicitly when designing metrics of directed interactions.
Dynamic multi-stability can give rise, in particular, to transitions between dif-

ferent oscillatory states of brain dynamics [28]. This is highly relevant in this con-
text, because long-range oscillatory coherence [59, 64] —in particular in the beta
or gamma band of frequency [6, 8, 22, 24, 29, 30, 51, 64]— is believed to play a
central role in inter-areal communication. According to the “communication-
through-coherence” hypothesis [29], information exchange between two neuronal
populations is enhanced when the oscillations of their coherent activity is phase-
locked with a suitable phase-relation. Therefore the efficiency and the directionality
of information transmission between neuronal populations is affected by changes in
their synchronization pattern, as also advocated by modeling studies [4, 12]. From a
general perspective, the correct timing of exchanged signals is arguably crucial for
a correct relay of information and a natural device to achieve such temporal coordi-
nation might be self-organized dynamic synchronization of neural activity. Beyond
tightly [65] or sparsely-synchronized [9, 10, 11] periodic-like oscillations, synchro-
nization in networks of spiking neurons can arise in other forms, including low-
dimensional chaotic rhythms [2, 3] or avalanche-like bursting [1, 5, 45, 46], which
are both temporal irregular, and yet able to support modulation of information flow.
This chapter will concentrate on the directed functional connectivity analysis
of simulated neural dynamics, rather than of actual experiments. It will focus in
particular on two representative systems at different spatial scales, both described
as large networks of hundreds or thousands of model spiking neurons. The analy-
sis will delve first on cultures of dissociated neurons, which after a certain critical
Function Follows Dynamics: State-Dependency of Directed Functional Influences 113
maturation age, are known to spontaneously develop an episodic synchronous burst-

ing activity [14, 25, 62]. Then, mesoscopic circuits of few interconnected oscillat-
ing brain areas will be considered, stressing how even simple structural motifs can
give rise to a rich repertoire of dynamic configurations. Emphasis on simulated sys-
tems will allow disentangling the role played by collective dynamics in mediating the
link between structural connectivity and emergent directed functional interactions.
In analogous experimental systems, the ground-truth connectivity or the actual on-
going dynamics would not be known with precision. On the contrary, on in silico
neural circuits, structural topology can be freely chosen and its impact on network
dynamics thoroughly explored, exhibiting directly that a correspondence exists be-
tween the supported dynamical regimes and the inferred functional connectivities.
Two phenomena will be highlighted: on one side, functional multiplicity, aris-

ing when multiple functional topologies stem out of a system with a given
structural topology (supporting multiple possible dynamics); on the other side,
structural degeneracy, arising when systems with different structural topolo-
gies (but similar dynamics) give rise to equivalent functional topologies.
2 State-Conditioned Transfer Entropy

In this contribution, directed functional connectivity —used with the meaning of
causal connectivity or exploratory data-driven effective connectivity, as commented
in [7]— is characterized in terms of a generalized version of Transfer Entropy (TE)
[52], an information-theoretic implementation of the well-known notion of Wiener-
Granger causality [37, 66]. Transfer Entropy is extensively discussed in other chap-
ters of this book. Here we will introduce a specific generalization which is used
for the analyses presented in the next sections. A bivariate definition will be given,
although a multi-variate extension is straightforward.
Let us consider a pair of continuous time-series describing the dynamics of two
different neural circuit elements x and y, like e.g. LFPs from different brain areas,
or calcium imaging recordings of single neuron activity in a neuronal culture. These
time-series are quantized into B discrete amplitude levels 1 , . . . , B (equal-sized for
simplicity) and are thus converted into (discretely-sampled) sequences X(t) and Y (t)
of symbols from a small alphabet.
Usually, two transition probability matrices are sampled as normalized his-
tograms over very long symbolic sequences:

PY |XY (τ ) i jk = P[Y (t) = i |Y (t − τ ) = j , X(t − τ ) = k ]

PY |Y (τ ) i j = P[Y (t) = i |Y (t − τ ) = j ]
where the lag τ is an arbitrary temporal scale on which causal interactions are
probed. The causal influence TEx→y (τ ) of circuit element x on circuit element y
is then operatively defined as the functional:
114 D. Battaglia
PY |XY (τ )
TEx→y (τ ) = ∑ PY |XY (τ ) log2 (1)
PY |Y (τ )
where the sum runs over all the three indices i, j and k of the transition matrices.
Higher Markov order descriptions of the time-series evolution can also be
adopted for the modeling of the source and target time-series [52]. In general, the
conditioning on the single past values X(t − τ ) and Y (t − τ ) appearing in the defi-
nition of the matrices PY |XY (τ ) and PY |Y (τ ) is replaced by conditioning on vectors
of several past values Yr = [Y (t − rτ ),Y (t − (r + 1)τ ), . . . (t − (p − 1)τ ),Y (t − pτ )]
p
and Xs = [X(t − sτ ), X(t − (s + 1)τ ), . . .(t − (q − 1)τ ), X(t − qτ )]. Here p and q cor-
q
respond to the Markov orders taken for the target and source time-series Y (t) and
X(t) respectively. The parameters r, s < p, q are standardly set to r, s = 1, but might
assume different values for specific applications (see later). A general Markov order
transfer entropy TEx→y (τ ; r, s, p, q) can then be written straightforwardly .
More importantly, to characterize the dependency of directed functional interac-
tions on dynamical states, a further state conditioning is introduced. Let S(t) be a
vector describing the history of the entire system —i.e. not only the two considered
circuit elements x and y but the whole neural circuit to which they belong— over the
time-interval [t − T,t]. We define then a “state selection filter”, i.e. a set of time in-
stants C for which the system history S(t) satisfies some arbitrary set of constraints.
The definition of C is left on purpose very general and will have to be instantiated
depending on the specific concrete application. It is then possible to introduce an
(arbitrary Markov orders) state-conditioned Transfer Entropy:
PY |XY ;C (τ ; r, s, p, q)
TECx→y (τ ; r, s, p, q) = ∑ PY |XY ;C (τ ; r, s, p, q) log2 (2)
PY |Y ;C (τ ; r, s)
where the sum runs over all the possible values of Y , Yrp and Xqs and the tran-
sition probability matrices PY |XY ;C (τ ; r, s, p, q) = P[Y (t)|Yrp (t), Xqs (t);t ∈ C ] and
PY |Y ;C (τ ; r, s) = P[Y (t)|Yr (t);t ∈ C ] are restrictedly sampled over time epochs in
p
which the ongoing collective dynamics is compliant with the imposed constraints.
Although such a general definition may appear hermetic, it becomes fairly natural
when specific constraints are taken. Simple constraints might be for instance based
on the dynamic range of the instantaneously sampled activity. A possible state se-
lection filter might therefore be: “The activity of every node of the network must be
below a given threshold value”. As a consequence, the overall sampled time-series
would be inspected, and time-epochs in which some network node has an activity
with an amplitude above the threshold level would be discarded and not sampled
for the evaluation of PY |XY ;C and PY |Y ;C . Other simple constraints might be defined
based on the spectral properties of the considered time-series. For instance, the state
selection filter could be: “The power in the theta range of frequencies of the aver-
age network activity must have been above a given threshold during the last 500
milliseconds at least”). In this way, only sufficiently long transients in which the
system displayed collectively a substantial theta oscillatory activity would be sam-
pled for the evaluation of PY |XY ;C and PY |Y ;C . Even more specifically, additional
constraints might be imposed by filtering for specific phase-relations between two

network nodes to be fulfilled. Once again, the result of imposing a constraint would
be to restrict the set of time-instants C over which the transition matrices PY |XY ;C
and PY |Y ;C are sampled for the evaluation of TECx→y .
Therefore, state-conditioned Transfer Entropy provides a measure of the di-

rected functional interactions associated to some definite dynamical regime,
specified through an ad hoc set of state-selection filtering constraints.
3 Directed Functional Interactions in Bursting Cultures

Neuronal cultures provide simple, yet versatile model systems [23] exhibiting a rich
repertoire of spontaneous activity [14, 62]. These aspects make cultures of disso-
ciated neurons particularly appealing for studying the interplay between activity
and connectivity. The activity of hundreds to thousands of cells in in vitro cultured
neuronal networks can be simultaneously monitored using calcium fluorescence
imaging techniques [39, 54] (cfr. Figure 1A). Calcium imaging can be applied both
in vitro and in vivo and can potentially be combined with interventional techniques
like optogenetic stimulation [69]. A major drawback of this technique, however, is
that the typical frame rate during acquisition is slower than the cell’s firing dynam-
ics by an order of magnitude. Furthermore the poor signal-to-noise ratio is such to
make hard the detection of elementary firing events.
The experimental possibility of following in parallel the activity of most nodes
of a large network provides ideal datasets for the extraction of directed functional
connectivity. In particular, model-free information theory-based metrics [34, 43, 56]
can be applied, since recordings can be stable over several hours [54]. A proper un-
derstanding of state-dependency of directed functional connectivity allows then to
restrict the analysis to regimes in which directed functional connectivity and struc-
tural connectivity are expected to have a good match, thus opening the way to the
algorithmic reconstruction of the connectivity of an entire neuronal network in vitro.
Such understanding can be built by the systematic analysis of semi-realistic syn-
thetic data from simulated neuronal cultures, in which the ground-truth structural
connectivity is known and can be arbitrarily tuned to observe its impact on the re-
sulting dynamics and functional interactions.
3.1 Neuronal Cultures “in silico”

A neuronal culture is modeled as a random network of N leaky integrate-and-fire
neurons. Synapses provide post-synaptic currents with a difference-of-exponentials
time-course [15]. For simplicity, all synapses are excitatory, to mimic common
experimental conditions in which inhibitory synaptic transmission is pharmaco-
logically blocked [54]. Neurons in culture show a rich spontaneous activity that
116 D. Battaglia
100ѥm 100ѥm
experiment simulation
8
B 75
Áuorescence (a.u.)
Áuorescence (a.u.)
70 6
65
4
60
2
55
50 0
0 20 40 60 80 0 20 40 60 80
time (s) time (s)
C 1.0
avg. Áuorescence (a.u.)
avg. Áuorescence (a.u.)
53
0.8
0.6
52
0.4
51 0.2
0.0
0 20 40 60 80 0 20 40 60 80
time (s) time (s)
D 1000 1000
500 500
nr. of occurrences
nr. of occurrences
100 100
50 50
10 10
5 5
51.0 51.5 52.0 52.5 53.0 0.0 0.1 0.2 0.3 0.4 0.5
Áuorescence (a.u.) Áuorescence (a.u.)
Fig. 1 Bursting neuronal cultures in vivo and in silico. A Bright field image (left panel) of a
region of a neuronal culture at day in vitro 12, together with its corresponding fluorescence
image (right panel), integrated over 200 frames. Round objects are cell bodies of neurons.
B Examples of real (left) and simulated (right) calcium fluorescence time series for different
individual neurons. C Corresponding averages over the whole population of neurons. Syn-
chronous network bursts are clearly visible from these average traces. D Distribution of pop-
ulation averaged fluorescence amplitudes, for a real network (left) and a simulated one (right).
These distributions are strongly right skewed, with a right tail corresponding to the strong av-
erage fluorescence during bursting events. Figure adapted from [56]. (Copyright: Stetter et
al. 2012, Creative Commons licence).
originates from both fluctuations in the membrane potential and small noise cur-
rents in the pre-synaptic terminals [14]. To reproduce spontaneous firing, each neu-
ron is driven by statistically independent Poisson spike sources with a small rate, in
addition to recurrent synaptic inputs.
A key feature required for the reproduction of network bursting is the introduc-
tion of synaptic short-term depression, described through classic Tsodyks-Markram
equations [58], which take into account the limited availability of neurotransmit-
ter resources for synaptic release and the finite time needed to recharge a de-
pleted synaptic terminal. Dynamics comparable with experiments [23] are obtained
by setting synaptic weights of internal connections to give a network bursting of
0.10 ± 0.01 Hz. To achieve these target rates, an automated conductance adjustment
procedure is used [56] for every considered topology.
Concerning more in detail the used structural topologies, connectivity is al-
ways sparse. The probability of connection is ”frozen” to lead an average degree of
about 100 neighbor neurons, compatible with average degrees reported previously
for neuronal cultures in vitro of the mimicked age (DIV) and density [44, 54]. Net-
works with different degrees of clustering are generated by first randomly drawing
connections and then rewiring them to reach a specified target degree of clustering
(non-locally clustered ensemble). Another possibility to generate clustered networks
is to adopt a connection probability law, depending on spatial distance. Variations of
the length-scale of connectivity will translate into more or less clustered networks
(locally clustered ensemble).
Finally, surrogate calcium fluorescence signals are generated based on the spik-
ing dynamics of the simulated cultured network. A common fluorescence model
introduced in [60] gives rise to an initial fast increase of fluorescence after acti-
vation, followed by a decay with a slow time-constant τCa = 1 s. Such a model
describes the intra-cellular concentration of calcium that is bound to the fluorescent
probe. The concentration changes rapidly for each action potential locally elicited
in a time bin corresponding to the acquisition frame. The net fluorescence level Fi
associated to the activity of a neuron i is finally obtained by further feeding the
Calcium concentration into a saturating static non-linearity, and by adding a Gaus-
sian distributed noise. Example surrogate calcium fluorescence time-series, together
with actual recordings for comparison, can be seen in Figure 1B.
All the details and the parameters of the used neuronal and network models and
calcium surrogate signals —including the modeling of systematic artifacts like light
scattering for an increased realism— can be found in the original publication by
Olav Stetter et al. [56]. With the selected parameters, the simulated neuronal cul-
tures display temporally irregular network bursting as highlighted by Figures 1C,
reporting fluorescence averaged over the entire network, and Figure 1D, showing
the right-skewed distribution of average fluorescence, with its right tail associated
to the high fluorescence during network bursts.
118 D. Battaglia
3.2 Extraction of Directed Functional Networks

A generalized TE score is calculated for every possible directed pair of nodes in the
analyzed simulated culture. The adjacency matrix of a directed functional network
is then obtained by applying a threshold to the TE values at an arbitrary level. Only
links whose TE value raises above this threshold are retained in the reconstructed
digraph. Selecting a threshold for the inclusion of links corresponds to set the aver-
age degree of the reconstructed network. An expectation about average degree in the
culture directly translates thus into a specific threshold number of links to include.
The estimation problem for TE scores themselves is, in this context, less severe
than usual. Indeed time-series generated by models are less noisy than real experi-
mental recordings. Furthermore they can be generated to be as long as required for
proper estimation. Yet, the length of simulated calcium fluorescence time-series is
restricted in [56] to a duration achievable in actual experiments. it is important to
mention that, for network reconstruction, it is not required to correctly estimate the
values of individual TE scores. Indeed, only their relative ranking matters. Since fir-
ing and connectivity are homogeneous across the simulated network, biases are not
expected to vary strongly for different edges. Moreover, the problem of assessing
statistical significance is also irrelevant, since the threshold used for deciding link
inclusion is based on an extrinsic criterion (i.e. achieving a specific target average
degree compatible with experimental knowledge) not dependent of TE estimation
itself. Thus, even rough plug-in estimates of generalized TE can be adopted 1 .
3.3 Zero-Lag Causal Interactions for Slow-Rate Calcium

Imaging
Original formulations of Transfer Entropy were meant to detect the causal influence
of events in the past toward events at a later time. However, since the slow acqui-
sition rate of calcium imaging techniques is an order of magnitude slower than the
actual synaptic and integration delays of neurons in the culture, it is conceivable that
many “cause” and-“effect” spike pairs may occur within a same acquisition frame.
A practical trick avoiding to completely ignore such causally-relevant correlation
events is to include “same-bin” interactions in the evaluation of (state-conditioned)
Transfer Entropy [56]. In practice, referring to the parameters labeling in Equation
2, this amounts to set r = 1, but s = 0, i.e. to condition the probability of transitions
from past to present values of the time-series Y (t) on present values of the (putative
cause) time-series X(t). When not otherwise specified, Transfer Entropy analyses
of calcium fluorescence time-series from neuronal cultures will be performed tak-
ing (r = 1, s = 0, p = 2, q = 1).
Note that a similar approach is adopted in this volume’s chapter by Luca Faes, to
cope with volume conduction in a Granger Causality analysis of EEG signals.
1 We have verified, in particular, that bootstrap corrections would not alter the obtained
results.
3.4 State-Selection Constraints for Neuronal Cultures

Neuronal cultures in vitro and in silico display stochastic-like switching between
relatively quiet inter-burst periods, characterized by low-rate and essentially asyn-
chronous firing of few neurons at a time, and bursting events, characterized by
Frequency of observation
I II III
Network-averaged fluorescence (a.u.)
I II III
B
C 1.0 1.0 1.0

Fraction of true positives
0.5 0.5 0.5
0.0 0.0 0.0

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
Fraction of false positives Fraction of false positives Fraction of false positives
Fig. 2 Functional multiplicity in simulated cultures. A Three ranges of amplitude are high-
lighted in the distribution of network-averaged fluorescence G(t). Directed functional inter-
actions associated to different dynamical regimes are assessed by conditioning the analysis
to these specific amplitude ranges. Range I corresponds to low-amplitude noise. Range II
to fluorescence level typical of sparse inter-burst activity. Range III to high average fluo-
rescence during network bursts. B Visual representation of the reconstructed functional net-
works topology in the three considered dynamical regimes (top 10% of TE score links only
are shown). Qualitative topological differences in the three extracted networks are evident.
C ROC analysis of the correspondence between inferred functional networks and the ground-
truth structural network. Overlap is random for noise-dominated range I, is marked for inter-
burst regime II and is only partial for bursting regime III.
120 D. Battaglia
exponentially fast rise of the number of recruited synchronously firing neurons. In

general, there is no reason to expect that such two regimes may be associated to
identical directed functional connectivity networks. As a matter of fact, firing of a
neuron during an inter-burst period is facilitated by firing of pre-synaptic neurons.
As a consequence, it is reasonable to expect that directed functional connectivity as-
sociated to inter-burst epochs has a large overlap with the underlying structural con-
nectivity of the culture. On the contrary, during a bursting event and its advanced
buildup phase, the network is over-excitable and the firing of a single neuron can
cause within a very short time the firing of many other neurons not necessarily
connected to it. For this reason, intuition suggests that the directed functional con-
nectivity during bursting events is dominated by collective behavior, rather than by
synaptic coupling.
To confirm these expectations, it is necessary to extract directed functional inter-
actions from calcium fluorescence time-series separately for each dynamical regime.
This can be achieved by defining an appropriate set of filtering constraints for the
evaluation of state-conditioned Transfer Entropy. A fast way to implement these
constraints is to track variations of the average fluorescence G(t) = ∑Ni=1 Fi (t) of
the entire network. Fully developed network bursts will be associated to anoma-
lously high average network fluorescence G(t) (fluorescence range denoted as III
in Figure 2A). Conversely, inter-bursts epochs will be associated to weaker network
fluorescence (fluorescence range denoted as II in Figure 2A). Too low network flu-
orescence would be indistinguishable from mere baseline noise (fluorescence range
denoted as I in Figure 2A).
A straightforward way to define a “state” based on average fluorescence might
thus be to restrict sampling to acquisition frames t in which the network-averaged
fluorescence G(t) falls within a prescribed range:
C = {t|Gbottom < G(t) ≤ Gtop } (3)
Different ranges of fluorescence will identify different dynamical regimes, to which

the evaluation of state-conditioned Transfer Entropy will be particularized.
3.5 Functional Multiplicity in Simulated Cultures

The state dependency of directed functional connectivity is illustrated by generating
a random network (e.g., from the local clustering ensemble, for the sake of a better
visualization) and by simulating its dynamics. The resulting distribution of network-
averaged fluorescence and the three dynamical ranges we focus on in detail are
highlighted in Figure 2A.
For simulated data, the inferred connectivity can be directly compared to the
ground truth. A standard Receiver-Operator Characteristic (ROC) analysis is used
to quantify the quality of reconstruction. ROC curves are generated by gradually
moving a threshold level from the lowest to the highest TE value, and by plotting
at each point the fraction of included true positive links against the corresponding
fraction of included false positive links. The functional networks extracted in the
three dynamical ranges I, II and III and their relation with structural connectivity are
shown, respectively in Figures 2B and 2C. For a fair comparison, an equal number
of samples is used to estimate TE in the three fluorescence ranges.
The lowest range I corresponds to a regime in which spiking-related signals are
buried in noise. Correspondingly, the associated functional connectivity is indistin-
guishable from random, as indicated by a ROC curve close to the diagonal. Note,
however, that a more extensive sampling (i.e. using all the available observation
samples) would show that limited information about structural topology is still con-
veyed by the activity in this regime [56].
At the other extreme, represented by range III —associated to fully developed
synchronous bursts— the functional connectivity has also a poor overlap with the
underlying structural network. The extracted functional networks are characterized
by the existence of hub nodes with an elevated out- and in-degree. The spatio-
temporal organization of bursting can be described in terms of these functional
connectivity hubs, since nodes within the neighborhood of a same functional hub
experience a strongest mutual synchronization than arbitrary pair of nodes across
the network [56]. In particular, figure 2B displays three visually-evident communi-
ties of “bursting-together” neurons.
The best agreement between functional and excitatory structural connectivity is
obtained for the middle range II, corresponding to above base-line noise activity
during inter-bursts epochs and the early building-up phases of synchronous bursts.
Thus, the retrieved TE-based functional networks confirm the intuitive expecta-
tions outlined in the previous section. The state-dependency of functional connec-
tivity is not limited to synthetic data. Very similar patterns of state-dependency are
observed also in real data from neuronal cultures. In particular, in both simulated
and real cultures, the functional connectivity associated to the buildup of bursts dis-
plays a stronger clustering level than during inter-burst periods [56].
The existence of such different topologies of functional interactions stemming
out of different dynamical ranges of a same structural network constitutes a per-
fect example of the notion of functional multiplicity, outlined in the introduction.
It is certainly possible to define ranges which are “right”, i.e. lead to good struc-
tural network reconstruction, importantly for practical applications in connectomics.
However, this statement should not be over-interpreted to claim that the directed
functional connectivity inferred in a regime like the one associated to range III is
“wrong”. On the contrary, this functional connectivity is correctly capturing the
topology of causal influences in such a collective state, in which the firing of a sin-
gle neuron can trigger the firing of a whole community of nodes.
3.6 Structural Connectivity from Directed Functional

Connectivity
A more refined analysis of function-to-structure overlap suggests that best matching
is achieved for a range including fluorescence levels just at the right of the Gaussian-
like peak in the histogram of Fig. 2A [56]. Characterizing state-dependency allows
122 D. Battaglia
1.0
A B
0.8
0.8
True positives fraction
Functional clustering
0.6
0.6
0.4
0.4
0.2 0.2
Conditioning only Cross-corr
Conditioning + zero-lag TE
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8
False positives fraction Structural clustering
Fig. 3 From functional to structural connectivity in simulated cultures. Good matching

between structural and inferred directed functional connectivity is achieved in simulated
neuronal cultures (here, from the non-locally clustered ensemble) by optimizing the state-
conditioning of TE and by correcting for slow acquisition rate of calcium imaging. A ROC
curves for a network reconstruction with generalized TE with fluorescence data optimally
conditioned at G < Gtop = 0.112. The area surrounded by dashed lined depicts ROC fluctua-
tion interval, based on analysis of 6 networks. The black ROC curve refers to reconstruction
performed with TE using (r = 1, s = 0, p = 2, q = 1), i.e. introducing zero-lag causal inter-
actions. The gray curve is for (r = s = 1, p = q = 1), i.e. always Markov Order 2, but not
correcting for slow acquisition rate. B Clustering of inferred directed functional connectivity
as a function of ground-truth structural clustering. In TE-based reconstructions, functional
and structural clustering are linearly correlated, in contrast with cross-correlation-based re-
constructions, overestimating clustering. Figure adapted from [56]. (Copyright: Stetter et al.
2012, Creative Commons licence).
thus defining the best TE-conditioning range for reconstruction of structural con-
nectivity of the culture. This range should exclude regimes of highly synchronized
activity (like range III) while keeping most of data points for the analysis. More de-
tails are provided in the original study by Stetter et al. [56], showing that very good
reconstruction performance is achieved on simulated data, by implementing a state-
selection filter with optimized threshold Gtop close to the upper limit of Range II
and no lower threshold Gbottom . ROCs corresponding to this choice can be seen in
Figure 3A, for the non-locally clustered ensemble. Good reconstruction is possible
for a vast spectrum of topologies, as denoted by a good correlation between ground-
truth structural clustering coefficient and reconstructed functional clustering level.
Note that a cross-correlation analysis performed over the same state-conditioned set
of simulated observations would systematically overestimates the level of clustering
(Figure 3B, cfr. [56]). Similar results would be obtained for the locally clustered en-
semble, for which the overall reconstruction performance is poorer but an excellent
correlation still exists between the ground-truth and the reconstructed length-scales
of connectivity. Finally, we mention that the just described reconstruction approach
A
# neuron (1-100)
20 s 20 s 20 s
B
Freq. of observation
0 30 60 0 30 60 0 30 60
Inter-burst interval (s) Inter-burst interval (s) Inter-burst interval (s)
C
Struct. CC Func. CC Struct. CC Func. CC Struct. CC Func. CC

~ 0.1 ~ 0.7 ~ 0.3 ~ 0.7 ~ 0.7 ~ 0.7
Fig. 4 Structural degeneracy in simulated cultures. A Examples of spike raster plots for
three simulated cultures with different structural clustering coefficients (non-local clustering
ensemble, structural clustering coefficient equal, respectively from left to right, to 0.1, 0.3 and
0.7). B As revealed by histograms of inter-burst intervals, the temporally-irregular network
bursting dynamics of these strongly different cultures are very similar. Vertical lines indi-
cating the mean of each distribution. C: panels below the IBI distributions illustrate through
graphical cartoons the amount of clustering in the actual structural network and in the directed
functional network reconstructed from fluorescence range III (bursting regime, cf. Figure 2).
To different degrees of structural clustering correspond equivalent elevated levels of func-
tional clustering, due to a common bursting statistics. Figure adapted from [56]. (Copyright:
Stetter et al. 2012, Creative Commons licence).
extends naturally to the reconstruction of networks including inhibitory interactions,

although additional steps are required, in this case, to label a link as being of inhibitory
nature, after having more straightforwardly inferred its simple existence [49].
3.7 Structural Degeneracy in Simulated Cultures

Different dynamical regimes of a structural network can give rise to multiple func-
tional networks. At the same time, functional networks associated to comparable
dynamical regimes are similar. Therefore, since comparable dynamical regimes can
be generated by very different networks, a same functional connectivity topology
can be generated by multiple structural topologies.
Figure 4 illustrates the dynamics of three simulated cultures with different clus-
tering coefficients (with a same total number of links). The synaptic strength is ad-
justed in each network using an automated procedure to obtain comparable bursting
124 D. Battaglia
and firing rates (see Stetter et al. 2012 [56] for details on the procedure and on the
models). The simulated spiking dynamics of the three cultures in silico is shown
in the raster plots of Figure 4A. These three networks display indeed very similar
bursting dynamics, not only in terms of the mean bursting rate, but also in terms of
the entire inter-burst interval (IBIs) distribution, shown in Figure 4B.
Based on these bursting dynamics, directed functional connectivity is extracted
for the three differently clustered structural networks. TE is state-conditioned for
the three networks on a same dynamic range, matching range III in Figure 2, i.e.
the fully-developed burst regime is selected. As a result, the functional networks
extracted in this range have always an elevated clustering level (close to 0.7) at
contrast with the actual structural clusterings, varying in a broad range between 0.1
and 0.5 (see Figure 4C).
The illustrative simulations of Figure 4 thus genuinely confirms that the rela-
tion between network dynamics and network structure is not trivially “one-to-one”,
manifesting the phenomenon of structural degeneracy, outlined in the introduction.
4 Directed Functional Interactions in Motifs of Oscillating

Areas
Ongoing local oscillatory activity modulates rhythmically neuronal excitability in
brain cortex [61]. The communication-through-coherence hypothesis [29] states that
neuronal groups oscillating in a suitable phase coherence relation —such to align
their respective “communication windows”— are likely to interact more efficiently
than neuronal groups which are not synchronized. Similar mechanisms are believed
to be involved in selective attention and top-down modulation [6, 24, 31, 38].
To cast light on the role of self-organized collective dynamics in establishing
flexible patterns of communication-through-coherence, it is possible to introduce
simple models of generic motifs of interacting brain areas (Figure 5A), each one
undergoing locally generated coherent oscillations (Figure 5B). Simple mesoscopic
circuits involving a small number of local areas, mutually coupled by long-range
excitatory projections (Figure 5C) are in particular considered. As analyzed also
with mean-field developments in [2, 4], phase-locking between the oscillations of
different local areas develops naturally in such structural motifs. Phase-relations
between the oscillations of different areas depend non trivially on the delays of local
and long-range interactions and on the actual strength of local inhibition. When
local inhibition gets sufficiently strong, phase-locking tends to occur in an out-of-
phase fashion, in which phase-leading and phase-lagging areas emerge, despite the
symmetry of their mutual long-range excitatory coupling [2, 4].
Through large-scale simulations of modular spiking networks —representing
structural motifs of interconnected brain areas [55]— directed functional connec-
tivity is extracted with a state-conditioned TE analyses of simulated local-field-
potential (LFP) parallel recordings. Once again, it is found that “causality follows
dynamics”, in the sense in which different phase-locked patterns of collective os-
cillations are mapped to different directed functional connectivity motifs [4]. The
used in silico approach allows as well to investigate how information encoded at the
level of the detailed spiking activity of thousands of neurons is routed between the
modeled areas. It becomes then possible to study how the specific routing modality
depends on the active directed functional connectivity2.
The spiking of individual neurons can be very irregular even when the collective
rate oscillations are regular (cfr. Figure 5B). Therefore, even local rhythms in which
the firing rate is modulated in a very stereotyped way, might correspond to irregu-
lar (highly entropic) sequences of codewords encoding information in a digital-like
fashion (e.g. by the firing —“1”— or missed firing —“0”— of specific spikes at
a given cycle [57]). In such a framework, oscillations would not directly represent
information, but would rather act as a carrier of “data-packets” associated to spike
patterns of synchronously active cell assemblies. By quantifying through a Mutual
Information (MI) analysis the maximum amount of information encoded potentially
in the spiking activity of a local area and by evaluating how much of this informa-
tion is actually transferred to distant interconnected areas, it is possible to demon-
strate that different directed functional connectivity configurations lead to different
modalities of information routing. Therefore, the pathways along which information
propagates can be reconfigured within the time of a few reference oscillation cycles,
by switching to a different effective connectivity motif, for instance by means of a
spatially and temporally precise optogenetic stimulation [4, 67].
4.1 Oscillating Local Areas “in silico”

Each local area is represented by a random network of excitatory and inhibitory
Wang-Buzsáki-type conductance-based neurons [63]. The Wang-Buzsáki model is
described by a single compartment endowed with sodium and potassium currents.
Each neuron receives an external noisy driving current due to background Poisson
synaptic bombardment, representing cortical noise. Other inputs are due to recur-
rent interactions with other neurons in the network. Excitatory synapses are of the
AMPA-type and inhibitory synapses of the GABAA -type and are modeled as time-
dependent conductances with difference-of-exponential time-course [15]. LFP sig-
nals Λ (t) = V (t) are defined as the average membrane potential over all the cells
in each area (N ∼ O(104 )).
Connectivity is random. Short-range connections within a local area are exci-
tatory and inhibitory. Excitatory neurons establish as well long-range connections
toward distant areas. For the used parameters, each area develops a sparsely syn-
chronized collective oscillation with a collective frequency in the 40-60 Hz range.
Firing frequency of individual neurons remains on average of a spike every 5-10
LFP oscillation cycles. The oscillations of different areas have similar frequencies
and self-organize into phase-locked configurations. A complete description of the
2 Note that TE-based analysis of “macroscopic” signals, like LFPs, is not guaranteed a priori
to describe information transmission at the level of “microscopic” spiking activity. Trans-
fer Entropy does not measure (directly) transfer of information, in the usual sense of neural
computation!
126 D. Battaglia
Fig. 5 Model oscillating ar-

eas. A A local area is mod-
eled as a random network of
conductance-based excita- A C
tory and inhibitory neurons.
A moderate fraction of them
is transduced with Channel-
E
rhodopsine (ChOP) con-
ductances [69], allowing I
optogenetic perturbation. With ChOP

B Sparsely-synchronized B
oscillations develop, in #100
which Poisson-like firing of Spikes

single neurons and strongly
oscillating LFPs coexist. #1
C Two local areas mutu- “LFP”
ally coupled by long-range
excitation. 40 ms
model can be found in [4]. For simplicity, only fully connected structural motifs
involving a few areas (K = 2, 3) are studied. Note however that the used approach
might be extended to other structural motifs [55] or, in perspective, to large-scale
thalamocortical networks [35, 42].
4.2 State-Selection Constraints for Motifs of Oscillating Areas

The dynamical regimes generated by motifs of interconnected areas are phase-
locked oscillatory configurations. Therefore a natural way of defining state-selection
constraints is to restrict the analysis to epochs with consistent phase-relations be-
tween the oscillations of different areas. Phases are extracted from LFP time-series
with spectral analysis techniques like Hilbert transform. Considering then instanta-
neous phase-differences Δ Φab (t) = (Φ [Λa (t)] − Φ [Λb (t)]) mod 2π (between pairs
of areas a and b) and the stable values φab around which they fluctuate in a given
locking mode, state selection constraints can be written as:
C = {t|∀(a, b), (φab − δ ) < Δ Φab (t) < (φab + δ )} (4)
In the more realistic case in which coherent oscillations and phase-locking arise only
transiently [59] —unlike in the model of [4] in which oscillations are stationary and
stable— additional constraints might be added, guaranteeing that the instantaneous
power of LFP time-series integrated over specified frequency band (e.g. the gamma
band) exceeds a given minimum threshold.
Since the sampling rates of the electrophysiological recordings simulated by the
computational model is elevated, there is no need to incorporate zero-lag causal
interactions. Therefore, the standard settings (r = s = 0, p = q = 1) are used.
Confidence intervals and statistical significancy of causal interaction strengths

are assessed by comparisons with TE estimates from surrogate time-series, ran-
domly resampled through a geometric bootstrap procedure [50], preserving the auto-
correlation structure of individual time-series and therefore compliant with their os-
cillatory nature. Details can be found in [4].
4.3 Functional Multiplicity in Motifs of Oscillating Areas

Different dynamical states —characterized by oscillations with different phase-
locking relations and degrees of periodicity— arise from simple symmetric
structural topological motifs [2, 4]. Changes in the strength of local inhibition, of
long-range excitation or of delays of local and long-range connections can lead
to phase transitions between qualitatively distinct dynamical states (Figure 6A–C).
A B C G
x6
D E F
x6
Fig. 6 Functional multiplicity in motifs of oscillating areas. Dynamical states and resulting
directed functional connectivities, generated by structural motifs of K = 2, 3 mutually and
symmetrically connected brain areas. A–C simulated “LFPs” and spike trains of the two pop-
ulations of a K = 2 motif for three different strengths of the symmetric inter-areal coupling,
leading to phase-locked states with different degrees of periodicity. D–E Transfer entropies
for the two possible directions of functional interaction, associated to the dynamic states in
panels A–C. A grey band indicates the threshold for statistical significancy. Below the TE
plots: graphic depiction of the functional interactions between the two areas, captured by
state.conditioned Transfer Entropy. Only arrows corresponding to significant causal inter-
actions are shown. Arrow thickness reflects TE strength. G Analogous directed functional
connectivity motifs generated by a K = 3 symmetric structural motif. The multiplier factors
denote multistability between motifs with same topology but different directions (functional
motif families). Figure adapted from [4]. (Copyright: Battaglia et al. 2012, Creative Com-
mons licence).
128 D. Battaglia
Moreover, within broad ranges of parameters, multi-stabilities between different

phase-locking patterns take place even without changes in connection strength or
delay.
Multivariate time-series of simulated “LFPs” are generated for different dynam-
ical states of the model structural motifs and TEs for all the possible directed pair-
wise interactions are calculated. The resulting directed connectivities are depicted
in diagrammatic form by drawing an arrow for each statistically significant causal
interaction, the thickness of each arrow encodeing the strength of the corresponding
interaction (Figure 6D–F). This graphical representations make thus apparent that
many directed functional connectivity motifs emerge from a same structural motif.
Such functional motifs are organized into families. Motifs within a same family cor-
respond to dynamical states which are multi-stable for a given choice of parameters,
while different families of motifs are obtained for different ranges of parameters
leading to different ensembles of dynamical states.
A first family of functional motifs occurs for weak inter-areal coupling. In this
case, neuronal activity oscillates in a roughly periodic fashion (Figure 6A). When
local inhibition is strong, the local oscillations generated within different areas lock
in an out-of-phase fashion. It is therefore possible to identify a leader area whose
oscillations lead in phase over the oscillation of laggard areas [2]. In this family,
causal interactions are statistically significant only for pairwise interactions pro-
ceeding from a phase-leading area to a phase-lagging area, as shown by the the
box-plots of Figure 6D (unidirectional driving).The anisotropy of functional influ-
ences in the leader-to-laggard and laggard-to-leader directions can be understood
in terms of the communication-through-coherence theory. Indeed the longer latency
from the oscillations of the laggard area to the oscillations of the leader area reduces
the likelihood that rate fluctuations originated locally within a laggard area trigger
correlated rate fluctuations within a leading area [68].
A second family of functional motifs occurs for intermediate inter-areal coupling.
In this case, the periodicity of the “LFP” oscillations is disrupted by the emergence
of large correlated fluctuations in oscillation cycle amplitudes and durations. Phase-
locking between “LFPs” becomes only approximate, even if still out-of-phase on
average. The rhythm of the laggard area is now more irregular than the rhythm in
the leader area (Figure 6B). Fluctuations in cycle length do occasionally shorten
the laggard-to-leader latencies, enhancing non-linearly and transiently the influence
of laggard areas on the leader activity. Correspondingly, TEs in leader-to-laggard
directions continue to be larger, but TEs in laggard-to-leader directions are now also
statistically significant (Figure 6E). The associated effective motifs are no more
unidirectional, but continue to display a dominant direction (leaky driving).
A third family of effective motifs occurs for stronger inter-areal coupling. In
this case the rhythms of all the areas become equally irregular, characterized by an
analogous level of fluctuations in cycle and duration amplitudes. During brief tran-
sients, leader areas can still be identified, but these transients do not lead to a stable
dynamic behavior and different areas in the structural motif continually exchange
their leadership role (Figure 6C). As a result of the instability of phase-leadership
relations, only average TEs can be evaluated, yielding to equally large TE values for
all pairwise directed interactions (Figure 6F, mutual driving).
Analogous unidirectional, leaky or mutual driving motifs of functional interac-
tion can be found in larger motifs with K = 3 areas, as shown by Figure 6G [4].
4.4 Control of Information Flow Directionality

The considered structural motifs are invariant under permutations of the intercon-
nected areas. However, while anti-phase or in-phase locking configurations would
share this permutation symmetry with the full system, this is not true for the out-of-
phase-locking configurations which are stable for strong local inhibition (cfr. Figure
6A–B). A situation in which a system with specific symmetry properties assumes
dynamic configurations whose degree of symmetry is reduced with respect to the
full symmetry of the system is termed spontaneous symmetry breaking. However,
A B C
0 0
10 10
ï ï
0.25 10 10
Phase
MI / H
MI / H
100%
50% ï ï
10 10
0.5 0
ï ï
Switching 10 10
frequency
0.75
Fig. 7 Switching information flow in motifs of oscillating areas.A A precisely-phased opto-

genetic or electric stimulation pulse can trigger switching between alternative phase-locking
modes of a structural motif of oscillating areas (here shown a switching from “black-
preceding-gray” to “gray-preceding-black” out-of-phase locking). For a given perturbation
intensity, the probability that a pulse induces an attractor switching event concentrates within
a narrow interval of stimulation phases. B-C: Actual information transmission efficiency is
quantified by the Mutual Information (MI) between spike trains of pairs of source and target
cells connected by a unidirectional transmission-line (TL) synapse, normalized by the en-
tropy (H) of the source cell. Boxplots show values of MI/H for different groups of cell pairs
and directed functional motifs. Black and pale gray arrows below boxplots indicate pairs of
cells interconnected by the TL marked with the corresponding color. A dot indicates control
pairs of cells interconnected by ordinary weak synapses. The dominant directionality of the
active functional motif is also shown. B Unidirectional driving functional motif family. Com-
munication efficiency is enhanced only along the TL aligned to the directionality of the active
functional motif, while it is undistinguishable from control along the other TL. C Leaky driv-
ing functional motif family. Communication efficiency is enhanced along both TLs, but more
along the TL aligned to the dominant directionality of the active functional motif. Figure
adapted from [4]. (Copyright: Battaglia et al. 2012, Creative Commons licence).
130 D. Battaglia
due to the overall structural symmetry, configurations in which the areas exchange
their leader or laggard roles must also be stable, i.e. the complete set of dynamical
attractors continues to be symmetric, even if individual attractors are asymmetric.
Exploiting multi-stability, fast reconfiguration of directed functional influences
can be obtained just by inducing switching between alternative multi-stable attrac-
tors, associated to functional motifs in a same family but with different directional-
ity. As elaborated in [4], an efficient way to trigger “jumps” between phase-locked
configurations is to perturb locally the dynamics of ongoing oscillations with pre-
cisely phased stimulation pulses. Such an external perturbation can be provided for
instance by optogenetic stimulation, if a sufficient fraction of cells in the target area
has been transduced with light-activated conductance. Simulation studies [67] sug-
gest that even transduction rates as low as 5-10% might be sufficient to optoge-
netically induce functional motif switching, if the pulse perturbation are properly
phased with respect to the ongoing rhythm (Figure 7A), as predicted also by a mean-
field theory [4]. But what is the impact of functional motif switching on the actual
flow of information encoded at the microscopic level of detailed spiking patterns?
In the studied model, rate fluctuations can encode only a limited amount of infor-
mation, because firing rate oscillations are stereotyped and amplitude fluctuations
are small with respect to the average excursion between peaks and throughs of the
oscillation. Higher amounts of information can be carried by spiking patterns, since
the spiking activity of single neurons during sparsely synchronized oscillations re-
mains very irregular and thus associated to a large entropy. To quantify information
exchanged by interacting areas, a reference code is considered, in which a “1” or a
“0” symbol denote respectively firing or missed firing of a spike by a specific neu-
ron at each given oscillation cycle. Based on such an encoding, the neural activity
of a group of neurons is mapped to digital-like streams, “clocked” by the network
rhythm, in which a different “word” is broadcast at each oscillation cycle3.
Focusing on a fully symmetric structural motif of K = 2 areas, the network
is modified by embedding into it transmission lines (TLs), i.e. mono-directional
fiber tracts dedicated to inter-areal communication. In more detail, selected sub-
populations of source excitatory neurons within each area establish synaptic con-
tacts with matching target excitatory or inhibitory cells in the other area, in a one-
to-one cell arrangement. Synapses in a TL are strengthened with respect to usual
synapses, in the attempt to enhance communication capacity, but not too much, in
order not to alter phase-relations between the collective oscillations of the two areas
(for more details, see [4]). The information transmission efficiency of each TL is
assessed —separately for different effective motifs— by quantifying Mutual Infor-
mation (MI) [57] between the “digitized” spike trains of pairs of source and target
cells. Since a source cell fires on average every five or six oscillation cycles, the
firing of a single neuron conveys H 0.7 bits of information per oscillation cycle.
MI normalized by the source entropy H indicates the fraction of this information
reaching the target cell. Due to the possibility of generating very long simulated
3 Such a code is here introduced uniquely as a theoretical construct grounding a rigorous
analysis of information transmission, without claim that it is actually being used in the
brain.
recordings in stationary conditions, straight plug-in estimates of MI and H provide

already reasonable levels of accuracy (in the sense in which taking into account fi-
nite sampling corrections [57] would not change the described phenomenology [4]).
As shown by Figure 7B–C, the communication efficiency of embedded TLs de-
pends strongly on the active functional motif. When the structural motif is prepared
in a dynamical state corresponding to a unidirectional driving functional motif (Fig-
ure 7B), communication is nearly optimal along the TL aligned with the functional
motif itself. The misaligned TL, however, shows no enhancement with respect to
control (i.e. pairs of connected cells not belonging to a TL). In the case of leaky
driving functional motifs (Figure 7C), communication efficiency is boosted for both
TLs, but more for the TL aligned with the dominant functional influence direc-
tion. For both families of functional motifs, communication efficiencies of the two
embedded TLs can be swiftly “swapped” by reversing the dominant functional in-
fluence direction through a suitably phased stimulation pulse.
In conclusion, the parallelism between TE analyses of directed functional con-
nectivity and MI analyses of information transmission is manifest. In simulated
structural motifs, indeed, information flow quantified by spike-based MI follows
closely in direction and strength the functional topology inferred by LFP-based TE.
5 Function from Structure, via Dynamics

The architect Louis Sullivan first popularized a celebrated tag-line stating that “form
follows function”. The two model systems here reviewed, cultures of dissociated
neurons and motifs of interacting oscillating areas, seem on the contrary to indi-
cate that function doesn’t follow structure, or, at least, not in a trivial sense. Both
functional multiplicity and structural degeneracy can be naturally understood if we
assume a primacy of dynamics on determining emergent functional interactions. In
other words, function follows dynamics, rather than structure. Still and all, func-
tional connectivity patterns are known to be strongly determined by structure. A
clear example is provided by resting-state functional connectivity [26], which can
largely be understood in terms of noise-driven fluctuations of the spontaneous dy-
namics of thalamocortical macroscale structures [18, 35, 42].
In the examples here considered, structure was fixed a priori. However, in na-
ture (or in the dish), networks are far from being hardwired, are gradually shaped
by activity- and context-dependent processes such as learning. We speculate that
this self-organized design of structural networks might be chasing an optimization
goal: the attempt to guarantee functional flexibility via the maximization of func-
tional multiplicity. In this view, specific structures which generate a particularly
rich dynamical repertoire [18, 35] would be maintained through development and,
ultimately, selected through evolution because of the fitness they confer.
At the end, thus, it might well be that Louis Sullivan’s motto applies as well to
the description of brain circuits, even if the structure to function relation is indirect
and involve a detour through nonlinear dynamics. As a matter of fact, for evolution
or development, the problem of engineering a circuit implementing a given set of
132 D. Battaglia
functions, could be nothing else than the design of structural networks acting as
emergent “functional collectivities” [27] with suitable dynamical regimes.
An advantageous feature allowing a dynamical network to transit fluently be-
tween qualitatively different dynamical regimes would be criticality [13]. Switching
would be indeed highly facilitated for a system tuned to be close to the edge between
multiple dynamic attractors. This is eventually the case for neuronal cultures, which
undergo spontaneous switching to bursting due to their proximity to a rate instabil-
ity (compensated for by synaptic resource depletion). Beyond that, networks at the
edge of synchrony might undergo noise-induced switching between a baseline es-
sentially asynchronous activity and phase-locked transients with elevated local and
inter-areal oscillatory coherence. In networks critically tuned to be at the edge of
synchrony, specific patterns of directed functional interactions associated to a latent
phase-locked attractor —becoming manifest only for fully developed synchrony—
might be “switched on” just through the application of weak biasing inputs which
stabilizing its metastable strong-noise “ghost” [19].
Acknowledgements. The framework here reviewed would not have been developed without
the help of colleagues and students. Credit for these and other related results must be shared
with (in alphabetic order): Ahmed El Hady, Theo Geisel, Christoph Kirst, Erik Martens, An-
dreas Neef, Agostina Palmigiano, Javier Orlandi, Jordi Soriano, Olav Stetter, Marc Timme,
Annette Witt, Fred Wolf. I am also grateful to Dante Chialvo, Gustavo Deco and Viktor Jirsa
for inspiring discussions.
References
1. de Arcangelis, L., Perrone-Capano, C., Herrmann, H.J.: Self-organized criticality model
for brain plasticity. Phys. Rev. Lett. 96, 028107 (2006)
2. Battaglia, D., Brunel, N., Hansel, D.: Temporal decorrelation of collective oscillations
in neural networks with local inhibition and long-range excitation. Phys. Rev. Lett. 99,
238106 (2007)
3. Battaglia, D., Hansel, D.: Synchronous chaos and broad band gamma rhythm in a mini-
mal multi-layer model of primary visual cortex. PLoS Comp. Biol. 7, e1002176 (2011)
4. Battaglia, D., Witt, A., Wolf, F., Geisel, T.: Dynamic effective connectivity of inter-areal
brain circuits. PLoS Comp. Biol. 8, e1002438 (2012)
5. Beggs, J., Plenz, D.: Neuronal avalanches in neocortical circuits. Journal of Neuro-
science 23, 11167–11177 (2003)
6. Bosman, C.A., Schoffelen, J.-M., Brunet, N., Oostenveld, R., Bastos, A.M., et al.: At-
tentional stimulus selection through selective synchronization between monkey visual
areas. Neuron 75, 875–888 (2012)
7. Bressler, S.L., Seth, A.K.: Wiener-Granger causality: a well established methodology.
NeuroImage 58, 323–329 (2011)
8. Brovelli, A., Ding, M., Ledberg, A., Chen, Y., Nakamura, R., Bressler, S.L.: Beta oscil-
lations in a large-scale sensorimotor cortical network: directional influences revealed by
Granger causality. Proc. Natl. Acad. Sci. USA 101, 9849–9854 (2004)
9. Brunel, N., Wang, X.J.: What determines the frequency of fast network oscillations with
irregular neural discharges? J. Neurophysiol. 90, 415–430 (2003)
10. Brunel, N., Hansel, D.: How noise affects the synchronization properties of recurrent
networks of inhibitory neurons. Neural Comput. 18, 1066–1110 (2006)
11. Brunel, N., Hakim, V.: Sparsely synchronized neuronal oscillations. Chaos 18, 015113
(2008)
12. Buehlmann, A., Deco, G.: Optimal information transfer in the cortex through synchro-
nization. PLoS Comput. Biol. 6(9), 1000934 (2010)
13. Chialvo, D.R.: Emergent complex neural dynamics. Nat. Phys. 6, 744–750 (2010)
14. Cohen, E., Ivenshitz, M., Amor-Baroukh, V., Greenberger, V., Segal, M.: Determinants
of spontaneous activity in networks of cultured hippocampus. Brain Res. 1235, 21–30
(2008)
15. Dayan, P., Abbott, L.: Theoretical Neuroscience: Computational and Mathematical Mod-
eling of Neural Systems. MIT Press, Cambridge (2001)
16. Deco, G., Romo, R.: The role of fluctuations in perception. Trends Neurosci. 31, 591–
598 (2008)
17. Deco, G., Rolls, E.T., Romo, R.: Stochastic dynamics as a principle of brain function.
Prog. Neurobiol. 88, 1–16 (2009)
18. Deco, G., Jirsa, V.K., McIntosh, R.: Emerging concepts for the dynamical organization
of resting-state activity in the brain. Nat. Rev. Neurosci. 12, 43–56 (2011)
19. Deco, G., Jirsa, V.K.: Ongoing cortical activity at rest: criticality, multistability, and ghost
attractors. Journal of Neuroscience 32, 3366–3375 (2012)
20. Ding, M., Chen, Y., Bressler, S.L.: Granger causality: basic theory and application to
neuroscience. In: Schelter, B., Winterhalder, M., Timmer, J. (eds.) Handbook of Time
Series Analysis. Wiley, New York (2006)
21. Ditzinger, T., Haken, H.: Oscillations in the perception of ambiguous patterns: a model
based on synergetics. Biol. Cybern. 61, 279–287 (1989)
22. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., Reitboeck, H.J.:
Coherent oscillations: a mechanism of feature linking in the visual cortex? Multiple elec-
trode and correlation analyses in the cat. Biol. Cybern. 60, 121–130 (1988)
23. Eckmann, J.P., Feinerman, O., Gruendlinger, L., Moses, E., Soriano, J., et al.: The
physics of living neural networks. Physics Reports 449, 54–76 (2007)
24. Engel, A., Fries, P., Singer, W.: Dynamic predictions: oscillations and synchrony in top-
down processing. Nat. Rev. Neurosci. 2, 704–716 (2001)
25. Eytan, D., Marom, S.: Dynamics and effective topology underlying synchronization in
networks of cortical neurons. J. Neurosci. 26, 8465–8476 (2006)
26. Fox, M.D., Snyder, A.Z., Vincent, J.L., Corbetta, M., Van Essen, D.C., et al.: The human
brain is intrinsically organized into dynamic, anticorrelated functional networks. Proc.
Natl. Acad. Sci. USA 102, 9673–9678 (2005)
27. Fraiman, D., Balenzuela, P., Foss, J., Chialvo, D.R.: Ising-like dynamics in large-scale
functional brain networks. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 79, 061922
(2009)
28. Freyer, F., Roberts, J.A., Becker, R., Robinson, P.A., Ritter, P., et al.: Biophysical mech-
anisms of multistability in resting-state cortical rhythms. J. Neurosci. 31, 6353–6361
(2011)
29. Fries, P.: A mechanism for cognitive dynamics: neuronal communication through neu-
ronal coherence. Trends Cogn. Sci. 9, 474–480 (2005)
30. Fries, P., Nikolić, D., Singer, W.: The gamma cycle. Trends Neurosci. 30, 309–316
(2007)
31. Fries, P., Womelsdorf, T., Oostenveld, R., Desimone, R.: The effects of visual stimulation
and selective visual attention on rhythmic neuronal synchronization in macaque area V4.
J. Neurosci. 28, 4823–4835 (2008)
134 D. Battaglia
32. Friston, K.J.: Functional and Effective Connectivity in Neuroimaging: A Synthesis. Hu-
man Brain Mapping 2, 56–78 (1994)
33. Friston, K.J.: Functional and Effective Connectivity: A Review. Brain Connectivity 1,
13–36 (2011)
34. Garofalo, M., Nieus, T., Massobrio, P., Martinoia, S.: Evaluation of the performance
of information theory-based methods and cross-correlation to estimate the functional
connectivity in cortical networks. PLoS One 4, e6482 (2009)
35. Ghosh, A., Rho, Y., McIntosh, A.R., Ktter, R., Jirsa, V.K.: Noise during rest enables the
exploration of the brain’s dynamic repertoire. PLoS Comp. Biol. 4, 1000196 (2008)
36. Gourévitch, B., Bouquin-Jeannès, R.L., Faucon, G.: Linear and nonlinear causality be-
tween signals: methods, examples and neurophysiological applications. Biol. Cybern. 95,
349–369 (2006)
methods. Econometrica 37, 424–438 (1969)
38. Gregoriou, G.G., Gotts, S.J., Zhou, H., Desimone, R.: High-frequency, long-range cou-
pling between prefrontal and visual cortex during attention. Science 324, 1207–1210
(2009)
39. Grienberger, C., Konnerth, A.: Imaging Calcium in Neurons. Neuron 73, 862–885 (2012)
40. Haken, H., Kelso, J.A., Bunz, H.: A theoretical model of phase transitions in human hand
movements. Biol. Cybern. 51, 347–356 (1985)
41. Hlaváčková-Schindler, K., Paluš, M., Vejmelka, M., Bhattacharya, J.: Causality detection
based on information-theoretic approaches in time series analysis. Phys. Rep. 441, 1–46
(2007)
42. Honey, C.J., Kötter, R., Breakspear, M., Sporns, O.: Network structure of cerebral cortex
shapes functional connectivity on multiple time scales. Proc. Natl. Acad. Sci. USA 104,
10240–10245 (2007)
network model. PLoS One 6, e27431 (2011)
44. Jacobi, S., Soriano, J., Segal, M., Moses, E.: BDNF and NT-3 increase excitatory input
connectivity in rat hippocampal cultures. Eur. J. Neurosci. 30, 998–1010 (2009)
45. Levina, A., Herrmann, J.M., Geisel, T.: Dynamical synapses causing self-organized crit-
icality in neural networks. Nat. Phys. 3, 857–860 (2007)
46. Levina, A., Herrmann, J.M., Geisel, T.: Phase Transitions towards Criticality in a Neural
System with Adaptive Interactions. Phys. Rev. Lett. 102, 118110 (2009)
47. Misic, B., Mills, T., Taylor, M.J., McIntosh, A.R.: Brain noise is task-dependent and
region specific. J. Neurophysiol. 104, 2667–2676 (2010)
48. Moreno-Bote, R., Rinzel, J., Rubin, N.: Noise-induced alternations in an attractor net-
work model of perceptual bistability. J. Neurophysiol. 98, 1125–1139 (2007)
49. Orlandi, J., Stetter, O., Soriano, J., Geisel, T., Battaglia, D.: Transfer Entropy reconstruc-
tion and labeling of neuronal connections from simulated calcium imaging. PLoS One
(in press, 2014)
50. Politis, D.N., Romano, J.P.: Limit theorems for weakly dependent Hilbert space valued
random variables with applications to the stationary bootstrap. Statistica Sinica 4, 461–
476 (1994)
51. Salazar, R.F., Dotson, N.M., Bressler, S.L., Gray, C.M.: Content-specific fronto-parietal
synchronization during visual working memory. Science 338, 1097–1100 (2012)
53. Seamans, J.K., Yang, C.R.: The principal features and mechanisms of dopamine modu-
lation in the prefrontal cortex. Prog. Neurobiol. 74, 1–58 (2004)
54. Soriano, J., Martinez, M.R., Tlusty, T., Moses, E.: Development of input connections in
neural cultures. Proc. Natl. Acad. Sci. USA 105, 13758–13763 (2008)
55. Sporns, O., Kötter, R.: Motifs in brain networks. PLoS Biol. 2, e369 (2004)
56. Stetter, O., Battaglia, D., Soriano, J., Geisel, T.: Model-free reconstruction of excitatory
neuronal connectivity from calcium imaging signals. PLoS Comp. Biol. 8, e1002653
(2012)
57. Strong, S.P., Koberle, R., de Ruyter van Steveninck, R.R., Bialek, W.: Entropy and infor-
mation in neural spike trains. Phys. Rev. Lett. 80, 197–200 (1998)
58. Tsodyks, M., Uziel, A., Markram, H.: Synchrony generation in recurrent networks with
frequency-dependent synapses. J. Neurosci. 20, 1–5 (2000)
59. Varela, F., Lachaux, J.P., Rodriguez, E., Martinerie, J.: The brainweb: Phase synchro-
nization and large-scale integration. Nat. Rev. Neurosci. 2, 229–239 (2001)
60. Vogelstein, J.T., Watson, B.O., Packer, A.M., Yuste, R., Jedynak, B., et al.: Spike in-
ference from calcium imaging using sequential Monte Carlo methods. Biophys. J. 97,
636–655 (2009)
61. Volgushev, M., Chistiakova, M., Singer, W.: Modification of discharge patterns of neo-
cortical neurons by induced oscillations of the membrane potential. Neuroscience 83,
15–25 (1998)
62. Wagenaar, D.A., Pine, J., Potter, S.M.: An extremely rich repertoire of bursting patterns
during the development of cortical cultures. BMC Neuroscience 7, 1–18 (2006)
63. Wang, X.J., Buzsáki, G.: Gamma oscillation by synaptic inhibition in a hippocampal
interneuronal network model. J. Neurosci. 16, 6402–6413 (1996)
64. Wang, X.J.: Neurophysiological and computational principles of cortical rhythms in cog-
nition. Physiol. Rev. 90, 1195–1268 (2010)
65. Whittington, M.A., Traub, R.D., Kopell, N., Ermentrout, B., Buhl, E.H.: Inhibition-based
rhythms: experimental and mathematical observations on network dynamics. Int. J. Psy-
chophysiol. 38, 315–336 (2000)
66. Wiener, N.: The theory of prediction. In: Beckenbach, E. (ed.) Modern Mathematics for
Engineers. McGraw-Hill, New York (1956)
67. Witt, A., Palmigiano, A., Neef, A., El Hady, A., Wolf, F., Battaglia, D.: Controlling
oscillation phase through precisely timed closed-loop optogenetic stimulation: a compu-
tational study. Front Neural Circuits 7, 49 (2013)
68. Womelsdorf, T., Lima, B., Vinck, M., Oostenveld, R., Singer, W., et al.: Orientation se-
lectivity and noise correlation in awake monkey area V1 are modulated by the gamma
cycle. Proc. Natl. Acad. Sci. USA 109, 4302–4307 (2012)
69. Yizhar, O., Fenno, L.E., Davidson, T.J., Mogri, M., Deisseroth, K.: Optogenetics in neu-
ral systems. Neuron 71, 9–34 (2011)
On Complexity and Phase Effects
in Reconstructing the Directionality
of Coupling in Non-linear Systems
Vasily A. Vakorin, Olga Krakovska, and Anthony R. McIntosh
Abstract. From the theoretical point of view, brain signals measured with elec-
troencephalogram (EEG), or magnetoencephalogram (MEG) can be described as
the manifestation of coupled nonlinear systems with time delays in coupling. From
the empirical point of view, to understand how the information is processed in the
brain, there is a need to characterize the information flow in a network of spatially
distinct brain areas. Tools for reconstructing the directionality of coupling, which
can be formalized as Granger causality, provide a framework for gaining the insight
into the functional organization of the brain networks. In turn, it is not completely
understood what kind of effects are captured by causal statistics. Under the context
of coupled non-linear oscillating systems with time delay in coupling, we consider
two effects that can contribute to the estimation of causality. First, we explore the
problem of ambiguity of phase delays observed between the dynamics of the driver
and the response, and its effect on the linear, spectral and information-theoretic
statistics. Second, we show that the directionality of coupling can be understood
as the differences in signal complexity between the driver and response.
1 Introduction
Rhythmic activity between neuronal ensembles is a widely observed phenomenon
in the brain [2]. The macroscopic oscillations can be detected with measurements of
Vasily A. Vakorin
Neurosciences & Mental Health, The Hospital for Sick Children, Toronto, Canada
e-mail: vasenka@gmail.com
Olga Krakovska
Department of Chemistry, York University, Toronto, Canada
Anthony R. McIntosh
Rotman Research Institute, Baycrest Centre and Department of Psychology,
University of Toronto, Toronto, Canada

138 V.A. Vakorin, O. Krakovska, and A.R. McIntosh
local field potentials (LFP), electroencephalographic (EEG), or magnetoencephalo-

graphic (MEG) recordings [18]. Mathematically, these neuronal ensembles can be
represented by single oscillators [14]. In turn, different neuronal ensembles can be
coupled with long-range connections, forming a large-scale network of coupled os-
cillators. Numerous studies have suggested that cognitive function can be explained
in terms of synchronous dynamics of large neuronal ensembles coupled within and
across systems [29]. In particular, encouraging results were obtained in modeling
the resting state networks under the context of non-linear dynamics, wherein time
delays in coupling play a crucial role in the generation of realistic fluctuations in
brain signals [8, 5].
One approach to gain insight into the mechanisms underlying functional net-
works is to explore the transfer of information in the networks. In this case, we want
not only to estimate the strength of the functional connectivity between the nodes
in a network, but also to infer the directionality of coupling. In other words, there is
a need to reconstruct causal relations between the observed signals. The notion of
Granger causality was introduced based on an idea of asymmetry in signals’ ability
to predict each other [11]. Under this framework, a process X is considered a cause
of another process Y , if the incorporation of the knowledge about the past of X sig-
nificantly improves the prediction of the future of Y , compared to the prediction
that is based solely on the knowledge about the past of Y . Granger causality thus is
based on the idea of temporal precedence where a cause precedes its consequences
(see, however, the chapter by Chicharro in this volume for a detailed discussion on
conceptual problems for inferring causal interactions using criteria based solely on
the temporal precedence).
In the case of brain oscillations carried at specific frequencies, time delay, in
general, cannot be converted into phase delay without ambiguity due to shifting
a wave backward or forward a full cycle (360◦). In the case of a linear transfer
function, there are strategies used to overcome the phase ambiguity that may exist
at a specific frequency. For example, computing the slope of the phase over a range
of frequencies, which is in essence the group delay, can be helpful [9]. However, the
situation is different in the case of non-linear systems, with possible time delays in
coupling.
In general, a connection can be characterized by the directionality, strength of
coupling, and time delay in coupling. The temporal precedence between the driver
and the response may materialize as either phase delays or phase advances at spe-
cific frequencies. Furthermore, in a network with many mutually connected nodes,
observed phase differences for a specific connection would be a result of intrinsic
combinations of all the parameters of coupling for all the connections. Thus, the
effects related to what is observed as a phase delay of the driver with respect to the
response, may counteract the inherent temporal precedence, as defined by physical
interactions of coupled systems.
To demonstrate what effects can be captured by causal statistics, we use a pro-
totypical non-linear system of coupled oscillators and explore the performance of
the standard Granger statistic as well as its spectral and information-theoretic ver-
sions. In the first part of this book chapter, we control the parameters of coupling to
On Complexity and Phase Effects in Reconstructing the Directionality of Coupling 139
show that, given the same directionality of coupling (as specified by the underlying
model), we can observe either phase delay or phase lead of the driver with respect
to the response. In turn, this phase difference affects the causal statistics, potentially
leading to spurious results. In the second part, we explore another mechanism that
can contribute to the causality estimation. Specifically, in spite of the confound-
ing effects of phase delays, the inference of the directionality of coupling may rely
on the differences in the complexity (information content) between the driver and
response. Intuitively, if the information is transferred from one system to another,
then the dynamics of the receiving system would reflect both its own complexity and
that of the sending system. Thus, the observed causality would depend on which of
the two effects, phase-related or complexity-related, would be stronger in a specific
situation.
2 Coupled Non-linear Systems

The interplay between causality and phase, and between causality and complexity
will be illustrated using a system of coupled Rössler oscillators [35]. Such a model
represents a relatively simple non-linear system able to generate self-sustained non-
periodic oscillations. Oscillatory behavior of the brain rhythms has been extensively
studied as a plausible mechanism for neuronal communication [29, 36], and under
this context, the coupled Rössler oscillators can be viewed as a prototypical example
of oscillatory brain networks [13].
Explicitly, the model reads
dx1 dx2
= −ω1 y1 − z1 + ε x2 (t − T ) = −ω2 y2 − z2
dt dt
dy1 dy2
= ω1 x1 + 0.15y1 = ω2 x2 + 0.15y2
dt dt
dz1 dz2
= 0.2 + z1(x1 − 10) = 0.2 + z2(x2 − 10) (1)
dt dt
where ω1 and ω2 are the natural frequencies of the oscillators, ε is the coupling
strength, and T denotes the time delay in coupling. Roughly speaking, each Rössler
system describes an oscillatory trajectory in the x-y plane, with spike-like behavior
in the z direction. All the further analyses were based on an assumption that only
variables x1 (t) and x2 (t) could be observed.
3 Granger Causality: Standard, Spectral and Non-linear

The dynamics of two finite time series x1 (t) and x2 (t), t = 1, ..., n, and the interac-
tions between them can be described by an autoregressive model based on p lagged
observations:
p p
x1 (t) = ∑ a11( j)x1 (t − j) + ∑ a12( j)x2 (t − j) + ε1(t)
j=1 j=1
p p
x2 (t) = ∑ a21( j)x1 (t − j) + ∑ a22( j)x2 (t − j) + ε2(t) (2)
j=1 j=1
where an optimal order of the model, the parameter p, can be estimated, for exam-
ple, according to Bayesian information criterion [27], and ε1 (t) and ε2 (t) are the
prediction errors for each time series. According to [11], if the variance of ε2 (t) is
reduced by including the terms a21 ( j) in the second equation of (2), compared to
keeping a21 ( j) = 0 for all j, then x1 (t) is thought to be causing x2 (t).
Formally, Granger causality F1→2 from x1 (t) to x2 (t) is quantified as an enhance-
ment of predictive power and defined as
(21)
var(ε2 )
F1→2 = ln . (3)
var(ε2 )
(21)
where var(ε2 ) is the variance of ε2 (t) derived from a model with a21 ( j) = 0 for
all j, and var(ε2 ) is the variance of ε2 (t) derived from the full model (2).
Two extensions of bivariate Granger causality are proposed in the literature: spec-
tral and non-linear. The spectral version of Granger causality [25] is based on the
Fourier transform of autoregressive models:

A11 ( f ) A12 ( f ) X (f) E1 ( f )
× 1 = , (4)
A21 ( f ) A22 ( f ) X2 ( f ) E2 ( f )
where f is the frequency, and Ai j , Xi , and Ei , i, j = 1, 2, are the Fourier coefficients

of the corresponding variables ai j , xi , and εi . The model (4) can be rewritten in terms
of the transfer function Hi j :

X1 ( f ) H ( f ) H12 ( f ) E (f) E (f)
= 11 × 1 ≡ H( f ) × 1 . (5)
X2 ( f ) H21 ( f ) H22 ( f ) E2 ( f ) E2 ( f )
Similar to (3), the spectral Granger causality G1→2 ( f ) from x1 (t) to x2 (t) is defined
as a function of the frequency f , and can be expressed in terms of the frequency-
specific covariance matrix of the residuals and the transfer function H( f ). More
details on the spectral causality can be found in [15].
A non-linear version of Granger causality in the time domain can be constructed
using the tools derived in the information theory. Under the information-theoretic
approach, we do not need to explicitly specify a model of signals and their interac-
tions. Instead, the transfer of information from the past of one process to the future
of another process can be quantified in terms of individual and joint entropies, which
essentially measure the variability of the observed signals or the amount of infor-
mation contained in them.
Non-linear Granger causality I1→2 is thus expressed as a transfer of informa-
tion from one signal to another, and can be quantified as the conditional mutual
information I(xδ2 , x1 |x2 ) between xδ2 , the future of x2 , and the past of x1 given the
past of x2 . It can be estimated in terms of individual H(·) and joint entropies H(·, ·)
and H(·, ·, ·) of the processes x1 , x1 , and xδ2 as follows:
I1→2 (δ ) ≡ I(xδ2 , x1 |x2 ) = H(xδ2 , x2 ) + H(x1, x2 ) − H(xδ2 , x1 , x2 ) − H(x2), (6)
where the time lag δ between the future and the past of a signal is typically measured
in multiples of the sampling interval. It can be shown that under certain conditions,
I(xδ2 , x1 |x2 ) is equivalent to the measure called transfer entropy [26, 19].
There are many ways to estimate the entropy of a signal. One approach is
based on an assumption that the observed time series are realizations of non-
linear dynamic systems. For example, the model (1) is a combination of two three-
dimensional systems, but we assume that only one dimension is observed (signals
x1 (t) and x2 (t)).
In this case, the dynamics in the multi-dimensional state space of the underlying
model should be reconstructed from a time series of observations. This can be done
with time delay embedding wherein the time series x1 (t) and x2 (t) are converted to
a sequence of vectors in a multidimensional space:
x 1 (t) = [x1 (t), x1 (t − τ1 ), x1 (t − 2τ1 ), ..., x1 (t − τ1 (d1 − 1))]T (7)

x 2 (t) = [x2 (t), x2 (t − τ2 ), x2 (t − 2τ2 ), ..., x2 (t − τ2 (d2 − 1))]T
where d1 and d2 are embedding dimensions, and τ1 and τ2 are embedding delays
measured in multiples of the sampling interval. Note that the ultimate goal is not
to reconstruct an orbit in the state space that is closest to the true one. However,
some invariants of a dynamical system, such as dimensions and entropy, can be
determined if the embedding dimension m is sufficiently high [31]. We estimate
the individual and joints entropies in (6) by computing the corresponding correla-
tion integrals, as proposed by [22], and tested using linear and non-linear models
[3, 10, 33].
Similar to F1→2 , G1→2 ( f ), and I1→2 , causal effects in the other direction, namely,
F2→1 , G2→1 ( f ), and I2→1 , can also be estimated. The difference in two measures
may indicate the directionality of dominant coupling between x1 (t) and x2 (t). Thus,
causality can be inferred from the standard Granger causality
Δ F = F2→1 − F1→2, (8)
spectral Granger causality
Δ G( f ) = G2→1 ( f ) − G1→2 ( f ), (9)
as a function of frequency f , and information transfer
Δ I(δ ) = I2→1 (δ ) − I1→2(δ ), (10)

as a function of time lag δ . If these measures are positive, the directionality of dom-
inant coupling is reconstructed as x2 (t) → x1 (t), and x1 (t) → x2 (t) if negative. Note
that these net measures will report no causality in a case of symmetric bidirectional
systems.
4 Phase Synchronization and Phase Delays

In the case of a coupled non-linear system with a possible time delay in coupling,
time delay, in general, cannot be converted into phase delay without ambiguity. We
will control the parameters of coupling to show that, given the same directionality
of coupling, as specified by the underlying model, either phase delay or phase lead
between the driver and the response can be observed. In turn, this phase lag affects
the estimation of Granger statistics.
The phase shift φ12 ( f ) between two signals x1 (t) and x2 (t) at a specific frequency
f can be computed from the cross spectrum Γ12 ( f ):
Γ12 ( f ) ∼ ei φ12 ( f ) . (11)
Suppose that there are n realizations of the processes x1 (t) and x2 (t), and for each
(k)
realization k = 1, ..., n, the phase shift φ12 ( f ) is computed. Relative stability of the
(k)
phase difference φ12 ( f ) across realizations quantifies the degree of phase-locking
between two signals at a given frequency:

1 n
iφ12 ( f )
(k)
R12 ( f ) = ∑ e . (12)
n k=1
By construction, the statistic R12 ( f ) is limited between 0 and 1. When the relative
phase distribution is concentrated around the mean, R12 ( f ) is close to one, whereas
phase scattering will result in a random distribution of phases and R12 ( f ) close to
zero.
The mean phase delay φ 12 ( f ) between two signals can also be computed by av-
eraging across the realizations. However, there is an ambiguity in cumulative phase
shift between harmonic signals as, in general, it is not known how many cycles the
phase completed. In this book chapter, the phase difference φ 12 ( f ) between −90◦
and 0◦ implies that the signal x1 (t) (response) is phase delayed with respect to x2 (t)
(driver) at frequency f , and vice versa.
5 Causality and Phase Differences: Three Scenarios

Now we consider three scenarios, showing an interplay between causality estima-
tion and observed phase differences. All simulations were based on the model (1)
with the directionality of coupling x2 → x1 . The generated signals were designed
to represent oscillations approximately at 10 Hz. Gaussian noise was added to the

signals.
Each scenario was characterized by a pair of the model parameters, the coupling
strength ε and the time delay in coupling T . For a given pair of ε and T , 50 real-
izations of the model (1) were generated. For each realization, a corresponding pair
of surrogate time series were also generated. Surrogate signals are artificial data
that mimic some properties of the original data. For example, some linear proper-
ties of the original signals remain unchanged, but non-linear characteristics can be
destroyed. We generated surrogate signals according to a method designed to test
pseudo-periodic data [30]. Specifically the surrogates were generated by preserv-
ing the large scale behavior of the data (the periodic structure), and destroying any
additional small scale structure. Thus, for each pair of ε and T , two ensembles of
the original and surrogate time series were created. Then, three measures of causal-
ity were computed. The spectral causality Δ G( f ) was estimated for the frequencies
1 − 25 Hz. The net information transfer Δ I(δ ) was estimated for the time lags δ =1-
51, with τ1 = τ2 = 1 and d1 = d2 = 5 (see also the chapter by Vicente and Wibral in
this volume for details on different methods for estimating the transfer entropy). In
addition, cumulative performance of Δ G( f ) and Δ I(δ ) was computed by averaging
these statistics across a range of f = 1 − 25 Hz and δ =1-51, respectively. Finally,
for each ensemble of the original and surrogate signals, the phase-locking index and
mean phase shift, R12 ( f ) and φ 12 ( f ), were computed at the frequencies f = 1 − 25
Hz. More details on the estimation of phase-locking and causal measures can be
found in [35].
Around f = 10 Hz, the signals x1 (t) and x2 (t) become phase-locked, with R12
close to 1 (see panels (c) in Fig 2-4). However, depending on a combination of the
coupling strength ε and the time delay in coupling T , the observed phase difference
φ 12 ( f ) at f = 10 Hz can be zero, positive, or negative. Remember that negative
φ 12 ( f ) between −90◦ and 0◦ implies that the response x1 (t) is phase delayed with
respect to the driver x2 (t) at frequency f . And vice versa, positive φ 12 ( f ) between
0◦ and 90◦ can be interpreted as a phase lead of the response x1 (t) with respect to
the driver x2 (t).
Phase differences can affect the estimation of the causal statistics. To show that
we consider, for the same directionality of coupling x2 → x1 , three combinations of
ε and T that correspond to three values of φ 12 ( f ) at f = 10Hz: (i) 0.1◦ (Fig. 1a
and 2d), (ii) −44.2◦ (Fig. 1b and 3d ), and (iii) 45.1◦ (Fig. 1c and 4d). Typical time
series, for each of the three scenarios, are shown in Fig. 1.
Figures 2-4 represent these three scenarios, showing an interplay between recon-
structed causality and phase differences. Specifically, each figure shows: (a) spectral
Granger causality Δ G( f ), (c) phase locking index R12 ( f ) and (d) phase difference
φ 12 ( f ) as functions of frequency f , and (b) net information transfer Δ I(δ ) as a
function of the time lag δ . Solid lines represent the mean of the causal statistics,
computed from the original data and subsequently averaged across the realizations.
The limits of the dark grey area are defined by the 5%- and 95%-tails of the distri-
butions computed using the surrogate data.
Phase shift = 0.4 degrees at 10 Hz

A
response
2
driver
Amplitude
1
−1
−2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
Phase shift = −44.2 degrees at 10 Hz

B
2 response
driver
1
Amplitude
−1
−2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
Phase shift = 45.1 degrees at 10 Hz

C
response
2 driver
Amplitude
−1
−2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
Fig. 1 Typical time series generated by the system (1) in three scenarios: (A) phase difference
is close to zero at 10 Hz (ε = 0.07 and T = 0.1083); (B) negative phase shift (ε = 0.07 and
T = 0.1208); (C) positive phase shift (ε = 0.07 and T = 0.0958)
In the first scenario (Fig. 2), the parameters ε and T were chosen such that the
phase difference at 10Hz was close to zero. In such a case, the measure of Δ G( f )
is positive for frequencies 1 − 15Hz, reaching a peak around 11Hz (Fig. 2a). The
net information transfer Δ I(δ ) was also positive for all time lags δ . Positive values
for the causal statistics imply that the directionality of coupling is correctly recon-
structed as 2 → 1.
In the second scenario (Fig. 3), wherein the responding signal x1 (t) is phase de-
layed with respect to the driving x2 (t) (Fig. 1b), both Δ I(δ ) and Δ G( f ) are positive,
reaching a peak around 10Hz, also implying (correctly) the directionality of cou-
pling as 2 → 1. Note that the peak in Δ G( f ) at 10 Hz is higher in Fig. 3, compared
to that at 11Hz in Fig. 2, although the strength of coupling was the same. The time
precedence, as specified by the directionality of coupling from the model, concurs
with the phase precedence, as detected from the phase-locking analysis.
A Case 1 : no phase shift

0.6
causality
Spectral 0.4
0.2
0
5 10 15 20 25
Frequency (Hz)
B
0.1
Transfer
Entropy
0.05
−0.05
10 20 30 40 50
Time lag (samples)
C
Phase−locking
1
index
0.5
0
5 10 15 20
Frequency (Hz)
D Phase shift = 0.4 degrees at 10 Hz
Phase shift,
100
degrees
−100
5 10 15 20
Frequency (Hz)
Fig. 2 Reconstructed causality and phase effects in the case where there is no phase shift
(φ 12 ( f ) = 0.4o for ε = 0.07 and T = 0.1083) at f = 10 Hz (see Fig. 1a): (A) spectral Granger
causality as a function of frequency; (B) net information transfer as a function of the time lag
δ ; (C) phase-locking index and (D) phase shift as functions of frequency. For the results
shown in Fig. 2-6, the embedding parameters, τ = 1 and d = 5, were kept the same, whereas
p was estimated according to Bayesian information criterion, separately for each pair of the
time series.
A Case 2 : negative phase shift

1
causality
Spectral
0.5
0
5 10 15 20 25
Frequency (Hz)
B
0.1
Transfer
Entropy
0.05
10 20 30 40 50
Time lag (samples)
C
Phase−locking
1
index
0.5
0
5 10 15 20
Frequency (Hz)
D Phase shift = −44.2 degrees at 10 Hz
Phase shift,
100
degrees
−100
5 10 15 20
Frequency (Hz)
Fig. 3 Reconstructed causality and phase effects in the case where the phase shift between
the driver and response is φ 12 ( f ) = −44.2o at f = 10 Hz for ε = 0.07 and T = 0.1208 (see
Fig. 1b): (A) spectral Granger causality as a function of frequency; (B) net information trans-
fer as a function of the time lag δ ; (C) phase-locking index and (D) phase shift as functions
of frequency
A Case 3 : positive phase shift

0.1
causality
Spectral
0
−0.1
−0.2
−0.3
5 10 15 20 25
Frequency (Hz)
B
0.15
Transfer
Entropy
0.1
0.05
0
−0.05
10 20 30 40 50
Time lag (samples)
C
Phase−locking
1
index
0.5
0
5 10 15 20
Frequency (Hz)
D Phase shift = 45.1 degrees at 10 Hz
Phase shift,
100
degrees
−100
5 10 15 20
Frequency (Hz)
Fig. 4 Performance of causal statistics and phase effects in the case of a positive phase dif-
ference (φ 12 ( f ) = 45.1◦ ) between them at f = 10 Hz for ε = 0.07 and T = 0.0958 (see
Fig. 1c): (A) spectral Granger causality as a function of frequency; (B) net information trans-
fer as a function of the time lag δ ; (C) phase-locking index and (D) phase shift as functions
of frequency
Fig. 4 represents the third scenario wherein the effects associated with phase
precedence counteract the effects related to the causal relations as implemented in
system (1). Specifically, in this case, the driver x2 (t) is phase delayed with respect
to the response x2 (t) with φ 12 ( f ) = 45o at f = 10 Hz. The causal effects related
to the phase shift are relatively strong compared to the inherent causality between
x1 (t) and x2 (t). The spectral Granger causality switches to negative values, implying
that the causal relations are spuriously reconstructed as 1 → 2. The net information
transfer is also sensitive to the phase shift, being either positive or negative, depend-
ing on the value of the time lag δ . It should be noted that Δ I(δ ) is more resistant
to the phase-locking effects, as the mean value Δ I(δ ) averaged across δ is positive
(2 → 1).
Notably, the performance of the standard Granger causality was similar to that of
the spectral Granger causality. When there was no phase shift at 10Hz, the mean Δ F
averaged across the realizations was 0.0168, whereas the confidence interval of Δ F
based on the corresponding surrogate data and defined by the 5%- and 95%-tails,
was [−0.0066 0.0059]. In the case of φ 12 ( f ) close to 45o at 10Hz, Δ F = 0.0574
with the confidence interval [−0.0060 0.0055] based on the surrogate data. How-
ever, when φ 12 ( f ) is about −44o , the analysis produced Δ F = −0.0125, whereas
the confidence interval for surrogate data was [−0.0050 0.0048]. Thus, the stan-
dard Granger causal statistic was significantly affected by the differences in phase
between the two signals.
6 Influence of the Parameters of Coupling on Causality and

Phase Delays
In the previous section, we showed how the effects related to phase advance or
phase delay can facilitate or counteract the reconstruction of causal relations. In this
section, we focus on the aggregated performance of the causal measures Δ G and Δ I,
averaged across frequencies f = 1 − 25 Hz and time lags δ = 1 − 51, respectively.
The measures Δ G and Δ I as well as Δ F are considered functions of the time delay T
or the strength of coupling ε (Fig. 5 and 6). The solid lines represent the mean values
of Δ F, Δ G and Δ I, computed for the original data and averaged across realizations.
The dark grey area reflects the variability of Δ F, Δ G, and Δ I, computed for the
surrogate data (5%- and 95%-tails of the corresponding distributions).
Fig. 5 is based on the simulations wherein the coupling strength was kept con-
stant, whereas the time delay T covered the range from 0.060 to 0.148. In this case,
the phase difference at 10Hz covers the entire period from −180o to 180o (Fig. 5d).
In turn, the standard and spectral Granger statistic, Δ F and Δ G, as well as the infor-
mation transfer are plotted as the functions of φ 12 at 10 Hz. For all T , Δ I is positive,
correctly reconstructing the coupling as 2 → 1. In contrast to Δ I, both Δ F and Δ G
produced false-positive results (1 → 2) for the phase differences approximately be-
tween 10o and 100o , which belong to the scenario wherein the driver x2 (t) is phase
delayed with respect to the response x1 (t).
Effects of time delay in coupling

A
Standard Granger
0.06
0.04
0.02
−150 −100 −50 0 50 100 150

B
Spectral causality
0.4
0.2
−0.2
−150 −100 −50 0 50 100 150
C
Transfer Entropy
0.06
0.04
0.02
0
−0.02
−0.04
−150 −100 −50 0 50 100 150
D
0.14
in coupling
Time delay
0.12
0.1
0.08
0.06
−150 −100 −50 0 50 100 150
Phase shift (degrees)
Fig. 5 Influence of time delay in coupling on: (A) standard Granger causality; (B) spectral
Granger causality and (C) net information transfer as functions of the observed time differ-
ence at 10 Hz; (D) phase difference at 10 Hz as a function of the time delay in coupling,
provided that the strength of coupling was unchanged (ε = 0.07)
The phase shift, at the frequency when the signals become phase-locked to each
other, depends not only on the time delay in coupling T , but also on the coupling
strength ε . Fig. 6 is based on the simulations wherein T was kept constant, whereas
ε varied from 0 to 0.1. The statistics Δ F, Δ G and Δ I as well as the phase difference
φ 12 estimated at 10Hz are shown as the functions of ε . As can be seen, φ 12 at 10Hz
can be either positive (phase delay) or negative (phase lead of the driver x2 (t) with
respect to the response x1 (t)).
Notably, the information-theoretic statistic Δ I is a monotonic function of ε as
can be seen in Fig. 6c (note, however, that for very high ε , when the driver and
x 10
−3 Influence of coupling strenth
A
Standard Granger
20
10
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
B
Spectral causality
0.15
0.1
0.05
0
−0.05
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
C
Transfer Entropy
0.08
0.06
0.04
0.02
0
−0.02
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Phase shift, degrees
D
6
4
2
0
−2
−4
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Coupling strength
Fig. 6 Influence of the strength of coupling on: (A) standard Granger causality; (B) spectral
causality and (C) net information transfer as functions of phase difference at 10 Hz; (D) phase
difference at 10 Hz as a function of the coupling strength, with the time delay in coupling kept
constant (T = 0.1083)
response are getting fully synchronized, Δ I might decay). Specifically, Δ I is able

to correctly reconstruct the causal relations for ε > 0.012, producing insignificant
values for small coupling strengths. At the same time, both standard and spectral
Granger statistics, Δ F and Δ G, are very sensitive to the phase delay when coupling
is weak. In particular, for ε = 0 and ε = 0.008 when we observe a phase delay of
5 − 7◦ of the driving x2 (t) with respect to the responding x1 (t), both Δ F and Δ G are
small, but statistically different from the null hypothesis (surrogate data). In other
words, the effects related to the phase locking and phase delay are relatively strong
compared to the effects associated with modeled causality. If the inherent causality
is relatively strong (for example, when the coupling strength ε is between 0.025 and
0.0075, which also corresponds to the phase delay of x2 (t) with respect to x1 (t)),
the standard and spectral Granger statistics correctly identify the directionality of
coupling.
7 Information Content of the Observed Time Series

In the scenario wherein the phase delay between x2 (t) with respect to x1 (t) at the
main frequency (10 Hz) was close to zero, both spectral and information-theoretic
measures were able to correctly reconstruct the directionality of coupling. In other
words, in a situation where the phase-related causal effects are minimized, there
should exist, at least, another mechanism which allows a causal statistic to correctly
identify the driver-response relations. We hypothesize that one possible mechanism
is based on detecting the differences in the information content (complexity) of the
signals under investigation. Previously, it was found that asymmetries in interdepen-
dency may reflect the different degrees of complexity of two systems at the scales
to which the observed measure are most sensitive [1, 24].
Intuitively, this can be understood if we first consider two uncoupled systems,
each system being characterized by its own information content or complexity of
dynamics. When we turn on a unidirectional coupling, the signal variability (com-
plexity) of the responding system should reflect not only its own information con-
tent but also include some of the variability (complexity) of the the driving system.
Thus, in general, the complexity of the response would be higher than that of the
driver. Vakorin et al. [34] studied the generation and transfer of information in the
system (1). Local generation of information was quantified with a method known
as multi-scale entropy [4, 32]. In turn, transfer entropy (6) was used to infer the di-
rectionality of coupling between the subsystems. The net information transfer was
correlated with the differences in the signal complexity between the two systems.
Various statistics quantifying signal variability based on the presense of non-
linear deterministic effects were developed to compare time series. Among others,
sample entropy was designed as a measure of signal regularity [25]. The sample
entropy was proposed as a refined version of approximate entropy [21]. In turn,
approximate entropy was devised as an attempt to estimate Kolmogorov entropy
[12], the rate of information generated by a dynamic system, from noisy and short
time series of clinical data.
For estimating sample entropy of time series xt , two multi-dimensional repre-
sentations of xt are used, as defined, according to (8), by two sets of embedding
parameters: {d, τ } and {d + 1, τ }. Sample entropy can be estimated in terms of the
average natural logarithm of conditional probability that two delay vectors (points in
a multi-dimensional state-space), which are close in the d-dimensional space (mean-
ing that the distance between them is less than the scale length r), will remain close
in the (d + 1)-dimensional space. A greater likelihood of remaining close results in
smaller values for the sample entropy statistic, indicating fewer irregularities. Con-
versely, higher values are associated with the signals having more variability and
less regular patterns in their representations.
Multi-scale entropy (MSE) was proposed to estimate sample entropy of finite

time series at different time scales [38, 4]. First, multiple coarse-grained time series
y(ζ ) are constructed from the original signal x(t) = {x1 , ..., xi−1 , xi , xi+1 ..., xn }. This
is performed by averaging the data points from the original time series within non-
overlapping windows of increasing length. Specifically, the amplitude of the coarse-
grained time series y(θ ) (ζ ) at time scale θ is calculated according to
i=ζ θ
1
y(θ ) (ζ ) = ∑ xi , 1 ≤ ζ ≤ n/θ
θ i=(ζ −1)
(13)
θ +1
wherein the fluctuations at scales smaller than θ are eliminated. The window length,
measured in data points, represents the scale factor, θ = 1, 2, 3, .... Note that θ = 1
represents the original time series, whereas relatively large θ produces a smooth
signal, containing basically low frequency components of the original signal. To
obtain the MSE curve, sample entropy is computed for each coarse-grained time
series.
8 Directionality of Coupling and Differences in Complexity

Now we will consider the net information transfer (transfer entropy) as a function
of the difference in the complexity between the two signals, computed at fine and
coarse time scales. Similar to what was done above, for a given pair of ε and T , an
ensemble of the signals x1 (t) and x2 (t) was generated. Net transfer entropy Δ I was
obtained by averaging across the time lags δ and realizations. The complexity at
fine time scales was estimated by averaging the sample entropy across the first five
scale factors, whereas the variability of coarse-grained time series was computed
by averaging the sample entropy across the last five time scales (scales 16 − 20 in
this example). Note that sample entropy is sensitive to both linear stochastic and
non-linear deterministic effects. The effects of additive noise at fine tine scales were
much stronger, compared to the coarse scales. In the coarse-grained time series,
the noise was filtered out according to (13), and the deterministic non-linear effects
were more pronounced. Thus, for a given combination of ε and T , we obtained an
estimate of: (i) net transfer entropy Δ I, (ii) difference in the complexity (sample
entropy) between the driver x2 (t) and the response x1 (t) at fine time scales, and (iii)
difference in the complexity between x1 (t) and x2 (t) at coarse time scales.
First, we considered the influence of the time delay, T , varied on some interval,
with the coupling parameter ε fixed. The effects of its variability on complexity
and information exchange are shown in Fig. 7, namely, net transfer entropy (a),
differences in sample entropy at fine (b) and coarse time scales (d) as functions
of the time delay T . Note that, in dealing with real data, such relations cannot be
observed as typically the true values of T are not known (see, however, the chapter
by Wibral in this volume as well as [23, 28, 37] for attempts in recovering time
delays in coupling). What we can observe is the relations between the net transfer
entropy and the differences in sample entropy (Fig. 7c and Fig. 7e). In Fig. 7c,
Net transfer Entropy (a)
0.22
0.2
0.18
0.16
0.14
5 10 15 20
(b) (c)
sample entropy (fine)
0.035 0.035
Difference in
0.03 0.03
0.025 0.025
0.02 0.02
5 10 15 20 0.14 0.16 0.18 0.2 0.22
(d) (e)
sample entropy (coarse)
−0.095 −0.095
Difference in
−0.1 −0.1
−0.105 −0.105
−0.11 −0.11
−0.115 −0.115
5 10 15 20 0.14 0.16 0.18 0.2 0.22
Time delay T Net transfer Entropy
Fig. 7 Influence of time delay in coupling T on the differences in complexity (sample en-
tropy) between the driver and response, and the information transfer: (A) net information
transfer, (B) difference in complexity measured at the fine time scales (scales 1-5), and (D)
difference in complexity at the coarse time scales (scales 16-20), and (E) difference in com-
plexity at the coarse time scales (r = −0.08, not significant) as a function of the net informa-
tion transfer. Positive correlation r = 0.73 (p-value< 0.0001) in panel C implies that at the
fine scales, the signal complexity of the driver is higher than that of the response.
there exists a relatively strong and robust linear correlation between the information
transfer and differences in complexity at fine time scales (r = 0.73, p-value< 0.001).
Positive r implies that a system with higher variability at fine time scales can better
predict the behavior of a system with lower variability, than the other way around.
At the same time, the correlation between the information transfer and differences
at coarse time scales (Fig. 7e) is close to zero.
(a)
Net transfer Entropy
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4
(b) (c)
sample entropy (fine)
0.04 0.04
Difference in
0.02 0.02
0 0
−0.02 −0.02
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
(d) (e)
sample entropy (coarse)
−0.1 −0.1
Difference in
−0.2 −0.2
−0.3 −0.3
−0.4 −0.4
−0.5 −0.5
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
Coupling parameter ε Net transfer Entropy
Fig. 8 Influence of the strength of coupling ε on the differences in complexity (sample en-
tropy) between the driver x2 (t) and the response x1 (t) in (1) and the information transfer:
(A) net information transfer, (B) difference in complexity measured at the fine time scales,
and (D) difference in complexity at the fine time scales (r = 0.73, p-value< 0.0001), and (E)
difference in complexity at the coarse time scales as a function of the net information transfer.
Note the negative correlation between the two statistics in panel E: the dominant amount of
information transfered from the system with lower complexity (driver x2 (t)) to the system
with higher complexity (driver x2 (t)) is a monotonic function of the difference in their signal
complexity at the time scales that are sensitive to non-linear deterministic effects.
Second, we considered the influence of the strength of coupling ε on the relations

between the information transfer and differences in complexity, keeping T constant.
Fig. 8 shows net transfer entropy (a), differences in sample entropy at fine (b) and
coarse time scales (d) as functions of ε . The influence of ε on the differences in
fine-grained sample entropy was ambiguous, as shown in Fig. 8b and c. What is
interesting is that both the net transfer entropy and the difference in coarse-grained
sample entropy were found to be monotonically increasing and decreasing functions
of ε , respectively (Fig. 8a and d). That is, the stronger the coupling, the larger the
differences in the complexity of the observed signals. In turn, this led to the nega-
tive correlation between the complexity difference at coarse scales and net transfer
entropy (Fig. 8e). This negative correlation supports the conclusion that at coarse
time scales when the focus is on the deterministic non-linear effects, other condi-
tions being equal, the driver is characterized as a subsystem with lower complexity,
in comparison to the response.
9 Conclusion
We considered two effects that can contribute to reconstruction of the driver-
response relations in coupled non-linear systems. The first effect reflects the idea
that the difference in complexity between the driver and the response is associated
with the dominant transfer of information. Specifically, the causality can be viewed
as a transfer of information from one system to another, which increases the signal
complexity of the individual subsystems as the information propagates along the
network. The time scales at which the complexity is computed is a critical factor. In
our example, at the coarse time scales as used in the multi-scale entropy estimation,
the difference in the complexity between the two coupled subsystems was propor-
tional to the strength of coupling. This suggests that it is the coarse scales that reflect
non-linear deterministic effects for the system (1). In addition, the net information
transfer was a monotonic function of the coupling strength. Thus, the propagation
of information, which is the basis for causality reconstruction, in general, induces
an increase of signal complexity at the time scales that reflect the deterministic ef-
fects underlying the observed time series. Expanding a model of two sources to a
larger network, this accumulated complexity may clarify the topological roles of
individual nodes in this network [16].
The second effect is based on the existence of possible phase differences be-
tween the driver and response at specific frequencies. This effect can either in-
tensify or counteract the causality effects considered as the propagation of com-
plexity. Depending on the strength of the effects associated with phase differences,
the complexity-related causal effects can be partly neutralized or even totally sup-
pressed. This can be explicitly observed in the scenarios wherein signals become
phase-locked to each other at some frequencies. In turn, this could have a dominant
influence on estimated causal statistics.
In our example, we considered the role of phase shifts in the context of non-linear
coupled systems, in contrast to the case of linear time-invariant systems. In the latter
scenario, the spectrum of the signal is not limited to a single harmonic component
but spans several frequencies. In the frequency domain, the slope of the phase (group
delay) produces an estimate for the time delay between the signals, which may be
used to solve the ambiguity of phase differences at a specific frequency [9]. There
exists a causal measure that explicitly exploits the cumulative phase delay as the
basis for causality [17]. However, as our examples show, in the case of non-linear
interactions, we should expect that such an approach may lead to spurious results.
In general, we found that all the statistics tested in this study were sensitive to
phase differences. However, in the situation wherein the driver was phase delayed
with respect to the response with φ 12 ( f ) approximately between 0◦ and 90◦ , both
the standard and spectral measures produced statistically significant, but spurious re-
sults. On the contrary, the information-theoretic measure performed reasonably well
in the same situations, correctly reconstructing the underlying relations as specified
by the model.
The spectral Granger statistic explicitly depends on the phase differences be-
tween harmonic components of tested signals, and the contribution from specific
frequencies can be intensified by the mechanism of phase-locking. In some sense,
inferring the directionality of coupling at a specific frequency can be viewed as an
extreme case of filtering the signals with a narrow band-pass filter. On the contrary,
we should expect that causality is ultimately based on interactions between differ-
ent frequency components. [6] explored the effects of different filtering techniques
on the performance of several causality measures. They found that, without strong
assumptions about the artifacts to be removed, filtering disturbs the information
content and leads to missed or spurious results.
Finally, the information transfer outperformed the standard Granger statistic, al-
though both measures work in the time domain. We believe that a critical difference
between the standard and non-linear versions of the causality lies in averaging the
causality measures across the length of forecast horizon, that is, across the param-
eter δ . As can be seen from the model (2), only one specific δ , namely, δ = 1, is
used for estimating the standard Granger measure. At the same time, the common
practice for computing transfer entropy is to average it across some range of the
lags δ . Originally this was proposed in [20] with the idea to decrease the variability
of estimated statistics and to increase the robustness of the results. The time lag δ
may affect the phase difference between the future and the past of the same signal.
In other words, δ = 1 may not be optimal. If the range of δ is relatively large to
cover the entire period of the characteristic scales of the signal dynamics, averaging
across δ would smooth out the phase effects.
Acknowledgments. This research was supported by research grants from the J.S. McDonnell
Foundation to Dr. Anthony R. McIntosh. We thank Maria Tassopoulos-Karachalios for her
assistance in preparing this manuscript.
References
1. Arnhold, J., Grassberger, P., Lehnertz, K., Elger, C.E.: A robust method for detecting in-
terdependences: application to intracranially recorded EEG. Physica D: Nonlinear Phe-
nomena 134(4), 419–430 (1999)
2. Buzsaki, G.: Rhythms of the brain. Oxford University Press, New York (2006)
3. Chavez, M., Martinerie, J., Le Van Quyen, M.: Statistical assessment of nonlinear causal-
ity: application to epileptic eeg signals. J. Neurosci. Methods 124(2), 113–128 (2003)
4. Costa, M., Goldberger, A.L., Peng, C.K.: Multiscale entropy analysis of physiologic time
series. Phys. Rev. Lett. 89, 062102 (2002)
5. Deco, G., Jirsa, V., McIntosh, A.R., Sporns, O., Ktter, R.: Key role of coupling, delay,
and noise in resting brain fluctuations. Proceedings of the National Academy of Sci-
ences 106(25), 10302–10307 (2009)
6. Florin, E., Gross, J., Pfeifer, J., Fink, G.R., Timmermann, L.: The effect of filtering on
Granger causality based multivariate causality measures. Neuroimage 50(2), 577–578
(2010)
7. Geweke, J.: Measurement of linear dependence and feedback between multiple time se-
ries. Journal of the American Statistical Association 7, 304–313 (1982)
8. Ghosh, A., Rho, Y., McIntosh, A.R., Ktter, R., Jirsa, V.: Cortical network dynamics with
time delays reveals functional connectivity in the resting brain. Cognitive Neurodynam-
ics 2(2), 115–120 (2008)
9. Gotman, J.: Measurement of small time differences between EEG channels: method and
application to epileptic seizure propagation. Electroenceph. Clin. Neurophysiol. 56, 501–
514 (1983)
10. Gourévitch, B., Le Bouquin-Jeannès, R., Faucon, G.: Linear and nonlinear causality be-
tween signals: methods, examples and neurophysiological applications. Biological Cy-
bernetics 95(4), 349–369 (2007)
11. Granger, C.W.J.: Investigating causal relations by econometric models and cross spectral
12. Grassberger, P., Procaccia, I.: Estimation of the Kolmogorov entropy from a chaotic sig-
nal. Phys. Rev. A 28, 2591–2593 (1983)
13. Hadjipapas, A., Casagrande, E., Nevado, A., Barnes, G.R., Green, G., Holliday, I.E.: Can
we observe collective neuronal activity from macroscopic aggregate signals? NeuroIm-
age 44(4), 1290–1303 (2009)
14. Haken, H.: Principles of brain functioning. Springer (1996)
15. Kamiński, M., Ding, M., Truccolo, W.A., Bressler, S.L.: Evaluating causal relations in
neural systems: Granger causality, directed transfer function and statistical assessment
of significance. Biological Cybernetics 85, 145–157 (2001)
16. Mišić, B., Vakorin, V., Paus, T., McIntosh, A.R.: Functional embedding predicts the vari-
ability of neural activity. Frontiers in Systems Neuroscience 5, 90 (2011)
17. Nolte, G., Ziehe, A., Nikulin, V.V., Brismar, T., Müller, K.R., Schlögl, A., Krämer, N.:
Robustly estimating the flow direction of information in complex physical systems. Phys.
Rev. Lett. 100(23), 234101 (2008)
18. Nunez, P.L.: Neocortical dynamics and human brain rhythms. Oxford University Press
(1995)
19. Paluš, M., Vejmelka, M.: Directionality of coupling from bivariate time series: How to
avoid false causalities and missed connections. Phys. Rev. E 75, 056211 (2007)
20. Paluš, M., Komárek, V., Hrnčı́ř, Z., Štěrbová, K.: Synchronization as adjustment of info-
mation rates: Detection from bivariate time series. Phys. Rev. E 63, 046211 (2001)
21. Pincus, S.M.: Approximate entropy as a measure of system complexity. Proc. Natl. Acad.
Sci. USA 88, 2297–2301 (1991)
22. Prichard, D., Theiler, J.: Generralized redundancies for time series analysis. Physica
D 84, 476–493 (1995)
23. Prokhorov, M.D., Ponomarenko, V.I.: Estimation of coupling between time-delay sys-
tems from time series. Physical Review E 72(1), 016210 (2005)
24. Quiroga, R.Q., Arnhold, J., Grassberger, P.: Learning driver-response relationships from
synchronization patterns. Phys. Rev. E 61, 5142–5148 (2000)
25. Richman, J.S., Moorman, J.R.: Physiological time-series analysis using approximate en-
tropy and sample entropy. Am. J. Physiol. Heart. Circ. Physiol. 278(6), H2039–H2049
(2000)
26. Schreiber, T.: Measuring information transfer. Phys. Rev. Letters 85(2), 461–464 (2000)
27. Schwarz, G.: Estimating the dimension of a model. The Annals of Statistics 6(2), 461–
464 (1978)
28. Silchenko, A.N., Adamchic, I., Pawelczyk, N., Hauptmann, C., Maarouf, M., Sturm, V.,
Tass, P.A.: Data-driven approach to the estimation of connectivity and time delays in the
coupling of interacting neuronal subsystems. Journal of Neuroscience Methods 191(1),
32–44 (2010)
29. Singer, W.: Neuronal synchrony: A versatile code for the definition of relations? Neu-
ron 24, 49–65 (1999)
30. Small, M., Tse, C.K.: Applying the method of surrogate data to cyclic time series. Phys-
ica D 164, 187–201 (2002)
31. Takens, F.: Detecting strange attractors in turbulence. In: Dynamical Systems and Tur-
bulence. Lecture Notes in Mathematics, vol. 898. Springer (1981)
32. Vakorin, V.A., McIntosh, A.R.: Mapping the multi-scale information content of complex
brain signals. In: Brinciples of Brain Dynamics: Global State Interactions, pp. 183–208.
The MIT Press (2012)
nections on causality estimation. Journal of Neuroscience Methods 184(1), 152–160
(2009)
34. Vakorin, V.A., Mišić, B., Krakovska, O., McIntosh, A.R.: Empirical and theoretical as-
pects of generation and transfer of information in a neuromagnetic source network. Fron-
tiers in Systems Neuroscience 5(96), 00096 (2012)
35. Vakorin, V.A., Mišić, B., Krakovska, O., Bezgin, G., McIntosh, A.R.: Confounding ef-
fects of phase delays on causality estimation. PLoS One 8(1), e5358 (2013)
36. Varela, F., Lachaux, J.P., Rodriguez, E., Martinerie, J.: The brainweb: phase synchro-
nization and large-scale integration. Nature Reviews Neuroscience 2(4), 229–239 (2001)
37. Vicente, R., Wibral, R., Lindner, M., Pipa, G.: Transfer entropy a model-free measure of
effective connectivity for the neurosciences 30(1), 45–67 (2011)
38. Zhang, Y.-C.: Complexity and 1/f noise. A phase space approach. J. Phys. I France 1
(1991)
Part III
Recent Advances in the Analysis of
Information Processing
160 Part III: Recent Advances in the Analysis of Information Processing
This chapter introduces recent advances in the analysis of information processing

using information theoretic methods. The chapter by Lizier provides an important
extension of current techniques to analyze information processing by investigating
information transfer on a local scale in space and time, and by also providing a
novel measure of information storage at this local scale, which is another highly
important aspect of information processing in neural, and other complex systems.
To finishing chapter by Chicharro takes up the discussion around causality presented
in the opening chapter points out how to differentiate between criteria for causal
inference and measures used to test them. He further introduces new criteria that
complete a unified picture of how the different approaches to causality are related.
Measuring the Dynamics of Information
Processing on a Local Scale in Time and Space
Joseph T. Lizier
Abstract. Studies of how information is processed in natural systems, in particular

in nervous systems, are rapidly gaining attention. Less known however is that the lo-
cal dynamics of such information processing in space and time can be measured. In
this chapter, we review the mathematics of how to measure local entropy and mutual
information values at specific observations of time-series processes. We then review
how these techniques are used to construct measures of local information storage
and transfer within a distributed system, and we describe how these measures can
reveal much more intricate details about the dynamics of complex systems than their
more well-known “average” measures do. This is done by examining their applica-
tion to cellular automata, a classic complex system, where these local information
profiles have provided quantitative evidence for long-held conjectures regarding the
information transfer and processing role of gliders and glider collisions. Finally, we
describe the outlook in anticipating the broad application of these local measures of
information processing in computational neuroscience.
1 Introduction
Analysis of directed information transfer between variables in time-series brain
imaging data and models is currently gaining much attention in neuroscience. Mea-
sures of information transfer have been computed, for example, in fMRI measure-
ments in the human visual cortex between average signals at the regional level [38]
and between individual voxels [8], as well as between brain areas of macaques from
local field potential (LFP) time-series [48]. A particularly popular topic in this do-
main is the use of information transfer measures to infer effective network connec-
tivity between variables in brain-imaging data [39, 91, 49, 88, 69, 54, 63], as well
as studying modulation of connection strength with respect to an underlying task
Joseph T. Lizier
CSIRO Computational Informatics, Marsfield, Australia
e-mail: joseph.lizier@csiro.au

162 J.T. Lizier
[94]. Furthermore, measures of information transfer are used to reveal differences

between healthy and diseased states in neural data (e.g. for EEG measurements of
epilipsy patients in [10]) and in models (e.g. for Parkinson’s disease in [43]).
Much of this work quantifies information transfer from a source variable to a
target variable using the information-theoretic measure known as the transfer en-
tropy [82], or its equivalent under linear-Gaussian conditions, the Granger causality
[28]. This information-theoretic approach to studying directed interactions in neural
systems can be viewed as part of a more broad effort to study distributed compu-
tation in complex systems in terms of how information is stored, transfered and
modified (e.g. [59, 60, 62]). The approach is highly appropriate in computational
neuroscience, and indeed for complex systems in general, because:
• these concepts of computation are meaningful and well-understood (e.g. infor-
mation transfer as reflecting directed coupling between two variables, informa-
tion storage as predictability or structure in a time-series process);
• the quantities measured (e.g. transfer entropy for measuring information transfer)
are well-defined and can be measured on any type of time-series data (continuous
or discrete-valued);
• the quantities are at heart model-free (in contrast to the Granger causality lineari-
sation)1 and detect non-linear interactions and structure; and
• distributed computation is the language in which dynamics are often described
in neuroscience (e.g. “the brain represents and processes information in a dis-
tributed fashion and in a dynamical way” [27]) and complex systems in general
(e.g. claims that small-world structures have “maximum capability to store, pro-
cess and transfer information” [42]).
Now, such work on distributed computation to date typically focuses on the
(time) average information transfer, which is how the transfer entropy and other
information-theoretic measures are traditionally defined. Yet the dynamics of trans-
fer from a source to a target can also be quantified at individual observations or
configurations of the variables using the local transfer entropy [59]. Such local mea-
sures can be defined for any traditional information-theoretic variable, including for
related measures of information storage and processing (e.g. [62]). To be explicit,
local information-theoretic measures characterise the information attributed with
specific measurements x and y of variables X and Y , rather than the average infor-
mation associated with these variables.
This local perspective can reveal dynamical structure that the average cannot.
Applied to time-series data, local measures tell us about the dynamics of information
in the system, since they vary with the specific observations in time, and local values
are known to reveal more details about the system than the averages alone [16, 83,
84]. To be specific, a measured average of transfer entropy does not tell us about
how the directed relationship between two variables fluctuates through time, how
1 This also contrasts with dynamic causal modeling, a model-based approach that compares
a set of a priori defined neural models and tests how well they explain the experimental
data [25].
Measuring the Dynamics of Information Processing on a Local Scale 163
different specific source states may be more predictive of a target than other states, or
how coupling strength may relate to changing underlying experimental conditions.
Indeed, the ability to investigate time-series dynamics of distributed computation
in complex systems provides an important connection from information theory to
dynamical systems theory or non-linear time-series analysis (e.g. see [81, 41]). We
use the term information dynamics to describe the study of distributed computation
in complex systems in terms of how information is stored, transferred and modified
[59, 60, 62]. The word dynamics is a key component of this term, referring to both:
1. That we study the dynamic state updates of variables in the system, decompos-
ing information in the measurement of a variable in terms of information from
that variable’s own past (information storage), information from other variables
(information transfer) and how those information sources are combined (infor-
mation modification);
2. That we study local information-theoretic measures for each of these variables,
quantifying the dynamics of these operations in time and space.
In this chapter, we review how such local information-theoretic measurements
can be made, and describe how they are used to define local measures of informa-
tion storage and transfer in distributed computation in complex systems. We begin
by describing the relevant information-theoretic concepts in Sect. 2, before provid-
ing a detailed presentation of how local information-theoretic measures are defined
in Sect. 3. We then provide an overview of our framework for information dynam-
ics in Sect. 4, describing the measures used for information storage and transfer,
and how they can be localised within a system in space and time using the tech-
niques of Sect. 3. Next, we review in Sect. 5 the application of these local measures
of computation to cellular automata, a simple discrete dynamical model which is
known to exhibit complex behaviour and emergent coherent structures (known as
particles or gliders) resembling coherent waves in neural dynamics [27]. This appli-
cation demonstrates the utility of these local measures of information storage and
transfer, by providing key insights into the dynamics of cellular automata, includ-
ing demonstrating evidence for long-held conjectures regarding the computational
role of the emergent structures (e.g. gliders as information transfer entities). Most
importantly, the local measures are shown to provide insights into the dynamics
of information in the system that are simply not possible to obtain with traditional
averaged information-theoretic methods.
We finish the chapter by describing in Sect. 6 further such insights into the dy-
namics of information that have since been obtained with these local measures for
other systems. For example, the measures have revealed coherent information cas-
cades spreading across flocks (or swarms) [92] and in modular robots [57], in anal-
ogy to the aforementioned gliders in cellular automata. They have also demonstrated
the key role of information transfer in network synchronization processes, in partic-
ular in indicating when a synchronized state has been “computed” but not yet ob-
viously reached [9]. Just like the cellular automata examples, these demonstrate the
ability of local information dynamics to reveal how the computation in a system un-
folds in time, and the dynamics of how separate agents or entities interact to achieve
164 J.T. Lizier
a collective task. Crucially, they allow one to answer meaningful questions about
the information processing in a system, in particular: “when and where is informa-
tion transferred in the brain during cognitive tasks?”, and we describe a preliminary
study where this precise question is explored using fMRI recordings during a but-
ton pressing task. As such, we demonstrate that local information dynamics enables
whole new lines of inquiry which were not previously possible in computational
neuroscience or other fields.
2 Information-Theoretic Preliminaries
To quantify the information dynamics of distributed computation, we first look to
information theory (e.g. see [85, 13, 65]) which has proven to be a useful framework
for the design and analysis of complex self-organized systems, e.g. [14, 77, 78, 66].
In this section, we give a brief overview of the fundamental quantities which will be
built on in exploring local information dynamics in the following sections.
The fundamental quantity of information theory is the Shannon entropy, which
represents the average uncertainty associated with any measurement x of a random
variable X (logarithms are taken by convention in base 2, giving units in bits):
H(X) = − ∑ p(x) log2 p(x). (1)

x
The uncertainty H(X) associated with such a measurement is equal to the informa-
tion required to predict it (see self-information below).
The Shannon entropy was originally derived following an axiomatic approach.
This is important because it gives primacy to desired properties over candidate mea-
sures, rather than retrospectively highlighting properties of an appealing candidate
measure. It shifts the focus of any arguments over the form of measures onto the
more formal ground of selecting which axioms should be satisfied. This is partic-
ularly useful where a set of accepted axioms can uniquely specify a measure (as
in the cases discussed here). We highlight the axiomatic approach here because it
has persisted in later developments in information theory, in particular for the local
measures we discuss in Sect. 3 (as well as more recently in debate over measures of
information redundancy [95, 35, 53]).
So, the Shannon entropy was derived as the unique formulation (up to the base
of the logarithm) satisfying a certain set of properties or axioms [85] (with property
labels following [76]):
• continuity with respect to the underlying probability distribution function p(x)
(PDF). This sensibly ensures that small changes in p(x) only lead to small
changes in H(X).
• monotony: being a monotonically increasing function of the number of choices
n for x when each choice xi is equally likely (with probability p(xi ) = 1/n). In
Shannon’s words, this desirable because: “With equally likely events there is
more choice, or uncertainty, when there are more possible events” [85].
• grouping: “If a choice (can) be broken down into two successive choices, the
original H should be the weighted sum of the individual values of H” [85]. That
is to say, “H is independent of how the process is divided into parts” [76]. This
is crucial because the intrinsic uncertainy we measure for the process should not
depend on any subjectivity in how we divide up the stages of the process to be
examined.
Further, note that the Shannon entropy for a measurement can be interpreted as
the minimal average number of bits required to encode or describe its value without
losing information [65, 13].
The joint entropy of two random variables X and Y is a generalization to quan-
tify the uncertainty of their joint distribution:
H(X,Y ) = − ∑ p(x, y) log2 p(x, y). (2)

x,y
The conditional entropy of X given Y is the average uncertainty that remains

about x when y is known:
H(X | Y ) = − ∑ p(x, y) log2 p(x | y). (3)

x,y
The conditional entropy for a measurement of X can be interpreted as the minimal

average number of bits required to encode or describe its value without losing infor-
mation, given that the receiver of the encoding already knows the value of Y . The
previous quantities are related by the following chain rule:
H(X,Y ) = H(X) + H(Y | X). (4)
The mutual information (MI) between X and Y measures the average reduction
in uncertainty about x that results from learning the value of y, or vice versa:
p(x | y)
I(X;Y ) = − ∑ p(x, y) log2 (5)
x,y p(x)
= H(X) − H(X | Y ). (6)
The MI is symmetric in the variables X and Y . The mutual information for measure-
ments of X and Y can be interpreted as the average number of bits saved in encoding
or describing X given that the receiver of the encoding already knows the value of Y ,
in comparison to the encoding of X without the knowledge of Y . These descriptions
of X with and without the value of Y are both minimal without losing information.
Note that one can compute the self-information I(X; X), which is the average in-
formation required to predict the value of X, and is equal to the uncertainty H(X)
associated with such a measurement.
The conditional mutual information between X and Y given Z is the mutual
information between X and Y when Z is known:
166 J.T. Lizier
p(x | y, z)
I(X;Y | Z) = − ∑ p(x, y, z) log2 (7)
x,y,z p(x | z)
= H(X | Z) − H(X | Y, Z). (8)
One can consider the MI from two variables Y1 ,Y2 jointly to another variable X,
I(X;Y1 ,Y2 ), and using (4), (6) and (8) decompose this into the information carried
by the first variable plus that carried by the second conditioned on the first:
I(X;Y1 ,Y2 ) = I(X;Y1 ) + I(X;Y2 | Y1 ). (9)
Of course, this chain rule generalises to multivariate Y of dimension greater than

two.
Note that a conditional MI I(X;Y | Z) may be either larger or smaller than the
related unconditioned MI I(X;Y ) [65]. The conditioning removes information re-
dundantly held by the source Y and the conditioned variable Z about X (e.g. if both
Y and Z were copies of X). Furthermore, the conditioning also includes synergistic
information about X which can only be decoded with knowledge of both the source
Y and conditioned variable Z (e.g. where X is the result of an exclusive-OR or XOR
operation from Y and Z). These components cannot be teased apart with traditional
information-theoretic analysis; the partial information decomposition approach was
introduced for this purpose [95] (and see also [35, 32, 53]).
We now move on to consider measures of information in time-series pro-
cesses X of the random variables {. . . Xn−1 , Xn , Xn+1 . . .} with process realisations
{. . . xn−1 , xn , xn+1 . . .} for countable time indices n. We refer to measures which con-
sider how the information in variable Xn is related to previous variables, e.g. Xn−1 ,
of the process or other processes as measures of information dynamics.
The entropy rate is defined by [13]:
1
Hμ
(X) = lim H(X1 , X2 , . . . , Xn ) (10)
n→∞n
1 (n)
= lim H(Xn ), (11)
n→∞ n
(k)
(where the limit exists) where we have used Xn = {Xn−k+1 , . . . , Xn−1 , Xn } to de-
note the k consecutive variables of X up to and including time step n. This quantity
describes the limiting rate at which the entropy of n consecutive measurements of X
grow with n. A related definition is given by:2
Hμ (X) = lim H [Xn | X1 , X2 , . . . , Xn−1 ] (12)

n→∞

(n−1)
= lim H Xn | Xn−1 . (13)
n→∞
2 Note that we have reversed the use of the primes in the notation from [13], in line with
[14].
Cover and Thomas [13] point out that these two quantities correspond to two subtly
different notions: the first is something of an average per symbol entropy, while the
second is a conditional entropy of the last random variable given the past. These
authors go on to demonstrate that for stationary processes X, the limits for the two
quantities Hμ
(X) and Hμ (X) exist (i.e. the average entropy rate converges) and are
equal.
For our purposes in considering information dynamics, we are interested in the
latter formulation Hμ (X), since it explicitly describes how one random variable Xn
(n−1)
is related to the previous instances Xn−1 . For practical usage, we are particularly
interested in estimation of Hμ (X) with finite-lengths k, and in estimating it regard-
ing the information at different time indices n. That is to say, we use the notation
(k)
Hμ (Xn+1 , k) to describe the conditional entropy in Xn+1 given Xn :

(k)
Hμ (Xn+1 , k) = H Xn+1 | Xn . (14)
Of course, letting k = n and joining (13) and (14) we have limn→∞ Hμ (Xn+1 , n) =
Hμ (X).
3 Local Information Theoretic Measures

In this section, we describe how one may obtain local information measures with
reference to their more well-known average information-theoretic counterparts. Lo-
cal information-theoretic measures characterise the information attributed with spe-
cific measurements x and y of variables X and Y , rather than the average information
associated with these variables. Local values within a global average are known to
provide important insights into the dynamics of nonlinear systems [16].
We begin by defining local values of the entropy and conditional entropy (Shan-
non information content values) in Sect. 3.1, and then describe local mutual in-
formation and conditional mutual information in Sect. 3.2. Next, in Sect. 3.3 we
consider the meaning and properties of these local values where where X and Y are
time-series processes and local information-theoretic measures characterise the in-
formation attributed at each local point in time in these series. Finally, we describe
in Sect. 3.4 the mechanics of how these local information-theoretic measures can be
practically quantified, using various types of estimators.
Before beginning, we note that such local information-theoretic measures have
been used (with less explicit presentation) in various earlier studies in complex sys-
tems science, e.g. for the local excess entropy [83], the local statistical complexity
[83, 84], and the local information [36]. Yet relatively little exploration has been
made into the dynamics of these local information measures in complex systems,
and certainly none had been made into the local dynamics of information storage,
transfer and modification, as we will review in Sect. 4.
168 J.T. Lizier
3.1 Shannon Information Content and Its Meaning

The Shannon information content or local entropy of an outcome x of measure-
ment of the variable X is [65]:
h(x) = − log2 p(x). (15)
Note that by convention we use lower-case symbols to denote local information-

theoretic measures throughout this chapter. The Shannon information content can
be shown to be the unique formulation (up to the base of the logarithm) satisfying
the following properties [1]:
• grouping: h(p1 (x1 ) × p2 (x2 )) = h(p1 (x1 )) + h(p2 (x2 )), where h(p(x)) =
− log2 p(x) = h(x), and p1 and p2 (both satisfying 0 < p ≤ 1) can be interpreted
as representing the probabilities of two independent events x1 and x2 ;
• monotonically decreasing with p(x); and
• continuity with p(x).
Note that these three properties map directly to the three properties for the (average)
Shannon entropy (see Sect. 2). Also, noting that this quantity is also equivalent to a
local self-information, it can also be derived (see [22, Chapter 2]) by starting with
the local mutual information (see Sect. 3.2).
Now, the quantity h(x) is simply the information content attributed to the specific
symbol x, or the information required to predict or uniquely specify that specific
value. Less probable outcomes x have higher information content than more prob-
able outcomes, and we have h(x) ≥ 0. Specifically, the Shannon information con-
tent of a given symbol x is the code-length for that symbol in an optimal encoding
scheme for the measurements X, i.e. one that produces the minimal expected code
length.3
In this light, one views the Shannon entropy as the “entropy of an ensemble” [65]
of the outcomes x of the random variable X, with probabilities p defined over the
alphabet Ax of possible outcomes. That is, H(X) is the average or expectation value
of the Shannon information content for each symbol x ∈ Ax (compare to (1)):
H(X) = ∑ p(x)h(x), (16)

x
= h(x) . (17)
As we will see, each average information-theoretic measure is an average over its

associated local quantity.
In the mathematics above, we see the average or expectation value as being taken
over each symbol x = m (where m ∈ {0, . . . , M − 1} without loss of generality for
some M discrete symbols). We can also view it however as being an average over
each observation or measurement xi (where i is a measurement index) of X that
3 Note that this “optimal code-length” may specify non-integer choices; full discussion of
the implications of this, practical issues in selecting integer code-lengths, and block-coding
optimisations are contained in [13, Chapter 5].
we used to construct our probability distribution function p(x). To do this, we start

from the operational definition of the PDF for each symbol: p(x = m) = c(x=m) N ,
where c(x = m) is the count of observations of the symbol m out of the N total
observations. To precisely compute this probability, the ratio should be composed
over all realisations of the observed variables (as described in [83]); realistically
however, estimates will be made from a finite number of observations N. We then
re-write (1) using this definition:
c(x = m)
H(X) = − ∑ log2 p(x = m), (18)
m N
c(x=m)
and then further expand using the identity c(x = m) = ∑g=1 1:
c(x=m)
1
H(X) = − ∑ ∑ log2 p(x = m). (19)
m g=1 N
This leaves a double sum running over i. each actual observation g, ii. for each
possible observation x = m. This is equivalent to a single sum over all N observations
xi , i = 1 . . . N, giving:
1 N
H(X) = − ∑ log2 p(xi ),
N i=1
(20)
= h(xi )i , (21)
as required. To reiterate, we refer to h(xi ) as a local entropy because it is defined

locally for each observation xi .
At this point, we note that the above derivation shows that the PDF p(x) for the
local value h(x) is evaluated at a specific local observation x, but the function p is
defined using all of the relevant observations. This is a subtle point - the evaluation
of p is local to the observation x, but we need other observations to define the func-
tion p in order to make this evaluation. We revisit this concept when we consider
time-series processes in Sect. 3.3.
Now, we note that one can also define conditional Shannon information con-
tent (or local conditional entropy) [65]:
h(x | y) = − log2 p(x | y), (22)
and that these quantities satisfy the chain rule in alignment with their averages:
h(x, y) = h(y) + h(x | y). (23)
In this way, we see that the information content of a joint quantity (x, y) is the code
length of y plus the code length of x given y. Finally, we note that this quantity is also
referred to as conditional self-information and can also be derived (see [22, Chapter
2]) by starting with the local conditional mutual information (see Sect. 3.2).
170 J.T. Lizier
3.2 Local Mutual Information and Conditional Mutual

Information
Next, we consider localisations of the mutual information. One way to think about
this quantity is to build the local mutual information directly from Shannon infor-
mation content or local entropy measures, in alignment with its average definition,
i.e.:
i(x; y) = h(x) − h(x | y), (24)

p(x | y)
= log2 . (25)
p(x)
In this way, we see that the local mutual information is the difference in code lengths
between coding the value x in isolation (under the optimal encoding scheme for X),
or coding the value x given y (under the optimal encoding scheme for X given Y ). In
other words, this quantity captures the coding “cost” for x in not being aware of the
value y. Similarly, the local conditional mutual information can be constructed as:
i(x; y | z) = h(x | z) − h(x | y, z), (26)

p(x | y, z)
= log2 . (27)
p(x | z)
Here, we see that the local conditional mutual information is the difference in code
lengths (or coding cost) between coding the value x given z (under the optimal en-
coding scheme for X given Z), or coding the value x given both y and z (under the
optimal encoding scheme for X given Y and Z).
More formally however, Fano [22, ch. 2] set out to quantify “the amount of infor-
mation provided by the occurrence of the event represented by yi about the occur-
rence of the event represented by xi .” He derived the local mutual information i(x; y)
(25) to capture this concept, as well as the local conditional mutual information
i(x; y | z) (27), directly from the following four postulates:
• once-differentiability with respect to the underlying probability distribution
functions p(x) and p(x | y);
• identical mathematical form for the conditional MI and local conditional
MI, only with p(x) replaced by p(x | z) and p(x | y) replaced by p(x | y, z);
• additivity for the information provided by y and z about x, i.e.: i({y, z} ; x) =
i(y; x) + i(z; x | y);
• separation for independent ensembles XY and UV , i.e. where we have
p(x, y, u, v) = p(x, y)p(u, v) then we must have i({x, u} ; {y, v}) = i(x; y) + i(u; v).
Crucially, Fano’s derivation means that i(x; y) and i(x; y | z) are uniquely specified,
up to the base of the logarithm.
Of course, we have I(X;Y ) = i(x; y) and I(X;Y | Z) = i(x; y | z) as per the
averaged entropy quantities in the previous section. It is particularly interesting that
Fano made the derivation for local mutual information directly, and only computed
the averaged quantity as a result of that. This contrasts with contemporary perspec-
tives which generally give primary consideration to the averaged quantity. (This is
not the case however in natural language processing for example, where the local
MI is commonly used and known as the point-wise mutual information, e.g. [68]).
We also note that i(x; y) is symmetric in x and y (like I(X;Y )), though this was
not explicitly built into the above postulates.
Next, consider that the local MI and conditional MI values may be either posi-
tive or negative, in contrast to the local entropy which cannot take negative values.
Positive values are fairly intuitive to understand: the local mutual information in
(25) is positive where p(x | y) > p(x), i.e. knowing the value of y increased our
expectation of (or positively informed us about) the value of the measurement x.
The existence of negative values is often a concern for readers unfamiliar with
the concept, however they too are simple to understand. Negative values simply
occur in (25) where p(x | y) < p(x), i.e. knowing about the value of y actually
changed our belief p(x) about the probability of occurrence of the outcome x to
a smaller value p(x | y), and hence we considered it less likely that x would occur
when knowing y than when not knowing y, in a case were x nevertheless occurred.
As an example, consider the probability that it will rain today, p(rain = 1), and
the probability that it will rain given that the weather forecast said it would not,
p(rain = 1 | rain forecast = 0). Being generous to weather forecasters for a
moment, let’s say that p(rain = 1 | rain forecast = 0) < p(rain = 1), so
we would have i(rain = 1; rain forecast = 0) < 0, because we considered it
less likely that rain would occur today when hearing the forecast than without the
forecast, in a case where rain nevertheless occurred. These negative values of MI are
actually quite meaningful, and can be interpreted as there being negative informa-
tion in the value of y about x. We could also interpret the value y as being misleading
or misinformative about the value of x, because it had lowered our expectation of
observing x prior to that observation being made in this instance. In the above ex-
ample, the weather forecast was misinformative about the rain today. One can also
view the negative values using (24), seeing that i(x; y) is negative where knowing y
increased the uncertainty about x.
Importantly, these local measures always average to give a non-negative value.
Elaborating on an example from Cover and Thomas [13, p.28], “in a court case,
specific new evidence” y “might increase uncertainty” about the outcome x, “but
on the average evidence decreases uncertainty”. Similarly, in our above example,
while the weather forecast might misinform us about the rain on a particular day, on
average the weather forecast will provide positive (or at least zero!) information.
Finally, we note that the local mutual information i(x; y) measures we consider
here are distinct from partial localization expressions, i.e. the partial mutual infor-
mation or specific information I(x;Y ) [18], which consider information contained in
specific values x of one variable X about the other (unknown) variable Y . Crucially,
there are two valid approaches to measuring partial mutual information, one which
preserves the additivity property and one which retains non-negativity [18]. As de-
scribed above however, there is only one valid approach for the fully local mutual
information i(x; y) (and see further discussion in [56]).
172 J.T. Lizier
3.3 Local Information Measures for Time Series

Now, consider Xn , Yn and Zn as the variables of time-series processes X, Y and Z
with specific measurements (xn , yn , zn ) at each time point n = 1, . . . , N (though the
specific time interval is arbitrary).
The local information-theoretic measures, e.g. i(xn ; yn ), then characterise the in-
formation attributed at each local point in time in these series. Furthermore, where
X is a multivariate spatiotemporal series with measurements xi,n at spatial points i
for each time n, then local information-theoretic measures, e.g. i(xi,n ; xi,n+1 ), charac-
terise the information attributed at each local spatiotemporal point in the series, and
one can form spatiotemporal profiles of the information characteristics. Such local
characterisation is what we mean by the local measures being useful for studying
the dynamics of information in space and time. We shall explore examples of such
dynamics in the next sections.
As described earlier for h(x), computing a local measure requires evaluating the
probability distribution function (PDF) p(x) for the given local observation x, how-
ever the PDF itself must be defined using all of the relevant observations of the
variable X. Furthermore, where X is a time series, it is clear that the observations
to construct the PDF for evaluating p(xn ) at xn are not local in time to that obser-
vation xn . We must carefully consider which parts of the time series X are used to
construct the PDF – one should select observations across which the time series
is stationary or in the same phase of a cyclostationary process when constructing
PDFs for information-theoretic functions.
Often, this may mean using a sliding window technique to construct the PDF
– i.e. to evaluate p(xn ) we may use observations {xn−T , . . . , xn+T } (for some T ) to
construct the PDF, assuming that the time series is stationary over that time-interval.
While one would wish to maximise the size of the time-window in order to have
many samples to estimate the PDF, this must be balanced against these stationarity
considerations.
An alternate ensemble approach may be to sample many repeat time series Xi
(where i is an instance, trial or realisation index of the time-series) with measure-
ments xi,n , where stationarity is assumed at fixed time points n over all samples
i. In this case, p(xi,n ) is constructed for each xi,n using the ensemble of samples
for all time-series instances i but with the same n, and the PDF is then somewhat
local in time. Gómez-Herrero et al. [26] use a hybrid ensemble – sliding-window
approach, estimating PDFs over values xi,n for all trials i within some time-window
t − σ ≤ n ≤ t + σ , giving the measures a local flavour (discussed further in the
chapter by Vicente and Wibral in this book). Also, note that TRENTOOL (transfer
entropy toolbox) [49] implements such an ensemble approach for PDF estimation.
For ergodic processes, the time-window and ensemble approaches are theoretically
equivalent.
Now, note that the sliding-window technique described above only refers to con-
structing the PDF using all observations from that window – it does not force us
to compute the average measure, e.g. H(X), over all observations in that window
{xn−T , . . . , xn+T }. Instead, once the PDF is obtained, we may evaluate the local
values of entropy and (conditional) mutual information. Averaging can of course

be done, e.g. [90], but while averaging in a sliding-window approach does provide a
more local measure than averaging over all available observations in the time series
X, it is not local in the same sense as the term is used here (i.e. it does not look at
the information involved in a computation at a single specific time step).
Still on averages, recall that average information-theoretic measures represent
averages over local measures at each observation (see (21)). For time-series X, if
the whole series is stationary (or if we look at data from identical phases of a cyclo-
stationary process) then we can take the time-average of all local values in order to
compute the relevant averaged information-theoretic measure, i.e.:
H(X) = h(xn )n . (28)
Alternatively, if we are taking an ensemble approach with observations xi,n for

each time series realisation or trial Xi , then we can take an average across all reali-
sations, e.g.:
H(Xn ) = h(xi,n )i , (29)
to compute an average measure at the given time index n (across realisations or

trials). Indeed, this approach can be quite useful to obtain a “local” quantity in time
H(Xn ), while mitigating against the large variance in local values (noted in [26]).
Of course, the PDFs could be estimated using a hybrid ensemble – sliding-window
approach, as noted above [26].
3.4 Estimating the Local Quantities

As described above, appropriately selecting the observations to use in the PDF is
one challenge associated with estimating these local quantities properly. Another
challenge is to select the type of estimator to use, and to properly extract local prob-
ability estimates from it for evaluating the local information quantities. Full details
on information-theoretic estimators are given in a separate chapter of this book by
Vicente and Wibral. In this subsection we specifically describe evaluation of the
local quantities using various estimators. 4
When we have discrete-valued data, estimating the local measures is relatively
straightforward. One simply counts the matching configurations in the available data
to obtain the relevant probability estimates ( p̂(x | y) and p̂(x) for mutual informa-
tion), and then uses these values directly in the equation for the given local quantity
(e.g. (25) for local mutual information) as a plug-in estimate.
For continuous-valued data where we deal with the differential entropy [13] and
probability density functions, estimation of the local quantities is slightly more com-
plicated and depends on the estimator being used.
4 Open-source code is available for local information-theoretic measures (using all of the
estimator types considered here) in the Java Information Dynamics Toolkit on Google
code [51].
174 J.T. Lizier
Using kernel-estimators (e.g. see [82, 41]), the relevant probabilities (e.g. p̂(x | y)
and p̂(x) for mutual information) are estimated with kernel functions, and then these
values are used directly in the equation for the given local quantity (e.g. (25)) as a
plug-in estimate (see e.g. [61]).
With the improvements to kernel-estimation for mutual information suggested by
Kraskov et al. [45, 44] (and extended to conditional mutual information and trans-
fer entropy by [24, 26]), the PDF evaluations are effectively bypassed, and for the
average measure one goes directly to estimates based on nearest neighbour counts
nx and ny in the marginal spaces for each observation. For example, for Kraskov’s
algorithm 1 we have:

I(X;Y ) = ψ (k) − ψ (nx + 1) + ψ (ny + 1) + ψ (N), (30)
where ψ denotes the digamma function, and the values are returned in nats rather
than bits. Local values can be extracted here simply by unrolling the expectation
values and computing the nearest neighbour counts only at the given observation
(x, y), e.g. for algorithm 1:
i(x; y) = ψ (k) − ψ (nx + 1) − ψ (ny + 1) + ψ (N). (31)
This has been observed as a “time-varying estimator” in [26] and used to estimate
the local transfer entropy in [50] and [89].
Using permutation entropy approaches [3] (e.g. symbolic transfer entropy [87]),
the relevant probabilities are estimated based on the relative ordinal structure of
the joint vectors, and these values are directly used in the equations for the given
quantities as plug-in estimates (e.g. see local symbolic transfer entropy in [72]).
Finally, using a multivariate Gaussian model for X (which is of d dimensions),
the average entropy has the form [13]:
1
H(X) = ln ((2π e)d | Ω |), (32)
2
(in nats) where | Ω | is the determinant of the d × d covariance matrix Ω = XT X

(for row vectors X), and the overbar “represents an average over the statistical en-
semble” [6]. Any standard information-theoretic measure of the variables (at the
same time step), e.g. mutual information, can then be obtained from sums and dif-
ferences of these joint entropies. While the PDFs were again effectively bypassed in
the average, the local entropies (and by sums and difference other local measures)
can be obtained by first reconstructing the probability of a given observation x in a
multivariate process with covariance matrix Ω :

1 1 −1
p(x) = √ exp − (x − μ )Ω (x − μ ) , T
(33)
( 2π )d | Ω |1/2 2
(where μ is the expectation value of x), then using these values directly in the equa-
tion for the given local quantity as a plug-in estimate.5
4 Local Measures of Information Processing

In this section, we build on the fundamental quantities of information theory, our
first look at dynamic measures of information, and on the dynamics of local in-
formation measures in time, to present measures of the dynamics of information
processing. We briefly review the framework for information dynamics which was
recently introduced in [58, 59, 60, 62, 52].
The fundamental question the measures of this framework address is: “where
does the information in a random variable Xn+1 in a time series come from?”. This
question is addressed in terms of information from the past of process X (i.e. the
information storage), information contributed from other source processes Y (i.e. the
information transfer), and how these sources combine (information modification).
Here we describe local measures of information storage and transfer, and refer the
reader to [60, 23, 53] regarding information modification.
4.1 Local Information Storage

The active information storage AX was introduced [62] to measure how much of
the information from the past of the process is observed to be in use in computing
its next state.6 The active information storage AX is the average mutual information
(k) (k)
between realizations xn of the past state Xn (as k → ∞) and the corresponding
realizations xn+1 of the next value Xn+1 of a given time series process X:
AX = lim AX (k), (34)

k→∞

(k)
AX (k) = I Xn ; Xn+1 . (35)
We note that the limit k → ∞ is required in general so as to capture all relevant

information in the past of X, unless the next value xn+1 is conditionally independent
(∞) (k)
of the far past values xn−k given xn [62]. Empirically of course, one is limited to
finite-k estimates AX (k).
5 See the next section, Sect. 4.2, regarding how this method can be used to produce a local
Granger causality, as a local transfer entropy using a Gaussian model estimator.
6 This contrasts with related measures including: the statistical complexity [15] which mea-
sures all information stored by the system which may be used in the future; and the excess
entropy [31, 14] which measures that information which is used by the system at some
point in the future. Of course, this means that the excess entropy measures information
storage that will possibly but not necessarily be used at the next time step n + 1, which
is greater than or equal to that measured by the active information storage. See further
discussion in [62].
176 J.T. Lizier
Now, the local active information storage aX (n + 1) is the local mutual informa-
(k) (k)
tion between realizations xn of the past state Xn (as k → ∞) and the corresponding
realizations xn+1 of the next value Xn+1 . This is computed as described for local mu-
tual information values in Sect. 3.2. The average active information storage AX is
the expectation of these local values:
AX = aX (n + 1), (36)

aX (n + 1) = lim aX (n + 1, k), (37)
k→∞
AX (k) = aX (n + 1, k), (38)
(k)
aX (n + 1, k) = i(xn ; xn+1 ), (39)
(k)
p(xn+1 | xn )
= log2 . (40)
p(xn+1 )
The local values of active information storage measure the dynamics of information
storage at different time points within a system, revealing to us how the use of mem-
ory fluctuates during a process. Where the observations used for the relevant PDFs
are from the whole time series of a process (under an assumption of stationarity, as
outlined in Sect. 3.3), then the average AX (k) is the time-average of the local values
aX (n + 1, k).
We also note that since [62]:
A(X) = H(X) − Hμ (X), (41)
then the limit in (34) exists for stationary processes (i.e. A(X) converges with k →
∞). A proof for convergence of a(xn+1 ) with k → ∞ remains a topic for future work.
As described for the local mutual information in Sect. 3.2, aX (n + 1) may be pos-
itive or negative, meaning the past history of the process can either positively inform
us or actually misinform us about its next value [62]. An observer of the process is
misinformed where, conditioned on the past history the observed outcome was rel-
atively unlikely as compared to the unconditioned probability of that outcome (i.e.
(k)
p(xn+1 | xn ) < p(xn+1 )). In deterministic systems (e.g. CAs), negative local active
information storage means that there must be strong information transfer from other
causal sources.
4.2 Local Information Transfer

Information transfer is defined as the amount of information that a source process
provides about a target (or destination) process’ next state that was not contained in
the target’s past. This definition pertains to Schreiber’s transfer entropy measure
[82], which has become a very popular tool in complex systems in general (e.g.
[96, 64, 73, 5, 59, 55, 7]) and in computational neuroscience in particular (e.g. [91,
49, 40, 88, 54, 19]).
The transfer entropy (TE) [82] captures the average mutual information from re-
(l) (l)
alizations yn of the state Yn of a source time-series process Y to the corresponding
realizations xn+1 of the next value Xn+1 of the target time-series process X, condi-
(k) (k)
tioned on realizations xn of the previous state Xn :
TY →X (l) = lim TY →X (k, l), (42)

k→∞

(l) (k)
TY →X (k, l) = I Yn ; Xn+1 | Xn . (43)
Schreiber emphasized that, unlike the (unconditioned) time-differenced mutual in-

formation, the transfer entropy was a properly directed, dynamic measure of infor-
mation transfer rather than shared information.
There are a number of important considerations regarding the use of this measure.
These are described more fully in the chapter by Wibral et al. in this book, and
summarised as follows.
First, in general, one should take the limit as k → ∞ in order to properly embed
(k)
or represent the previous state Xn as relevant to the relationship between the next
(l)
value Xn+1 and the source Yn [59]. Note that k can be limited here where the next
(∞) (k)
value xn+1 is conditionally independent of the far past values xn−k given (xn , yn ).
We observe that this historical information conditioned on by the transfer entropy is
exactly that provided by the active information storage. As such, setting k properly
in this manner gives the observer the perspective to properly separate information
storage and transfer in the distributed computation in the systems, and allows one to
interpret the transfer entropy as properly representing information transfer [59, 56].
Empirically of course one is restricted to finite-k estimates TY →X (k, l).
Also, note that the transfer
entropy can be defined for an arbitrary source-target
(l) (k)
delay, i.e. measuring I Yn−u ; Xn+1 | Xn , and indeed that this should be done for
the appropriate causal delay u > 0 [93]. For ease of presentation here, we describe
the measures for u = 1 only, though all are straightforward to generalise.
(l)
Furthermore, considering the source state yn rather than a scalar yn is most ap-
propriate where the observations y mask a hidden Markov process which is causal to
X, or where multiple past values of Y in addition to yn are causal to xn+1 . Otherwise,
where yn is directly causal to xn+1 , and where it is the only direct causal source in
Y , we use only l = 1 [59, 56].
Finally, for proper interpretation as information transfer, Y is constrained among
the causal information contributors to X [56]. We have also provided a thermody-
namic interpretation of transfer entropy in [79], as being proportional to external
entropy production, possibly due to irreversibility.
Now, we continue on to extract the local transfer entropy tY →X (n + 1) [59] as
a local conditional mutual information using the approach described in Sect. 3.2. It
is the amount of information transfer attributed to the specific configuration or real-
(k) (l)
ization (xn+1 , xn , yn ) at time step n + 1; i.e. the amount of information transfered
from process Y to X at time step n + 1:
178 J.T. Lizier
TY →X (l) = tY →X (n + 1, l), (44)

tY →X (n + 1, l) = lim tY →X (n + 1, k, l), (45)
k→∞
TY →X (k, l) = tY →X (n + 1, k, l) , (46)
(l) (k)
tY →X (n + 1, k, l) = i(yn ; xn+1 | xn ), (47)
(k) (l)
p(xn+1 | xn , yn )
= log2 (k)
. (48)
p(xn+1 | xn )
These local information transfer values measure the dynamics of transfer in time be-
tween any given pair of processes within a system, revealing to us how information
is transferred across the system in time and space. Fig. 1 indicates a local transfer
entropy measurement for a pair of processes Y → X.
As above, where the observations used for the relevant PDFs are from the whole
time series of the processes (under an assumption of stationarity, as outlined in
Sect. 3.3) then the average TY →X (k, l) is the time-average of the local transfer values
tY →X (n + 1, k, l).
As described for the local conditional mutual information in Sect. 3.2, tY →X (n +
1) may be positive or negative, meaning the source process can either positively
Fig. 1 Local transfer entropy tY →X (n + 1, k, l = 1) indicated by the blue arrow: information

contained in the realization yn of the source variable Y about the next value xn+1 of the
(k)
destination variable X at time n + 1, in the context of the corresponding realization xn of the
destination’s past state
inform us or actually misinform us about the next value of the target (in the context
of the target’s past state) [59]. An observer of the process is misinformed where,
conditioned on the source and the past of the target the observed outcome was rel-
atively unlikely, as compared to the probability of that outcome conditioning on the
(k) (l) (k)
past history only (i.e. p(xn+1 | xn , yn ) < p(xn+1 | xn )).
Noting the equivalence of the transfer entropy and the concept of Granger causal-
ity [28] when the transfer entropy is estimated using a Gaussian model [4], we ob-
serve that the local transfer entropy – when estimated with a Gaussian model as
described in Sect. 3.4 – directly gives a local Granger causality measurement .
Now, the transfer entropy may also be conditioned on other possible sources Z
to account for their effects on the target. The conditional transfer entropy was
introduced for this purpose [59, 60]:
TY →X|Z (l) = lim TY →X|Z (k, l), (49)

k→∞

(l) (k)
TY →X|Z (k, l) = I Yn ; Xn+1 | Xn , Z , (50)
Note that Z may represent an embedded state of another variable and/or be explicitly
multivariate. Transfer entropies conditioned on other variables have been used in
several biophysical and neuroscience applications, e.g. [20, 21, 88].
We also have the corresponding local conditional transfer entropy:

TY →X|Z (k, l) = tY →X|Z (n + 1, k, l) , (51)
(k) (l)
p(xn+1 | xn , yn , zn )
tY →X|Z (n + 1, k, l) = log2 (k)
, (52)
p(xn+1 | xn , zn )
(l) (k)
= i(yn ; xn+1 | xn , zn ). (53)
Of course, this extra conditioning can prevent the (redundant) influence of a com-
mon drive Z from being attributed to Y , and can also include the synergistic contri-
bution when the source Y acts in conjunction with another source Z (e.g. where X is
the outcome of an XOR operation on Y and Z).
We specifically refer to the conditional transfer entropy as the complete transfer
entropy (with notation TYc→X (k, l) and tYc →X (n + 1, k, l) for example) when it con-
ditions on all other causal sources Z to the target X [59]. To differentiate the condi-
tional and complete transfer entropies from the original measure, we often refer to
TY →X simply as the apparent transfer entropy [59] - this nomenclature conveys that
the result is the information transfer that is apparent without accounting for other
sources.
Finally, note that one can decompose the mutual information from a set of sources
to a target as a sum of incrementally conditioned mutual information terms [60, 56,
53]. For example, for a two source system we have:
180 J.T. Lizier
(k) (k) (k)

I(Xn+1 ; {Xn ,Y1,n ,Y2,n }) = I(Xn+1 ; Xn ) + I(Xn+1;Y1,n | Xn )+
(k)
+ I(Xn+1;Y2,n | Xn ,Y1,n ), (54)
= AX (k) + TY1 →X (k) + TY2 →X|Y1 (k).
This equation could be reversed in the order of Y1 and Y2 , and its correctness is
independent of k (so long as k is large enough to capture the causal sources in the
past of the target). Crucially, this equation reveals the nature in which information
storage (AX ) and transfer (TY1 →X , etc.) are complementary operations in distributed
computation.
5 Local Information Processing in Cellular Automata

In this section, we review the application of local information storage and transfer
measures to cellular automata (as first presented in [58, 59, 56, 60, 62, 61]), in order
to demonstrate the ability of the local measures to reveal deeper insights into the dy-
namics of complex systems than their averaged and more well-known counterparts.
Cellular automata (CAs) are discrete dynamical systems with an array of cells
that synchronously update their value as a function of a fixed number of spatial
neighbours using a uniform rule [97]. The update rule is specified by listing the next
value for a given cell as a function of each possible configuration of its neighbour-
hood in a rule table – see Table 1 – and summarising this specification in a single
number (known as a Wolfram number; see [97]). We focus here on Elementary CAs
(ECAs), which are 1D arrays of binary-valued cells with one neighbour on either
side.
Although the behaviour of each individual cell in a CA is very simple, the (non-
linear) interactions between all cells can lead to very intricate global behaviour,
meaning CAs have become a classic example of self-organised complex dynamics.
Of particular importance, CAs have been used to model real-world spatial dynamical
Table 1 Rule table for ECA rule 54. The Wolfram rule number for this rule table is composed
by taking the next cell value for each configuration, concatenating them into a binary code
starting from the bottom of the rule table as the most significant bit (e.g. b00110110 here),
and then forming the decimal rule number from that binary encoding.
Neighbourhood configuration for cell i at time n

Next cell value xi,n+1 at time n + 1
cell xi−1,n value (left) cell xi,n value cell xi+1,n value (right)
0 0 0 0
0 0 1 1
0 1 0 1
0 1 1 0
1 0 0 1
1 0 1 1
1 1 0 0
1 1 1 0
processes, including fluid flow, earthquakes and biological pattern formation [70].
Indeed, CAs have even been used in neural network models to study criticality in
avalanches of activity [75, 67]. While they may not be the most realistic microscopic
neural model available, it is certainly true that CAs can exhibit certain phenomena
that are of particular interest in neuroscience, including avalanche behaviour (e.g.
[75, 80, 47, 67]) and coherent propagating wave-like structures (e.g. [27, 17]).
Indeed, the presence of such coherent emergent structures: particles, gliders,
blinkers and domains; is what has made CAs so interesting in complex systems
science in general. A domain is a set of background configurations in a CA, any of
which will update to another configuration in the set in the absence of any distur-
bance. Domains are formally defined by computational mechanics as spatial pro-
cess languages in the CA [33]. Particles are considered to be dynamic elements of
coherent spatiotemporal structure, which are disturbances or lie in contrast to the
background domain. Gliders are regular particles, blinkers are stationary gliders.
Formally, particles are defined by computational mechanics as a boundary between
two domains [33]; as such, they can be referred to as domain walls, though this term
is usually reserved for irregular particles. Several techniques exist to filter particles
from background domains (e.g. [29, 30, 33, 34, 98, 36, 37, 84, 59, 60, 62]).
These emergent structures have been quite important to studies of distributed
computation in CAs, for example in the design or identification of universal compu-
tation (see [70]), and analyses of the dynamics of intrinsic or other specific computa-
tion ([46, 33, 71]). This is because these studies typically discuss the computation in
terms of the three primitive functions of computation and their apparent analogues
in CA dynamics [70, 46]:
• blinkers as the basis of information storage, since they periodically repeat at a
fixed location;
• particles as the basis of information transfer, since they communicate information
about the dynamics of one spatial part of the CA to another part; and
• collisions between these structures as information modification, since collision
events combine and modify the local dynamical structures.
Previous to the work reviewed here however, these analogies remained conjecture
only, based on qualitative observation of CA dynamics. In the following subsections,
we review the applications [59, 60, 62, 58, 56] of the local information storage and
transfer measures described in Sect. 4 to cellular automata.
These experiments involved constructing 10 000 cell 1-dimensional CAs, and
executing the relevant update rules to generate 600 time steps of dynamics. All
resulting 6 × 106 observations of cell-updates are then used to compose the relevant
PDFs, and the local measures of information storage and transfer were computed
for each observation using these PDFs. Specifically, local active information storage
aX (n, k = 16) is computed for each cell X for each time step n, while local transfer
entropy tY →X (n, k = 16, l = 1) is computed for each time step n for each target cell
X and for the two causal sources Y on either side of X (referred to as channels j = 1
and −1 for transfer across 1 cell to the right or left). The use of all observations
across all cells and time steps implies an assumption of stationarity here. This is
182 J.T. Lizier
justified in that the large CA length and relativity short number of time steps (and
ignoring of initial steps) is designed to ensure that an attractor is not reached while
the typical transient dynamics of the CA are well-sampled. Note also that l = 1 is
used since we directly observe the interacting values and only one previous time
step is a causal source here. As such, in line with (54) we have
(k)
I(Xn+1 ; {Xn ,Yl,n ,Yr,n }) = AX (k) + TYl →X (k) + TYr →X|Yl (k), (55)
where Yl represents the causal source to the left (channel j = 1) and Yr the causal
source to the right (channel j = −1) – although their placement is interchangeable
in this equation.
Sample results of this application are displayed for rules 54 and 18 in Fig. 2
and Fig. 3. The figures displayed here were produced using the open source Java
Information Dynamics Toolkit (JIDT) [51], which can be used in Matlab, Octave
and Python as well as Java. All results can be reproduced using the Matlab/Octave
script DirectedMeasuresChapterDemo2013.m in the demos/octave/-
CellularAutomata example distributed with this toolkit.
These applications provided the first quantitative evidence for the above conjec-
tures, and are discussed in the following subsections. But the most important result
for our purposes is that the local measures reveal richly-structured spatiotem-
poral profiles of the information storage and transfer dynamics here, with in-
teresting local features revealed at various points in space-time. It is simply not
possible for these dynamics to be revealed by the average measures, be they aver-
ages across all cells and times or averages just across all cells in time. These features
are uniquely provided by considering the local dynamics of information processing
in CAs, and are discussed in the following subsections.
5.1 Blinkers and Background Domains as Information Storage

Entities
The first and most expected result is that blinkers (regular, stationary particles)
and regular background domains are dominant information storage entities
[62], e.g. see Fig. 2(b). This is because these structures are temporally periodic, and
(k)
so the past state of a cell xn is highly predictive of the next value xn+1 – this means
(k)
that we have p(xn+1 | xn ) > p(xn+1 ), giving large positive values of aX (n + 1, k)
via (40).
In contrast, we see in Fig. 2(b) and Fig. 3(b) that moving particle structures (both
regular gliders and domain walls) are associated with negative local information
(k)
storage aX (n + 1, k). This is because at these locations, the past state of a cell xn
is part of the background domain and observing it would normally predict that the
background domain continues. Since a particle is encountered at the cell instead
(k)
however, this past state xn is in fact misinformative about the next value xn+1 .
(k)
That is to say, we have p(xn+1 | xn ) < p(xn+1 ), giving negative values of aX (n +
1, k) via (40). We note that these misinformative values can only occur (for this

γ γ γ α
α γ

γ
γ

β γ

α α
γ γ
γ

α

γ γ

(a) Raw CA (b) aX (n, k = 16)
γ γ γ γ γ γ

γ γ

γ γ

γ
γ

γ
γ
γ γ

γ γ

γ γ

γ γ γ γ

(c) tY →X (n, k = 16) right – j = 1 channel (d) tY →X (n, k = 16) left – j = −1 channel
Fig. 2 Local information dynamics in ECA rule 54 for the raw values in (a) (black for “1”,
white for “0”). 35 time steps are displayed for 35 cells, and time increases down the page
for all CA plots. All units are in bits. (b) Local active information storage; Local apparent
transfer entropy: (c) one cell to the right, and (d) one cell to the left per time step.
deterministic system) where another information source is having a relatively large

predictive effect on the target – to explore these further, we turn our attention to
local information transfer in the next subsection.
Finally, we note that these results required a large enough k to properly capture
the past state of the cell, and could not be observed with a value say of k = 1 (as
discussed in [62]).
5.2 Particles, Gliders and Domain Walls as Dominant

Information Transfer Entities
Perhaps the most important result from our application to CAs is that local infor-
mation transfer is typically strongly positive at moving particles in comparison to
blinkers and background domains [59]. To clarify, this is when the local information
transfer is measured at a particle in the same direction or channel j as the macro-
scopic motion of that particle. For example, see the highlighting of left and right
184 J.T. Lizier

(a) Raw CA (b) aX (n, k = 16)

(c) tY →X (n, k = 16) left – j = −1 channel (d) tYc →X (n, k = 16) left – j = −1 channel
Fig. 3 Local information dynamics in ECA rule 18 for the raw values in (a) (black for “1”,
white for “0”). 50 time steps are displayed for 50 cells, and all units are in bits. (b) Local
active information storage; (c) Local apparent transfer entropy one cell to the left per time
step; (d) Local complete transfer entropy one cell to the left per time step.
moving gliders for rule 54 in Fig. 2(c) and Fig. 2(d) by transfer entropy to the left
and right respectively, and similarly for the left moving sections of domain walls for
rule 18 in Fig. 3(c) and Fig. 3(d) by transfer entropy to the left (TE to right omit-
(k)
ted). In these examples, the past state of the target cell xn is part of the background
domain and so is misinformative about the next value xn+1 where the particle is
encountered. In contrast, the source cell yn which is in the particle at the previous
time step n (be that the left or right neighbour, as relevant for that particular parti-
cle) is highly predictive about the next value of the target (in the context of its past).
(k) (k)
As such, we have p(xn+1 | xn , yn ) > p(xn+1 | xn ), giving large positive values of
tY →X (n + 1, k) via (48).
These results for local transfer entropy are particularly important because they
provided the first quantitative evidence for the long-held conjecture that par-
ticles are the dominant information transfer agents in CAs. As stated above,
it is simply not possible for these space-time specific dynamics to be revealed
by the average transfer entropy, it specifically requires the local transfer entropy.
Furthermore, the average values do not give so much as a hint towards the complex-
ities of these local dynamics: ECA rule 22 has much larger average transfer entropy
values than rule 54 (0.19 versus 0.08 bits for each, respectively, in both left and right
directions), yet has no emergent self-organized particle structures [61].
As per the information storage results, we note that these results required a large
enough k to properly capture the past state of the cell, and could not be observed
with a value say of k = 1 (as discussed in [59]). When linked to the result of mis-
informative storage at the particles from Sect. 5.1, we see again the complementary
nature of information storage and transfer.
It is important to note that particles are not the only points with positive local
transfer entropy. Small positive non-zero values are also often measured in the
domain and in the orthogonal direction to glider motion in space-time (e.g. see
Fig. 2(d)) [59]. These correctly indicate non-trivial information transfer in these
regions (e.g. indicating the absence of a glider), though they are dominated by the
positive transfer in the direction of glider motion.
5.3 Sources Can Be Locally Misinformative

Next, we note that local information transfer is often found to be negative at moving
particles, when measured in the orthogonal direction to macroscopic particle motion
in space-time [59]. For example, see the right-moving gliders in Fig. 2(d) or right-
moving domain walls in Fig. 3(c)). This is because the source Y here, being on the
opposite side of the target to the incoming particle and therefore still part of the
domain observed in the target’s past, would suggest that this domain pattern would
(k)
continue, which is misinformative. That is to say, we have here p(xn+1 | xn , yn ) <
(k)
p(xn+1 | xn ), giving negative values of tY →X (n + 1, k) via (48).
As described in Sect. 4.2, a source can be locally misinformative but must be pos-
itively informative on average (or at least provide zero information). These negative
or misinformative values are quite useful, since they imply that there is an extra
feature in the dynamics that is unaccounted for in the past of the source and target
alone. In the case of deterministic systems, this means that more sources must be
examined to explain the dynamics, as explored in the next subsection.
5.4 Conditional Transfer Entropy Is Complementary

Fig. 3(d) displays a profile of the local conditional transfer entropy tY →X|Z applied
to rule 18 (discussed in detail in [59]). This is the transfer entropy from the source
cell Y on the right of the target X, conditioned on the other source cell Z on the left.
Because we condition on all of the other causal sources here, this measurement may
also be referred to as a complete transfer entropy [59].
This profile is rather different to that of the apparent transfer entropy tY →X for
the same channel (i.e. from the same relative source) displayed in Fig. 3(c). The
first noticeable difference is the checkerboard pattern of transfer in the background
domain, which is only visible with the conditional measure. This pattern forms due
186 J.T. Lizier
to complex dynamics in the domain here, with two interleaving phases. The first
phase occurs at every second cell (both in space and time), and is simply a ‘0’ –
at these cells there is strong information storage alone (see Fig. 3(b)) because the
cell value is predictable from its past (which predicts the phase accurately). The
other phase occurs at the alternate cells, and is a ‘0’ or a ‘1’ as determined via an
exclusive OR (or XOR) operation between the neighbouring left and right cells. As
such, apparent transfer entropy from either left or right cell alone provides almost no
information about the next value (hence absence of apparent transfer in the domain
– see Fig. 3(c)), whilst conditional transfer entropy provides full information about
the next value because the other contributing cell is taken into account (hence the
strong conditional transfer at every second cell in Fig. 3(d)).
The other noticeable difference between these profiles is that the conditional
transfer entropy does not have any negative local values, unlike the apparent trans-
fer entropy. This is because examining the source in the context of all other causal
sources in this deterministic system necessarily provides more information than not
examining the source. That is to say, there are no unaccounted sources here which
could mislead the observer, unlike that possibility for the apparent transfer entropy.
There are two key messages from the comparison of these measures:
1. The apparent and conditional transfer entropy reveal different aspects of
the dynamics of a system – neither is more correct than the other; they are both
useful and complementary. This is a particularly important message, since often
the importance of conditioning “out” all other sources using a conditional mea-
sure is emphasised, without acknowledging the complementary utility retained
by the pairwise transfer entropy. Both are required to have a full picture of the
dynamics of a system;
2. The differences in local dynamics that they reveal simply cannot be observed
here by using the average of each measure alone.
5.5 Contrasting Information Transfer and Causal Effect

Finally, we note that differences between the concepts of information transfer (as
captured by the transfer entropy) and causal effect are now well established [2, 56,
11]. We briefly review how the local perspective of transfer entropy was used to
provide insight into these differences in [56].
Causal effect refers to the extent to which the source variable has a direct influ-
ence or drive on the next state of a target variable, i.e. “if I change the value of the
source, to what extent does that alter the value of the target?” [74, 2, 56]. In this
light, consider the causal effect of the left cell xi−1,n in the seventh row of the rule
table for rule 54 in Table 1, i.e. “1 1 0 → 0”. Altering the value of this source has a
clear causal effect on the target, since it changes the rule being executed to “0 1 0 →
1” (i.e. we have a different outcome at the target). Crucially though, this particular
configuration (“1 1 0 → 0”) is observed both in the (right-moving) gliders and in
the background domain of rule 54. This means that the same causal effect occurs in
both types of dynamics.7
This is quite different to our interpretation of information transfer in the previ-
ous sections however. This interpretation can be restated as: predictive information
transfer refers to the amount of information that a source variable adds to the state
change of a target variable; i.e. “if I know the state of the source, how much does that
help to predict the state change of the target?” [56]. In dealing with state updates
of the target, and in particular in separating information storage from transfer, the
transfer entropy has a very different perspective to causal effect. As we have seen,
local transfer entropy attributes large positive local values at the gliders here, be-
cause the source cells help prediction in the context of a target’s past, but attributes
vanishing amounts in the domain, where stored information from a target’s past is
generally sufficient for prediction.
Again, neither perspective is more correct than the other – they both provide
useful insights and are complementary. This argument is explored in more depth in
[56]. Crucially, these insights are only fully revealed with our local perspective of
information dynamics here.
6 Discussion: Relevance of Local Measures to Computational

Neuroscience
In the previous section, we have demonstrated that local transfer entropy and the
associated measures of local information dynamics provide key insights into local
information processing in cellular automata that cannot be provided with traditional
average information-theoretic measures. We have gone on to use these local tech-
niques to provide similar insights in other systems, such as:
• visualising coherent waves of motion in flocks (or swarms) as information cas-
cades spreading across the flock (as previously conjectured, [12]) using local
transfer entropy [92];
• revealing coherent information transfer waves in modular robots [57];
• demonstrating information transfer as a key driver in the dynamics of network
synchronization processes, with local values dropping to zero (i.e. the synchro-
nized state has been “computed”) before it is otherwise apparent that a synchro-
nized state has been either reached or determined [9].
We can reasonably expect local information transfer and storage to provide
new insights in a computational neuroscience setting also. As described earlier,
avalanche behaviour (e.g. [80, 47, 75]) and coherent propagating wave-like struc-
tures (e.g. [27, 17]) are of particular interest in neuroscience, and particles and glid-
ers bear more than a passing resemblance to these coherent structures. Given that
local transfer entropy has been used to provide the first quantitative evidence that
similar propagating coherent structures in other domains are information transfer
7 Reference [56], which covers this issue in more depth, explores measuring the causal effect
in these dynamics using the measure presented in [2].
188 J.T. Lizier
entities (e.g. particles and gliders in cellular automata [59], above, motion in flocks
and swarms [92], and in modular robotics [57]), one expects that this measure will
be used to provide similar insights into these structures in neural systems.
Yet local transfer entropy will find much more broad application than simply
identifying local coherent structure. It offers the opportunity to answer the question:
“Precisely when and where is information transferred between brain regions?”
The where is answerable with average transfer entropy, but the when is only pre-
cisely answerable with a local approach. This is a fundamentally important question
for us to have the opportunity to answer, because it will provide insight into the pre-
cise dynamics of how information is stored, transferred and modified in the brain
during neural computation.
For example, we have conducted a preliminary study applying this method to a
set of fMRI measurements where we could expect to see differences in local infor-
mation transfer between two conditions at specific time steps [50]. The fMRI data
set analyzed (from [86]) is a ‘Libet’-style experiment, which contains brain activity
recorded while subjects were asked to freely decide whether to push one of two but-
tons (with left or right index finger). Significant differences (at the group level) were
found in the local transfer entropy between left and right button presses from a sin-
gle source region (e.g. pre-SMA) into the left and right motor cortices respectively.
Furthermore, simple thresholding of these local transfer entropy values provides a
statistically significant prediction of which button was pressed.
These results are a strong demonstration that local transfer entropy can usefully
provide task-relevant insights into when and where information is transferred be-
tween brain regions. Once validation studies have been completed in this domain,
we expect that further utility will be found for these local information-theoretic mea-
sures in computational neuroscience. There are many studies in this domain which
will benefit from the ability to view local information storage, transfer and modifi-
cation operations on a local scale in space and time in the brain.
Acknowledgements. The author wishes to thank Michael Wibral for very helpful comments
on a draft paper and discussions on the topic, as well as Mikhail Prokopenko, Daniel Polani,
Ben Flecker and Paul Williams for useful discussions on these topics.
References
1. Ash, R.B.: Information Theory. Dover Publishers, Inc., New York (1965)
2. Ay, N., Polani, D.: Information Flows in Causal Networks. Advances in Complex Sys-
tems 11(1), 17–41 (2008)
3. Bandt, C., Pompe, B.: Permutation entropy: A natural complexity measure for time se-
ries. Physical Review Letters 88(17), 174102 (2002)
4. Barnett, L., Barrett, A.B., Seth, A.K.: Granger Causality and Transfer Entropy Are
Equivalent for Gaussian Variables. Physical Review Letters 103(23), 238701 (2009)
5. Barnett, L., Bossomaier, T.: Transfer Entropy as a Log-Likelihood Ratio. Physical Re-
view Letters 109, 138105 (2012)
6. Barnett, L., Buckley, C.L., Bullock, S.: Neural complexity and structural connectivity.
Physical Review E 79(5), 051914 (2009)
7. Boedecker, J., Obst, O., Lizier, J.T., Mayer, N.M., Asada, M.: Information processing in
echo state networks at the edge of chaos. Theory in Biosciences 131(3), 205–213 (2012)
8. Bressler, S.L., Tang, W., Sylvester, C.M., Shulman, G.L., Corbetta, M.: Top-Down Con-
trol of Human Visual Cortex by Frontal and Parietal Cortex in Anticipatory Visual Spatial
Attention. Journal of Neuroscience 28(40), 10056–10061 (2008)
9. Ceguerra, R.V., Lizier, J.T., Zomaya, A.Y.: Information storage and transfer in the syn-
chronization process in locally-connected networks. In: Proceedings of the 2011 IEEE
Symposium on Artificial Life (ALIFE), pp. 54–61. IEEE (2011)
ity: application to epileptic EEG signals. Journal of Neuroscience Methods 124(2), 113–
128 (2003)
11. Chicharro, D., Ledberg, A.: When Two Become One: The Limits of Causality Analysis
of Brain Dynamics. PLoS One 7(3), e32466 (2012)
12. Couzin, I.D., James, R., Croft, D.P., Krause, J.: Social Organization and Information
Transfer in Schooling Fishes. In: Brown, C., Laland, K.N., Krause, J. (eds.) Fish Cog-
nition and Behavior, Fish and Aquatic Resources, pp. 166–185. Blackwell Publishing
(2006)
13. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, New
York (1991)
14. Crutchfield, J.P., Feldman, D.P.: Regularities Unseen, Randomness Observed: Levels of
Entropy Convergence. Chaos 13(1), 25–54 (2003)
15. Crutchfield, J.P., Young, K.: Inferring statistical complexity. Physical Review Let-
ters 63(2), 105–108 (1989)
16. Dasan, J., Ramamohan, T.R., Singh, A., Nott, P.R.: Stress fluctuations in sheared Stoke-
sian suspensions. Physical Review E 66(2), 021409 (2002)
17. Derdikman, D., Hildesheim, R., Ahissar, E., Arieli, A., Grinvald, A.: Imaging spatiotem-
poral dynamics of surround inhibition in the barrels somatosensory cortex. The Journal
of Neuroscience 23(8), 3100–3105 (2003)
18. DeWeese, M.R., Meister, M.: How to measure the information gained from one symbol.
Network: Computation in Neural Systems 10, 325–340 (1999)
19. Effenberger, F.: A primer on information theory, with applications to neuroscience,
arXiv:1304.2333 (2013), http://arxiv.org/abs/1304.2333
20. Faes, L., Nollo, G., Porta, A.: Information-based detection of nonlinear Granger causality
in multivariate processes via a nonuniform embedding technique. Physical Review E 83,
051112 (2011)
mation transfer in cardiovascular and cardiorespiratory variability series. Computers in
Biology and Medicine 42(3), 290–297 (2012)
22. Fano, R.M.: Transmission of information: a statistical theory of communications. MIT
Press, Cambridge (1961)
23. Flecker, B., Alford, W., Beggs, J.M., Williams, P.L., Beer, R.D.: Partial information de-
composition as a spatiotemporal filter. Chaos: An Interdisciplinary Journal of Nonlinear
Science 21(3), 037104 (2011)
24. Frenzel, S., Pompe, B.: Partial Mutual Information for Coupling Analysis of Multivariate
Time Series. Physical Review Letters 99(20), 204101 (2007)
25. Friston, K.J., Harrison, L., Penny, W.: Dynamic causal modelling. NeuroImage 19(4),
1273–1302 (2003)
190 J.T. Lizier
ing coupling dynamics from an ensemble of time series. arXiv:1008.0539 (2010),
http://arxiv.org/abs/1008.0539
27. Gong, P., van Leeuwen, C.: Distributed Dynamical Computation in Neural Circuits with
Propagating Coherent Activity Patterns. PLoS Computational Biology 5(12) (2009)
29. Grassberger, P.: New mechanism for deterministic diffusion. Physical Review A 28(6),
3666 (1983)
30. Grassberger, P.: Long-range effects in an elementary cellular automaton. Journal of Sta-
tistical Physics 45(1-2), 27–39 (1986)
31. Grassberger, P.: Toward a quantitative theory of self-generated complexity. International
Journal of Theoretical Physics 25(9), 907–938 (1986)
32. Griffith, V., Koch, C.: Quantifying synergistic mutual information. In: Prokopenko, M.
(ed.) Guided Self-Organization: Inception, pp. 159–190. Springer, Heidelberg (2014)
33. Hanson, J.E., Crutchfield, J.P.: The Attractor-Basin Portait of a Cellular Automaton.
Journal of Statistical Physics 66, 1415–1462 (1992)
34. Hanson, J.E., Crutchfield, J.P.: Computational mechanics of cellular automata: An ex-
ample. Physica D 103(1-4), 169–189 (1997)
35. Harder, M., Salge, C., Polani, D.: Bivariate Measure of Redundant Information. Physical
Review E 87, 012130 (2013)
36. Helvik, T., Lindgren, K., Nordahl, M.G.: Local information in one-dimensional cellu-
lar automata. In: Sloot, P.M.A., Chopard, B., Hoekstra, A.G. (eds.) ACRI 2004. LNCS,
37. Helvik, T., Lindgren, K., Nordahl, M.G.: Continuity of Information Transport in Surjec-
tive Cellular Automata. Communications in Mathematical Physics 272(1), 53–74 (2007)
38. Hinrichs, H., Heinze, H.J., Schoenfeld, M.A.: Causal visual interactions as revealed by
an information theoretic measure and fMRI. NeuroImage 31(3), 1051–1060 (2006)
39. Honey, C.J., Kotter, R., Breakspear, M., Sporns, O.: Network structure of cerebral cor-
tex shapes functional connectivity on multiple time scales. Proceedings of the National
Academy of Science 104(24), 10,240–10,245 (2007)
Transfer Entropy Improves Identification of Effective Connectivity in a Spiking Cortical
Network Model. PLoS One 6(11), e27431 (2011)
41. Kantz, H., Schreiber, T.: Nonlinear Time Series Analysis. Cambridge University Press,
Cambridge (1997)
42. Katare, S., West, D.H.: Optimal complex networks spontaneously emerge when infor-
mation transfer is maximized at least expense: A design perspective. Complexity 11(4),
26–35 (2006)
43. Kerr, C.C., Van Albada, S.J., Neymotin, S.A., Chadderdon, G.L., Robinson, P.A., Lyt-
ton, W.W.: Cortical information flow in parkinson’s disease: a composite network/field
model. Frontiers in Computational Neuroscience 7(39) (2013)
44. Kraskov, A.: Synchronization and Interdependence Measures and their Applications to
the Electroencephalogram of Epilepsy Patients and Clustering of Data. Publication Se-
ries of the John von Neumann Institute for Computing, vol. 24. John von Neumann In-
stitute for Computing, Jülich (2004)
45. Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Physical
Review E 69(6), 066138 (2004)
46. Langton, C.G.: Computation at the edge of chaos: phase transitions and emergent com-
putation. Physica D 42(1-3), 12–37 (1990)
47. Levina, A., Herrmann, J.M., Geisel, T.: Dynamical synapses causing self-organized crit-
icality in neural networks. Nature Physics 3(12), 857–860 (2007)
48. Liang, H., Ding, M., Bressler, S.L.: Temporal dynamics of information flow in the cere-
bral cortex. Neurocomputing 38-40, 1429–1435 (2001)
49. Lindner, M., Vicente, R., Priesemann, V., Wibral, M.: TRENTOOL: A Matlab open
source toolbox to analyse information flow in time series data with transfer entropy.
BMC Neuroscience 12(1), 119 (2011)
50. Lizier, J., Heinzle, J., Soon, C., Haynes, J.D., Prokopenko, M.: Spatiotemporal infor-
mation transfer pattern differences in motor selection. BMC Neuroscience 12(Suppl. 1),
P261 (2011)
51. Lizier, J.T.: JIDT: An information-theoretic toolkit for studying the dynamics of complex
systems (2012),
https://code.google.com/p/information-dynamics-toolkit/
52. Lizier, J.T.: The Local Information Dynamics of Distributed Computation in Complex
Systems. Springer Theses. Springer, Heidelberg (2013)
53. Lizier, J.T., Flecker, B., Williams, P.L.: Towards a synergy-based approach to measuring
information modification. In: Proceedings of the 2013 IEEE Symposium on Artificial
Life (ALIFE), pp. 43–51. IEEE (2013)
54. Lizier, J.T., Heinzle, J., Horstmann, A., Haynes, J.D., Prokopenko, M.: Multivariate
information-theoretic measures reveal directed information structure and task relevant
changes in fMRI connectivity. Journal of Computational Neuroscience 30(1), 85–107
(2011)
55. Lizier, J.T., Pritam, S., Prokopenko, M.: Information dynamics in small-world Boolean
networks. Artificial Life 17(4), 293–314 (2011)
56. Lizier, J.T., Prokopenko, M.: Differentiating information transfer and causal effect. Eu-
ropean Physical Journal B 73(4), 605–615 (2010)
57. Lizier, J.T., Prokopenko, M., Tanev, I., Zomaya, A.Y.: Emergence of Glider-like Struc-
tures in a Modular Robotic System. In: Bullock, S., Noble, J., Watson, R., Bedau, M.A.
(eds.) Proceedings of the Eleventh International Conference on the Simulation and Syn-
thesis of Living Systems (ALife XI), Winchester, UK, pp. 366–373. MIT Press, Cam-
bridge (2008)
58. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Detecting Non-trivial Computation in Com-
plex Dynamics. In: Almeida e Costa, F., Rocha, L.M., Costa, E., Harvey, I., Coutinho, A.
(eds.) ECAL 2007. LNCS (LNAI), vol. 4648, pp. 895–904. Springer, Heidelberg (2007)
poral filter for complex systems. Physical Review E 77(2), 026110 (2008)
60. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Information modification and particle colli-
sions in distributed computation. Chaos 20(3), 037109 (2010)
61. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Coherent information structure in complex
computation. Theory in Biosciences 131(3), 193–203 (2012)
62. Lizier, J.T., Prokopenko, M., Zomaya, A.Y.: Local measures of information storage in
complex distributed computation. Information Sciences 208, 39–54 (2012)
63. Lizier, J.T., Rubinov, M.: Multivariate construction of effective computational networks
from observational data. Tech. Rep. Preprint 25/2012, Max Planck Institute for Mathe-
matics in the Sciences (2012)
64. Lungarella, M., Sporns, O.: Mapping Information Flow in Sensorimotor Networks. PLoS
Computational Biology 2(10), e144 (2006)
65. MacKay, D.J.C.: Information Theory, Inference, and Learning Algorithms. Cambridge
University Press, Cambridge (2003)
192 J.T. Lizier
66. Mahoney, J.R., Ellison, C.J., James, R.G., Crutchfield, J.P.: How hidden are hidden pro-
cesses? A primer on crypticity and entropy convergence. Chaos 21(3), 037112 (2011)
67. Manchanda, K., Yadav, A.C., Ramaswamy, R.: Scaling behavior in probabilistic neuronal
cellular automata. Physical Review E 87, 012704 (2013)
68. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing.
The MIT Press, Cambridge (1999)
69. Marinazzo, D., Wu, G., Pellicoro, M., Angelini, L., Stramaglia, S.: Information flow
in networks and the law of diminishing marginal returns: evidence from modeling and
human electroencephalographic recordings. PLoS One 7(9), e45026 (2012)
70. Mitchell, M.: Computation in Cellular Automata: A Selected Review. In: Gramss, T.,
Bornholdt, S., Gross, M., Mitchell, M., Pellizzari, T. (eds.) Non-Standard Computation,
pp. 95–140. VCH Verlagsgesellschaft, Weinheim (1998)
71. Mitchell, M., Crutchfield, J.P., Hraber, P.T.: Evolving Cellular Automata to Perform
Computations: Mechanisms and Impediments. Physica D 75, 361–391 (1994)
72. Nakajima, K., Li, T., Kang, R., Guglielmino, E., Caldwell, D.G., Pfeifer, R.: Local infor-
mation transfer in soft robotic arm. In: 2012 IEEE International Conference on Robotics
and Biomimetics (ROBIO), pp. 1273–1280. IEEE (2012)
73. Obst, O., Boedecker, J., Asada, M.: Improving Recurrent Neural Network Perfor-
mance Using Transfer Entropy. In: Wong, K.W., Mendis, B.S.U., Bouzerdoum, A. (eds.)
ICONIP 2010, Part II. LNCS, vol. 6444, pp. 193–200. Springer, Heidelberg (2010)
74. Pearl, J.: Causality: Models, Reasoning, and Inference. Cambridge University Press,
Cambridge (2000)
75. Priesemann, V., Munk, M., Wibral, M.: Subsampling effects in neuronal avalanche dis-
tributions recorded in vivo. BMC Neuroscience 10(1), 40 (2009)
76. Prokopenko, M., Boschietti, F., Ryan, A.J.: An Information-Theoretic Primer on Com-
plexity, Self-Organization, and Emergence. Complexity 15(1), 11–28 (2009)
77. Prokopenko, M., Gerasimov, V., Tanev, I.: Evolving Spatiotemporal Coordination in a
Modular Robotic System. In: Nolfi, S., Baldassarre, G., Calabretta, R., Hallam, J.C.T.,
Marocco, D., Meyer, J.-A., Miglino, O., Parisi, D. (eds.) SAB 2006. LNCS (LNAI),
78. Prokopenko, M., Lizier, J.T., Obst, O., Wang, X.R.: Relating Fisher information to order
parameters. Physical Review E 84, 41116 (2011)
79. Prokopenko, M., Lizier, J.T., Price, D.C.: On thermodynamic interpretation of transfer
entropy. Entropy 15(2), 524–543 (2013)
80. Rubinov, M., Lizier, J., Prokopenko, M., Breakspear, M.: Maximized directed informa-
tion transfer in critical neuronal networks. BMC Neuroscience 12(supp.l 1), P18 (2011)
81. Schreiber, T.: Interdisciplinary application of nonlinear time series methods - the gener-
alized dimensions. Physics Reports 308, 1–64 (1999)
82. Schreiber, T.: Measuring Information Transfer. Physical Review Letters 85(2), 461–464
(2000)
83. Shalizi, C.R.: Causal Architecture, Complexity and Self-Organization in Time Series and
Cellular Automata. Ph.D. thesis, University of Wisconsin-Madison (2001)
84. Shalizi, C.R., Haslinger, R., Rouquier, J.B., Klinkner, K.L., Moore, C.: Automatic fil-
ters for the detection of coherent structure in spatiotemporal systems. Physical Review
E 73(3), 036104 (2006)
85. Shannon, C.E.: A mathematical theory of communication. Bell System Technical Jour-
nal 27, 379–423, 623–656 (1948)
86. Soon, C.S., Brass, M., Heinze, H.J., Haynes, J.D.: Unconscious determinants of free
decisions in the human brain. Nature Neuroscience 11(5), 543–545 (2008)
87. Staniek, M., Lehnertz, K.: Symbolic transfer entropy. Physical Review Letters 100(15),
158101 (2008)
88. Stramaglia, S., Wu, G.R., Pellicoro, M., Marinazzo, D.: Expanding the transfer entropy to
identify information subgraphs in complex systems. In: Proceedings of the 2012 Annual
International Conference of the IEEE Engineering in Medicine and Biology Society, pp.
3668–3671. IEEE (2012)
89. Ver Steeg, G., Galstyan, A.: Information-theoretic measures of influence based on con-
tent dynamics. In: Proceedings of the Sixth ACM International Conference on Web
Search and Data Mining, pp. 3–12 (2013)
90. Verdes, P.F.: Assessing causality from multivariate time series. Physical Review E 72(2),
026222 (2005)
91. Vicente, R., Wibral, M., Lindner, M., Pipa, G.: Transfer entropy–a model-free mea-
sure of effective connectivity for the neurosciences. Journal of Computational Neuro-
science 30(1), 45–67 (2011)
92. Wang, X.R., Miller, J.M., Lizier, J.T., Prokopenko, M., Rossi, L.F.: Quantifying and
Tracing Information Cascades in Swarms. PLoS One 7(7), e40084 (2012)
93. Wibral, M., Pampu, N., Priesemann, V., Siebenhühner, F., Seiwert, H., Lindner, M.,
Lizier, J.T., Vicente, R.: Measuring Information-Transfer delays. PLoS One 8(2), e55809
(2013)
in magnetoencephalographic data: quantifying information flow in cortical and cerebellar
networks. Progress in Biophysics and Molecular Biology 105(1-2), 80–97 (2011)
95. Williams, P.L., Beer, R.D.: Nonnegative Decomposition of Multivariate Information.
96. Williams, P.L., Beer, R.D.: Generalized Measures of Information Transfer.
97. Wolfram, S.: A New Kind of Science. Wolfram Media, Champaign (2002)
98. Wuensche, A.: Classifying cellular automata automatically: Finding gliders, filtering,
and relating space-time patterns, attractor basins, and the Z parameter. Complexity 4(3),
47–66 (1999)
Parametric and Non-parametric Criteria
for Causal Inference from Time-Series
Daniel Chicharro
Abstract. Granger causality constitutes a criterion for causal inference from time
series that has been largely applied to study causal interactions in the brain from
electrophysiological recordings. This criterion underlies the classical parametric im-
plementation in terms of linear autoregressive processes as well as Transfer entropy,
i.e. a non-parametric implementation in the framework of information theory. In the
spectral domain, partial directed coherence and the Geweke formulation are related
to Granger causality but rely on alternative criteria for causal inference which are in-
herently based on the parametric formulation in terms of autoregressive processes.
Here we clearly differentiate between criteria for causal inference and measures
used to test them. We compare the different criteria for causal inference from time-
series and we further introduce new criteria that complete a unified picture of how
the different approaches are related. Furthermore, we compare the different mea-
sures that implement these criteria in the information theory framework.
1 Introduction
The inference of causality in a system of interacting processes from recorded time-
series is a subject of interest in many fields. Particularly successful has been the
concept of Granger causality [29, 31], originally applied to economic time-series. In
the last years, measures of causal inference have been also widely applied to elec-
trophysiological signals, in particular to characterize causal interactions between
different brain areas (see [46, 28, 10] for a review of Granger causality measures
applied to neural data).
In the original formulation of Granger causality, causality from a process Y to a
process X was examined based on the reduction of the prediction error of X when
Daniel Chicharro
Center for Neuroscience and Cognitive Systems@UniTn, Istituto Italiano di Tecnologia,
Via Bettini 31, 38068 Rovereto (TN)
e-mail: chicharro31@yahoo.es

196 D. Chicharro
including the past of Y [60, 29]. However, this prediction error criterion generalizes
to a criterion of conditional independence on probability distributions [31] that is
generally applicable to stationary and non-stationary stochastic processes.
Here we consider the criterion of Granger causality together with related criteria
of causal inference, like Sims causality [55]. We also consider the criteria underly-
ing other measures that have been introduced to infer causality but for which the
underlying criterion has not been made explicit. This includes the Geweke spectral
measures of causality (GSC) [25, 26], and partial directed coherence (PDC) [5]. We
make a clear distinction between criteria for causal inference and measures imple-
menting them. Accordingly, we refer by Granger causality to the general criterion
of causal inference and not as it is often the case to the measure implementing it
for linear processes. This means that we consider transfer entropy [54] as a partic-
ular measure to test for Granger causality in the information-theoretic framework
(e.g. [56, 1]).
This distinction between criteria and measures is important because in practice
one is usually not only interested in assessing the existence of a causal connection
but in evaluating its strength (e.g. [11, 9, 8, 52, 59]). Causal inference can be associ-
ated with the construction of a causal graph representing which connections exist in
the system [19]. However, quantifying the causal effects resulting from these con-
nections is a more difficult task. Recently [16] examined how the general notion of
causality developed by Pearl [45] can be applied to study the natural dynamics of
complex systems. This notion is based on the idea of externally manipulating the
system to evaluate causal effects. For example, if one is studying causal connec-
tivity in the brain, this manipulation could be the deactivation of some connections
between brain areas, or stimulating electrically a given area. It is clear that these
manipulations alter the normal dynamics of the brain, those which one wants to
analyze in order to understand neural computations. Accordingly, [16] pointed out
that if the main interest is not the effect of external perturbations, but how the causal
connections participate in the generation of the unperturbed dynamics of the system,
then only in some cases it is meaningful to characterize interactions between differ-
ent subsystems in terms of the effect of one subsystem over another. To identify
these cases the notion of natural causal effects between dynamics was introduced
and conditions for their existence were provided. Consequently, Granger causality
measures, and in particular transfer entropy, cannot be used in general as measures
of the strength of causal effects [4, 39]. Alternatively, a different approach was de-
veloped in [15]. Instead of examining the causal effects resulting from the causal
connections, a unifying multivariate framework to study the dynamic dependencies
between the subsystems that arise from the causal interactions was proposed.
Considering this, we here focus on the criteria for causal inference and the mea-
sures are only used as statistics to test these criteria. We closely follow [14] relating
the different formulations of Granger causality and the corresponding criteria of
causal inference, and integrating parametric and non-parametric formulations, as
well as time-domain and spectral formulations, for both bivariate and multivariate
systems. Furthermore, we do not discuss the fundamental assumptions that deter-
mine the valid applicability of the criterion of Granger causality. In particular we
Parametric and Non-parametric Criteria for Causal Inference from Time-Series 197
assume that all the relevant processes are observed and well-defined. This is of
course a big idealization for real applications, but our purpose is examining the
relation between the different criteria and measures that appear in the different for-
mulations of Granger causality. (For a detailed discussion of the limitations of these
criteria see [58, 16]). More generally, [45] offers a complete explanation of the lim-
itations of causal inference without intervening the system.
This Chapter is organized as follows: In Section 2 we review the non-parametric
formulation of the criteria of Granger and Sims causality and the information-
theoretic measures, including transfer entropy, used to test them. In section 3 we
review the parametric autoregressive representation of the processes and the time
domain and spectral measures of Granger causality, in particular GSC and PDC. We
make explicit the parametric criteria of causal inference underlying these measures
and discuss their relation to the non-parametric criteria. Furthermore we introduce
related new criteria for causal inference that allow us to complete a consistent uni-
fying picture that integrates all the criteria and measures. This picture is presented
all together in Section 4.
2 Non-parametric Approach to Causal Inference from

Time-Series
We here review Granger causality and Sims causality as non-parametric criteria to
infer causality from time-series as well as some measures used to test them. Al-
though both the criteria of Granger causality [29, 30] and Sims causality [55] were
originally introduced in combination with a linear formulation, we here consider
their general non-parametric expression [31, 12].
2.1 Non-parametric Criteria for Causal Inference

In [31] it was stated a general criterion for causal inference from time-series based
on the comparison of two probability distributions. We consider first its bivariate
formulation. Assume that for the processes X and Y we record two time-series
{X} = {X1 , X2 , ..., XN } and {Y } = {Y1 ,Y2 , ...,YN }. Granger causality states that there
is no causality from Y to X if the equality
p(Xt+1 |X t ) = p(Xt+1 |X t ,Y t ) ∀X t ,Y t (1)
holds. Here X t = {Xt , Xt−1 , ...X1 } is the past of the process at time t. From now on
we will assume stationarity so that the results do not depend on the particular time.
Therefore we consider N → ∞ and select t such that X t accounts for the infinite past
of the process. See [56, 15] for a non-stationary formulation. According to Eq. 1
Granger causality indicates that there is no causality from Y to X when the future
Xt+1 is conditionally independent of the past Y t given the partialization on his own
past X t . That is, the past of Y has no dependence with the future of X that cannot be
accounted by the past of X.
198 D. Chicharro
As an alternative criterion Sims causality [55] examines the equality
p(X t+1:N |X t ,Y t ) = p(X t+1:N |X t ,Y t ,Yt+1 ) ∀X t ,Y t ,Yt+1 . (2)
It states that there is no causality from Y to X if the whole future X t+1:N is condi-
tionally independent of Yt+1 given the past of the two processes. In fact, assuming
stationarity it is not necessary to condition on Y t so that like Granger causality the
criterion indicates that the future of X is completely determined by his own past (see
[37] for a detailed review of the relation between the two criteria).
While Granger causality and Sims causality are equivalent criteria for the bivari-
ate case [12], this is not true for multivariate processes. When other processes also
interact with X and Y it is necessary to distinguish a causal connection from Y to
X from other connections that also result in statistical dependencies incompatible
with the equality in Eq. 1. These other connections are indirect causal connections
Y → Z → X as well as the effect of common drivers, i.e. a common parent Z such
that Z → Y and Z → X. The formulation of Granger causality turns out to be easily
generalizable to account for these influences resulting in the equality
p(Xt+1 |X t , Zt ) = p(Xt+1 |X t ,Y t , Zt ) ∀X t ,Y t , Zt , (3)
where Zt refers to the past of any other process that interacts with X and Y . In fact, on
which processes it is needed to condition depends on the particular causal structure
of the system, which is exactly what one wants to infer. This renders the criterion
of Granger causality context dependent [31]. This means that if Z does not include
all the relevant processes a false positive can be obtained when testing for causality
from Eq. 3. The problem of hidden variables for causal inference is an issue not
specific for time-series that in general can only be addressed by an interventional
treatment of causality [45]. In practice, from observational data, some procedures
can help to optimize the selection of the variables on which to condition [22, 41].
In this Chapter we do not further deal with this problem and we assume that all the
relevant processes are observed.
In contrast to Granger causality, Sims causality cannot be generalized to the mul-
tivariate case as a criterion for causal inference. The reason is that, since in Eq. 2 the
whole future X t+1:N is considered jointly, there is no way to disentangle direct from
indirect causal connections from Y to X. This means that for multivariate processes
the criterion of Granger causality in Eq. 3 remains as the unique non-parametric
criterion for causal inference between the time series.
2.2 Measures to Test for Causality

In this Chapter we want to clearly differentiate between the criteria for causal infer-
ence and the particular measures used to test for causality according to these criteria.
This is why we refer by Granger causality to the general criterion proposed in [31]
(Eqs. 1 and 3) so that Granger causality measures include both the transfer entropy
and the linear Granger causality measure. The linear measure, that quantifies the
predictability improvement [60], implements for linear processes a test on the equal-
ity of the mean of the distributions appearing in Eq. 1. More generally, if one wants
to test for the equality between two probability distributions without examining spe-
cific moments of a given order, the Kullback-Leibler divergence (KL-divergence)
[38]
p∗ (x)
KL(p∗ (x), p(x)) = ∑ p∗ (x) log (4)
x p(x)
is a non-negative measure that is zero if and only if the two distributions are iden-
tical. For a multivariate variable X, since it quantifies the divergence of the distri-
bution p(x) from p∗ (x), one can construct p(x) to reflect a specific null-hypothesis
about the dependence between the components of X. As particular applications of
the KL-divergence to quantify the interdependence between random variables one
has the conditional mutual information
p(x|y, z)
I(X;Y |Z) = ∑ p(x, y, z) log p(x|z)
. (5)
x,y,z
We can see that the form of the probability distributions in the argument of the
logarithm is the same as the ones in Eqs. 1-3. Accordingly, testing the equality of
Eq. 1 is equivalent to having a zero transfer entropy [54, 44]
TY →X = I(Xt+1 ;Y t |X t ) = 0. (6)
An analogous information-theoretic measure of Sims causality is obtained so that

Eq. 2 leads to
SY →X = I(Yt+1 ; X t+1:N |Y t , X t ) = 0. (7)
For multivariate processes Eq. 3 leads to a zero conditional transfer entropy
TY →X|Z = I(Xt+1 ;Y t |X t , Zt ) = 0. (8)
[54] introduced the transfer entropy to test the equality of Eq. 1 further assum-
ing that the processes were Markovian with a finite order. A similar information-
theoretic quantity, the directed information, has been introduced in the context of
communication theory [42, 43, 36]. The directed information was originally formu-
lated for the non-stationary case and naturally appears in a causal decomposition of
the mutual information (e.g. [1]). Such a decomposition can also be expressed in
terms of transfer entropies, and is valid for both a non-stationary formulation of the
measures which is local in time and another that is cumulative on the whole time
series [15]. These two formulations converge for the stationary case resulting in
I(X N ;Y N ) = TY →X + TX→Y + TX·Y , (9)
where TX·Y is a measure of instantaneous causality. From this relation it can be

checked that for both the cumulative non-stationary formulation and for the station-
ary one, if there is no instantaneous causality
200 D. Chicharro
TY →X =H(Xi+1 |X i ) − H(Xi+1|X i ,Y i ) = H(Yi+1 |X i ,Y i ) − H(Y N |X N )

(10)
=SY →X .
This equality, restricted to the stationary linear case, is indicated already in Theorem
1(ii) of [25], where no instantaneous causality is enforced by a normalization of the
covariance matrix.
Notice that here we consider the measures as particular instantiations of the KL-
divergence used as a statistic for hypothesis testing [38]. This is important to keep in
mind because the KL-divergence can be interpreted as well in terms of code length
[17], and in particular the transfer entropy (directed information) determines the
error-free transmission rate when applied to specific communication channels with
feedback [36], (and see also [47] for a discussion of different application of trans-
fer entropy). Furthermore, any conditional mutual information can be evaluated as
a difference of two conditional entropies, and interpreted as a reduction of uncer-
tainty. To test for causality only the significance of nonzero values is of interest,
but it is common to use the values of TY →X to characterize the causal dependencies.
Alternatively, the value of SY →X could be used, giving a not necessarily equivalent
characterization if the conditions of Eq. 10 are not fulfilled or depending on the
particular estimation procedure.
More generally, the KL-divergence is not the only option to test the criteria of
causality above in a non-parametric way. Other measures have been proposed based
on the same criterion (e.g. [33, 2]) that are sensitive to higher-order moments of
the distributions. A natural alternative that also considers all the moments of the
distributions is to use the Fisher information

∂ ln p(Y |x) 2
F(Y ; x) = dY p(Y |x)( ) (11)
∂x
which, by means of the Cramer-Rao bound [17], it is related to the accuracy of an
unbiased estimator of X from Y . For the particular equality of Eq. 1 this leads to test
Eyt [F(Xt+1 ; yt |X t )] = 0. (12)
In the Appendix we examine in detail this expression for linear Gaussian autore-
gressive processes.
3 Parametric Approach to Causal Inference from Time-Series

The criteria of Section 2.1 do not assume any particular form of the processes. Op-
positely, in the implementation originally introduced by [29], the processes are as-
sumed to have a linear autoregressive representation. Here by parametric we refer
specifically to the assumption of this representation. Notice that this is different from
a parametric approach in which not the processes but the probability distributions
are estimated parametrically, for example using generalized linear models [49].
We first review the autoregressive representation of stationary stochastic pro-
cesses for bivariate and multivariate systems, describing the projections used in the
different linear formulations of Granger causality. We then review these formula-

tions, in particular the Geweke formulation in the temporal and spectral domain
[25, 26] and partial directed coherence [5, 53]. Apart from stationarity we will as-
sume that there is no instantaneous causality, i. e. that the covariance matrices of
the innovations terms in the autoregressive representation are diagonal. This sub-
stantially simplifies the formulation avoiding a normalization step [25, 18]. Further-
more, strictly speaking, the existence of instantaneous causality is a signature of
time or spatial aggregation, or of the existence of hidden variables, that questions
the validity of the causal inference [30].
3.1 The Autoregressive Process Representation

Consider the system formed by the stationary stochastic processes X and Y . Two
projections are required to construct the bivariate linear measure of Granger causal-
ity from Y to X. First, the projection of Xt+1 on his own past:
∞
∑ axs Xt−s + εxt+1 ,
(x) (x) (x) (x)
Xt+1 = var(εx ) = Σx , (13)
s=0
second, its projection on the past of both X and Y :

∞
∑ axxs Xt−s + axys Yt−s + εxt+1
(xy) (xy) (xy)
Xt+1 =
s=0
∞ (14)
∑
(xy) (xy) (xy)
Yt+1 = ayxs Xt−s + ayys Yt−s + εyt+1
s=0

(xy) (xy)
(xy) Σxx Σxy
Σ = (xy) (xy) (15)
Σyx Σyy
(xy) (xy) (xy) (xy) (xy) (xy) (xy) (xy)
where Σxx = var(εx ), Σyy = var(εy ), Σxy = cov(εx , εy ), and Σyx =
(xy)T
Σxy . Notice that while the subindexes are used to refer to the corresponding vari-
able or to components of a matrix, the superindexes refer to the particular projection.
As we said above, we assume that Σ (xy) is diagonal.
[25] also proved the equality between Granger and Sims causality measures for
linear autoregressive processes. For that purpose also the projection of Yi+1 in the
whole process X is needed
∞
∑
(xy)
Yt+1 = bx s(xy) Xt−s + ηyt+1 . (16)
s=−∞
For multivariate systems we consider the fully multivariate autoregressive repre-

sentation of the system W = {X,Y, Z}:
202 D. Chicharro
∞
∑ axxs
(xyz) (xyz) (xyz) (xyz)
Xt+1 = Xt−s + axys Yt−s + axzs Zt−s + εxt+1
s=0
∞
∑ ayxs
Yt+1 = Xt−s + ayys Yt−s + ayzs Zt−s + εyt+1 (17)
s=0
∞
∑ azxs
Zt+1 = Xt−s + ayzs Yt−s + azzs Zt−s + εzt+1
s=0
⎛
(xyz) (xyz) (xyz)
⎞
Σxx Σxy Σxz
⎜ (xyz) (xyz) (xyz) ⎟
Σ (xyz) = ⎝ Σyx Σyy Σyz ⎠ . (18)
(xyz) (xyz) (xyz)
Σzx Σzy Σzz
Like for the bivariate case we assume that Σ (xyz) is diagonal.
Apart from the joint autoregressive representation of W to calculate the condi-
tional GSC from Y to X it is also needed the projection of Xt+1 only on the past of
X and Z:
∞
∑ axxs Xt−s + axzs Zt−s + εxt+1
(xz) (xz) (xz)
Xt+1 =
s=0
∞ (19)
∑
(xz) (xz) (xz)
Zt+1 = azxs Xt−s + azzs Zt−s + εzt+1
s=0

(xz) (xz)
(xz) Σxx Σxz
Σ = (xz) (xz) . (20)
Σzx Σzz
3.2 Parametric Measures of Causality

The autoregressive representations described in Section 3.1 have been used to define
quite many measures related to the criterion of Granger causality. We here focus on
the Geweke measures [25, 26], and partial directed coherence [5]. Other measures
introduce some variation or refinement of these measures to deal with estimation
problems or attenuate the influence of hidden variables (e.g. [13, 32, 52]). Fur-
thermore, directed transfer function [35] is another related measure [14] but only
equivalent to Geweke measure for bivariate systems [20].
3.2.1 The Geweke Measures of Granger Causality

The temporal formulation and the relation between linear Granger causality and
transfer entropy
Granger [29, 30] proposed to test for causality from Y to X examining if there is
an improvement of predictability of Xt+1 when using the past of Y apart from the
past on X for an optimal linear predictor. For a linear predictor h(X t ), using only
information from the past of X, the squared error is determined by

(x)
E = dXt+1 dX t (Xt+1 − h(X t ))2 p(Xt+1 , X t ), (21)
and analogously for E (xy) using information from the past of X and Y . Since the
optimal linear predictor is the conditional mean [40], we have that

E (x) = dX t p(X t ) dXt+1 (Xt+1 − EXt+1 [Xt+1 |X t ])2 p(Xt+1 |X t )
(22)
=EX t [σ 2 (Xt+1 |X t )].
If the autoregressive representation of Eq. 13 is assumed to be valid the variance

σ 2 (Xt+1 |X t ) does not depend on the value of X t and we have
(x)
E (x) = EX t [σ 2 (Xt+1 |X t )] = Σx . (23)
An analogous equality is obtained for E (xy) , so that the Geweke measure of Granger
causality is defined as:
(x)
Σx
GY →X = ln( (xy) ), (24)
Σxx
using the autoregressive representation of Eqs. 13-15. This measure, as indicated in
[31], tests if there is causality from Y to X in mean, that is, the equality:
EXt+1 [Xt+1 |X t ] = EXt+1 [Xt+1 |X t ,Y t ] ∀X t ,Y t . (25)
Accordingly, given Eqs. 1 and 25, it is clear that
GY →X = 0 ⇒ TY →X = 0, (26)
since the first only test for difference in the moment of order 1 and the other in
the whole probability distribution. In principle, the opposite implication is not al-
ways true. However, since Eq. 25, as well as Eqs. 1-3 impose a stack of constraints
(one for each value of the conditioning variables) we expect that, at least in gen-
eral, the inequality for higher order moments is accompanied by one in the con-
ditional means. Furthermore, when the autoregressive representations are assumed
to be valid, testing for the equality in the mean or the variance of the distributions
is equivalent, given Eq. 23 and that the conditional variance is independent on the
value conditioning. Notice that Gaussianity has not to be assumed for this equal-
ity and in general in [25] it is only further assumed to find the distribution of the
measures under the null-hypothesis of no causality.
The explanation above further relates the distinction in [31] between causation
in mean (Eq. 25) and causation prima facie (Eq. 1) to the equivalence between the
Geweke linear measure of Granger causality GY →X and the transfer entropy for
Gaussian processes. Since a Gaussian probability distribution is completely deter-
mined by its first two moments, and the conditional variance is independent on the
value conditioning, it is clear from the explanation above that for Gaussian vari-
ables causation in mean and prima facie have to be equivalent. This in practice can
204 D. Chicharro
be seen [7] taking into account that the entropy of a N-variate Gaussian distribution
is completely determined by its covariance matrix Σ :
1
N
H(XGaussian )= ln ((2π e)N |Σ |). (27)
2
Accordingly, the two measures are such that:
GY →X = 2 TY →X . (28)
For multivariate processes the conditional GSC [26] is defined in the time domain
analogously to GY →X in Eq. 24, but now using the autoregressive representations of
Eqs 17-20:
(xz)
Σxx
GY →X|Z = ln( (xyz) ). (29)
Σxx
It is straightforward to see that, given the form of the entropy for Gaussian variables
(Eq. 27) and the definition of the conditional transfer entropy TY →X|Z (Eq. 8), the re-
lation between Granger causality and Transfer entropy also holds for the conditional
measures for Gaussian variables:
GY →X|Z = 2 TY →X|Z . (30)
The spectral formulation
Geweke [25] also proposed a spectral decomposition of the time domain Granger
causality measure (Eq. 24). Geweke derived the spectral measure of causality from
Y to X, gY →X (ω ), requiring the fulfillment of some properties:
1. The spectral measure should have an intuitive interpretation so that the spectral
decomposition is useful for empirical applications.
2. The measure has to be nonnegative.
3. The temporal and spectral measures have to be related so that
π
1
gY →X (ω )d ω = GY →X . (31)
2π −π
Conditions two and three imply that
GY →X = 0 ⇔ gY →X (ω ) = 0 ∀ω . (32)
The GSC is obtained from the spectral representation of the bivariate autoregres-
sive process as follows. Fourier transforming Eq. 14 leads to:
(xy) (xy) (xy)
Axx (ω ) Axy (ω ) X(ω ) εx (ω )
(xy) (xy) = (xy) , (33)
Ayx (ω ) Ayy (ω ) Y (ω ) εy (ω )
(xy) (xy) (xy)

where we have Axx (ω ) = 1 − ∑∞ s=1 axxs e
−iω s , as well as A
xy (ω ) =
∞ (xy) −iω s (xy) (xy)
− ∑s=1 axys e , and analogously for Ayy (ω ), Ayx (ω ). The coefficients matrix
A(xy) (ω ) can be inverted into the transfer function H(xy) (ω ) = (A(xy) )−1 (ω ), so that
(xy) (xy) (xy)
X(ω ) Hxx (ω ) Hxy (ω ) εx (ω )
= (xy) (xy) (xy) . (34)
Y (ω ) Hyx (ω ) Hyy (ω ) εy (ω )
Accordingly, the spectral matrix can be expressed as:
S(xy) (ω ) = H(xy) (ω )Σ (xy) (H(xy) )∗ (ω ) (35)
where ∗ denotes complex conjugate and matrix transpose. Given the lack of instan-
taneous correlations
(xy) (xy) (xy) (xy)
Sxx (ω ) = Σxx |Hxx (ω )|2 + Σyy |Hxy (ω )|2 . (36)
The GSC from Y to X at frequency ω is defined as:
Sxx (ω )
gY →X (ω ) = ln (xy) (xy)
. (37)
Σxx |Hxx (ω )|2
This definition fulfills the requirement of being nonnegative since, given Eq. 36,
(xy) (xy)
Sxx (ω ) is always higher than Σxx |Hxx (ω )|2 . It also fulfills the requirement of be-
ing intuitive since gY →X (ω ) quantifies the portion of the power spectrum which is
associated with the intrinsic innovation process of X. Furthermore, the third condi-
tion is also accomplished (see [25, 57, 14] for details). This can be seen considering
that
(xy) 2
gY →X (ω ) = − ln (1 − |C(X, εy )| ) (38)
(xy) (xy)
where |C(X, εy )|2 is the squared coherence of X with the innovations εy of
Eq. 14. Given the general relation of the mutual information rate with the squared
coherence [24] we have that for Gaussian variables
π
(xy)N −1 (xy) 2
TY →X = I(X N ; εy )= ln (1 − |C(X, εy )| )d ω . (39)
4π −π
For the multivariate case, to derive the spectral representation of GY →X|Z for sim-
plicity we assume again that there is no instantaneous causality and Σ (xyz) and Σ (xz)
are diagonal (see [18] for a detailed derivation when instantaneous correlations ex-
ist). We rewrite Eq. 19 after Fourier transforming as:

(xz) (xz) (xz)
εx (ω ) Axx (ω ) Axz (ω ) X(ω )
= . (40)
(xz)
εz (ω )
(xz) (xz)
Azx (ω ) Azz (ω ) Z(ω )
206 D. Chicharro
Furthermore we rewrite Eq. 17 using the transfer function H(xyz) :

⎛ ⎞ ⎛ (xyz) ⎞
X(ω ) εx (ω )
⎝ Y (ω ) ⎠ = H(xyz) ⎜ (xyz)
⎝ εy (ω ) ⎠ .
⎟
(41)
Z(ω ) (xyz)
εz (ω )
Geweke [26] showed that
GY →X|Z = G (xz) (xz) . (42)

Y εz →εx
(xz) (xz)
Accordingly, Eqs. 40 and 41 are combined to express Y , εz and εx in terms of
the innovations of the fully multivariate process:
⎛ ⎞ ⎛ (xyz) ⎞
(xz)
εx (ω ) εx (ω )
⎜ ⎟ ⎜ (xyz) ⎟
⎝ Y (ω ) ⎠ = DH(xyz) ⎝ εy (ω ) ⎠ , (43)
(xz) (xyz)
εz (ω ) εz (ω )
where ⎛ ⎞
(xz) (xz)
Axx (ω ) 0 Axz (ω )
⎜ ⎟
D=⎝ 0 1 0 ⎠. (44)
(xz) (xz)
Azx (ω ) 0 Azz (ω )
(xz) (xz)
Considering Q = DH(xyz) , the spectrum matrix of Y , εz and εx is:

S(ω ) = Q(ω )Σ (xyz) Q∗ (ω ), (45)
and in particular
(xyz) (xyz) (xyz)
Sε (xz) ε (xz) (ω ) = |Qxx (ω )|2 Σxx + |Qxy (ω )|2 Σyy + |Qxz (ω )|2 Σzz . (46)
x x
The conditional GSC from Y to X given Z is defined [26] as the portion of the power
(xyz)
spectrum associated with εx , in analogy to Eq. 37:
S (xz) (xz) (ω )
εx εx
gY →X|Z (ω ) = g (xz) (xz) (ω ) = ln (xyz)
. (47)
Y εz →εx
|Qxx (ω )|2 Σxx
This measure also fulfills the requirements that [25] imposed to the spectral mea-
sures. Furthermore, in analogy to Eq. 38, gY →X|Z (ω ) is related to a multiple
coherence:
(xz) (xyz) (xyz) 2
gY →X|Z (ω ) = − ln(1 − |C(εx , εy εz )| ), (48)
(xz) (xyz) (xyz)

where |C(εx , εy εz )|2 is the squared multiple coherence [48]. This equality
results from the direct application of the definition of the squared multiple coherence
(see [14] for details).
Given the definition of gY →X|Z (ω ) in terms of the squared multiple coherence it
is clear that, analogously to GY →X (Eq. 39):
(xz)N (xyz)N (xyz)N
GY →X|Z = 2 I(εx ; εy εz ). (49)
3.2.2 Partial Directed Coherence

The other measure related to Granger causality that we review here is partial directed
coherence [6, 5], which is defined only in the spectral domain. In particular, the
information partial directed coherence (iPDC) from Y to X [57] is defined in the
bivariate case as:
(xy)
(xy) (xy) (xy) Axy (ω ) Syy|X
iπxy (ω ) = C(εx , ηy ) = , (50)
(xy)
Σxx
where A(xy) (ω ) is the spectral representation of the autoregressive coefficients ma-

trix of Eq. 14 and Syy|X is the partial spectrum [48] of the Y process when partialized
(xy)
on process X. Furthermore, ηy refers to the partialized process resulting from the
Y process when partialized on X, as results from Eq. 16. Like in the case of the
GSC, a mutual information rate is associated with iPDC [57], and is further related
to SY →X [14] for Gaussian variables:
π
(xy)N (xy)N −1 (xy)
SY →X = I(εx ; ηy )= ln (1 − |iπxy (ω )|2 )d ω . (51)
4π −π
In the multivariate case the information partial directed coherence (iPDC) from
Y to X [57] is:
(xyz)
(xyz) (xyz) (xyz) Axy (ω ) Syy|W \y
iπxy (ω ) = C(εx , ηy ) = (52)
(xyz)
Σxx
where A(xyz) (ω ) is the spectral representation of the autoregressive coefficients ma-

trix of Eq. 17 and Syy|W \y is the partial spectrum of the Y process when partialized
(xyz)
on all the other processes in the multivariate process W . Furthermore, ηy refers
to the partialized process resulting from the Y process when partialized on all the
others.
In the multivariate case not even after the integration across frequencies the iPDC
can be expressed in terms of the variables of the observed processes X, Y , Z. The
equality [57]
208 D. Chicharro
π
(xyz)N (xyz)N −1 (xyz)
I(εx ; ηy )= ln (1 − |iπxy (ω )|2 )d ω (53)
4π −π
analogous to the one of the bivariate case (Eq. 51), provides only an expression
(xyz) (xyz)
which involves the innovation processes εx and ηy .
3.3 Parametric Criteria for Causal Inference

From the revision above of the spectral Geweke measures of Granger causality and
of the partial directed coherence one can see that alternative criteria for causal infer-
ence which involve the innovation processes intrinsic to the parametric autoregres-
sive representation are implicit in the mutual information terms. In particular, for
the bivariate case, the spectral Geweke measure is related (Eq. 39) to the criterion
(xy)N (xy)N
p(X N ) = p(X N |εy ) ∀εy . (54)
The bivariate PDC is related (Eq. 51) to

(xy)N (xy)N (xy)N (xy)N
p(εx ) = p(εx |ηy ) ∀ηy . (55)
For the multivariate case the Geweke measure is related (Eq. 49) to
(xz)N (xz)N (xyz)N (xyz)N (xyz)N (xyz)N
p(εx ) = p(εx |εy , εz ) ∀εy , εz (56)
while the PDC is related (Eq. 53) to

(xyz)N (xyz)N (xyz)N (xyz)N
p(εx ) = p(εx |ηy ) ∀ηy . (57)
Comparing the non-parametric criteria of Section 2.1 with these parametric cri-
teria we can see another main difference, apart from that the parametric ones all
involve some innovation process. This difference is that in Eqs. 54-57 temporal
separation between future and past is not required to state the criteria, while the
non-parametric criteria all rely explicitly on temporal precedence. The lack of tem-
poral separation is exactly what allows to construct the spectral measures based on
the criteria of Eqs. 54-57. In [14] it was shown, based on this difference with re-
spect to temporal separation, that transfer entropy does not have a non-parametric
spectral representation. This lack of a non-parametric spectral representation of
the transfer entropy can be further understood considering why a criterion with-
out temporal separation that involves only the processes X, Y and not innovation
processes, cannot be used for causal inference: Consider p(X N ) = p(X N |Y N ) as a
criterion to infer causality from Y to X in contrast to the ones of Eqs. 1 and 54.
Using the chain rule for the probability distributions this equality implies checking
p(Xt+1 |X t ) = p(Xt+1 |X t ,Y N ). But this equality does not hold if there is a causal
connection in the opposite direction, from X to Y , because of the conditioning on
the whole process Y N instead of only on its past. Oppositely
N−1 N−1
) = ∏ p(Xt+1 |X t , εy ∏ p(Xt+1 |X t , εy
(xy)N (xy)N (xy)t
p(X N |εy )= )
t=0 t=0
(58)
N−1
= ∏ p(Xt+1 |X ,Y ),
t t
t=0
since by construction there are no causal connections from the processes to the in-
novation processes. The last equality can be understood considering that the autore-
gressive projections described in Section 3.1 introduce a functional relation of the
variables, such that, for example, given Eq. 14, Xt+1 is completely determined by
(xy)t+1 (xy)t
εx , εy , and analogously for Yt+1 . Accordingly, it is equivalent to condition
(xy)t
on X t , εy or X t ,Y t .
The probability distributions in Eq. 1 and Eq. 54 are still not the same, as it
is clear from Eq. 58. However, under the assumption of stationarity, they are the
functional relations that completely determine the processes from the innovation
processes (and inversely) what leads to the equality in Eq. 39 of the transfer entropy
with the mutual information corresponding to the comparison of the probability dis-
tributions in Eq. 54, and analogously for Eqs. 49 and 51. Remarkably, the mutual
information associated with Eq. 57, as noticed above Eq. 53, is not equal to a mu-
tual information associated with a non-parametric criterion. As indicated in Eq. 51
(see [14] for details) for bivariate processes the PDC is related to Sims causality.
However, for the multivariate case, while there is no extension of Sims causality, it
is clear from the comparison of the definitions in Eqs. 50 and 52, as well as from
the comparison of the criteria of Eqs. 55 and 57, that the multivariate formulation
appears as a natural extension of the bivariate one. This stresses the role of the func-
tional relations that are assumed to implicitly define the innovation processes. It is
not only the causal structure between the variables but the specific functional form
in which they are related what guarantees the validity of the criteria in Eqs. 54-57.
In general this functional form is not required to be linear, as long as it establishes
that the processes and innovation processes are mutually determined.
Another interesting aspect is revealed from the comparison of the bivariate and
multivariate criteria respectively associated with GSC and PDC measures. While
for the PDC the multivariate criterion is a straightforward extension of the bivariate
one, this is not the case for the criteria associated with GSC. This can be noticed as
well comparing the autoregressive projections used for each measure. In particular,
for the bivariate case, gY →X (ω ) is obtained directly from the bivariate autoregres-
sive representation (Eq. 14), not by combining it with the univariate autoregressive
representation of X (Eq. 13). Oppositely, gY →X|Z (ω ) requires the combination of
the full multivariate projection (Eq. 17) and the projection on the past of X, Z (Eq.
19). Below we show that in fact there is a natural counterpart for both the criteria of
Eqs. 54 and 56 respectively.
210 D. Chicharro
3.4 Alternative Geweke Spectral Measures

Instead of constructing gY →X (ω ) just from the bivariate autoregressive represen-
tation, one could proceed alternatively following the procedure used for the con-
ditional case. This means combining Eq. 34 with the Fourier transform of Eq. 13
(x) (x)
εx (ω ) = axx (ω )X(ω ). (59)
This is analogous to combining Eqs. 40 and 41 in the conditional case. Combining
Eqs. 34 and 59 we get an expression analogous to Eq. 43:
(x)

(xy)
εx (ω ) = PH(xy) εx (ω ) , (60)
(xy)
Y (ω ) εy (ω )
where
(x)
P= axx (ω ) 0 . (61)
0 1
= PH(xy) , the spectrum of εx is (x)
Considering Q
(xy)
xx |2 Σxx + |Q (xy)
xy |2 Σyy = |axx (ω )|2 Sxx (ω ) (x)
Sε (x) ε (x) (ω ) = |Q (62)
x x
(xy)
and comparing the total power to the portion related to εx one can define
S (x) (x) (ω ) Sxx (ω )

εx εx
gY →X (ω ) = ln (xy)
= ln (xy)
= gY →X (ω ). (63)
xx |2 Σxx
|Q |Hxx |2 Σxx
This shows that

(x) (xy) (xy)
gY →X = − ln (1 − C(εx , εy )) = − ln (1 − C(X, εx )) (64)
and
(x)N (xy)N (xy)N
TY →X = I(εx ; εy ) = I(X N ; εy ). (65)
This equality indicates that although the procedure used for the multivariate case is
apparently not reducible to the bivariate case for Z = 0,
/ the spectral decomposition
gY →X (ω ) is the same. The criterion for causal inference that results from reducing
to the bivariate case straightforwardly the one of Eq. 56 is
(x)N (x)N (xy)N (xy)N
p(εx ) = p(εx |εy ) ∀εy . (66)
Again the particular functional relation between the processes and the innovation
(x)N (xy)N
processes determines that X N and εx share the same information with εy ,
given that they are mutually determined in Eq. 13.
Analogously, we want to find the criterion that results from a straightforward ex-
tension of the one in Eq. 54. An alternative way to construct gY →X|Z (ω ) is suggested
by the relation between the bivariate and the conditional measures stated by Geweke
[26]:
GY →X|Z = GY Z→X − GZ→X , (67)
which is just an application of the chain rule for the mutual information [17]. In
analogy to Eq. 37
Sxx
gY Z→X (ω ) = ln (xyz) (xyz) (68)
|Hxx |2 Σxx
and
Sxx
gZ→X (ω ) = ln (xz) 2 (xz)
, (69)
|Hxx | Σxx
where H(xz) is the inverse of the coefficients matrix of Eq. 40. This leads to
(xz) (xz)
|Hxx |2 Σxx
gY →X|Z (ω ) = ln (xyz) 2 (xyz)
. (70)
|Hxx | Σxx
Notice that while gY →X (ω ) = gY →X (ω ), the two measures are different in the con-
ditional case. This means that two alternative spectral decompositions are possible,
although their integration is equivalent. This can be seen considering that the in-
(xz) (xyz)
tegration of the logarithm terms including |Hxx |2 and |Hxx |2 is zero, based on
theorem 4.2 of Rozanov [51], (see [14] for details). Accordingly
(xyz)N (xyz)N (xz)N
TY →X|Z = I(X N ; εy εz ) − I(X N ; εz ). (71)
and the natural extension of the criterion in Eq. 54 is

(xz)N (xyz)N (xyz)N (xyz)N (xyz)N
p(X N |εz ) = p(X N |εy , εz ) ∀εy , εz . (72)
The fact that the variable conditioning on the left hand side is not preserved
among the variables conditioning on the right hand side is what determines that
the information-theoretic statistic to test this equality is not a single KL-divergence
(in particular a mutual information) but a difference of two.
We examine if the alternative spectral measures fulfill the three conditions im-
posed by Geweke described in Section 3.2.1. In the bivariate case the measure is
equal so it is clear that it does. In the multivariate case the measure has an intuitive
interpretation and fulfills the relation with the time domain measure under integra-
tion. However, nonnegativity is not guaranteed for every frequency since it is related
to a difference of mutual informations.
3.5 Alternative Parametric Criteria Based on Innovations Partial

Dependence
Above we have shown that the different criteria underlying bivariate and multivari-
ate GSC can be reduced or extended respectively to the other case. We indicated that
212 D. Chicharro
the parametric criteria rely not only on the causal structure but also in the functional
relations assumed between the processes and the innovation processes. This is par-
ticularly clear in the multivariate criteria (Eqs. 56, 57 and 72) because the criteria
combine innovations from different projections. This prevents from considering the
autoregressive models as actual generative models which structure can be mapped
to a causal graph. Here we introduce an alternative type of parametric criteria which
relies on a single projection, which can be considered as the model from which the
processes are generated.
In the bivariate case the criterion is
(xy)N (xy)N (xy)N (xy)N
p(εx |X N ) = p(εx |X N , εy ) ∀X N , εy , (73)
which can be tested with the mutual information

(xy)N (xy)N
I(εx , εy |X N ) = 0. (74)
(xy)N (xy)N
The innovations εx and εy are assumed to be independent (or are rendered
independent after the normalization step in [25]) when there is no conditioning. The
logic of the criterion is that if conditioning on X N introduces some dependence, this
can only be because both innovation processes have a dependence with process X
(xy)N
(this is the conditioning on a joint child effect). Since by construction εx is asso-
(xy)N
ciated with X , this effect occurs only if and only if εy
N has an influence on X N ,
which can only be through an existent connection from Y to X. In the multivariate
case the criterion is straightforwardly extended to
(xyz)N (xyz)N (xyz)N (xyz)N
p(εx |X N , Z N ) = p(εx |X N , Z N , εy ) ∀X N , Z N , εy (75)
and can be tested with the mutual information

(xyz)N (xyz)N
I(εx , εy |X N , Z N ) = 0. (76)
(xyz)N
Here the conditioning on Z N is required so that the connection from εy to X N is
not indirect through Z.
These criteria have the advantage that they rely on a unique autoregressive rep-
resentation. They are also useful to illustrate the difference between using the
information-theoretic measures as statistics to test for causality or as measure to
quantify the dependencies. In particular, the mutual informations of Eqs. 74, 76 are
either infinity or zero if there is or there is no causality from Y to X. This is clear
(xyz) (xyz)
when expressed in terms of the squared coherence, for example |C(X εx , εy )|2
is associated with Eq. 74, and is 1 when there is causality. This is because since the
(xyz)
two innovation processes completely determine X, inversely the innovations εy
can be known from process X and its innovations. The same occurs in the multivari-
ate case. In principle, this renders these mutual information very powerful to test
for causality and useless to quantify in some way the strength of the dependence. In
practice, the actual value estimated would also reflect how valid is the autoregreesive
model chosen.
4 Comparison of Non-parametric and Parametric Criteria for

Causal Inference from Time-Series
We have reviewed different criteria for causal inference and introduced some re-
lated ones that conform a whole consistent framework for causal inference from
time-series. Here we briefly summarize them, further highlighting their relations. In
Table 1 and 2 we collect all the criteria for causal inference, organized according to
being parametric or non-parametric, and bivariate or multivariate. We see that all the
bivariate criteria have their multivariate counterpart except the criterion 2, associated
with Sims causality. In Table 3 we display the corresponding information-theoretic
measures to test the criteria. We group the measures according to which of them are
equal given the functional form that determines the processes from the innovation
processes. In the bivariate case the measures in 1 and 2 are equivalent only when
there is no instantaneous causality. All these measures can be used to test for causal
inference from Y to X, but when a nonzero value is obtained they provide alternative
characterizations of the dependencies.
Finally, in Table 4 we use the set W = {X,Y, Z} to reexpress the criteria of Table
1 in a synthetic form that integrates the bivariate and multivariate notation used
so far, making more transparent their link. For example, {W \Y } refers to all the
({W \Y })
processes except Y . Furthermore, for innovation processes, ε{W \X,Y } refers to, given
the projection ({W \Y }) which includes all the processes except Y , all the innovation
processes {W \X,Y }, that is, all the ones in the projection except the ones associated
with X and Y .
Table 1 Bivariate criteria for causal inference
Non-parametric
1 p(Xt+1 |X t ) = p(Xt+1 |X t ,Y t )
2 p(X t+1:N |X t ,Y t ) = p(X t+1:N |X t ,Y t ,Yt+1 )
Parametric
(x)N (x)N (xy)N
3 p(εx ) = p(εx |εy )
(xy)N
4 p(X ) = p(X N |εy
N )
(xy)N (xy)N (xy)N
5 p(εx ) = p(εx |ηy )
(xy)N N (xy)N N (xy)N
6 p(εx |X ) = p(εx |X , εy )
214 D. Chicharro
Table 2 Multivariate criteria for causal inference
Non-parametric
1 p(Xt+1 |X t , Z t ) = p(Xt+1 |X t ,Y t , Z t )
2 −
Parametric
(xz)N (xz)N (xyz)N (xyz)N
3 p(εx ) = p(εx | εy , εz )
(xz)N (xyz)N (xyz)N
4 p(X |εz
N ) = p(X |εy
N , εz )
(xyz)N (xyz)N (xyz)N
5 p(εx ) = p(εx |ηy )
(xyz)N N N (xyz)N N N (xyz)N
6 p(εx |X , Z ) = p(εx |X , Z , εy )
Table 3 Mutual information measures to test for causality
Bivariate
(xy)N (x)N (xy)N
1 I(Xt+1 ;Y t |X t ) = I(X N ; εy ) = I(εx ; εy )
(xy)N (xy)N
2 I(Yt+1 ; X t+1:N |Y t , X t ) = I(εx ; ηy )
(xy)N (xy)N
3 I(εx ; εy |X N )
Multivariate
(xz)N (xyz)N (xyz)N
4 I(Xt+1 ;Y t |X t , Z t ) = I(εx ; εy , εz )
(xyz)N (xyz)N (xz)N
= I(X N ; εy , εz ) − I(X N ; εz )
(xyz)N (xyz)N
5 I(εx ; ηy )
(xyz)N (xyz)N N N
6 I(εx ; εy |X , Z )
Table 4 Criteria for causal inference
Non-parametric
1 p(Xt+1 |{W \Y }t ) = p(Xt+1 |{W }t )
Parametric
({W \Y })N ({W \Y })N ({W })N

2 p(εx ) = p(εx |ε{W \X } )
({W \Y })N ({W })N

3 p(X N |ε{W \X ,Y } ) = p(X N |ε{W \X } )
({W })N ({W })N ({W })N

4 p(εx ) = p(εx |ηy )
({W })N ({W })N ({W })N

5 p(εx |{W \Y }N ) = p(εx |{W \Y }N , εy )
5 Conclusion
We have reviewed criteria for causal inferences related to Granger causality and
proposed some new ones in order to complete a unified framework of criteria and
measures to test for causality in a parametric and non-parametric way, in the time or
spectral domain, and for bivariate or multivariate processes. These criteria and mea-
sures are summarized in Tables 1-4. This offers an integrating picture comprising
the measures proposed by Geweke [25, 26] and partial directed coherence [5]. The
contributions of this Chapter are complementary to the work in [57] and [14]. The
distinction between parametric and non-parametric criteria further emphasizes the
necessity to check the validity of the autoregressive representation when applying a
measure which inherently relies on the definition of the innovation processes. The
distinction between criteria and measures stresses that causal inference and the char-
acterization of the dynamic dependencies resulting from them should be addressed
by different approaches [16, 15].
Finally, we notice again that we have here focused on the formal relation be-
tween the different criteria and measures. For practical applications, problems like
the influences of hidden variables [21], or time and temporal aggregation [23] con-
stitute serious challenges that prevent from successfully applying these criteria. For
example, in the case of brain causal analysis it is now clear that a successful char-
acterization can only be obtained if the application of these criteria is combined
with a biologically plausible reconstruction of how the recorded data are generated
by the neural activity [50, 58, 23]. Even at a more practical level, estimating from
small data sets the information-theoretic measures to test for causality is compli-
cated [34]. Most often stationarity is assumed for simplification, but event-related
estimation is also possible [3, 27]. We believe that a clear understanding of the un-
derlying criteria for causal inference and their relation to measures can also help to
better interpret and address these practical problems.
6 Appendix: Fisher Information Measure of Granger Causality

for Linear Autoregressive Gaussian Processes
In Eq. 12 we showed how the criterion of Granger causality of Eq. 1 can be tested
using the Fisher information. For linear Gaussian autoregressive processes, consid-
ering the definition of the Fisher information (Eq. 9) we have
Eyt [F(Xt+1 ; yt |X t )]

∂ log p(Xi+1 |X t , yt ) 2 (77)
= dyt p(yt ) dX t p(X t |yt ) p(Xi+1 |X t , yt )( ) dXt+1 .
∂y t
We start considering the term F(Xt+1 ; yt |xt ) corresponding to the first integral.
For a gaussian process p(Xi+1 |xt , yt ) = N(μ (Xt+1 |xt , yt ), σ (Xt+1 |xt , yt )) is Gaussian.
Therefore
216 D. Chicharro

F(Xt+1 ; yt |xt ) = − N(μ (Xt+1 |xt , yt ), σ (Xt+1 |xt , yt ))
(xy) (xy)
√ x − ∞ axxs Xt−s +axys Yt−s 2 (78)
∂ log 2πσ (Xt+1 |xt , yt ) 2 ∂ 12 ( t+1 ∑s=0
σ (Xt+1 |xt ,yt ) )
(( ) + ( )2 )dXt+1 .
∂ yt ∂ yt
The first summand inside the integral is zero because the term on which the deriva-
tive is done is independent of yt . For the second summand, since it is linear, we
consider for simplification just the partial derivation on a single variable yt . We get

F(Xt+1 ; yt |xt ) = N(μ (Xt+1 |xt , yt ), σ (Xt+1 |xt , yt ))
(xy) (xy)
xt+1 − ∑∞
s=0 axxs Xt−s + axyt yt axyt
( )2 dXt+1 (79)
σ (Xt+1 |x , yt )
t σ (Xt+1 |xt , yt )
a2xyt
= .
σ 2 (Xt+1 |xt , yt )
This term is independent both of xt and yt , so that the other two integration in Eq.
77 can be done straightforwardly. We have
a2xyt
Eyt [F(Xt+1 ; yt |X t )] = , (80)
σ 2 (Xt+1 |xt , yt )
so that each coefficient in the autoregressive representation can be given a mean-
ing in terms of the Fisher information. This relation further illuminates the relation
between coefficients and GY →X [55, 40]:
(xy)
GY →X = 0 ⇔ axys = 0 ∀s. (81)
References
1. Amblard, P.O., Michel, O.: On directed information theory and Granger causality graphs.
J. Comput. Neurosci. 30, 7–16 (2011)
2. Ancona, N., Marinazzo, D., Stramaglia, S.: Radial basis function approach to nonlinear
Granger causality of time series. Phys. Rev. E 70(5), 056221 (2004)
3. Andrzejak, R.G., Ledberg, A., Deco, G.: Detection of event-related time-dependent di-
rectional couplings. New. J. Phys. 8, 6 (2006)
4. Ay, N., Polani, D.: Information flows in causal networks. Advances in Complex Sys-
tems 11, 17–41 (2008)
5. Baccala, L., Sameshima, K.: Partial directed coherence: a new concept in neural structure
determination. Biol. Cybern. 84(1), 463–474 (2001)
6. Baccala, L., Sameshima, K., Ballester, G., Do Valle, A., Timo-Iaria, C.: Studying the
interaction between brain structures via directed coherence and Granger causality. Appl.
Sig. Process. 5, 40–48 (1999)
lent for Gaussian variables. Phys. Rev. Lett. 103(23), 238701 (2009)
8. Besserve, M., Schoelkopf, B., Logothetis, N.K., Panzeri, S.: Causal relationships be-
tween frequency bands of extracellular signals in visual cortex revealed by an informa-
tion theoretic analysis. J. Comput. Neurosci. 29(3), 547–566 (2010)
9. Bressler, S.L., Richter, C.G., Chen, Y., Ding, M.: Cortical functional network organiza-
tion from autoregressive modeling of local field potential oscillations. Stat. Med. 26(21),
3875–3885 (2007)
10. Bressler, S.L., Seth, A.K.: Wiener-Granger causality: A well established methodology.
Neuroimage 58(2), 323–329 (2011)
11. Brovelli, A., Ding, M., Ledberg, A., Chen, Y., Nakamura, R., Bressler, S.L.: Beta oscil-
lations in a large-scale sensorimotor cortical network: Directional influences revealed by
Granger causality. P Natl. Acad. Sci. USA 101, 9849–9854 (2004)
12. Chamberlain, G.: The general equivalence of Granger and Sims causality. Economet-
rica 50(3), 569–581 (1982)
13. Chen, Y., Bressler, S., Ding, M.: Frequency decomposition of conditional Granger
causality and application to multivariate neural field potential data. J. Neurosci.
Meth. 150(2), 228–237 (2006)
14. Chicharro, D.: On the spectral formulation of Granger causality. Biol. Cybern. 105(5-6),
331–347 (2011)
15. Chicharro, D., Ledberg, A.: Framework to study dynamic dependencies in networks of
interacting processes. Phys. Rev. E 86, 41901 (2012)
16. Chicharro, D., Ledberg, A.: When two become one: The limits of causality analysis of
17. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. John Wiley and
Sons (2006)
18. Ding, M., Chen, Y., Bressler, S.L.: Granger causality: Basic theory and application to
neuroscience. In: Handbook of Time Series Analysis: Recent Theoretical Developments
and Applications, pp. 437–460. Wiley-VCH Verlag (2006)
19. Eichler, M.: A graphical approach for evaluating effective connectivity in neural systems.
Phil. Trans. R Soc. B 360, 953–967 (2005)
20. Eichler, M.: On the evaluation of information flow in multivariate systems by the directed
transfer function. Biol. Cybern. 94(6), 469–482 (2006)
21. Eichler, M.: Granger causality and path diagrams for multivariate time series. J. Econo-
metrics 137, 334–353 (2007)
22. Faes, L., Nollo, G., Porta, A.: Information-based detection of nonlinear Granger causality
in multivariate processes via a nonuniform embedding technique. Phys. Rev. E 83(5),
051112 (2011)
23. Friston, K.J.: Functional and effective connectivity: A review. Brain Connectivity 1(1),
13–36 (2012)
24. Gelfand, I., Yaglom, A.: Calculation of the amount of information about a random func-
tion contained in another such function. Am. Math. Soc. Transl. Ser. 2(12), 199–246
(1959)
25. Geweke, J.F.: Measurement of linear dependence and feedback between multiple time
series. J. Am. Stat. Assoc. 77(378), 304–313 (1982)
26. Geweke, J.F.: Measures of conditional linear dependence and feedback between time
series. J. Am. Stat. Assoc. 79(388), 907–915 (1984)
27. Gómez-Herrero, G., Wu, W., Rutanen, K., Soriano, M.C., Pipa, G., Vicente, R.: Assess-
ing coupling dynamics from an ensemble of time series. arXiv:1008.0539v1 (2010)
28. Gourevitch, B., Le Bouquin-Jeannes, R., Faucon, G.: Linear and nonlinear causality
between signals: methods, examples and neurophysiological applications. Biol. Cy-
bern. 95(4), 349–369 (2006)
218 D. Chicharro
29. Granger, C.W.J.: Economic processes involving feedback. Information and Control 6,
28–48 (1963)
methods. Econometrica 37(3), 424–438 (1969)
31. Granger, C.W.J.: Testing for causality: A personal viewpoint. J. Econ. Dynamics and
Control 2(1), 329–352 (1980)
32. Guo, S., Seth, A.K., Kendrick, K.M., Zhou, C., Feng, J.: Partial Granger causality - elim-
inating exogenous inputs and latent variables. J. Neurosci. Meth. 172(1), 79–93 (2008)
33. Hiemstra, C., Jones, J.D.: Testing for linear and nonlinear Granger causality in the stock
price-volume relation. J. Financ. 49(5), 1639–1664 (1994)
34. Hlaváčkova-Schindler, K., Paluš, M., Vejmelka, M., Bhattacharya, J.: Causality detection
based on information-theoretic approaches in time-series analysis. Phys. Rep. 441, 1–46
(2007)
35. Kaminski, M., Blinowska, K.: A new method of the description of the information flow
in the brain structures. Biol. Cybern. 65(3), 203–210 (1991)
36. Kramers, G.: Directed information for channels with feedback. PhD dissertation, Swiss
Federal Institute of Technology, Zurich (1998)
37. Kuersteiner, G.: Granger-Sims causality, 2nd edn. The New Palgrave Dictionary of Eco-
nomics (2008)
38. Kullback, S.: Information Theory and Statistics. Dover, Mineola (1959)
poral filter for complex systems. Phys. Rev. E 77, 26110 (2008)
40. Lütkepohl, H.: New introduction to multiple time series analysis. Springer, Berlin (2006)
41. Marinazzo, D., Pellicoro, M., Stramaglia, S.: Causal information approach to partial con-
ditioning in multivariate data sets. Comput. Math. Meth. Med., 303601 (2012)
42. Marko, H.: Bidirectional communication theory - generalization of information-theory.
IEEE T. Commun. 12, 1345–1351 (1973)
43. Massey, J.: Causality, feedback and directed information. In: Proc. Intl. Symp. Info. Th.
Appli., Waikiki, Hawai, USA (1990)
44. Paluš, M., Komárek, V., Hrnčı́ř, Z., Štěrbová, K.: Synchronization as adjustment of in-
formation rates: Detection from bivariate time series. Phys. Rev. E 63, 046211 (2001)
45. Pearl, J.: Causality: Models, Reasoning, Inference, 2nd edn. Cambridge University Press,
New York (2009)
46. Pereda, E., Quian Quiroga, R., Bhattacharya, J.: Nonlinear multivariate analysis of neu-
rophysiological signals. Prog. Neurobiol. 77, 1–37 (2005)
47. Permuter, H., Kim, Y., Weissman, T.: Interpretations of directed information in portfolio
theory, data compression, and hypothesis testing. IEEE Trans. Inf. Theory 57(3), 3248–
3259 (2009)
48. Priestley, M.: Spectral analysis and time series. Academic Press Inc., San Diego (1981)
49. Quinn, C.J., Coleman, T.P., Kiyavash, N., Hatsopoulos, N.G.: Estimating the directed
information to infer causal relationships in ensemble neural spike train recordings. J.
Comput. Neurosci. 30, 17–44 (2011)
50. Roebroeck, A., Formisano, E., Goebel, R.: The identification of interacting networks in
the brain using fmri: Model selection, causality and deconvolution. NeuroImage 58(2),
296–302 (2011)
51. Rozanov, Y.: Stationary random processes. Holden-Day, San Francisco (1967)
52. Schelter, B., Timmer, J., Eichler, M.: Assessing the strength of directed influences among
neural signals using renormalized partial directed coherence. J. Neurosci. Meth. 179(1),
121–130 (2009)
53. Schelter, B., Winterhalder, M., Eichler, M., Peifer, M., Hellwig, B., Guschlbauer, B.,
Lucking, C., Dahlhaus, R., Timmer, J.: Testing for directed influences among neural
signals using partial directed coherence. J. Neurosci. Meth. 152(1-2), 210–219 (2006)
55. Sims, C.: Money, income, and causality. American Economic Rev. 62(4), 540–552
(1972)
56. Solo, V.: On causality and mutual information. In: Proceedings of the 47th IEEE Con-
ference on Decision and Control, pp. 4639–4944 (2008)
57. Takahashi, D.Y., Baccala, L.A., Sameshima, K.: Information theoretic interpretation of
frequency domain connectivity measures. Biol. Cybern. 103(6), 463–469 (2010)
58. Valdes-Sosa, P., Roebroeck, A., Daunizeau, J., Friston, K.: Effective connectivity: Influ-
ence, causality and biophysical modeling. Neuroimage 58(2), 339–361 (2011)
59. Vicente, R., Wibral, M., Lindner, M., Pipa, G.: Transfer entropy: A model-free measure
of effective connectivity for the neurosciences. J. Comput. Neurosci. 30, 45–67 (2010)
60. Wiener, N.: The theory of prediction. In: Modern Mathematics for Engineers, pp. 165–
190. McGraw-Hill, New York (1956)
Author Index
Battaglia, Demian 111 McIntosh, Anthony R. 137
Chicharro, Daniel 195 Pellicoro, Mario 87

Porta, Alberto 61
Faes, Luca 61
Stramaglia, Sebastiano 87
Krakovska, Olga 137
Vakorin, Vasily A. 137
Lindner, Michael 3 Vicente, Raul 3, 37
Lizier, Joseph T. 161
Wibral, Michael 3, 37
Marinazzo, Daniele 87 Wu, Guorong 87
Subject Index
A diminishing marginal returns 88

dynamical systems theory 163
active information storage 66, 175 state space, see state space
approximate entropy 151
autoregressive process 201 E
B
edge of synchrony 132
Bayesian information criterion 145 EEG 22, 54, 81, 93, 105, 138, 162
bursting neuronal cultures, effective connectivity 94
see neuronal cultures electroencephalogram, see EEG
entropy 6, 63, 164
C entropy rate 166
Erdos-Renyi networks 90
causal inference 195
causality 8, 186, 195 F
see transfer entropy, and
causality Fisher information 200
see Granger causality fMRI 100, 161, 188
cellular automata 31, 180
functional connectivity 101, 112,
communication-through-coherence 113, 118
hypothesis 124
resting-state 131
conditional entropy 6, 63, 165
functional magnetic resonance
conditional mutual information 7,
imaging, see fMRI
64, 165, 199
functional multiplicity 113, 121
corrected conditional entropy 69
Cramer-Rao bound 200
G
criticality 132
cross entropy 64
Gaussian processes 88, 98, 201
D conditional mutual information
of 105
diffusion spectrum imaging 91 entropy of 43, 174, 204
224 Subject Index
Granger causality 138, 140, 162, Kullback-Leibler divergence 199

195, 197
conditional 95
distributions of 94 L
Geweke measures of 202
kernel 94 law of diminishing marginal returns,
local 179 see diminishing marginal
partially conditioned 95 returns
leaky integrate-and-fire
spectral 140, 204
neurons 115
LFP 22, 113, 124, 138, 161
H local active information
storage 176
Henon process 74 local entropy 168
human connectome 91 local field potential, see LFP
local inhibition 124
I local mutual information 170
local transfer entropy,
information dynamics 31, 39, 54, see transfer entropy, local
64, 84, 163, 175
M
information partial directed
coherence 207 magnetoencephalography, see MEG
information storage 107, 175 Marr’s levels of understanding 32
information theory 4, 38 Massey’s directed information 28,
local measures 167 199
information-theoretic MEG 22, 54, 82, 138
estimators 173 misinformation 171
information transfer, momentary information transfer 30
see transfer entropy multi-scale entropy 151
instantaneous effects 29, 82, 118, multivariate transfer entropy,
126 see transfer entropy, multi-
variate
J mutual information 7, 39, 63, 98,
125, 165
joint entropy 165 time-lagged 27
N
K
natural causal effects 196
kernel estimation 14, 45, 174 network motifs 101, 107, 113, 124,
Kolmogorov entropy 151 126
Kozachenko-Leonenko networks 88
estimator 46 neuronal cultures 115
Kraskov-Stögbauer-Grassberger non-linear time-series analysis 163
estimator 15, 48, 53, 174 non-uniform embedding 73
Subject Index 225
O structural degeneracy 113, 124

structural motifs, see network
open source software motifs
JIDT 22, 52, 173, 182 surrogate data 23, 51, 106
TET 21 synergy 11, 12, 25, 39, 67, 101,
TIM 52 105, 107, 166, 179
TRENTOOL 21, 52, 172
oscillations 112, 124, 125, 138 T
P
time delay embedding, see state
phase synchronization 142 space
physiological time series 82 transfer entropy 3, 38, 40, 65, 96,
point-wise mutual information, 101, 113, 141, 152, 162,
see local mutual 177, 196, 199
information and causality 8, 41, 186
preferential attachment 89 apparent 179
principal component analysis 96 bias 16, 23, 44–48, 50, 51
complete 179
R conditional 179, 199
delay 18, 177
Rössler oscillators 139 distributions of 105
Ragwitz’ criterion 11, 50 embedding length 177
Receiver-Operator estimators 13, 41, 173
Characteristic 120 ensemble method 26, 52
redundancy 11, 12, 25, 39, 67, 81, for nonstationary
101, 105, 164, 166 processes 26
expansion of 100, 102, 179
S
functional 7
sample entropy 151 local 162, 177
scale-free networks 89 multivariate 11, 66, 67, 102,
self entropy 64 179, 199
Shannon entropy 39, 69, 164 relation to Granger
see Entropy causality 43, 97, 179, 203
Shannon information state-conditioned 114, 118,
content 6, 168 124
Sims causality 198 state-dependent and
spontaneous symmetry independent 11, 54
breaking 129 symbolic 44
state space 12, 49, 69, 141
statistical parametric mapping 100 W
statistical significance testing 23,
51, 106, 118 Wang-Buzsáki model 125
structural connectivity 112, 113, white matter tractography 91
115 Wiener’s principle 7

Directed Information Measures in Neuroscience 2014

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Directed Information Measures in Neuroscience 2014

Uploaded by

Copyright:

Available Formats

Understanding Complex Systems

Editorial and Programme Advisory Board

For further volumes:

ISSN 1860-0832 ISSN 1860-0840 (electronic)

Library of Congress Control Number: 2014932427

c Springer-Verlag Berlin Heidelberg 2014

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

in neural systems separating the components of distributed information processing

Frankfurt, Tartu, Sydney Michael Wibral

Part I: Introduction to Directed Information Measures

Transfer Entropy in Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Efficient Estimation of Information Transfer . . . . . . . . . . . . . . . . . . . . . . . . . 37

Part II: Information Transfer in Neural and Other Physiological Systems

Conditional Entropy-Based Evaluation of Information Dynamics in

Information Transfer in the Brain: Insights from a Unified Approach . . . 87

On Complexity and Phase Effects in Reconstructing the Directionality

Part III: Recent Advances in the Analysis of Information Processing

Measuring the Dynamics of Information Processing on a Local Scale

Parametric and Non-parametric Criteria for Causal Inference from

Michael Wibral, Raul Vicente, and Michael Lindner

Abstract. Information transfer is a key component of information processing, next

M. Wibral et al. (eds.), Directed Information Measures in Neuroscience, 3

2.1 Physical Systems, Time Series, Random Processes and

To avoid confusion when introducing the concept of transfer entropy, we first

2.2 Basic Information Theory

observation of x. It is clear that after observation of x only chains of events that

The average Shannon information that we obtain by repeatedly observing out-

The Shannon information of an outcome x of X, given we have already observed

Averaging this for all possible outcomes of X, weighted by their probabilities

I(X;Y ) = H(X) − H(X|Y) = H(Y ) − H(Y |X) (6)

2.3 The Transfer Entropy Functional

2.4.1 Transfer Entropy and Causality

To sum up, the quantity measured by TE is the amount of predictive informa-

2.4.2 State Dependent and State-Independent Transfer Entropy

we immediately see the possibility of redundant or synergistic information transfer

3.1 Signal Representation and State Space Reconstruction

p(yt+1 , ytk |yt ) = p(yt |yt )p(ytk |yt )

A realization yt of Yt is called a state of the random process Y at time t.

xtd = (xt , xt−τ , xt−2τ , ...

3.2 Transfer Entropy Estimators

Thus, T ESPO estimation amounts to computing a combination of different joint

• Nearest-neighbour techniques. These techniques exploit the statistics of distances

or, following the second suggestion by Kraskov [31] as:

3.3 A Graphical Summary of the TE Principle

p(Yt | Xt-u =0.45 ,Yt-1=0.15)

Fig. 2 Central TE concepts. (A) Coupled systems X → Y . To quantify T E(X → Y ) we

to the x-independent distribution (Figure 2,D; distribution on the left of subfigure).

3.4 Information Transfer Delay Estimation

δ = arg max(T ESPO (X → Y, u)) . (15)

Fig. 4 Interaction delay reconstruction between a pair of bidirectionally coupled Lorenz

bidirectional coupling does not lead to full synchronization. The reconstruction of

3.5 Practical TE Estimation and Open Source Tools

the Java Information Dynamics Toolkit (JIDT, http://code.google.com/p/

software GNU Octave http://www.gnu.org/software/octave/, or

4 Common Problems and Solutions

4.1 Statistical Testing to Overcome Bias and Variance Problems

the TE values in either dataset the statistical comparison has to be non-parametric,

4.2 Multivariate TE and Approximation Techniques

4.3 Observation Noise

4.4 Stationarity and Ensemble Methods

5 Relation to Other Directed Information Measures

5.1 Time-Lagged Mutual Information

5.2 Transfer Entropy and Massey’s Directed Information

60 = 0.01 25 = 0.0345 20 = 0.5