Theory of Information and Its Value: Ruslan L. Stratonovich

Ruslan L.
Stratonovich
Theory
of Information
and its Value
Edited by Roman V. Belavkin
Panos M. Pardalos · Jose C. Principe
Theory of Information and its Value
Roman V. Belavkin • Panos M. Pardalos
Jose C. Principe
Editors
Theory of Information
and its Value
Editors
Roman V. Belavkin Panos M. Pardalos
Faculty of Science and Technology Industrial and Systems Engineering
Middlesex University University of Florida
London, UK Gainesville, FL, USA
Jose C. Principe
Electrical & Computer Engineering
University of Florida
Gainesville, FL, USA
Author
Ruslan L. Stratonovich (Deceased)
ISBN 978-3-030-22832-3 ISBN 978-3-030-22833-0 (eBook)

https://doi.org/10.1007/978-3-030-22833-0
Mathematics Subject Classification: 94A17, 94A05, 60G35
© Springer Nature Switzerland AG 2020

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword
It would be impossible for us to start this book without mentioning the main achieve-
ments of its remarkable author, Professor Ruslan Leontievich Stratonovich (RLS or
Ruslan later). He was a brilliant mathematician, probabilist and theoretical physicist,
best known for the development of the symmetrized version of stochastic calculus
(an alternative to Itô calculus), with stochastic differential equations and integrals
now bearing his name. His unique and beautiful approach to stochastic processes
was invented in the 1950s during the time of his doctoral work on the solution to the
notorious nonlinear filtering problem. The importance of this work was immediately
recognized by the great Andrei Kolmogorov, who invited Ruslan, then a graduate
student, for a discussion of his first papers.
This work was so much ahead of its time that its initial reception in the Soviet
mathematical community was mixed, mainly due to misunderstandings of the differ-
ences between the Itô and Stratonovich approaches. These and perhaps other factors
related to the cold war had obscured some of the achievements of Stratonovich’s
early papers on optimal nonlinear filtering, which, apart from the general solution
to the nonlinear filtering problem, contained also the equation for the Kalman–Bucy
filter as a special (linear) case as well as the forward–backward procedures for com-
puting posterior probabilities, which were later rediscovered in the hidden Markov
models theory. Nonetheless, the main papers were quickly translated into English,
and thanks to the remarkable hard work of RLS, by 1966 he had already published
two monographs—the Topics in the Theory of Random Noise (see [54] for a recent
reprint) and Conditional Markov Processes [52]. These books were also promptly
translated and had a better reception in the West, with the latter book being edited by
Richard Bellman. In 1968, Professor W. Murray Wonham wrote in a letter to RLS
‘Perhaps you are the prophet, who is honored everywhere except in his own land’.
Despite the difficulties, the following ten years in the 1960s were very productive
for Ruslan, who quickly became recognized as one of the top scientists in the World
in his field. He became a Professor by 1969 (in the post he remained at the Depart-
ment of Physics all his life). At that time in the 1960s, he managed to form a group
of young and talented graduate students including Grishanin, B. A., Sosulin, Yu. G.,
Kulman, N. K., Kolosov, G. E., Mamayev, D. D., Platonov, A. A. and Belavkin, V. P.
v
vi Foreword
Fig. 1 Stratonovich R. L. (second left) with his group in front of Physics Department, Moscow
State University, 1971. From left to right: Kulman, N. K., Stratonovich, R. L., Mamayev, D. D.,
Sosulin, Yu. G., Kolosov, G. E., Grishanin, B. A. Picture taken by Slava (V. P.) Belavkin
(Figure 1). These students began working under his supervision in completely new
areas of information theory, stochastic and adaptive optimal control, cybernetics and
quantum information. Ruslan was young and had a somewhat legendary reputation
among students and colleagues at the university, so that many students aspired to
work with him, even though he was known to be a very hard working and demand-
ing supervisor. At the same time, he treated his students as equals and gave them a
lot of freedom. Many of his former students recall that Ruslan had an amazing gift
to predict the solution before it was derived. Sometimes, he surprised his colleagues
by giving an answer to some difficult problem, and when they asked him how he
obtained it, his answer was ‘just verify it yourself, and you will see it is correct’.
Ruslan and his students spent a lot of their spare time together, either playing tennis
during the summer time or skiing in winters, such that they developed long-lasting
friendships.
In the mid-1960s, Ruslan read several specialist courses on various topics, in-
cluding adaptive Bayesian inference, logic and information theory. The latter course
included a lot of original material, which emphasized the connection between infor-
mation theory with statistical thermodynamics and introduced the Value of Informa-
tion, which he pioneered in his 1965 paper [47] and later developed together with
his students (mainly Grishanin, B. A.). This course motivated Ruslan to write his
third book called ‘Theory of Information’. Its first draft was ready in 1967, and it
is remarkable to note that RLS even had some negotiations with Springer to pub-
Foreword vii
lish the book in English, which unfortunately did not happen, perhaps due to some
bureaucratic difficulties in the Soviet Union. The monograph was quite large and
included parts on quantum information theory, which he developed together with
his new student Slava (V. P.) Belavkin. Although there was an agreement to publish
the book with a leading Soviet scientific publisher, the publication was delayed for
some unexplained reasons. In the end, an abridged version of the book was pub-
lished by a different publisher (Soviet Radio) in 1975 [53], which did not include
the quantum information parts!
Nonetheless, the book had become a classic even without having been translated
into English. Several anecdotal stories exist regarding the book that it was used
in the West and discussed at seminars with the help of Russian-speaking graduate
students. For example, Professor Richard Bucy used translated parts of the book in
his seminars on information theory, and in the 1990s he even suggested that the book
be published in English. In fact, in 1994 and 1995 Ruslan visited his former student
and collaborator Slava Belavkin in the University of Nottingham, United Kingdom,
who worked there at the Department of Mathematics (Figure 2). They had a plan to
publish a second edition of the book together in English and to include the parts on
quantum information. Because quantum information theory had progressed in the
1970s and 1980s, it was necessary to update the quantum parts of the manuscript,
and this had become the Achilles’s heel to their plan. During Ruslan’s visit, they
spent more time working on joint papers, which seemed as a more urgent matter.
Recollections of Roman Belavkin
I also visited my father (V.P. Belavkin) in Nottingham during the summer of 1994,
and I remember very clearly how happy Ruslan was during that visit (Figure 3),
especially for the ability to mow the lawn in his backyard—a completely new expe-
rience for someone who lived in a small Moscow flat all his life. Two years later, in
January 1997, Ruslan died after catching a flu during the winter examinations at the
Moscow State University. I went to his funeral at the Department of Physics, from
which I too had already graduated. It was a very big and sad event attended by a
crowd of students and colleagues. In the next couple of years, my father collabo-
rated with Valentina, Ruslans wife, on an English translation of the book, the first
version of which was in fact finished. Valentina too passed away two years later, and
my father never finished this project.
Before telling the reader how the translation of this book eventually came about,
I would like to write a few words about this book from my personal experience, how
it became one of my favourite texts on information theory, and why I believe it is so
relevant today.
Having witnessed first-hand the development of quantum information and filter-
ing theory in the 1980s (my father’s study in our small Moscow flat was also my
bedroom), I decided that my career could do without non-commutative probability
and stochastics. So, although I graduated from the same department as my father
viii Foreword
Fig. 2 Stratonovich with his wife during their visit to Nottingham, England, 1994. From left
to right: Robin Hudson, Slava Belavkin, Ruslan Stratonovich, Valentina Stratonovich, Nadezda
Belavkina
and Ruslan, I became interested in Artificial Intelligence (AI), and a couple of years
later I managed to get a scholarship to do a PhD in cognitive modelling of human
learning at the University of Nottingham. I was fortunate enough to be in the same
city with my parents, which allowed me to take a cycle ride through Wollaton Park
and visit them either for lunch or dinner. Although we often had scientific discus-
sions, I felt comfortable that my area of research was far away and independent of
my father’s territory. That, however, turned out to be a false impression.
During that time at the end of 1990s, I came across many heuristics and learning
algorithms using randomization in the form of the so-called soft-max rule, where
decisions were sampled from a Boltzmann distribution with a temperature param-
eter controlling how much randomization was necessary. And although using these
heuristics had clearly improved the performance of the algorithms and cognitive
models, I was puzzled by these links with statistical physics and thermodynamics.
Foreword ix
Fig. 3 Stratonovich with his wife at Belavkins’ home in Nottingham, England, 1994. From left
to right: Nadezda Belavkina, Slava Belavkin, Roman Belavkin, Ruslan Stratonovich and Valentina
Stratonovich
The fact that it was more than just a coincidence became clear when I saw that
performance of cognitive models could be improved by relating the temperature pa-
rameter dynamically to entropy. Of course, I could not help sharing these naive ideas
with my father, and to my surprise he did not criticize them. Instead, he went to his
study and brought an old copy of Ruslan’s Theory of Information. I spent the next
few days going through various chapters of the book, and I was immediately im-
pressed by the self-contained, and at the same time, very detailed and deep style of
the presentation. Ruslan managed to start each chapter with basic and fundamental
ideas, supported by very understandable examples, and then developed the material
to such depth and detail that no questions seemed to remain unanswered. However,
the main value of the book was in the ideas unifying theories of information, opti-
mization and statistical physics.
My main focus was on Chapters 3 and 9, which covered variational problems
leading to optimal solutions in the form of exponential family distributions (the
‘soft-max’), defined and developed the value of information theory and explored
many interesting examples. The value of information is an amalgamation of the-
ories of optimal statistical decisions and information, and its applications go far
beyond problems of information transmission. For example, the relation to machine
learning and cognitive modelling was very immediately clear—learning from math-
ematical point of view was simply an optimization problem with information con-
straints (otherwise, there is nothing to learn), and a solution to such a problem could
only be a randomized policy, where randomization was the consequence of incom-
plete information. Furthermore, the temperature parameter was simply the Lagrange
multiplier defined by the constraint, which also meant that an optimal temperature
could be derived (at least in theory) giving the solution to the notorious ‘exploration-
exploitation’ dilemma in reinforcement learning theory. A few years later, I applied
x Foreword
these ideas to evolutionary systems and derived optimal control strategies for mu-
tation rates in genetic algorithms (controlling randomization of DNA sequences).
Similar applications can be developed to control learning rates in artificial neural
networks and other data analysis algorithms.
History of this translation
This publication is a new translation of the 1975 book, which incorporates some
parts from the original translation by Ruslan’s wife, Valentina Stratonovich. The
publication has become possible, thanks to the initiatives of Professors Panos Parda-
los and Jose Principe. The collaboration was initiated at the ‘First International
Conference on Dynamics of Information Systems’ organized by Panos in the Uni-
versity of Florida in 2009 [22]. It is fair to say that at that time it was the only
conference dedicated to more than traditional information-theoretic aspects of data
and systems analysis, but also to the importance of analysing and understanding
the value of information. Another very important achievement of this conference
was the first attempt to develop a geometric approach to the value of information,
which is why one of the invited speakers to the conference was Professor Shun-
Ichi Amari. It was at this conference that the editors of this book first met together.
Panos, who by that time was the author and editor of dozens of books on global opti-
mization and data science, expressed his amazement at the unfortunate fact that this
book had still not been available in English. Jose Principe, known for his pioneering
work on information-theoretic learning, had already recognized the importance and
relevance of this book to modern applications and was planning the translation of
specific chapters. It was clear that there was a huge interest in the topic of value
of information, and we began discussing the possibility of making the new English
translation of this classic book, and finishing the project, which unfortunately was
never completed by Ruslan Stratonovich and Slava Belavkin.
Panos suggested that Vladimir Stozhkov, one of his Russian-speaking PhD stu-
dents, should do the initial translation. Vladimir took on the bulk of this work. The
equations for each chapter were coordinated by Matt Emigh and entered in LATEX
by students and visitors in the Department of Computational Neuro-Engineering
Laboratory (CNEL), University of Florida, during the Summer and Fall of 2016 as
follows: Sections 1.1–1.5 by Carlos Loza, 1.6–1.7 by Ryan Burt, 2.1–3.4 by Ying
Ma, 3.5–4.3 by Zheng Cao, 4.4–5.3, 6.1–6.5 and 8.6–8.8 by Isaac Sledge, 5.4–5.7
by Catia Silva, 5.8–5.11 by John Henning, 6.6–6.7 by Eder Santana, 7.3–7.5 by
Paulo Scalassara, 7.6–8.5 by Shulian Yu and Chapters 9–12 by Matt Emigh. This
translation and equations were then edited by Roman Belavkin, who also combined
it with the translation by Valentina Stratonovich in order to achieve a better reflec-
tion of the original text and terminology. In particular, the introductory paragraphs
of each chapter are largely based on Valentina’s translation.
We would like to take the opportunity to thank Springer, and specifically Razia
Amzad and Elizabeth Loew, for making the publication of this book possible. We
Foreword xi
also acknowledge the help of the Laboratory of Algorithms and Technologies for
Network Analysis (LATNA) at the Higher School of Economics in Nizhny Nov-
gorod, Russia, for facilitating meetings and collaborations among the editors and
accommodating many fruitful discussions on the topic in this beautiful Russian city.
With the emergence of data-driven economy, progress in machine learning and
AI algorithms and increased computational resources, the need for a better under-
standing of information, its value and limitations is greater than ever. This is why we
believe this book is even more relevant today than when it was first published. The
vast amount of examples pertaining to all kinds of stochastic processes and prob-
lems makes it a treasure trove of information for any researcher working in the areas
of data science or machine learning. It is a great pleasure to be able to contribute a
little to this project and see that finally this amazing book will be open to the rest of
the World.
London, UK Roman Belavkin

Gainesville, FL, USA Panos Pardalos
Gainesville, FL, USA Jose Principe
Preface
This book was inspired by the author’s lectures on information theory in the De-
partment of Physics at Moscow State University, 1963–1965. Initially, the book was
written in accordance with the content of those lectures. The plan was to organize
the book in order to reflect all the paramount achievements of Shannon’s informa-
tion theory. However, while working on the book the author ‘lost his way’ and used
a more familiar style, in which the development of own ideas dominated over a thor-
ough recollection of existing results. That led to an inclusion of the original material
into the book and to an original interpretation of many central constructs of the the-
ory. The original material extruded a part of steady results, which were about to be
included into the book. For instance, the chapter devoted to known steady methods
of encoding and decoding in noisy channels was discarded.
The material included in the book is organized in three stages: the first, second,
third variational problems and the first, second, third asymptotical theorems, respec-
tively. This creates a clear panorama of the most fundamental content of Shannon’s
information theory.
Every writing style has its own benefits and disadvantages. The disadvantage of
the chosen style is that the work of many scientists in the field of interest remains
non-reflected (or insufficiently reflected). This fact should not be regarded as an
indication of insufficient respect to them. As a rule, an assessment of the material’s
originality and the attribution of the author’s ownership of the results are not given.
The only exception is made for a few explicit facts.
The book adopts ‘the principle of increasing complexity of material’, which
means that simpler and easily understood material is placed in the beginning of
a book (as well as in the beginning of a chapter). The reader is not required to be
familiar with more difficult and specific material situated towards the end of a chap-
ter/the book. This principle allows the inclusion of complicated material into the
book, while not making it difficult for a fairly wide range of readers. The hope is
that many readers will gain useful knowledge for themselves from the book.
While considering general questions, the author tried to lead statements with the
most generality possible. To achieve this, he often used the language of measure
theory. For example, he utilized the notation P(dx) for probability distribution. This
xiii
xiv Preface
should not scare off those who did not master the specified language. The point is
that, omitting the details, a reader can always use a simple dictionary which con-
verts those ‘intimidating’ terms into those more familiar. For instance, ‘probability
measure P(dx)’ can be treated as probability P(x) in the case of a discrete random
variable or as product p(x) dx in the case of a continuous random variable, where
p(x) is a probability density function and dx = dx1 . . . dxr is a differential corre-
sponding to a space of dimensionality r.
Focusing on various readers, the author did not attach significance to consistency
of terminology. He thought that it did not matter whether we say ‘expectation’ or
‘mean value’, ‘probabilistic measure’ or ‘distribution’. If we employ the apparatus
of generalized functions, then there always exists a probability density and it can
be used. Often there is no need to distinguish between minimization signs min. and
inf. By this we mean that if infimum is not attained within a considered region then
we can always ‘turn on’ an absolutely standard ε -procedure and, as a matter of
fact, nothing essential will change as a result. In the book, we pass from a discrete
probabilistic space to a continuous probabilistic space in a free manner. The author
tried to spare the reader any concern of inessential details and distractions from the
main ideas.
The general constructs of the theory are illustrated in the book by numerous cases
and examples. Due to a common importance of the theory and the examples, their
statement will not require a special radiophysics terminology. If a reader is interested
in application of the stated material to the problem of messages transmission through
radiochannels, he/she should fill abstract concepts with radiophysical content. For
instance, when considering noisy channels (Chapter 7), an input stochastic process
x = {xt } should be treated as a variable informational parameter of signal s(t, xt )
emitted by a radiotransmitter. Also, an output process y = {yt } should be treated
as a signal at a receiver input. A proper concretization of concepts is needed for
application of any mathematical theory.
The author expresses acknowledgements to the first reader of this book B.A. Gr-
ishanin, who rendered significant assistance while the book was being prepared to
print, and professor B.I. Tikhonov for discussions involving a variety of subjects
and a number of precious comments.
Moscow, Russia Ruslan L. Stratonovich

Introduction
The term ‘information’ mentioned in the title of the book is understood here not in
the broad sense in which the word is understood by people working in the press,
radio, media, but in the narrow scientific sense of Claude Shannon’s theory. In other
words, the subject of this book is the special mathematical discipline, Shannon’s
information theory, which can solve its own quite specific problems.
This discipline consists of abstractly formulated theorems and results, which can
have different specializations in various branches of knowledge. Information theory
has numerous applications in the theory of message transmission in the presence of
noise, the theory of recording and registering devices, mathematical linguistics and
other sciences including genetics.
Information theory, together with other mathematical disciplines, such as the the-
ory of optimal statistical decisions, the theory of optimal control, the theory of algo-
rithms and automata, game theory and so on, is a part of theoretical cybernetics—a
discipline dealing with problems of control. Each of the above disciplines is an in-
dependent field of science. However, this does not mean that they are completely
separated from each other and cannot be bridged. Undoubtedly, the emergence of
complex theories is possible and probable, where concepts and results from differ-
ent theories are combined and which interconnect distinct disciplines. The picture
resembles trees in a forest: their trunks stand apart, but their crowns intertwine. At
first, they grow independently, but then their twigs and branches intertwine making
their new common crown.
Of course, generally speaking, the statement about uniting different disciplines
is just an assertion, but, in fact, the merging of some initially disconnected fields of
science is now an actual fact. As is evident from a number of works and from this
book, the following three disciplines are inosculating:
1. statistical thermodynamics as a mathematical theory
2. Shannon’s information theory
3. the theory of optimal statistical decisions (together with its multi-step or sequen-
tial variations, such as optimal filtering and dynamic programming).
xv
xvi Introduction
This book will demonstrate that the three disciplines mentioned above are ce-
mented by ‘thermodynamic’ methods with typical attributes such as ‘thermody-
namic’ parameters and potentials, Legendre transforms, extremum distributions, and
asymptotic nature of the most important theorems.
Statistical thermodynamics can be referred to as cybernetic disciplines only con-
ditionally. However, in some problems of statistical thermodynamics, its cybernetic
nature manifests itself quite clearly. It is sufficient to recall the second law of ther-
modynamics and ‘Maxwell’s demon’, which is a typical automaton converting in-
formation into physical entropy. Information is ‘fuel’ for perpetual motion of the
second kind. These points will be discussed in Chapter 12.
If we consider statistical thermodynamics as a cybernetic discipline, then L. Boltz-
mann and J. C. Maxwell should be called the first outstanding cyberneticists. It is
important to bear in mind that the formula expressing entropy in terms of probabil-
ities was introduced by L. Boltzmann, who also introduced the probability distri-
bution that was the solution to the first variational problem (of course, it does not
matter how we call the functions in this formula—energy or cost function).
During the emergence of Shannon’s information theory, the appearance of a well-
known notion of thermodynamics, namely entropy, was regarded by some scientists
as a curious coincidence, and little attention was given to the fact. It was thought that
this entropy had nothing to do with physical entropy (despite the work of Maxwell’s
demon). In this connection, we can recall a countless number of quotation marks
around the word ‘entropy’ in the first edition of the collection of Shannon’s pa-
pers translated into Russian (the collection under the editorship of A.N. Zheleznov,
Foreign Literature, 1953). I believe that now even terms such as ‘temperature’ in in-
formation theory can be written without quotation marks and understood merely as
a parameter incorporated in the expression for the extreme distribution. Similar laws
are valid both in information theory and statistical physics, and we can conditionally
call them ‘thermodynamics’.
At the beginning (from 1948 until 1959), only one ‘thermodynamic’ notion ap-
peared in Shannon’s information theory—entropy. There seemed to be no room for
energy and other analogous thermodynamic potentials in it. In that regard, the the-
ory looked feeble in comparison with statistical thermodynamics. This, however,
was short-lived. The situation changed when scientists realized that in applied infor-
mation theory, regarded as the theory of signal transmission, the cost function was
the analogue of energy, and risk or average cost was the analogue of average energy.
It became evident that a number of main concepts and relations between them are
similar in two disciplines. In particular, if the first variational problem is considered,
then we can speak about resemblance, ‘isomorphism’ of the two theories. Mathe-
matical relationships between the corresponding notions of both disciplines are the
same, and they are the contents of a mathematical theory that is considered in this
book.
The content of information theory is not limited to the specified relations. Be-
sides entropy the theory contains other notions such as the Shannon’s amount of
information. In addition to the first variational problem, related to an extremum of
entropy under fixed risk (i.e. energy), there are also possible variational problems, in
Introduction xvii
which entropy is replaced by the Shannon’s amount of information. Therefore, the

content of information theory is broader than the mathematical content of statistical
thermodynamics.
New variational problems reveal a remarkable analogy with the first variational
problem. They contain the same ensemble of conjugate parameters and potentials,
the same formulae linking potentials via the Legendre transform. And this is no
surprise. It is possible to show that all those phenomena emerge when considering
any non-singular variational problem.
Let us consider this subject more thoroughly. Suppose that at least two function-
als Φ1 [P], Φ2 [P] of the argument (‘distribution’) P are given, and it is required to
find an extremum of one functional subject to a fixed value constraint of the second
one, say,
Φ1 [P] = extrP subject to Φ2 [P] = A.
Introducing the Lagrange multiplier α , which serves as a ‘thermodynamic’ pa-
rameter canonically conjugate with parameter A, we study an extremum of the ex-
pression
K = Φ1 [P] + αΦ2 [P] = extrP (1)
under a fixed value of α . The extreme value
K(α ) = Φ1 [Pextr ] + αΦ2 [Pextr ] = Φ1 + α A
serves as a ‘thermodynamic’ potential since dK = d Φ1 + α d Φ2 + Φ2 d α = Φ2 d α ,

i.e.
dK/d α = A. (2)
The relation used
d Φ1 + α d Φ2 ≡ [Φ1 (Pextr + δ P) − Φ1 (Pextr )] + α [Φ2 (Pextr + δ P) − Φ2 (Pextr )] = 0
follows from (1). Here, we take a partial variation ∂ P = Pextr (α + d α ) − Pextr (α )

due to the increment of parameter A, i.e. parameter α . Function L(A) = K − α A =
K − α (dK/d α ) is a potential conjugate by Legendre with K(α ). Meanwhile, it fol-
lows from (2) that
dL(A)/dA = −α .
These relations, which are common in thermodynamics, are apparently valid re-
gardless of the nature of the functionals Φ1 , Φ2 and the argument P.
The results of information theory discussed in this book are related to the three
variational problems whose solution yields a number of relations, parameters and
potentials. Variational problems play an important role in theoretical cybernetics
because it concerns mainly optimal constructions and procedures.
As can be seen from the book contents, these variational problems are related also
to the major laws (theorems) having asymptotic nature (i.e. valid for large compos-
ite systems). The first variational problem corresponds to stability of the canonical
distribution, which is essential in statistical physics, as well as to asymptotic equiv-
alence of constraints imposed on specific and mean values of a function.
xviii Introduction
The second variational problem is related to the famous result of Shannon about
asymptotic zero probability of error for the transmission of messages through noisy
channels. The third variational problem is connected with asymptotic equivalence
of the values of information of Shannon’s and Hartley’s type. The latter results are
a splendid example of unity of discrete and continuous worlds. They are also a
good example that when the complexity of a discrete system grows, it is convenient
to describe it by continuous mathematical objects. Finally, they are an example of
how a complex continuous system behaves asymptotically similarly to a complex
discrete system. It would be tempting to observe something similar, say, in a future
asymptotical theory of dynamical systems and automata.
Contents
1 Definition of information and entropy in the absence of noise . . . . . . . 1

1.1 Definition of entropy in the case of equiprobable outcomes . . . . . . 3
1.2 Entropy and its properties in the case of non-equiprobable
outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Conditional entropy. Hierarchical additivity . . . . . . . . . . . . . . . . . . . 8
1.4 Asymptotic equivalence of non-equiprobable and equiprobable
outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Asymptotic equiprobability and entropic stability . . . . . . . . . . . . . . 16
1.6 Definition of entropy of a continuous random variable . . . . . . . . . . 22
1.7 Properties of entropy in the generalized version. Conditional
entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2 Encoding of discrete information in the absence of noise
and penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.1 Main principles of encoding discrete information . . . . . . . . . . . . . . 36
2.2 Main theorems for encoding without noise. Independent
identically distributed messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3 Optimal encoding by Huffman. Examples . . . . . . . . . . . . . . . . . . . . 44
2.4 Errors of encoding without noise in the case of a finite code
sequence length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3 Encoding in the presence of penalties. First variational problem . . . . 53
3.1 Direct method of computing information capacity of a message
for one example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Discrete channel without noise and its capacity . . . . . . . . . . . . . . . . 56
3.3 Solution of the first variational problem. Thermodynamic
parameters and potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Examples of application of general methods for computation of
channel capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5 Methods of potentials in the case of a large number of parameters 70
3.6 Capacity of a noiseless channel with penalties in a generalized
version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
xix
xx Contents
4 First asymptotic theorem and related results . . . . . . . . . . . . . . . . . . . . . 77

4.1 Potential Γ or the cumulant generating function . . . . . . . . . . . . . . . 78
4.2 Some asymptotic results of statistical thermodynamics. Stability
of the canonical distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 Asymptotic equivalence of two types of constraints . . . . . . . . . . . . 89
4.4 Some theorems about the characteristic potential . . . . . . . . . . . . . . 94
5 Computation of entropy for special cases. Entropy of stochastic
processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1 Entropy of a segment of a stationary discrete process and entropy
rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2 Entropy of a Markov chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3 Entropy rate of part of the components of a discrete Markov
process and of a conditional Markov process . . . . . . . . . . . . . . . . . . 113
5.4 Entropy of Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . 123
5.5 Entropy of a stationary sequence. Gaussian sequence . . . . . . . . . . . 128
5.6 Entropy of stochastic processes in continuous time. General
concepts and relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.7 Entropy of a Gaussian process in continuous time . . . . . . . . . . . . . . 137
5.8 Entropy of a stochastic point process . . . . . . . . . . . . . . . . . . . . . . . . 144
5.9 Entropy of a discrete Markov process in continuous time . . . . . . . . 153
5.10 Entropy of diffusion Markov processes . . . . . . . . . . . . . . . . . . . . . . . 157
5.11 Entropy of a composite Markov process, a conditional process,
and some components of a Markov process . . . . . . . . . . . . . . . . . . . 161
6 Information in the presence of noise. Shannon’s amount
of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.1 Information losses under degenerate transformations and simple
noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.2 Mutual information for discrete random variables . . . . . . . . . . . . . . 178
6.3 Conditional mutual information. Hierarchical additivity of
information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.4 Mutual information in the general case . . . . . . . . . . . . . . . . . . . . . . . 187
6.5 Mutual information for Gaussian variables . . . . . . . . . . . . . . . . . . . . 189
6.6 Information rate of stationary and stationary-connected
processes. Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.7 Mutual information of components of a Markov process . . . . . . . . 202
7 Message transmission in the presence of noise. Second asymptotic
theorem and its various formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
7.1 Principles of information transmission and information reception
in the presence of noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.2 Random code and the mean probability of error . . . . . . . . . . . . . . . 221
7.3 Asymptotic zero probability of decoding error. Shannon’s
theorem (second asymptotic theorem) . . . . . . . . . . . . . . . . . . . . . . . . 225
7.4 Asymptotic formula for the probability of error . . . . . . . . . . . . . . . . 228
Contents xxi
7.5 Enhanced estimators for optimal decoding . . . . . . . . . . . . . . . . . . . . 232

7.6 Some general relations between entropies and mutual
informations for encoding and decoding . . . . . . . . . . . . . . . . . . . . . . 243
8 Channel capacity. Important particular cases of channels . . . . . . . . . . 249
8.1 Definition of channel capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
8.2 Solution of the second variational problem. Relations for channel
capacity and potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
8.3 The type of optimal distribution and the partition function . . . . . . . 259
8.4 Symmetric channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.5 Binary channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
8.6 Gaussian channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
8.7 Stationary Gaussian channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
8.8 Additive channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
9 Definition of the value of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
9.1 Reduction of average cost under uncertainty reduction . . . . . . . . . . 290
9.2 Value of Hartley’s information amount. An example . . . . . . . . . . . . 294
9.3 Definition of the value of Shannon’s information amount and
α -information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
9.4 Solution of the third variational problem. The corresponding
potentials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
9.5 Solution of a variational problem under several additional
assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
9.6 Value of Boltzmann’s information amount . . . . . . . . . . . . . . . . . . . . 318
9.7 Another approach to defining the value of Shannon’s information 321
10 Value of Shannon’s information for the most important Bayesian
systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
10.1 Two-state system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
10.2 Systems with translation invariant cost function . . . . . . . . . . . . . . . 331
10.3 Gaussian Bayesian systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
10.4 Stationary Gaussian systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
11 Asymptotic results about the value of information. Third
asymptotic theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
11.1 On the distinction between the value functions of different types
of information. Preliminary forms . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
11.2 Theorem about asymptotic equivalence of the value functions of
different types of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
11.3 Rate of convergence between the values of Shannon’s and
Hartley’s information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
11.4 Alternative forms of the main result. Generalizations and special
cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
11.5 Generalized Shannon’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
xxii Contents
12 Information theory and the second law of thermodynamics . . . . . . . . 391

12.1 Information about a physical system being in thermodynamic
equilibrium. The generalized second law of thermodynamics . . . . 392
12.2 Influx of Shannon’s information and transformation of heat into
work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
12.3 Energy costs of creating and recording information. An example . 399
12.4 Energy costs of creating and recording information. General
formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
12.5 Energy costs in physical channels . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
A Some matrix (operator) identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
A.1 Rules for operator transfer from left to right . . . . . . . . . . . . . . . . . . . 409
A.2 Determinant of a block matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Chapter 1
Definition of information and entropy
in the absence of noise
In modern science, engineering and public life, a big role is played by information
and operations associated with it: information reception, information transmission,
information processing, storing information and so on. The significance of informa-
tion has seemingly outgrown the significance of the other important factor, which
used to play a dominant role in the previous century, namely, energy.
In the future, in view of a complexification of science, engineering, economics
and other fields, the significance of correct control in these areas will grow and,
therefore, the importance of information will increase as well.
What is information? Is a theory of information possible? Are there any general
laws for information independent of its content that can be quite diverse? Answers
to these questions are far from obvious. Information appears to be a more difficult
concept to formalize than, say, energy, which has a certain, long established place
in physics.
There are two sides of information: quantitative and qualitative. Sometimes it is
the total amount of information that is important, while other times it is its quality,
its specific content. Besides, a transformation of information from one format into
another is technically a more difficult problem than, say, transformation of energy
from one form into another. All this complicates the development of information
theory and its usage. It is quite possible that the general information theory will
not bring any benefit to some practical problems, and they have to be tackled by
independent engineering methods.
Nevertheless, general information theory exists, and so do standard situations
and problems, in which the laws of general information theory play the main role.
Therefore, information theory is important from a practical standpoint, as well as in
fundamental science, philosophy and expanding the horizons of a researcher.
From this introduction one can gauge how difficult it was to discover the laws
of information theory. In this regard, the most important milestone was the work of
Claude Shannon [44, 45] published in 1948–1949 (the respective English originals
are [38, 39]). His formulation of the problem and results were both perceived as a
surprise. However, on closer investigation one can see that the new theory extends
and develops former ideas, specifically, the ideas of statistical thermodynamics due
© Springer Nature Switzerland AG 2020 1
R. V. Belavkin et al. (eds.), Theory of Information and its Value,
https://doi.org/10.1007/978-3-030-22833-0 1
2 1 Definition of information and entropy in the absence of noise
to Boltzmann. The deep mathematical similarities between these two directions are
not accidental. It is evidenced in the use of the same formulae (for instance, for
entropy of a discrete random variable). Besides that, a logarithmic measure for the
amount of information, which is fundamental in Shannon’s theory, was proposed
for problems of communication as early as 1928 in the work of R. Hartley [19] (the
English original is [18]).
In the present chapter, we introduce the logarithmic measure of the amount of
information and state a number of important properties of information, which follow
from that measure, such as the additivity property.
The notion of the amount of information is closely related to the notion of en-
tropy, which is a measure of uncertainty. Acquisition of information is accompanied
by a decrease in uncertainty, so that the amount of information can be measured by
the amount of uncertainty or entropy that has disappeared.
In the case of a discrete message, i.e. a discrete random variable, entropy is de-
fined by the Boltzmann formula
Hξ = − ∑ P(ξ ) ln P(ξ ),
ξ
where ξ is a random variable, and P(ξ ) is its probability distribution.

In this chapter, we emphasize the fact that this formula is a corollary (in the
asymptotic sense) of the simpler Hartley’s formula H = ln M.
The fact that the set of realizations (the set of all possible values of a random
variable) of an entropically stable (this property is defined in Section 1.5) random
variable can be partitioned into two subsets, is an essential result of information the-
ory. The first subset has infinitesimal probability, and therefore it can be discarded.
The second subset contains approximately eHξ realizations (i.e. variants of repre-
sentation), and it is often substantially smaller than the total number of realizations.
Realizations of that second subset can be approximately treated as equiprobable.
According to the terminology introduced by Boltzmann, those equiprobable real-
izations can be called ‘micro states’.
Information theory is a mathematical theory that employs definitions and meth-
ods of probability theory. In these discussions, there is no particular need to hold to
a special ‘informational’ terminology, which is usually used in applications. With-
out detriment to the theory, the notion of ‘message’ can be substituted by the notion
of ‘random variable’, and the notion of ‘sequence of messages’ by ‘stochastic pro-
cess’, etc. Then in order to apply the general theory, certainly, we require a proper
formalization of the applicable concepts in the language of probability theory. We
will not pay extra attention here to such a translation.
The main results of Sections 1.1–1.3, related to discrete random variables (or pro-
cesses) having a finite or countable number of states, can be generalized to the case
of continuous (and even arbitrary) random variables assuming values from a multi-
dimensional real space. For such a generalization, one should overcome certain dif-
ficulties and put up with certain complications of the most important concepts and
formulae. The main complication consists in the following. In contrast to the case of
1.1 Definition of entropy in the case of equiprobable outcomes 3
discrete random variables, where it is sufficient to define entropy using one measure
or one probability distribution, in the general case it is necessary to introduce two
measures to do so. Therefore, entropy is now related to two measures instead of one,
and thus it characterizes the relationship between these measures. In our presenta-
tion, the general version of the formula for entropy is derived from the Boltzmann
formula by using as an example the condensation of the points representing random
variables.
There are several books on information theory, in particular, Goldman [16], Fein-
stein [12] and Fano [10] (the corresponding English originals are [11, 15] and [9]).
These books conveniently introduce readers to the most basic notions (also see the
book by Kullback [30] or its English original [29]).
1.1 Definition of entropy in the case of equiprobable outcomes
Suppose we have M equiprobable outcomes of an experiment. For example, when

we roll a standard die, M = 6. Of course, we cannot always perform the formaliza-
tion of conditions so easily and accurately as in the case of a die. We assume though
that the formalization has been performed and, indeed, one of M outcomes is real-
ized, and they are equivalent in probabilistic terms. Then there is a priori uncertainty
directly connected with M (i.e. the greater the M is, the higher the uncertainty is).
The quantity measuring the above uncertainty is called entropy and is denoted by
H:
H = f (M), (1.1.1)
where f (·) is some increasing non-negative function defined at least for natural
numbers.
When rolling a dice and observing the outcome number, we obtain information
whose amount is denoted by I. After that (i.e. a posteriori) there is no uncertainty
left: the a posteriori number of outcomes is M = 1 and we must have Hps = f (1) = 0.
It is natural to measure the amount of information received by the value of disap-
peared uncertainty:
I = Hpr − Hps . (1.1.2)
Here, the subscript ‘pr’ means ‘a priori’, whereas ‘ps’ means ‘a posteriori’.
We see that the amount of received information I coincides with the initial en-
tropy. In other cases (in particular, for formula (1.2.3) given below) a message hav-
ing entropy H can also transmit the amount of information I equal to H.
In order to determine the form of function f (·) in (1.1.1) we employ very natural
additivity principle. In the case of a die it reads: the entropy of two throws of a die is
twice as large as the entropy of one throw, the entropy of three throws of a die is three
times as large as the entropy of one throw, etc. Applying the additivity principle to
other cases means that the entropy of several independent systems is equal to the
sum of the entropies of individual systems. However, the number M of outcomes
for a complex system is equal to the product of the numbers m of outcomes for each
one of the ‘simple’ (relative to the total system) subsystems. For two throws of dice,
the number of various pairs (ξ1 , ξ2 ) (where ξ1 and ξ2 both take one out of six values)
equals to 36 = 62 . Generally, for n throws the number of equivalent outcomes is 6n .
Applying formula (1.1.1) for this number, we obtain entropy f (6n ). According to
the additivity principle, we find that
f (6n ) = n f (6).
For other m > 1 the latter formula would take the form
f (mn ) = n f (m). (1.1.3)
Denoting x = mn , we have n = ln x/ ln m. Then it follows from (1.1.3) that

f (x) = K ln x, (1.1.4)
where K = f (m)/ ln m is a positive constant, which is independent of x. It is related

to the choice of the units of information. Thus, a representation of function f (·) has
been determined up to a choice of the measurement units. It is easy to verify that the
aforementioned constraint f (1) = 0 is satisfied, indeed.
Hartley was the first [19] (the English original is [18]) who introduced the loga-
rithmic measure of information. That is why the quantity H = K ln M is called the
Hartley’s amount of information.
Let us indicate three main choices of the measurement units for information:
1. If we set K = 1 in (1.1.4), then the entropy will be measured in natural units
(nats) 1 :
Hnat = ln M; (1.1.5)
2. If we set K = 1/ ln 2, then we will have the entropy expressed in binary units
(bits) 2 :
1
Hbit = ln M = log2 M; (1.1.6)
ln 2
3. Finally, we will have a physical scale of K if we consider the Boltzmann constant
k = 1.38 · 10−23 J/K. The entropy measured in those units will be equal to
S = Hph = k ln M. (1.1.7)
It is easy to see from the comparison of (1.1.5) and (1.1.6) that 1 nat is greater than
1 bit in log2 e = 1/ ln 2 ≈ 1.44 times.
In what follows, we shall use natural units (formula (1.1.5)), dropping subscript
‘nat’, unless otherwise stipulated.
Suppose the random variable ξ takes any of the M equiprobable values, say,
1, . . . , M. Then the probability of each individual value is equal to P(ξ ) = 1/M,
ξ = 1, . . . , M. Consequently, formula (1.1.5) can be rewritten as
1 ‘nat’ refers to natural digit that means natural unit.

2 ‘bit’ refers to binary digit that means binary unit (sign).
1.2 Entropy and its properties in the case of non-equiprobable outcomes 5
H = − ln P(ξ ). (1.1.8)
1.2 Entropy and its properties in the case of non-equiprobable

outcomes
1. Suppose now the probabilities of different outcomes are unequal. If, as earlier, the
number of outcomes equals to M, then we can consider a random variable ξ , which
takes one of M values. Considering an index of the corresponding outcome as ξ , we
obtain that those values are nothing else but 1, . . . , M. Probabilities P(ξ ) of those
values are non-negative and satisfy the normalization constraint: ∑ξ P(ξ ) = 1.
If we formally apply equality (1.1.8) to this case, then each ξ should have its own
entropy
H(ξ ) = − ln P(ξ ). (1.2.1)
Thus, we attribute a certain value of entropy to each realization of the variable ξ .
Since ξ is a random variable, we can also regard this entropy as a random variable.
As in Section 1.1, the a posteriori entropy, which remains after the realization of
ξ becomes known, is equal to zero. That is why the information we obtain once the
realization is known is numerically equal to the initial entropy
I(ξ ) = H(ξ ) = − ln P(ξ ). (1.2.2)
Similar to entropy H(ξ ), information I depends on the actual realization (on the
value of ξ ), i.e., it is a random variable. One can see from the latter formula that
information and entropy are both large when a posteriori probability of the given
realization is small and vice versa. This observation is quite consistent with intuitive
ideas.
Example 1.1. Suppose we would like to know whether a certain student has passed
an exam or not. Let the probabilities of these two events be
P(pass) = 7/8, P(fail) = 1/8.
One can see from these probabilities that the student is quite strong. If we were
informed that the student had passed the exam, then we could say: ‘Your message
has not given me a lot of information. I have already expected that the student passed
the exam’. According to formula (1.2.2) the information of this message is quanti-
tatively equal to
I(pass) = log2 (8/7) = 0.193 bits.
If we were informed that the student had failed, then we would say ‘Really?’ and
would feel that we have improved our knowledge to a greater extent. The amount of
information of such a message is equal to
I(fail) = log2 (8) = 3 bits.

In theory, however, a greater role is not played by random entropy (random in-
formation, respectively) (1.2.1), (1.2.2), but by the average entropy defined by the
formula
Hξ = E[H(ξ )] = − ∑ P(ξ ) ln P(ξ ). (1.2.3)
ξ
We shall call it the Boltzmann’s entropy or the Boltzmann’s amount of information.3

In the aforementioned example averaging over both messages yields
7 3
Iξ = Hξ = 0.193 + = 0.544 bits.
8 8
The subscript ξ of symbol Hξ (as opposed to the argument of H(ξ )) is a dummy,
i.e., the average entropy Hξ describes the random variable ξ (depending on the
probabilities Pξ ), but it does not depend on the actual value of ξ , i.e., on the realiza-
tion of ξ . Further, we shall use this notation system in its extended form including
conditional entropies.
Uncertainty of the type 0 ln 0 occurring in (1.2.3), for vanishing probabilities is
always understood in the sense 0 ln 0 = 0. Consequently, a set of M outcomes can
be always complemented by any outcomes having zero probability. Also, determin-
istic (non-random) quantities can be added to the entropy index. For instance, the
equality Hξ = Hξ ,7 is valid for every random variable ξ .
2. Properties of entropy.
Theorem 1.1. Both random and average entropies are always non-negative.
This property is connected with the fact that probability cannot exceed one and
that the constant K in (1.1.4) is necessarily positive. Since P(ξ ) 1, we have
− ln P(ξ ) = H(ξ ) 0. Certainly, this inequality remains valid after averaging as
well.
Theorem 1.2. Entropy attains the maximum value of ln M when the outcomes (real-
izations) are equiprobable, i.e. when P(ξ ) = 1/M.
Proof. This property is the consequence of the Jensen’s inequality (for instance, see
Rao [35] or its English original [34])
E[ f (ζ )] f (E[ζ ]) (1.2.4)
that is valid for every concave function f (x). (Function f (x) = ln x is concave for
x > 0, because f (x) = −x−2 < 0). Indeed, denoting ζ = 1/P(ξ ) we have

1 M
P(ξ )
E[ζ ] = E = ∑ = M, (1.2.5)
P(ξ ) ξ =1
P(ξ )

1
E[ f (ζ )] = E ln = E[H(ξ )] = Hξ . (1.2.6)
P(ξ )
3 Boltzmann’s entropy is commonly referred to as ‘Shannon’s entropy’, or just ‘entropy’ within
the field of information theory.
1.2 Entropy and its properties in the case of non-equiprobable outcomes 7
Substituting (1.2.5), (1.2.6) to (1.2.4), we obtain
Hξ ln M.
For a particular form of function f (ζ ) = ln ζ it is easy to verify inequality (1.2.4)

directly. Averaging the obvious inequality
ζ ζ
ln − 1, (1.2.7)
E[ζ ] E[ζ ]
we obtain (1.2.4).
In the general case, it is convenient to consider the tangent line f (E[ζ ]) +
f (E[ζ ])(ζ − E[ζ ]) for function f (ζ ) at point ζ = E[ζ ] in order to prove (1.2.4).
Concavity implies
f (ζ ) f (E[ζ ]) + f (E[ζ ])(ζ − E[ζ ]).
Averaging the latter inequality, we derive (1.2.4).

As is seen from the provided proof, Theorem 1.2 would remain valid if we re-
placed the logarithmic function by any other concave function in the definition of
entropy. Let us now consider the properties of entropy, which are specific for the
logarithmic function, namely, properties related to the additivity of entropy.
Theorem 1.3. If random variables ξ1 , ξ2 are independent, then the full (joint) en-
tropy Hξ1 ξ2 is decomposed into the sum of entropies:
Hξ1 ξ2 = Hξ1 + Hξ2 . (1.2.8)
Proof. Suppose that ξ1 , ξ2 are two random variables such that the first one assumes
values 1, . . . , m1 and the second one—values 1, . . . , m2 . There are m1 m2 pairs of
ξ = (ξ1 , ξ2 ) with probabilities P(ξ1 , ξ2 ). Numbering the pairs in an arbitrary order
by indices ξ = 1, . . . , m1 m2 we have
Hξ = −E[ln P(ξ )] = −E[ln P(ξ1 , ξ2 )] = Hξ1 ξ2 .
In view of independence, we have
P(ξ1 , ξ2 ) = P(ξ1 )P(ξ2 ).
Therefore, ln P(ξ1 , ξ2 ) = ln P(ξ1 ) + ln P(ξ2 ); H(ξ1 ξ2 ) = H(ξ1 ) + H(ξ2 ). Averaging

of the latter equality yields Hξ = Hξ1 ξ2 = Hξ1 + H ξ2 , i.e. (1.2.7). The proof is com-
plete.
If, instead of two independent random variables, there are two independent
groups (η1 , . . . , ηr and ζ1 , . . . , ζs ) of random variables (P(η1 , . . . , ηr , ζ1 , . . . , ζs ) =
P(η1 , . . . , ηr )P(ζ1 , . . . , ζs )), then the provided reasoning is still applicable when de-
noting ensembles (η1 , . . . , ηr ) and (ζ1 , . . . , ζs ) as ξ1 and ξ2 , respectively.
The property mentioned in Theorem 1.3 is a manifestation of the additivity prin-

ciple, which was taken as a base principle in Section 1.1 and led to the logarithmic
function (1.1.1). This property can be generalized to the case of several independent
random variables ξ1 , . . . , ξn , which yields,
n
Hξ1 ...ξn = ∑ Hξ j , (1.2.9)
j=1
and is easy to prove by an analogous method.
1.3 Conditional entropy. Hierarchical additivity
Let us generalize formulae (1.2.1), (1.2.3) to the case of conditional probabilities.

Let ξ1 , . . . , ξn be random variables described by the joint distribution P(ξ1 , . . . , ξn ).
The conditional probabilities
P(ξ1 , . . . , ξn )
P(ξk , . . . , ξn | ξ1 , . . . , ξk−1 ) = (k n)
P(ξ1 , . . . , ξk−1 )
are associated with the random conditional entropy
H(ξk , . . . , ξn | ξ1 , . . . , ξk−1 ) = − ln P(ξk , . . . , ξn | ξ1 , . . . , ξk−1 ). (1.3.1)
Let us introduce a special notation for the result of averaging (1.3.1) over ξk , . . . , ξn :
Hξk ...ξn (| ξ1 , . . . ξk−1 ) = − ∑ P(ξk , . . . , ξn | ξ1 , . . . , ξk−1 )×

ξk ...ξn
× ln P(ξk , . . . , ξn | ξ1 , . . . , ξk−1 ), (1.3.2)
Fig. 1.1 Partitioning stages and the decision tree in the general case
1.3 Conditional entropy. Hierarchical additivity 9
and also for the result of total averaging:
Hξk ,...,ξn |ξ1 ,...,ξk−1 = E[H(ξk , . . . , ξn | ξ1 , . . . , ξk−1 )]

=− ∑ P(ξ1 . . . ξn ) ln P(ξk , . . . , ξn | ξ1 , . . . , ξk−1 ). (1.3.3)
ξ1 ...ξn
If, in addition, we vary k and n, then we will form a large number of different en-
tropies, conditional and non-conditional, random and non-random. They are related
by identities that will be considered below.
Before we formulate the main hierarchical equality (1.3.4), we show how to in-
troduce a hierarchical set of random variables ξ1 , . . . , ξn , even if there was just one
random variable ξ initially.
Let ξ take one of M values with probabilities P(ξ ). The choice of one realization
will be made in several stages. At the first stage, we indicate which subset (from a
full ensemble of non-overlapping subsets E1 , . . . , Em1 ) the realization belongs to. Let
ξ1 be the index of such a subset. At the second stage, each subset is partitioned into
smaller subsets Eξ1 ξ2 . The second random variable ξ2 points to which smaller sub-
set the realization of the random variable belongs to. In turn, those smaller subsets
are further partitioned until we obtain subsets consisting of a single element. Appar-
ently, the number of nontrivial partitioning stages n cannot exceed M − 1. We can
juxtapose a fixed partitioning scheme with a ‘decision tree’ depicted on Figure 1.1.
Further considerations will be associated with a particular selected ‘tree’.
Fig. 1.2 A ‘decision tree’ for one particular example
In order to indicate a realization of ξ , it is necessary and sufficient to fix the

assembly of realizations (ξ1 , ξ2 , . . . , ξn ). In order to indicate the ‘node’ of (k + 1)-
th stage, we need to specify values ξ1 , . . . , ξk . Then the value of ξk+1 points to the
branch we use to get out from this node.
As an example of a ‘decision tree’ we can give the simple ‘tree’ represented on
Figure 1.2, which was considered by Shannon.
At each stage, the choice is associated with some uncertainty. Consider the en-
tropy corresponding to this uncertainty. The first stage has one node, and the entropy
of the choice is equal to Hξ1 . Fixing ξ1 , we determine the node of the second stage.
The probability of moving from the node ξ1 along the branch ξ2 is equal to the
conditional probability
P(ξ1 | ξ2 ) = P(ξ1 , ξ2 )/P(ξ1 ).
The entropy, associated with a selection of one branch emanating from this node, is
precisely the conditional entropy of type (1.3.2) for n = 2, k = 2:
Hξ2 (| ξ1 ) = − ∑ P(ξ2 | ξ1 ) ln P(ξ2 | ξ1 ).

ξ2
As in (1.3.3), averaging over all second stage nodes yields the full selection entropy
at the second stage:
Hξ2 |ξ1 = E[Hξ2 (| ξ1 )] = ∑ P(ξ1 )Hξ2 (| ξ1 ).

ξ1
As a matter of fact, the selection entropy at stage k in the node defined by values
ξ1 , . . . , ξk−1 is equal to
Hξk (| ξ1 , . . . , ξk−1 ).
At the same time the total entropy of stage k is equal to
Hξk |ξ1 ,...,ξk−1 = E[Hξk (| ξ1 , . . . , ξk−1 )].
For instance, on Figure 1.2 the first stage entropy is equal to Hξ1 = 1 bit. Node A
has entropy Hξ2 (| ξ1 = 1) = 0, and node B has entropy Hξ2 (| ξ1 = 2) = 23 log 32 +
3 log 3 = log2 3 − 3 bits. The average entropy at the second stage is apparently equal
1 2
to Hξ2 |ξ1 = 2 Hξ2 (| 2) = 12 log2 3 − 13 bits. An important regularity is that the sum
1
of entropies of all stages is equal to the full entropy Hξ , which can be computed
without partitioning a selection into stages. For the above example
1 1 1 2 1
Hξ1 ξ2 = log 2 + log 3 + log 6 = + log2 3 bits,
2 3 6 3 2
which does, indeed, coincide with the sum Hξ1 + Hξ2 |ξ1 . This observation is general.
Theorem 1.4. Entropy possesses the property of hierarchical additivity:
Hξ1 ,...,ξn = Hξ1 + Hξ2 |ξ1 + Hξ3 |ξ1 ξ2 + · · · + Hξn |ξ1 ,...,ξn−1 . (1.3.4)
Proof. By the definition of conditional probabilities, they possess the following

property of hierarchical multiplicativity:
P(ξ1 , . . . , ξn ) = P(ξ1 )P(ξ2 | ξ1 )P(ξ3 | ξ1 ξ2 ) · · · P(ξn | ξ1 , . . . , ξn−1 ). (1.3.5)
Taking the logarithm of (1.3.5) and taking into account definition (1.3.1) of condi-
tional random entropy, we obtain
1.3 Conditional entropy. Hierarchical additivity 11
H(ξ1 , . . . , ξn ) = H(ξ1 ) + H(ξ2 | ξ1 ) + H(ξ3 | ξ1 ξ2 ) + · · ·

· · · + H(ξn | ξ1 , . . . , ξn−1 ). (1.3.6)
Averaging this equality according to (1.3.3) gives (1.3.4). This completes the proof.
The property in Theorem 1.4 is a reflection of the simple additivity princi-

ple that was stated in Section 1.1. This property is a consequence of the choice
of the logarithmic function in (1.1.1). It is easy to understand that the additiv-
ity property (1.2.8), (1.2.9) follows from it. Indeed, a conditional probability is
equivalent to a non-conditional probability for independent random variables. Tak-
ing the logarithms of those probabilities, we have H(ξ2 | ξ1 ) = H(ξ2 ) and, thus,
Hξ2 |ξ1 = Hξ2 after averaging. Therefore, the equality Hξ1 ξ2 = Hξ1 + Hξ2 |ξ1 turns into
Hξ1 ξ2 = Hξ1 + Hξ2 .
A particular case of (1.3.4) for two-stage partition was taken by Shannon [45]
(the original in English is [38]) and also by Feinstein [12] (the original in English is
[11]) as one of the axioms for deriving formula (1.2.3), i.e., in essence for specifying
the logarithmic measure of information. For other axiomatic ways of defining the
amount of information it is also necessary to postulate the additivity property to
some extent (may be in the weak or special form) for the logarithmic measure to be
singled out.
In conclusion of this section, we shall prove a theorem involving conditional
entropy. But first we shall state an important auxiliary proposition.
Theorem 1.5. Whatever probability distributions P(ξ ) and Q(ξ ) are, the following
inequality holds:
P(ξ )
∑ P(ξ ) ln Q(ξ ) 0. (1.3.7)
ξ
Proof. The proof is similar to the proof of Theorem 1.2. It is based on inequal-
ity (1.2.4) for function f (x) = ln x. We set ζ = Q(ξ )/P(ξ ) and perform averaging
with weight P(ξ ).
Then
Q(ξ )
E[ζ ] = ∑ P(ξ )
P(ξ ) ∑
= Q(ξ ) = 1
ξ ξ
and
Q(ξ )
E[ f (ζ )] = ∑ P(ξ ) ln .
ξ
P(ξ )
A substitution of these values into (1.2.4) yields
Q(ξ )
∑ P(ξ ) ln P(ξ ) ln 1 = 0,
ξ
that completes the proof.

Theorem 1.6. Conditional entropy cannot exceed regular (non-conditional) one:
Hξ |η Hξ . (1.3.8)
Proof. Using Theorem 1.5 we substitute P(ξ ) and Q(ξ ) by P(ξ | η ) and P(ξ ),
respectively, therein. Then we will obtain
− ∑ P(ξ | η ) ln P(ξ | η ) ∑ P(ξ | η ) ln P(ξ ).

ξ ξ
Averaging this inequality over η with weight P(η ) results in the inequality
− ∑ P(ξ | η ) ln P(ξ | η ) − ∑ P(ξ ) ln P(ξ ),

ξ ξ
that is equivalent to (1.3.8). The proof is complete.
The following theorem is proven similarly.
Theorem 1.6a Adding conditions does not increase conditional entropy:
Hξ |ηζ = Hξ |ζ . (1.3.9)
1.4 Asymptotic equivalence of non-equiprobable and

equiprobable outcomes
The idea that the general case of non-equiprobable outcomes can be asymptotically
reduced to the case of equiprobable outcomes is fundamental for information theory
in the absence of noise. This idea belongs to Ludwig Boltzmann who derived for-
mula (1.2.3) for entropy. Claude Shannon revived this idea and broadly used it for
derivation of new results.
In considering this question here, we shall not try to reach generality, since these
results form a particular case of more general results of Section 1.5. Consider the
set of independent realizations η = (ξ1 , . . . , ξn ) of a random variable ξ = ξ j , which
assumes one of two values 1 or 0 with probabilities P[ξ = 1] = p < 1/2; P[ξ =
0] = 1 − p = q. Evidently, the number of such different combinations (realizations)
is equal to 2n . Let realization ηn1 contain n1 ones and n − n1 = n0 zeros. Then its
probability is given by
P(ηn1 ) = pn1 qn−n1 . (1.4.1)
Of course, these probabilities are different for different n1 . The ratio P(η0 )/P(ηn ) =
(q/p)n of the largest probability to the smallest one is big and increases fast with a
growth of n. What equiprobability can we talk about then? The thing is that due to
1.4 Asymptotic equivalence of non-equiprobable and equiprobable outcomes 13
the Law of Large Numbers the number of ones n1 = ξ1 + · · · + ξn has a tendency to

take values, which are close to its mean
n
E[n1 ] = ∑ E[ξ j ] = nE[ξ j ] = np.
j=1
Let us find the variance Var[n1 ] = Var[ξ1 + · · · + ξn ] of the number of ones. Due to
independence of the summands we have
Var(n1 ) = nVar(ξ ) = n[E[ξ 2 ] − (E[ξ ])2 ],
and
E[ξ 2 ] = E[ξ ] = p; E[ξ 2 ] − (E[ξ ])2 = p − p2 = pq.
Therefore,
Var(n1 ) = npq; Var(n1 /n) = pq/n. (1.4.2)
Hence, we have obtained that the mean deviation
√
Δ n1 = n1 − np ∼ pqn
increases with n, but slower than the mean value np and the length of the entire
A typical relative deviation Δ n1 /n1 decreases according
range 0 n1 n grow.
√
to
the law Δ n1 /n1 ∼ q/np. Within the bounds of the range |n1 − pn| ∼ pqn the
difference between probabilities P(ηn1 ) is still quite substantial:
Δ n1 √ pqn
P(ηn1 ) q q
= ≈
P(ηn1 −Δ n1 ) p p
(as it follows from (1.4.1)) and increases with a growth of n. However, this increase
√
is relatively slow, as compared to that of the probability itself. For Δ n1 ≈ pqn, the
corresponding inequality
P(ηn1 ) 1
ln ln
P(ηn1 −Δ n1 ) P(ηn1 )
becomes stronger and stronger with a growth of n. Now we state the forgoing in
more precise terms.
Theorem 1.7. All of the 2n realizations of η can be partitioned into two sets An and
Bn , so that
1. The total probability of realizations from the first set An vanishes:
P(An ) → 0 as n → ∞; (1.4.3)
2. Realizations from the second set Bn become relatively equiprobable in the fol-
lowing sense:

ln P(η ) − ln P(η )
→ 0, η ∈ Bn ; η ∈ Bn . (1.4.4)
ln P(η )
Proof. Using Chebyshev’s inequality (for instance, see Gnedenko [13] or its trans-
lation to English [14]), we obtain
Var(n1 )
P[|n1 − pn| ε ] .
ε2
Taking into account (1.4.2) and assigning ε = n3/4 , we obtain from here that
P[|n1 − pn| n3/4 ] pqn−1/2 . (1.4.5)
We include realizations ηn1 , for which |n1 − pn| n3/4 into set An and the rest of
them into set Bn . Then the left-hand side of (1.4.5) is nothing but P(An ), and passing
to the limit n → 0 in (1.4.5) proves (1.4.3).
For the realizations from the second set Bn , the inequality pn − n3/4 < n1 < pn +
3/4
n holds, which, in view of (1.4.1), gives
q
− n(p ln p + q ln q) − n3/4 ln < − ln P(ηn1 ) <
p
q
< −n(p ln p + q ln q) + n3/4 ln . (1.4.6)
p
Hence

ln P(η ) − ln P(η ) < 2n3/4 ln q , s. t. η ∈ Bn , η ∈ Bn
p
and
q
| ln P(η )| > −n(p ln p + q ln q) − n3/4 ln .
p
Consequently,

ln P(η ) − ln P(η ) 2n−1/4 ln qp
<
ln P(η ) −p ln p − q ln q − n−1/4 ln q .
p
In order to obtain (1.4.4), one should pass to the limit as n → ∞. This ends the proof.
Inequality (1.4.6) also allows us to evaluate the number of elements of the set Bn .
Theorem 1.8. Let Bn be a set described in Theorem 1.7. Its cardinality M is such
that
ln M
→ −p ln p − q ln q ≡ Hξ1 as n → ∞. (1.4.7)
n
This theorem can also be formulated as follows.

1.4 Asymptotic equivalence of non-equiprobable and equiprobable outcomes 15
Theorem 1.8a If we suppose that realizations of set An have zero probability, real-
izations of set Bn are equiprobable and compute entropy Hη by the simple formula
Hη = ln M (see (1.1.5)), then the entropy rate Hη /n in the limit will be equivalent
to the entropy determined by formula (1.2.3), i.e.
H̃η
lim = −p ln p − q ln q. (1.4.8)
n→∞ n
Note that formula (1.2.3) can be obtained as a corollary of the simpler rela-
tion (1.1.5).
Proof. According to (1.4.5) the sum
∑ P(η ) = P(Bn )
η ∈Bn
can be bounded by the inequality

pq
P(Bn ) = 1 − P(An ) 1 − √ .
n
The indicated sum involves the elements of set Bn and has M summands. In conse-
quence of (1.4.6) each summand can be bounded above:

q
P(η ) < exp n(p ln p + q ln q) + n ln .
3/4
p
Therefore, the number of terms cannot be less than a certain number:

P(Bn ) pq q
M > 1 − √ exp −n(p ln p + q ln q) − n3/4 ln . (1.4.9)
max P(η ) n p
On the other hand, ∑Bn P(η ) 1 and due to (1.4.6)

q
P(η ) > exp n(p ln p + q ln q) − n ln
3/4
,
p
so that

P(Bn ) 1 q
M < exp −n(p ln p + q ln q) + n ln .
3/4
(1.4.10)
min P(η ) min P(η ) p
Taking the logarithms of the derived inequalities (1.4.9), (1.4.10) we obtain

q pq ln M q
Hξ1 − n−1/4 ln + ln 1 − √ < < Hξ1 + n−1/4 ln .
p n n p
Passing to the limit as n → ∞ proves the desired relations (1.4.7), (1.4.8).

In the case when ξ1 takes one out of m values there are mn different realiza-
tions of process η = (ξ1 , . . . , ξn ) with independent components. According to the
nH
aforesaid only M = e ξ1 of them (which we can consider as equiprobable) deserve
attention. When P(ξ1 ) = 1/m and Hξ1 = ln m, the numbers mn and M are equal; oth-
nH
erwise, (when Hξ1 < ln m) the fraction of realizations deserving attention e ξ1 /mn
unboundedly decreases with a growth of n. Therefore, the vast majority of realiza-
tions in this case is not essential and can be disregarded. This fact underlies coding
theory (see Chapter 2).
The asymptotic equiprobability takes place under more general assumptions
in the case of ergodic stationary processes as well. Boltzmann called the above
equiprobable realizations ‘micro states’ contrasting them to ‘macro states’ formed
by an ensemble of ‘micro states’.
1.5 Asymptotic equiprobability and entropic stability
1. The ideas of preceding section concerning asymptotic equivalence of non-

equiprobable and equiprobable outcomes can be extended to essentially more gen-
eral cases of random sequences and processes. It is not necessary for random
variables ξ j forming the sequence η n = (ξ1 , . . . , ξn ) to take only one of two val-
ues and to have the same distribution law P(ξ j ). There is also no need for ξ j to be
statistically independent and even for η n to be the sequence (ξ1 , . . . , ξn ). So what is
really necessary, the asymptotic equivalence?
In order to state the property of asymptotic equivalence of non-equiprobable and
equiprobable outcomes in general terms we should use the notion of entropic sta-
bility of family of random variables.
A family of random variables {η n } is entropically stable if the ratio H(η n )/Hη n
converges in probability to one as n → ∞. This means that whatever ε > 0, η > 0
are, there exists N(ε , η ) such that the inequality
P{|H(η n )/Hη n − 1| ε } < η (1.5.1)
is satisfied for every n N(ε , η ).

The above definition implies that 0 < Hη n < ∞ and Hη n does not decrease with
n. Usually Hη n → ∞.
Asymptotic equiprobability can be expressed in terms of entropic stability in the
form of the following general theorem.
Theorem 1.9. If a family of random variables {η n } is entropically stable, then the

set of realizations of each random variable can be partitioned into two subsets An
and Bn in such a way that
1. The total probability of realizations from subset An vanishes:
P(An ) → 0 as n → ∞; (1.5.2)
1.5 Asymptotic equiprobability and entropic stability 17
2. Realizations of the second subset Bn become relatively equiprobable in the fol-

lowing sense:

→ 0 as n → ∞, η ∈ Bn , η ∈ Bn ; (1.5.3)
ln P(η )
3. The number Mn of realizations from subset Bn is associated with entropy Hη n via

the relation
ln Mn /Hη n → 1 as n → ∞. (1.5.4)
Proof. Putting, for instance, ε = 1/m and η = 1/m, we obtain from (1.5.1) that
P{|H(η n )/Hη n − 1| 1/m} < 1/m (1.5.5)
for n N(1/m, 1/m). Let m increase by ranging over consecutive integer numbers.
We define sets An as sets of realizations satisfying the inequality
|H(η n )/Hη n − 1| 1/m (1.5.6)
when N(1/m, 1/m) n < N(1/(m + 1), 1/(m + 1)). For such a definition prop-
erty (1.5.2) apparently holds true due to (1.5.5). For the additional set Bn we have
(1 − 1/m)Hη n < − ln P(η n ) < (1 + 1/m)Hη n (η n ∈ Bn ), (1.5.7)
that entails
ln P(η n ) − ln P(η n ) 2Hη n/m 2/m

− ln P(η ) n − ln P(η n ) 1 − 1/m
for η n ∈ Bn , η n ∈ Bn that proves property (1.5.3) (the convergence of n =
N(1/m, 1/m) invokes the convergence m → ∞).
According to (1.5.7), the probabilities P(η n ) of all realizations from set Bn lie in
the range
e−(1+1/m)Hη n < P(η n ) < e−(1−1/m)Hη n .
At the same time, the total probability P(Bn ) = ∑Bn P(η n ) is enclosed between 1 −
1/m and 1. Hence we get the following range for the number of terms:
(1 − 1/m)e(1−1/m)Hη n < Mn < e(1+1/m)Hη n
(on the left-hand side the least number is divided by the greatest number, and on the
right-hand side, vice versa). Therefore,
1 ln(1 − 1/m) ln Mn 1
1− + < < 1+ ,
m Hη n Hη n m
that entails (1.5.4). Note that term ln (1 − 1/m)/Hη n converges to zero because en-
tropy Hη n does not decrease. The proof is complete.
The property of entropic stability, which plays a big role according to Theo-
rem 1.9, can be conveniently checked for different examples by determining the
variance of random entropy:
Var(H(η n )) = E[H 2 (η n )] − Hη2 n .
If this variance does not increase too fast with a growth of n, then applying Cheby-
shev’s inequality can prove (1.5.1), i.e. entropic stability. We now formulate three
theorems related to this question.
Theorem 1.10. If there exists the null limit
var(H(η n ))
lim = 0, (1.5.8)
n→∞ Hη2 n
then the family of random variables {η n } is entropically stable.
Proof. According to Chebyshev’s inequality we have
P{|ξ − E[ξ ]| ε } Var(ξ /ε 2 )
for every random variable with a finite variance and every ε > 0. Supposing here
that ξ = H(ηη n )/Hη n and taking into account (1.5.8), we obtain

H(η n )
P
− 1 ε → 0 as n → 0
Hη n
for every ε . Hence, (1.5.1) follows from here, i.e. entropic stability.
Theorem 1.11. If the entropy Hη n increases without bound and there exists a finite
upper limit
var(H(η n ))
lim sup < C, (1.5.9)
n→∞ Hη n
then the family of random variables is entropically stable.
Proof. To prove the theorem it suffices to note that the quantity Var[Hη n ]/Hη2 n ,
which on the strength of (1.5.9) can be bounded as follows:
Var(H(η n )) C
2
+ε (s.t. n > N(ε ))
Hη n Hη n
tends to zero, because Hη n increases (and ε is arbitrary), so that Theorem 1.10 can
be applied to this case.
In many important practical cases the following limits exist:

1 1
H1 = lim Hη n ; D1 = lim VarH(η n ), (1.5.10)
n→∞ n n→∞ n
which can be called the entropy rate and the variance rate, respectively. A number of
more or less general methods have been developed to calculate these rate quantities.
According to the theorem stated below, finiteness of these limits guarantees entropic
stability.
Theorem 1.12. If limits (1.5.10) exist and are finite, the first of them being different
from zero, then the corresponding family of random variables is entropically stable.
Proof. To prove the theorem we use the fact that formulae (1.5.10) imply that
Hη n = H1 n + o(n) (1.5.11)
Var(H(η n )) = D1 n + o(n). (1.5.12)
Here, as usual, o(n) means that o(n)/n → 0. Since H1 > 0, entropy Hη n infinitely
increases. Dividing expression (1.5.11) by (1.5.12), we obtain a finite limit
Var(H(η n )) D1 + o(1) D1
lim = lim = .
n→∞ Hη n n→∞ H1 + o(1) H1
Thus, the conditions of Theorem 1.11 are satisfied that proves entropic stability.
Example 1.2. Suppose we have an infinite sequence {ξ j } of discrete random vari-

ables that are statistically independent but distributed differently. We assume that
entropies Hξ j of all those variables are finite and uniformly bounded below:
Hξ j > C1 . (1.5.13)
We also assume that their variances are uniformly bounded above:
Var(H(ξ j )) < C2 . (1.5.14)
We define random variables η n as a selection of first n elements of the sequence

η n = {ξ1 , . . . , ξn }. Then due to (1.5.10), (1.5.14) and statistical independence, i.e.
additivity of entropies, we will have
Hη n > C1 n; Var(H(η n )) < C2 n.
In this case, the conditions of Theorem 1.11 are satisfied, and, thereby, entropic
stability of family {η n } follows from it.
Other more complex examples of entropically stable random variables, which

cannot be decomposed into statistically independent variables, will be considered
further.
The concept of entropic stability can be applied (less rigorously though) to a
single random variable η instead of a sequence of random variables {η n }. Then we
perceive an entropically stable random variable η as a random variable, for which
the probability

H(η )

P − 1 < ε
Hη
is quite close to one for some small value of ε , i.e. H(η )/Hη is sufficiently close to
one.
In what follows, we shall introduce other notions: informational stability, canon-
ical stability and others, which resemble entropic stability in many respects.
2. In order to derive some results related to entropically stable random variable,
it is sometimes convenient to consider the characteristic potential μ0 (α ) of entropy
H(η ) defined by the formula
eμ0 (α ) = ∑ eα H(η ) P(η ) = ∑ P1−α (η ). (1.5.15)

η η
This potential is similar to the potentials that will be considered further (Sections 4.1
and 4.4). With the help of this potential it is convenient to investigate the rate of
convergence (1.5.2–1.5.4) in Theorem 1.9. This subject is covered in the following
theorem.
Theorem 1.13. Let potential (1.5.15) be defined and differentiable on the interval
s1 < α < s2 (s1 < 0; s2 > 0), and let the equation
d μ0
(s) = (1 + ε )Hη (ε > 0) (1.5.16)
ds
have a root s ∈ [0, s2 ]. Then the subset A of realizations of η defined by the constraint
H(η )
− 1 > ε, (1.5.17)
Hη
has the probability

P(A) e−sμ0 (s)+μ0 (s) . (1.5.18)
The rest of realizations constituting the complimentary subset B have probabilities
related by
< 2ε
ln P(η ) 1 − ε (η , η ∈ B), (1.5.19)
and the number M of such realizations satisfies the inequality

1

ln M
1−ε + ln 1 − e−sμ0 (s)+μ0 (s) < < 1 + ε. (1.5.20)
Hη Hη
Proof. The proof is analogous to that of Theorem 1.9 in many aspects. Rela-
tion (1.5.19) follows from (1.5.17). Inequality (1.5.17) is equivalent to the inequality
e−(1+ε )Hη < P(η ) < e−(1−ε )Hη .
Taking it into account and considering the equality

∑ P(η ) = 1 − P(A),
B
we find that the number of terms in the above sum (i.e. the number of realizations
in B) satisfies the inequality
[1 − P(A)] e(I−ε )Hη < M < [1 − P(A)] e(1+ε )Hη .
Therefore,
(1 − ε )Hη + ln [1 − P(A)] < ln M < (1 + ε )Hη + ln [1 − P(A)].
If we take into account negativity of ln [1 − P(A)] and inequality (1.5.18), then

(1.5.20) follows from the last formula. Thus, we only need to justify (1.5.18) in
order to complete the proof. We will obtain the desired relation, if we apply The-
orem 4.7 (Section 4.4), which will be proven later on, with ξ = η , B(ξ ) = H(η ).
This completes the proof.
We can obtain a number of simple approximate relations from the formulae pro-
vided in the previous theorem, if we use the condition that ε is small. For ε = 0,
there is a null root s = 0 of equation (1.5.16), because ddsμ (0) = Hη . For small values
of ε , the value of s is small as well. Thus, the right-hand side of equation (1.5.16)
can be expanded to the Maclaurin series
μ0 (s) = μ0 (0) + μ0 (0)s + · · · ,
so that the root of the equation will take the form

ε Hη
s= + O(ε 2 ). (1.5.21)
μ0 (0)
Furthermore, we write down the Maclaurin expansion for the expression in the ex-
ponent of (1.5.18):
1
sμ0 (s) − μ0 (s) = μ0 (0)s2 + O(s3 ).
2
Substituting (1.5.21) into the above expression, we obtain

ε 2 Hη2
P(A) exp − [1 + O(ε 3 )]. (1.5.22)
2μ0 (0)
Here μ (0) > 0 is a positive quantity that is equal to the variance
μ0 (0) = Var(H(η )) (1.5.23)
due to general properties of the characteristic potentials.

We see that the characteristic potential of entropy helps us to investigate problems

related to entropic stability.
1.6 Definition of entropy of a continuous random variable
Up to now we have assumed that a random variable ξ , with entropy Hξ , can take
values from some discrete space consisting of either a finite or a countable number
of elements, for instance, messages, symbols, etc. However, continuous variables
are also widespread in engineering, i.e. variables (scalar or vector), which can take
values from a continuous space X, most often from the space of real numbers. Such a
random variable ξ is described by the probability density function p(ξ ) that assigns
the probability

ΔP = p(ξ )d ξ ≈ p(A)Δ V (A ∈ Δ X)
ξ εΔ X
of ξ appearing in region Δ X of the specified space X with volume Δ V (d ξ = dV is

a differential of the volume).
How can we define entropy Hξ for such a random variable? One of many possible
formal ways is the following: In the formula
Hξ = − ∑ Pξ ln P(ξ ) = −E[ln P(ξ )], (1.6.1)

ξ
appropriate for a discrete variable we formally replace probabilities P(ξ ) in the

argument of the logarithm by the probability density and, thereby, consider the ex-
pression
Hξ = −E[ln p(ξ )] = − p(ξ ) ln p(ξ )d ξ . (1.6.2)
X
This way of defining entropy is not well justified. It remains unclear how to de-
fine entropy in the combined case, when a continuous distribution in a continuous
space coexists with concentrations of probability at single points, i.e. the probabil-
ity density contains delta-shaped singularities. Entropy (1.6.2) also suffers from the
drawback that it is not invariant, i.e. it changes under a non-degenerate transforma-
tion of variables η = f (ξ ) in contrast to entropy (1.6.1), which remains invariant
under such transformations.
That is why we deem it expedient to give a somewhat different definition of en-
tropy for a continuous random variable. Let us approach this definition based on
formula (1.6.1). We shall assume that ξ is a discrete random variable with probabil-
ities concentrated at points Ai of a continuous space X:
P(Ai ) = ωi .
This means that the probability density has the form

1.6 Definition of entropy of a continuous random variable 23
p(ξ ) = ∑ ωi δ (ξ − Ai ), ξ ε X. (1.6.3)
i
Employing formula (1.6.1) we have
Hξ = − ∑ P(Ai ) ln P(Ai ) . (1.6.4)

i
Further, let points Ai be located in space X sufficiently densely, so that there is a

quite large number Δ N of points within a relatively small region Δ X having volume
Δ V . Region Δ X is assumed to be small in the sense that all probabilities P(Ai ) of
points within are approximately equal:
ΔP
P(Ai ) ≈ for Ai ∈ Δ X. (1.6.4a)
ΔN
Then, for the sum over points lying within Δ X, we have
ΔP
− ∑ P(Ai ) ln P(Ai ) ≈ −Δ P ln
ΔN
.
Ai ∈Δ X
Summing with respect to all regions Δ X, we see that the entropy (1.6.4) assumes
the form
ΔP
Hξ ≈ − ∑ Δ P ln . (1.6.5)
ΔN
If we introduce a measure ν0 (ξ ) specifying the density of points Ai , and such that
by integrating ν0 (ξ ) we calculate the number of points

ΔN = ν0 (ξ )d ξ
ξ εΔ X
inside any region Δ X, then entropy (1.6.4) can be written as

p(ξ )
Hξ = − p(ξ ) ln dξ . (1.6.6)
X ν0 ( ξ )
Apparently, a delta-shaped density ν0 (ξ ) of the form
ν0 (ξ ) = ∑ δ (ξ − Ai )
i
corresponds to the probability density of distribution (1.6.3). From these delta-

shaped densities we can move to the smoothed densities

p(ξ ) = K(ξ − ξ )p(ξ )d ξ ,
(1.6.7)

ν0 = K(ξ − ξ )ν0 (ξ )d ξ .
If the ‘smoothing radius’ r0 corresponding to the width of function K(x) is relatively

small (it is necessary that the probabilities P(Ai ) do not change essentially over such
distances), then smoothing has little effect on the probability density ratio:
p(ξ ) p(ξ )
≈ .
ν0 (ξ ) ν0 (ξ )
Therefore, we obtain from (1.6.6) that

p(ξ )
Hξ ≈ − p(ξ ) ln dξ . (1.6.8)
ν0 (ξ )
This formula can be also derived from (1.6.5), since Δ P ≈ pΔ V , Δ N ≈ v0 Δ V when
the ‘radius’ r0 is significantly smaller than sizes of regions Δ X. However, if the
‘smoothing radius’ r0 is significantly longer than the mean distance between points
Ai , then smoothed functions (1.6.7) will have a simple (smooth) representation,
which was assumed in, say, formula (1.6.2).
Discarding the signs ∼ in (1.6.8) for both densities, instead of (1.6.2), we obtain
the following formula for entropy Hξ :

p(ξ )
Hξ = − p(ξ ) ln dξ . (1.6.9)
X ν0 ( ξ )
Here ν0 (ξ ) is some auxiliary density, which is assumed to be given. It is not nec-

essarily normalized. Besides, according to the aforementioned interpretation of en-
tropy (1.6.9) as a particular case of entropy (1.6.1), the normalizing integral

ν0 (ξ )d ξ = N (1.6.10)
X
is assumed to be a rather large number (N

1) interpreted as the total number of
points Ai , where probabilities P(Ai ) are concentrated. However, such an interpreta-
tion is not necessary and, in particular, we can put N = 1.
If we introduce the normalized density
q(ξ ) = ν0 (ξ )/N (1.6.11)
(it is possible to do so when N is finite), then, evidently, it follows from for-

mula (1.6.9) that
p(ξ )
Hξ = ln N − p(ξ ) ln dξ . (1.6.12)
q(ξ )
Definition (1.6.9) of entropy Hξ can be generalized both to the combined case
and the general case of an abstract random variable as well. The latter is thought to
be given if some space X of points ξ and a Borel field F (σ -algebra) of its subsets
are fixed. In such cases, it is said that a measurable space (X, F) is given. Moreover,
on this field a probability measure P(A) (A ∈ F) is defined, such that P(X) = 1. In
defining entropy we require that, on a measurable space (X, F), an auxiliary measure
ν (A) should be given such that the measure P is absolutely continuous with respect
to ν .
A measure P is called absolutely continuous with respect to measure ν , if for
every set A from F, such that ν (A) = 0, the equality P(A) = 0 holds. According to
the well-known Radon–Nikodym theorem, it follows from the absolute continuity
of measure P with respect to measure ν that there exists an F-measurable function
f (x), denoted dP/d ν and called the Radon–Nikodym derivative, which generalizes
the notion of probability density. It is defined for all points from space X excluding,
perhaps, some subset Λ , for which ν (Λ ) = 0 and therefore P(Λ ) = 0.
Thus, if the condition of absolute continuity is satisfied, then the entropy Hξ is
defined with the help of the Radon–Nikodym derivative by the formula

dP
Hξ = − ln (ξ )P(d ξ ). (1.6.13)
X−Λ −Λ0 dν
The subset Λ , for which function dP/d ν is not defined, has no effect on the result
of integration since it has null measure P(Λ ) = ν (Λ ) = 0. Also, there is one more
inessential subset Λ0 , namely, a subset on which function dP/d ν is defined but equal
to zero, because

dP
P(Λ0 ) = ν (d ξ ) = 0 · v(d ξ ) = 0
Λ0 dν Λ0
even if ν (Λ0 ) = 0. Therefore, some indefiniteness of the function f (ξ ) = dP/d ν

and infinite values of ln f (ξ ) at points with f (ξ ) = 0 do not make an impact on
definition (1.6.13) of entropy Hξ in the case of absolute continuity of P with respect
to ν . Therefore, the quantity
dP
H(ξ ) = − ln (ξ ) (1.6.14)
dν
plays the role of random entropy analogous to random entropy (1.2.2). It is defined
almost everywhere in X, i.e. in the entire space excluding, perhaps, sets Λ + Λ0 of
zero probability P.
By analogy with (1.6.11), if N = ν (X) < ∞, then instead of ν (A) we can intro-
duce a normalized (i.e. probability) measure
Q(A) = ν (A)/N, A ∈ F, (1.6.15)
and transform (1.6.13) to the form

dP
Hξ = ln N − ln (ξ )P(d ξ ) (1.6.16)
dQ
that is analogous to (1.6.12). The quantity

P/Q dP
Hξ = ln (ξ )P(d ξ ) (1.6.17)
dQ
is non-negative. This statement is analogous to Theorem 1.5 and can be proven by

the same method. In consequence of the indicated non-negativity, quantity (1.6.17)
can be called the entropy of probability distribution P relative to probability distri-
bution Q. The definition of entropy (1.6.9), (1.6.13), given in this section, allows
us to consider entropy (1.6.2) as a particular case of that general definition. In fact,
formula (1.6.2) is entropy (1.6.13) for the case, when measure ν corresponds to a
uniform unit density ν0 (ξ ) = d ν /d ξ = 1.
It is worth noting that entropy (1.6.17) of probability measure P relative to prob-
ability measure Q can be used as an indicator of the degree of difference between
measures P and Q (refer to the book by Kullback [30] or its original in English
[29] regarding this subject). This is facilitated by the fact that it becomes zero for
coincident measures P(·) = Q(·) and is positive for non-coincident ones.
Another indicator of difference between measures, which possess the same prop-
erties and may be a ‘distance’, is defined by the formula
Q
s(P, Q) = inf ds, (1.6.18)
P
where
P(d ζ )
ds2 = 2H P/P±δ P = 2 ln P(d ζ ). (1.6.19)
P(d ζ ) ± δ P(d ζ )
By performing a decomposition of function − ln (1 ± δ P/P) by δ P/P, it is not dif-
ficult to assure yourself that this metric can be also given by the equivalent formula

[δ P(d ζ )]2
ds2 = = [δ ln P(d ζ )]2 P(d ζ ). (1.6.20)
P(d ζ )
We connect points P and Q by a ‘curve’—a family of points Pλ dependent on

parameter λ , so that P0 = P, P1 = Q. Then (1.6.18) can be rewritten as
1
ds
s(P, Q) = inf dλ , (1.6.21)
0 dλ
where due to (1.6.20) we have
2 2
ds ∂ ln Pλ (d ζ )
= Pλ (d ζ ). (1.6.22)
dλ ∂λ
Here and further we assume that differentiability conditions are satisfied. It follows
from (1.6.22) that
2
ds ∂ 2 ln Pλ (d ζ )
=− Pλ (d ζ ). (1.6.23)
dλ ∂ 2λ
Indeed, the difference of these expressions


2
∂ ln Pλ (d ζ ) ∂ 2 ln Pλ (d ζ ) ∂ ∂ ln Pλ (d ζ )
+ Pλ (d ζ ) = Pλ (d ζ )
∂λ ∂ 2λ ∂λ ∂λ
is equal to zero, because the expression

∂ ln Pλ (d ζ ) ∂ ∂
Pλ (d ζ ) = Pλ (d ζ ) = 1
∂λ ∂λ ∂λ
similarly disappears.
One can see from the definition above that, as points P and Q get closer, the
entropies HP/Q , HQ/P and quantity 12 s2 (P, Q) do not differ anymore. However, we
can formulate a theorem that connects these quantities not only in the case of close
points.
Theorem 1.14. The squared ‘distance’ (1.6.18) is bounded from above by the sum
of entropies:
s2 (P, Q) H P/Q + H Q/P . (1.6.24)
Proof. We connect points P and Q by the curve
Pλ (d ζ ) = e−Γ (λ ) [P(d ζ )]1−λ Q(d ζ )λ (1.6.25)

(Γ (λ ) = ln [P(d ζ )]1−λ [Q(d ζ )]λ ).
We obtain
∂ 2 ln Pλ (d ζ ) d 2Γ (λ )
− = .
∂ 2λ dλ 2
Therefore, due to (1.6.23) we get
2
ds d 2Γ
= ,
dλ dλ 2
and also 2
∂ 2 ln Pλ (d ζ ) ds
− = . (1.6.26)
∂ 2λ dλ
Moreover, it is not difficult to make certain that expression (1.6.17) can be rewritten
in the following integral form:
1 1 λ
∂ ln Pλ (d ζ ) ∂ 2 ln Pλ (d ζ )
H P/Q
=− dλ P0 (d ζ ) = − dλ dλ P0 (d ζ ).
0 ∂λ 0 0 ∂ λ 2

To derive the last formula we have taken into account that [∂ (ln Pλ (d ζ ))/∂ λ ]
P0 (d ζ ) = 0 for λ = 0. Then we take into consideration (1.6.26) and obtain that
1 λ 1 2
ds 2 ds
H P/Q
= dλ d λζ = (1 − λ ) dλ . (1.6.27)
0 0 dλ 0 dλ
Analogously, switching P and Q while keeping the connecting curve (1.6.25)

intact, we obtain
1
ds 2
H Q/P
= λ dλ . (1.6.28)
0 dλ
Next, we add up (1.6.28), (1.6.27) and apply the Cauchy–Schwartz inequality:
2 1
1 ds ds 2
dλ dλ .
0 dλ 0 dλ
This will yield

1
2
ds
dλ H P/Q + H Q/P ,
0 dλ
and, therefore, also (1.6.24) if we account for (1.6.21). The proof is complete.
1.7 Properties of entropy in the generalized version. Conditional

entropy
Entropy (1.6.13), (1.6.16) defined in the previous section possesses a set of prop-
erties, which are analogous to the properties of an entropy of a discrete random
variable considered earlier. Such an analogy is quite natural if we take into account
the interpretation of entropy (1.6.13) (provided in Section 1.6) as an asymptotic case
(for large N) of entropy (1.6.1) of a discrete random variable.
The non-negativity property of entropy, which was discussed in Theorem 1.1, is
not always satisfied for entropy (1.6.13), (1.6.16) but holds true for sufficiently large
N. The constraint
P/Q
Hξ ln N
results in non-negativity of entropy Hξ .
Now we move on to Theorem 1.2, which considered the maximum value of en-
tropy. In the case of entropy (1.6.13), when comparing different distributions P we
need to keep measure ν fixed. As it was mentioned, quantity (1.6.17) is non-negative
and, thus, (1.6.16) entails the inequality
Hξ ln N.
At the same time, if we suppose P = Q, then, evidently, we will have
Hξ = ln N.
This proves the following statement that is an analog of Theorem 1.2.
Theorem 1.15. Entropy (1.6.13) attains its maximum value equal to ln N, when
measure P is proportional to measure ν .
1.7 Properties of entropy in the generalized version. Conditional entropy 29
This result is rather natural in the light of the discrete interpretation of for-
mula (1.6.13) given in Section 1.6. Indeed, a proportionality of measures P and ν
means exactly a uniform probability distribution on discrete points Ai and, thereby,
Theorem 1.15 becomes a paraphrase of Theorem 1.2. The following statement is an
P/Q P/Q
analog of Theorems 1.2 and 1.15 for entropy Hξ : entropy Hξ attains a mini-
mum value equal to zero when distribution P coincide with Q.
P/Q
According to Theorem 1.15, it is reasonable to interpret entropy Hξ , defined
by formula (1.6.17), as a deficit of entropy, i.e. as lack of this quantity needed to
attain its maximum value.
So far we assumed that measure P is absolutely continuous with respect to mea-
sure Q or (that is the same for finite N) measure ν . It raises a question how to
P/Q
define entropy Hξ or Hξ , when there is no such absolute continuity. The answer
to this question can be obtained if formula (1.6.16) is considered as an asymptotic
case (for very large N) of the discrete version (1.6.1). If in condensing the points
Ai (introduced in Section 1.6) we regard, contrary to formula P(Ai ) ≈ Δ P/(N Δ Q)
(see (1.6.4a)), the probabilities P(Ai ) of some points as finite: P(Ai ) > c > 0 (c is
independent of N), then measure QN , as N → ∞, will not be absolutely continuous
P/Q
with respect to measure P. In this case, the deficiency of Hξ will increase without
bound as N → ∞. This allows us to assume
P/Q
Hξ =∞
if measure P is not absolutely continuous with respect to Q (i.e. singular with re-
spect to Q). However, the foregoing does not define the entropy Hξ in the absence
of absolute continuity, since we have indeterminacy of the type ∞ − ∞ according
to (1.6.16). In order to eventually define it, we require a more detailed analysis of
the passage to the limit N → ∞ related to condensing points Ai .
Other properties of the discrete version of entropy, mentioned in Theorems 1.3,
1.4, 1.6, are related to entropy of many random variables and conditional entropy.
With a proper definition of the latter notions, the given properties will take place for
the generalized version, based on definition (1.6.13), as well.
Consider two random variables ξ , η . According to (1.6.13) their joint entropy is
of the form
P(d ξ , d η )
Hξ ,η = − ln P(d ξ , d η ). (1.7.1)
ν (d ξ , d η )
At the same time, applying formula (1.6.13) to a single random variable ξ or η we
obtain

P(d ξ )
Hξ = − ln P(d ξ ),
ν1 (d ξ )

P(d η )
Hη = − ln P(d η ).
ν2 (d η )
Here, ν1 , ν2 are some measures; their relation to ν will be clarified later. We define
conditional entropy Hξ |η as the difference
Hξ |η = Hξ η − Hη , (1.7.2)
i.e. in such a way that the additivity property is satisfied:
Hξ η = Hη + Hξ |η . (1.7.3)
Taking into account (1.6.13) and (1.7.2), it is easy to see that for Hξ |η we will have
the formula
P(d ξ | η )
Hξ |η = − ln P(d ξ d η ), (1.7.4)
ν (d ξ | η )
where P(d ξ | λ η ), ν (d ξ | η ) are conditional measures defined as the Radon–
Nikodym derivatives with the help of standard relations

P(ξ ∈ A, η ∈ B) = P(ξ ∈ A | η )P(d η ),
η ∈B

ν (ξ ∈ A, η ∈ B) = v(ξ ∈ A | η )ν2 (d η )
η ∈B
(sets A, B are arbitrary). The definition in (1.7.4) uses the following random entropy:
P(d ξ | η )
H(ξ | η ) = − ln .
ν (d ξ | η )
The definition of conditional entropy Hξ |η provided above can be used multiple

times when considering a chain of random variables ξ1 , ξ2 , . . . , ξn step by step. In
this case, relation (1.7.3) leads to the hierarchical additivity property
H(ξ1 , . . . , ξη ) = H(ξ1 ) + H(ξ2 | ξ1 ) + H(ξ3 | ξ1 , ξ2 ) + · · · + H(ξη | ξ1 , . . . , ξη −1 ),

Hξ1 ,...,ξη = Hξ1 + Hξ2 |ξ1 + Hξ3 |ξ1 ,ξ2 + · · · + Hξη |ξ1 ,...,ξη −1 , (1.7.4a)
that is analogous to (1.3.4).

In order that the entropies Hξ , Hξ |η , and other ones be interpreted as measures of
uncertainty, it is necessary that in the generalized version the following inequality
be valid:
Hξ Hξ |η , (1.7.4b)
which is usual in the discrete case. Meanwhile, we manage to prove this inequality
in the case, which is not generalized as it can be seen from formula (1.7.6) further.
Theorem 1.16. If the following condition is satisfied
ν (d ξ , d η ) ν1 (d ξ )ν2 (d η ), (1.7.5)
then the inequality Hξ Hξ |η holds.

Proof. For the difference Hξ − Hξ |η we will have

P(d ξ | η ) ν1 (d ξ )
Hξ − Hξ |η = P(d η ) ln P(d ξ | η ) + E ln . (1.7.6)
P(d ξ ) ν (d ξ | η )
But for all probability measures P1 , P2 the following inequality holds:

P1 (d ξ )
ln P1 (d ξ ) 0, (1.7.7)
P2 (d ξ )
This inequality is a generalization of inequality (1.3.7) and can be proven in the

same way. Applying (1.7.7) to (1.7.6) for P1 (d ξ ) = P(d ξ | η ), P2 (d ξ ) = P(d ξ ), we
obtain that the first term of the right-hand side of (1.7.6) is non-negative. The second
term can be represented as follows:

ν1 (d ξ ) ν1 (d ξ )ν2 (d η )
E ln = E ln .
ν (d ξ | η ) ν (d ξ , d η )
Its non-negativity is seen from here due to condition (1.7.5). Therefore, the differ-
ence (1.7.6) is non-negative. The proof is complete.
Inequality (1.7.5) is satisfied naturally, when ν (A) is interpreted according to the

aforesaid in Section 1.6 as the number of discrete points of zero probability within
set A. Then ν1 (Δ ξ ) and ν2 (Δ η ) are interpreted as the numbers of such points on the
intervals [ξ , ξ + Δ ξ ] and [η , η + Δ η ], respectively. Naturally, there remain no more
than ν1 (Δ ξ )ν2 (Δ η ) of valid points within the rectangle [ξ , ξ + Δ ξ ] × [η , η + Δ η ].
If we do not follow the above interpretation, then we will have to postulate con-
dition (1.7.5) independently. It is especially convenient to use the multiplicativity
condition
ν (d ξ , d η ) = ν1 (d ξ )ν2 (d η ), (1.7.8)
which we shall postulate in what follows.
Thus, the generalized version of entropy has the usual properties, i.e. not only
possesses the property of hierarchical additivity, but also satisfies standard inequal-
ities, if the multiplicativity condition (1.7.8) is satisfied. When there are several
random variables ξ1 , . . . , ξn , it is convenient to select measure ν , which satisfies the
total multiplicativity condition
n
ν (d ξ1 , . . . , d ξn ) = ∏ νk (d ξk ), (1.7.9)
k=1
which we understand as the condition of consistency of auxiliary measures. Ac-

cording to (1.7.9), measure ν of the combined random variable ξ = (ξ1 , . . . , ξn ) is
decomposed into a product of ‘elementary’ measures νk . Thus, in compliance with
the aforementioned formulae we will have
P(d ξk )
H(ξk ) = − ln , (1.7.10)
νk (d ξk )
Hξk = E[H(ξk )], (1.7.11)
P(d ξk | ξ1 , . . . , ξk−1 )
H(ξk | ξ1 , . . . , ξk−1 ) = − ln , (1.7.12)
νk (d ξk )
P(d ξk | ξ1 , . . . , ξk−1 )
Hξk |ξ1 ,...,ξk−1 = E[H(ξk | ξ1 , . . . , ξk−1 )] = − ln . (1.7.13)
νk (d ξk )
If random variables ξ1 , . . . , ξn are statistically independent, then P(d ξk | ξ1 , . . . ,

ξk−1 ) = P(d ξk ) (with probability 1). Consequently,
H(ξk | ξ1 , . . . , ξk−1 ) = H(ξk ),

Hξk |ξ1 ,...,ξk−1 = Hξk (k = 1, . . . , n),
and (1.7.4a) implies the additivity property

n
Hξ1 ,...,ξn = ∑ Hξk ,
k=1
coinciding with (1.2.9).

Let us now consider the normalized measures
Q(d ξ , d η ) = ν (d ξ , d η )/N, Q1 (d ξ ) = ν1 (d ξ )/N1 , Q2 (d η ) = ν2 (d η )/N2 ,
where
N= ν (d ξ , d η ), N1 = ν1 (d ξ ), N2 = ν2 (d η ),
and the corresponding entropies (1.6.17). The following relations

P/Q1 P/Q2
Hζ η = ln N − H P/Q ; Hξ = ln N1 − Hξ ; Hη = ln N2 − Hη (1.7.14)
of type (1.6.16) are valid for them. Supposing
Q1 (d ξ ) = Q(d ξ ); Q2 (d η ) = Q(d η ) (1.7.15)
and assuming the multiplicativity condition
N = N1 N2 ; Q(d ξ d η ) = Q(d ξ )Q(d η ), (1.7.16)
we define conditional entropy by the natural formula

P/Q P(d ξ | η )
Hξ |η = ln P(d ξ d η ), (1.7.17)
Q(d ξ )
so that
P/Q
Hξ |η = ln N1 − Hξ |η . (1.7.17a)
In this case, the additivity property holds:

P/Q P/Q P/Q
Hξ η = Hη + Hξ |η (1.7.18)
Taking into account (1.7.2), we reduce inequality (1.7.4b) to the form
Hξ Hξ η − Hη .
Using (1.7.14), in view of (1.7.15), (1.7.16), we obtain from the last formula that
P/Q P/Q P/Q
Hξ η − Hη Hξ .
In consequence of (1.7.18), this inequality can be rewritten as

P/Q P/Q
Hξ Hξ |η . (1.7.19)
Comparing it with (1.7.4b), it is easy to see that the sign has been replaced
with the opposite one () for entropy H P/Q . This is convincing evidence that en-
P/Q P/Q
tropies Hξ , Hξ |η cannot be regarded as measures of uncertainty in contrast to
entropies (1.6.1) or (1.6.13).
In the case of many random variables, it is expedient to impose constraints of
type (1.7.15), (1.7.16) for many variables and use the hierarchical additivity property
P/Q P/Q P/Q P/Q P/Q
Hξ = Hξ + Hξ + Hξ + · · · + Hξn |ξ . (1.7.20)
1 ,...,ξn 1 2 | ξ1 3 | ξ1 , ξ2 1 ,...,ξn−1
This property is analogous to (1.3.4) and (1.7.4a).

Chapter 2
Encoding of discrete information in the absence
of noise and penalties
The definition of the amount of information, given in Chapter 1, is justified when

we deal with a transformation of information from one kind into another, i.e. when
considering encoding of information. It is essential that the law of conservation of
information amount holds under such a transformation. It is very useful to draw an
analogy with the law of conservation of energy. The latter is the main argument for
introducing the notion of energy. Of course, the law of conservation of information
is more complex than the law of conservation of energy in two respects. The law of
conservation of energy establishes an exact equality of energies, when one type of
energy is transformed into another. However, in transforming information we have
a more complex relation, namely ‘not greater’ (), i.e. the amount of information
cannot increase. The equality sign corresponds to optimal encoding. Thus, when
formulating the law of conservation of information, we have to point out that there
possibly exists such an encoding, for which the equality of the amounts of informa-
tion occurs.
The second complication is that the equality is not exact. It is approximate,
asymptotic, valid for complex (large) messages and for composite random variables.
The larger a system of messages is, the more exact such a relation becomes. The ex-
act equality sign takes place only in the limiting case. In this respect, there is an
analogy with the laws of statistical thermodynamics, which are valid for large ther-
modynamic systems consisting of a large number (of the order of the Avogadro
number) of molecules.
When conducting encoding, we assume that a long sequence of messages ξ1 , ξ2 ,
. . . is given together with their probabilities, i.e. a sequence of random variables.
Therefore, the amount of information (entropy H) corresponding to this sequence
can be calculated. This information can be recorded and transmitted by different
realizations of the sequence. If M is the number of such realizations, then the law
of conservation of information can be expressed by the equality H = ln M, which is
complicated by the two above-mentioned factors (i.e. actually, H ln M).
Two different approaches may be used for solving the encoding problem. One
can perform encoding of an infinite sequence of messages, i.e. online (or ‘slid-
ing’) encoding. The inverse procedure, i.e. decoding, will be performed analogously.
https://doi.org/10.1007/978-3-030-22833-0 2
36 2 Encoding of discrete information in the absence of noise and penalties
Typically in online processing, the quantitative equality between the amounts of in-
formation to be encoded and information that has been encoded is maintained only
on average. With time, random time lag is produced. For a fixed length message
sequence, the length of its code sequence will have random spread increasing with
time; and vice versa: for a fixed record length, the number of elementary messages
transmitted will have increasing random spread.
Another approach can be called ‘block’ or ‘batch’ encoding. A finite collection (a
block) of elementary messages is encoded. If there are several blocks, then different
blocks are encoded and decoded independently. In this approach, there is no increase
in random time lag, but there is a loss of some realizations of the message. A small
portion of message realizations cannot be encoded and is lost, because there are not
enough code realizations. If the block is entropically stable, then the probability of
such a loss is quite small. Therefore, when we use the block approach, we should
investigate problems related to entropic stability.
Following tradition, we mostly consider the online encoding in the present chap-
ter. In the last paragraph, we study the errors of the aforementioned type of encod-
ing, which occur in the case of a finite length of an encoding sequence.
2.1 Main principles of encoding discrete information
Let us confirm the validity and efficiency of the definitions of entropy and the
amount of information given earlier by considering a transformation of informa-
tion from a certain form to another. Such a transformation is called encoding and is
applied to transforming and storing information.
Suppose that the message to be transmitted and recorded is represented as a se-
quence ξ1 , ξ2 , ξ3 , . . . of elementary messages or random variables. Let each elemen-
tary message be represented in the form of a discrete random variable ξ , which takes
one out of m values (say, 1, . . . , m) with probability P(ξ ). It is required to transform
the message into a sequence η1 , η2 , . . . of letters of some alphabet A = (A, B, . . . , Z).
The number of letters in this alphabet is denoted by D. We treat a space, punctuation
marks and so on as letters of the alphabet, so that a message is represented in one
word. It is required to record a message in such a way that the message itself can be
recovered from the record without any losses or errors. The respective theory has to
show which conditions must be satisfied and how to do so.
Since there is no fixed constraint between the number of elementary messages
and the number of letters in the alphabet, the trivial ‘encoding’ of one message by
one special letter may not be the best strategy in general. We will associate each
separate message, i.e. each realization ξ = i of a random variable, with the corre-
sponding ‘word’ V (i) = (η1i , . . . , ηlii ) in the alphabet A (li is the length of that word).
The full set of such words (their number equals m) forms a code. Having defined
the code, we can recover the message in letters of alphabet A from its realization
ξ1 , ξ2 , . . . , i.e. it will take the form
2.1 Main principles of encoding discrete information 37
V (ξ1 )V (ξ2 ) . . . = η1 η2...,
where we treat V (ξi ) as a sequence of letters.

As all words V (ξi ) are written in one letter sequentially, there emerges a difficult
problem of decomposing a sequence of letters η1 , η2 , . . . into words V (ξi ) [sepa-
rate representation of words would mean some particular case (also, uneconomical)
of the aforementioned encoding because a space can be considered as an ordinary
letter]. This problem will be covered later on (refer to Section 2.2).
It is desirable to make a record of the message as short as possible as long as it
can be done without loss of content. Information theory points out limits to which
the length of the record can be compressed (or the time of message transmission if
the message is transmitted over some time interval).
Next we explain the theoretical conclusions via argumentation using the results
of Sections 1.4 and 1.5. Let the full message (ξ1 , . . . , ξn ) consist of n identical in-
dependent random elementary messages ξ j = ξ . Then (ξ1 , . . . , ξn ) = ζ will be an
entropically stable random variable. If n is sufficiently large, then according to the
aforesaid in Sections 1.4 and 1.5 we can consider the following simplified model of
full message: it is supposed that only one out of enHζ equiprobable messages occurs,
where Hζ = − ∑ζ P(ζ ) ln P(ζ ). At the same time only one out of DL possibilities
is realized for a literal record containing L consecutive letters. Equating these two
quantities, we obtain that the length L of the record is determined by the equation
nHξ = L ln D. (2.1.1)
If a shorter representation is used, then the message will be inevitably distorted

in some cases. Besides, if a record of length greater than nHξ / ln D is used, then
such a record will be uneconomical and non-optimal.
The conclusions from Sections 1.4 and 1.5 can be applied not only to the mes-
sage sequence ξ1 , ξ2 , . . ., ξn but also to a literal sequence η1 , η2 , . . ., ηL . The
maximum entropy (equal to L ln D) of such a sequence is attained for a joint dis-
tribution such that marginal distributions of two adjacent letters are independent
and all marginal distributions are equiprobable over the entire alphabet. But proba-
bilities of letters are exactly determined by probabilities of messages and a choice
of code. When probabilities of messages are fixed, the problem is to pick out code
words and try to attain the aforementioned optimal distribution of letters. If the
maximum entropy L ln D of a literal record is attained for some code, then there are
eL ln D = DL essential realizations of the record and we can establish a one-to-one
correspondence between these realizations and essential realizations of the message
(the number of which is enHξ ). Therefore, the optimal relation (2.1.1) is valid. Such
codes are called optimal. Since record letters are independent for optimal codes we
have H(η j+1 | η j , η j−1 , . . .) = H(η j+1 ). Besides, since equiprobable distribution
P(η j+1 ) = 1/D over the alphabet is realized, every letter carries identical informa-
tion
H η j+1 = ln D (2.1.2)
This relationship tells us that every letter is used to its ‘full power’; it is an indication
of an optimal code.
Further, we consider some certain realization of message ξ = j. According
to (1.2.1) it contains random information H(ξ ) = − ln P( j). But for an optimal
code every letter of the alphabet carries information ln D due to (2.1.2). It follows
from here that the code word V (ξ ) of this message (which also carries information
H(ξ ) = − ln P(ξ ) in the optimal code) must consist of l(ξ ) = − ln P(ξ )/ ln D letters.
For non-optimal encoding, every letter carries less information than ln D. That is
why the length of the code word l(ξ ) (which carries information − ln P(ξ ) = H(ξ ))
must be greater. Therefore, for each code
l (ξ ) Hξ / ln D.
Averaging of this inequality yields
lav Hξ / ln D (2.1.3)
that complies with (2.1.1), since lav = L/n.

Thus, the theory points out what cannot be achieved (duration L of the record
cannot be made less than nHξ / ln D), what can be achieved (duration L can be made
close to nHξ / ln D) and how to achieve that (by selecting a code possibly providing
independent equiprobable distributions of letters over the alphabet).
Example 2.1. Let ξ be a random variable with the following values and probabili-
ties:
ξ 1 2 3 4 5 6 7 8
P(ξ ) 1 1 1 1 1 1 1 1
4 4 8 8 16 16 16 16
The message is recorded in the binary alphabet (A, B), so that D = 2. We take in the
code words
V (1) = (AAA); V (2) = (AAB); V (3) = (ABA); V (4) = (AAB);

V (5) = (BAA); V (6) = (BAB); V (7) = (BBA); V (8) = (BBB). (2.1.4)
In order to tell whether this code is good or not, we compare it with an optimal
code, for which relation (2.1.1) is satisfied. Computing entropy of an elementary
message by formula (1.2.3), we obtain
3
Hξ = ln 2 + ln 2 + ln 2 = 2.75 nat.
4
There are three letters per elementary message in code (2.1.4), whereas the same
message may require L/n = Hξ / ln 2 = 2.75 letters in the optimal code according
to (2.1.1). Consequently, it is possible to compress the record by 8.4%. As an opti-
mal code we can choose the code
2.1 Main principles of encoding discrete information 39
V (1) = (AA); V (2) = (AB); V (3) = (BAA); V (4) = (BAB); (2.1.5)

V (5) = (BBAA); V (6) = (BBAB); V (7) = (BBBA); V (8) = (BBBB).
It is easy to check if the probability of occurrence of letter A at the first place is

equal to 1/2. Thus, the entropy of the first character is equal to Hη1 = 1 bit. Further,
when the first character is fixed, the conditional probability of occurrence of letter
A at the second place is equal to 1/2 as well. This yields conditional entropies
Hη2 (| η1 ) = 1 bit; Hη2 /η1 = 1 bit. Analogously,
Hηs (| η1 , η2 ) = 1 bit; Hηs |η1 ,η2 = 2 bit; Hηs |η1 ,η2 ,η3 = 1 bit.
Thus, every letter carries the same information 1 bit independently of its situation.
Under this condition random information of an elementary message ξ is equal to
the length of the corresponding code word:
H (ξ ) = l (ξ ) bit. (2.1.6)
Therefore, the average length of the word is equal to information of the elementary
message:
Hξ = lav bit = lav ln 2 nat.
This complies with (2.1.1) so that the code is optimal. Relation P(ξ ) = (1/2)l(ξ ) ,
which is equivalent to equality (2.1.6), is a consequence of independence of letter
distribution in a literal record of a message.
Relation (2.1.3) with the equality sign is valid under the assumption that n is
large, when we can neglect probability P(An ) of non-essential realizations of the
sequence ξ1 , . . . , ξn (see Theorems 1.7, 1.9). In this sense it is proper asymptotically.
We cannot establish a similar relation for a finite n. Indeed, if we require precise
reproduction of all literal messages of a finite length n (their number equals mn ),
then length L of the record will be determined from the formula DL = mn . It will
be impossible to shorten it (for instance, to the value nHξ / ln D) without loss of
some messages (the probability of which is finite for finite n). On the other hand,
supposing that length L = L(ξ1 , . . . , ξn ) is not constant for finite n, we can encode
messages in such a way that (2.1.3) transforms into the reverse inequality lav =
Hξ / ln D.
We demonstrate the validity of the last fact in the case of n = 1. Let there be
given three possibilities ξ = 1, 2, 3 with probabilities 1/2, 1/4, 1/4. By selecting a
code
ξ 1 2 3
(2.1.7)
V (ξ ) (A) (B) (AB)
corresponding to D = 2, we obtain lav = 12 ln 2 + 24 ln 2 + 24 ln 2 = 1.5 bits. Therefore,
for the given single message inequality (2.1.3) is violated.
2.2 Main theorems for encoding without noise. Independent

identically distributed messages
Let us now consider the case of an unbounded (from one side) sequence ξ1 , ξ2 ,
ξ3 , . . . of identically distributed independent messages. It is required to encode the
sequence. In order to apply the reasoning of the previous paragraph, one has to
decompose this sequence into intervals, called blocks, which contain n elementary
messages each. Then we can employ the aforementioned general asymptotic (‘ther-
modynamic’) relations for large n.
However, there also exists a different approach for studying encoding in the
absence of noise—to avoid a sequence decomposition into blocks and to discard
inessential realizations. The corresponding theory will be stated in this paragraph.
Code (2.1.7) fits for transmitting (or recording) one single message but is not
suitable for a transmission of a sequence of such messages. For instance, the record
ABAB can be simultaneously interpreted as the record V (1)V (2)V (3) of the mes-
sage (ξ1 , ξ2 , ξ3 ) = (1 2 3) or the record V (3)V (1)V (2) of the message (3 1 2) (when
n = 3), let alone the messages (ξ1 , ξ2 ) = (3 3); (ξ1 , ξ2 , ξ3 , ξ4 ) = (1 2 1 2), which
correspond to different n. The code in question does not make it possible to unam-
biguously decode a message and, thereby, we have to reject it if we want to transmit
a sequence of messages.
A necessary condition, that a long sequence of code symbols can be uniquely
decomposed into words, restricts the class of feasible codes. Codes in which every
semi-infinite sequence of code words is uniquely decomposed into words are called
uniquely decodable. As it can be proven (for instance, in the book by Feinstein [12]
or its English original [11]) the inequality
m
∑ D−l(ξ ) 1 (2.2.1)
ξ =1
is valid for such codes.

It is more convenient to consider a somewhat narrower class of uniquely decod-
able codes—we call them Kraft’s uniquely decodable codes (see the work by Kraft
[28]).
Fig. 2.1 An example of a code ‘tree’

2.2 Main theorems for encoding without noise. I.i.d. messages 41
In such codes any code word cannot be a forepart (‘prefix’) of another word. In
code (2.1.7) this condition is violated for word (A) because it is a prefix of word
(AB). Codes can be drawn in the form of a ‘tree’ (Figure 2.1) similarly to ‘trees’
represented on Figures 1.1 and 1.2. However (if considering a code separately from
a message ξ ), those ‘trees’ probabilities are not assigned to ‘branches’ of a ‘code
tree’. A branch choice is conducted in stages by recording a next letter η of the
word. The end of the word corresponds to a special end node which we denote as
a triangle on Figure 2.1. For each code word there is a code line coming from a
start node to an end node. The ‘tree’ corresponding to code (2.1.5) is depicted on
Figure 2.1.
Any end node does not belong to a different code line for Kraft’s uniquely de-
codable codes.
Theorem 2.1. Inequality (2.2.1) is a necessary and sufficient condition for existence
of a Kraft uniquely decodable code.
Proof. At first, we prove that inequality (2.2.1) is satisfied for every Kraft’s code.
We emanate a maximum number (equal D) of auxiliary lines (which are represented
by wavy lines on Figure 2.2) from each end node of the code tree. We suppose that
they multiply in an optimal way (each of them reproduces D offspring lines) at all
subsequent stages. Next, we calculate the number of auxiliary lines corresponding
to index k at some high-order stage. We assume that k is greater than the length
of the longest word. The end node of a word of length l(ξ ) will produce Dk−l(ξ )
auxiliary lines at stage k. The total number of auxiliary lines is equal to
∑ Dk−l(ξ ) . (2.2.2)
ξ
Now we emanate auxiliary lines not only from end nodes but also from every interim
node if the number of code lines coming out from it is less than D. In turn, let
Fig. 2.2 Constructing and counting auxiliary lines at stage k (for k = 3, D = 2)
those lines multiply in a maximum way at other stages. At k-th stage the number of
auxiliary lines will increase compared to (2.2.2) and will become equal to
Dk ∑ Dk−l(ξ ) .
ξ
By dividing both parts of this inequality by Dk , we obtain (2.2.1). Apparently, an

equality sign takes place in the case when all internal nodes of a code tree are used
in a maximum way, i.e. D code lines come out of each node.
Now we prove that condition (2.2.1) is sufficient for Kraft’s decipherability, i.e.
that a code tree with isolated nodes can be always constructed if there is given a
set of lengths l(ξ ) satisfying (2.2.1). Construction of a code line of a fixed length
on a tree means completing a word of a fixed length with letters. What obstacles
can be possibly encountered at that? Suppose there are m1 = m[l(ξ ) = 1] one-letter
words. While completing those words with letters, we can encounter an obstacle
when and only when there are not enough letters in alphabet, i.e. when m1 > D. But
this inequality contradicts condition (2.2.1). Indeed, keeping only terms with l(ξ ) =
1 in the left-hand side of (2.2.1), we obtain just an enhancement of the inequality:
m
∑ D−l(ξ ) =
D
1.
l(ξ )=1
Hence, m1 D and the alphabet contains enough letters to fill out all one-letter
words. Further, we consider two-letter words. Besides already used letters located
at the first position, there are also D − m1 letters, which we can place at the first po-
sition. We can place any of D letters after the first letter. Totally, there are (D − m1 )D
possibilities. Every node out of D − m1 non-end nodes of the first stage produces D
lines. The number of those two-letter combinations (lines) must be enough to fill
out all two-letter words. We denote their number by m2 . Indeed, keeping only terms
with l(ξ ) = 1 or l(ξ ) = 2 in the left-hand side of (2.2.1), we have
D−1 m1 + D−2 m2 1
i.e. m2 D2 − Dm1 . This exactly means that the number of two-letter combinations
is enough for completing two-letter words. After such a completion there are D2 −
Dm1 − m2 two-letter combinations left available, which we can use for adding new
letters. The number of three-letter combinations equal (D2 − Dm1 − m2 )D is enough
for completing three-letter words and so forth. Every time a part of the terms of the
sum ∑ξ D−l(ξ ) is used in the proof. This finished the proof.
As is seen from the proof provided above, it is easy to actually construct a code
(filling out a word with letters) if an appropriate set of lengths l(1), . . . , l(n) is given.
Next we move to the main theorems.
Theorem 2.2. The average word length lav cannot be less than Hξ / ln D for any
encoding.
Proof. We consider the difference

D−l(ξ )
lav ln D − Hξ = E [(l (ξ )) ln D + ln P (ξ )] = −E ln .
P (ξ )
2.2 Main theorems for encoding without noise. I.i.d. messages 43
Using the evident inequality

− ln x 1 − x (2.2.3)
we obtain
D−l(ξ ) D−l(ξ )
−E ln 1−E ,
P (ξ ) P (ξ )
i.e.
D−l(ξ )
−E ln 1 − ∑ D−l(ξ )
P (ξ ) ξ
since
D−l(ξ ) D−l(ξ )
E ∑ P (ξ )
P (ξ ) ξ
P (ξ )
Taking into account (2.2.1) we obtain from here that
lav ln D − Hξ 1 − ∑ D−l(ξ ) 0.
ξ
The proof is complete.

Instead of inequality (2.2.3) we can use inequality (1.2.4).
The following theorems state that we can approach the value Hξ / ln D (which
bounds the average length lav from below) rather close by selecting an appropriate
encoding method.
Theorem 2.3. For independent identically distributed messages, it is possible to
specify an encoding procedure such that
Hξ
lav < + 1. (2.2.4)
ln D
Proof. We choose lengths l(ξ ) of code words in such a way that they satisfy the
inequality
ln P (ξ ) ln P (ξ )
− l (ξ ) < − + 1. (2.2.5)
ln D ln D
Such a choice is possible because the double inequality contains at least one integer
number.
It follows from the left-hand side inequality that P(ξ ) D−l(ξ ) and, thus,
1 − ∑ P (ξ ) ∑ D−l(ξ )
ξ ξ
i.e. the decipherability condition (2.2.1) is satisfied. As is seen from the arguments
just before Theorem 2.2, it is not difficult to practically construct code words with
selected lengths l(ξ ). Averaging out both sides of the right-hand side inequality
of (2.2.5), we obtain (2.2.4). This finishes the proof.
Next we provide the following theorem, which has asymptotic nature.

Theorem 2.4. There exist encoding methods of an infinite message ξ1 , ξ2 , . . . such
that the average length of an elementary message can be made arbitrarily close to
Hξ / ln D.
Proof. We consider an arbitrary integer n > 1 and split the sequence ξ1 , ξ2 , . . . into
groups consisting of exactly n random variables. Next, we treat each of such groups
as a random variable ζ = (ξ1 , . . . , ξn ) and apply Theorem 2.3 to it. Inequality (2.2.4)
applied to ζ will take the form
Hζ
lavζ < + 1, (2.2.6)
ln D
where Hζ = nHξ and lav,ζ is the average length of a word transmitting message
ζ = (ξ1 , . . . , ξn ). It is evident that lav,ζ = nlav and, thereby, (2.2.6) yields
Hξ 1
lav < + .
ln D n
Increasing n, we can make value 1/n as small as needed that proves the theorem.
The provided results have been derived for the case of a stationary sequence of
independent random variables (messages). These results can be extended to the case
of dependent messages, the non-stationary case or others. In this case, it is essential
that a sequence satisfies certain properties of entropic stability (see Section 1.5).
2.3 Optimal encoding by Huffman. Examples
The previous reasoning does not only resolve principal questions about existence of
asymptotically optimal codes, i.e. codes for which an average length of a message
converges to a minimum value, but also provides practical methods to find them. By
selecting n and determining lengths of code words with the help of the inequality
ln P (ξ1 , . . . , ξn ) ln P (ξ1 , . . . , ξn )
− l ( ξ1 , . . . , ξn ) − +1
ln D ln D
we obtain a code that is asymptotically optimal according to Theorem 2.4. However,
such a code may not be quite optimal for fixed n. In other words, it may appear that
the message ζ = (ξ1 , . . . , ξn ) can be encoded in a better way, i.e. with a smaller
length lav .
Huffman [23] (see also Fano [10] or its original in English [9]) has found a sim-
ple method of optimal encoding, which corresponds to a minimum average length
amongst all possible ones for a given message.
At first, we consider the case of D = 2 and some certain optimal code K for
message ζ . We will investigate which mandatory properties the optimal code or the
respective ‘tree’ possesses.
2.3 Optimal encoding by Huffman. Examples 45
At every stage (maybe except the last one) there must be a maximum number
of D lines emanating from each non-end node. Otherwise, the end line from the
last stage can be moved to that ‘incomplete’ node and, thereby, reduce length lav .
There must be at least two lines at the last stage. Otherwise, the last stage would
have been redundant and could have been excluded that would have shortened the
corresponding code line. Amongst the lines of the last stage there are two lines,
which correspond to two least likely probable realizations ζ1 , ζ2 of the message
chosen from the set of its all realizations ζ1 , . . ., ζm . Equivalently, we could have
been able to exchange messages by assigning a less likely realization of a message
to a longer word and, thus, to shorten the average word length.
But if two least likely messages have code words ending at the last stage, then we
can suppose that their lines emanate from the same node from the penultimate stage.
If it does not work for K, then by switching two messages we make it work and, thus,
we obtain another definite code having the same average length and, consequently,
being at least as good as the original code K.
We turn the node from the penultimate stage (from which the two aforemen-
tioned lines emanate) into an end node. Correspondingly, we treat two least likely
realizations ζ1 , ζ2 as one. Then the number of distinct realizations is equal to m − 1
and the respective probabilities are P(ζ1 ) + P(ζ2 ), P(ζ3 ), . . ., P(ζm ). We can think
that there are a new (‘simplified’) random variable, a new code and a new (‘trun-
cated’) ‘tree’. The average word length of this variable can be expressed in terms of
the prior average length by the following formula:

lav = lav + P(ζ1 ) + P(ζ2 ).
Evidently, if the original code ‘tree’ minimizes lav , then the ‘truncated’ code ‘tree’
minimizes lav and vice versa. Therefore, the problem of finding an optimal code
‘tree’ is reduced to the problem of finding an optimal ‘truncated’ code ‘tree’ for
a message with m − 1 possibilities. All those considerations related to the initial
message above can be applied to this ‘simplified’ message. Thus, we obtain a ‘twice
simplified’ message with m−2 possibilities. Then the aforementioned consideration
is also applied to the latter message and so forth until a trivial message with two
possibilities is obtained. During the process of the specified simplification of a code
‘tree’ and a message, two branches of the code ‘tree’ merge into one at every step
and eventually its structure is understood completely.
Similar reasoning can also be applied to the case D > 2. Suppose that all the
nodes of the code ‘tree’ are entirely filled, which occurs when m − 1 is divided by
D − 1 without a remainder. In fact, each node (except the terminal ones) adds D − 1
new lines (if we suppose that only one line comes to the first node from below),
and the quotient (m − 1)/(D − 1) is equal to the number of nodes. In this case every
‘simplification’ decreases the number of realizations of a random variable by D − 1.
Then the D least probable outcomes merge, and the D longest branches of the code
tree (among the remaining ones) are shortened by unity and are replaced by one
terminal node.
There is some difficulty in the case when quotient (m − 1)/(D − 1) is not integer.
This means that at least one of internal nodes of the code ‘tree’ must be incomplete.
But, as it was mentioned, incomplete nodes can belong only to the last stage in the
optimal tree, i.e. they are associated with the last choice. Without ‘impairing’ the
tree we can transpose messages related to words of the maximum length in such a
way that: 1) only one single node will be incomplete, 2) messages corresponding to
that node will have the least probability. We denote the remainder of the division of
m − 1 by D − 1 as r (0 < r < D − 2). Then the single incomplete node will have r + 1
branches. According to the aforesaid, the first ‘pruning’ of the ‘tree’ and the sim-
plification of the random variable will be the following: we select r + 1 least likely
probabilistic realizations out of ζ1 , . . ., ζm and replace them by a single realization
with the aggregate probability. The ‘pruned’ tree will have already filled internal
nodes. For further simplification we take D least likely realizations out of the ones
formed at the previous simplification and replace them by a single one with the ag-
gregate probability. The same operation is conducted further. We have just described
a method to construct an optimal Huffman’s code. The aforesaid considerations tell
us that the resultant code ‘tree’ may not coincide with the optimal ‘tree’ K but is not
worse than that, i.e. it has the same average length lav .
Example 2.2. We consider encoding of an infinite sequence ξ1 , ξ2 , . . ., consisting of

identical elementary messages, in the binary (D = 2) alphabet (A, B). Let every ele-
mentary message ξ be of the same type as the one in the example from Section 1.2,
i.e. m = 2; P(0) = 7/8, P(1) = 1/8.
1. If we take in n = 1, i.e. we encode every elementary message separately, then the

only chance is to choose the following trivial code:
V (0) = A; V (1) = B.
For this code we have lav = 1. At the same time
Hξ / ln D = 0.544 (2.3.1)
and, thereby, lav = 0.544 is possible. The comparison of the latter number with 1
shows that we can achieve a significant improvement of conciseness if we move
to more complicated types of encoding—to n > 1.
2. Suppose that n = 2. Then we will have the following possibilities and probabili-
ties:
ζ 1 2 3 4
ξ1 ξ2 11 10 01 00
P (ζ ) 0.015 0.110 0.110 0.765
Here ζ means an index of pair (ξ1 , ξ2 ).
At the first ‘simplification’ we can merge realizations ζ = 1 and ζ = 2 into
one having probability 0.015 + 0.110 = 0.125. Among realizations left after the
‘simplification’ and having probabilities 0.110, 0.125, 0.765 we merge two least
2.3 Optimal encoding by Huffman. Examples 47
likely ones again. The scheme of such ‘simplifications’ and the respective code
‘tree’ are represented on Figures 2.3 and 2.4.
It just remains to position letters A, B along the branches in order to obtain the
following optimal code:
ζ 1 2 3 4
ξ1 ξ2 11 10 01 00
V (ξ1 ξ2 ) AAA AAB AB B
Its average word length is equal to
lavζ = 0.0153 + 0.1103 + 001102 + 0.765 = 1.36, (2.3.2)

1
lav = lavζ = 0.68.
2
Fig. 2.3 The ‘simplification’ scheme for n = 2 Fig. 2.4 The code ‘tree’ for the
considered example with n = 2
3. Next we consider n = 3. Now we have the following possibilities

ζ 1 2 3 4 5 6 7 8
ξ1 ξ2 ξ3 111 110 101 011 100 010 001 000
P (ζ ) 0.002 0.014 0.014 0.014 0.096 0.096 0.096 0.668
The corresponding code ‘tree’ is shown on Figure 2.5. Positioning letters along
the branches we find the optimal code:
ζ V (ζ ) ζ V (ζ )
1 AAAAA 5 AAB
2 AAAAB 6 ABA
3 AAABA 7 ABB
4 AAABB 8 B
We calculate the average length of code words for it:
1
lavζ = 2.308, lav = lavζ = 0.577
4
By this time, the value is sufficiently closer to the limit value lav = 0.544 (2.3.1)
than lav = 0.68 (2.3.2). Increasing n and constructing optimal codes by the spec-
ified method, we can approach value 0.544 as close as necessary. Of course, that
is achieved by complication of the coding system.
Fig. 2.5 The code ‘tree’ for n = 3
2.4 Errors of encoding without noise in the case of a finite code

sequence length
The encoding methods described in Section 2.2 are such that the record length of a
fixed number of messages is random. The provided theory gives estimators for the
average length but tells nothing about its deviation. Meanwhile, in practice a record
length or a message transmission time can be bounded by technical specifications.
It may occur that a message record does not fit in permissible limits and, thus, the
given realization of the message cannot be recorded (or transmitted). This results
in certain losses of information and distortions (dependent on deviation of a record
length). The investigation of those phenomena is a special problem. As it will be
seen below, entropical stability of random variables
η n = (ξ1 , . . . ., ξn ) .
is a critical factor in the indicated setting.

Suppose that the record length l(η n ) of message η n cannot exceed a fixed value
L. Next we choose a code defined by the inequalities
ln P (η n ) ln P (η n )
− l (η n ) < − +1
ln D ln D
2.4 Errors of encoding without noise in the case of a finite code sequence length 49
according to (2.2.5). Then the record of those messages, for which the constraint
ln P (η n )
− +1 L (2.4.1)
ln D
holds, will automatically be within fixed bounds. When decoding, those realizations
will be recovered with no errors. Records of some realizations will not fit in. Thus,
when decoding such a case, we can stick with any realization that will entail emer-
gence of errors, as a rule. There arises a question how to estimate probabilities of a
correct decoding and an erroneous decoding.
Evidently, the probability of decoding error will be at most the probability of the
inequality
ln P (η n )
− +1 > L (2.4.2)
ln D
which is reverse to (2.4.1).
Inequality (2.4.2) can be represented in the form
H (η n ) L − 1
> ln D.
Hη n Hη n
But the ratio H(η n )/Hη n converges to one for entropically stable random variables.
Taking into account the definition of entropically stable variables (Section 1.5), we
obtain the following result.
Theorem 2.5. If random variables η n are entropically stable and the record length
L = Ln increases with a growth of n in such a way that the expression
Ln − 1
ln D − 1
Hη n
is kept larger than some positive constant ε , then probability Per of decoding error
converges to zero as n → ∞.
It is apparent that the constraint

Ln − 1
ln D − 1 > ε
Hη n
in the indicated theorem can be replaced with the simpler constraint
Ln ln D > (1 + ε ) Hη n
if Hη n → ∞ as n → ∞.
The realizations, which have not fitted in and thereby have been decoded erro-
neously, pertain to set An of non-essential realizations. We remind that set An is
treated in Theorem 1.9.
Employing the results of Section 1.5 we can derive more detailed estimators for
the error probability Per and estimate the rate of decrease.
Theorem 2.6. For the fixed record length L > (Hη n / ln D) + 1 the probability of
decoding error satisfies the inequality
Var (H (η n ))
Per . (2.4.3)
[(L − 1) ln D − Hη n ]2
Proof. We apply Chebyshev’s inequality
P [|ξ − E [ξ ]| ε ] Var (ξ ) /ε 2
(ε > 0 is arbitrary) to the random variable ξ = H(η n )/Hη n . We suppose
L−1
ε= ln D − 1
Hη n
and obtain from here that

−2
H (η n ) L − 1 Var (H (η n )) L − 1
P > ln D ln D − 1
Hη n Hη n Hη2 n Hη n
that proves (2.4.3).

It follows from (2.4.3) that probability Per decreases with a growth of n according
to the law 1/n for a sequence of identically distributed independent messages, if
L − 1 increases linearly with a growth of n and variance Var[H(ξ )] is finite. Indeed,
substituting
Hη n = nHξ ; Var (H (η n )) = nVar (H (ξ )) ; L − 1 = L1 n (2.4.4)
to (2.4.3), we find that

1 Var (H (ξ ))
Per (2.4.5)
n L1 ln D − Hξ 2
Using Theorem 1.13, it is easy to show a faster exponential law of decay for the
probability Per .
Theorem 2.7. For the fixed length L > Hη / ln D + 1 the probability of decoding er-
ror satisfies the inequality

Per eμ (s)−sμ (s) , (2.4.6)
where μ (α ) is the potential defined by formula (1.5.15), and s is a positive root of
the equation
μ (s) = (L − 1) ln (D)
if such a root lies both in the domain and in the differentiability region of the poten-
tial.
In order to prove the theorem we just need to use formula (1.5.18) from Theo-
rem 1.13 by writing
2.4 Errors of encoding without noise in the case of a finite code sequence length 51
L−1
ε= ln D − 1.
Hη
For small ε we can replace inequality (2.4.6) with the following inequality:
2
[(L − 1) ln D − Hη ]
Per exp − (2.4.7)
2Var (H (η ))
within the applicability of formula (1.5.22).

In the case of identically distributed independent messages, when relations (2.4.4)
and formula (2.4.5) are valid, due to (2.4.7) we obtain the following inequality:
⎧ ⎫
⎨ n L ln D − H 2 ⎬
1 ξ
Per exp − . (2.4.8)
⎩ 2 Var (H (ξ )) ⎭
It tells us about an exponential law of decay for the probability Per with a growth
of n.
Formulae (2.4.7) and (2.4.8) correspond to the case, in which the probability
distribution of entropy can be regarded as approximately Gaussian due to the central
limit theorem.
Chapter 3
Encoding in the presence of penalties. First
variational problem
The amount of information that can be recorded or transmitted is defined by a log-

arithm of the number of various recording realizations or transmission realizations,
respectively. However, calculation of this number is not always a simple task. It can
be complicated because of the presence of some constraints imposed on feasible re-
alizations. In many cases, instead of direct calculation of the number of realizations,
it is reasonable to compute the maximum value of recording entropy via maximiza-
tion over distributions compatible with conditions imposed on the expected value of
some random cost. This maximum value of entropy is called the capacity of a chan-
nel without noise. This variational problem is the first from the set of variational
problems playing an important role in information theory.
Solving the specified variational problem results in relationships analogous to
statistical thermodynamics (see, for instance, the following textbooks by Leontovich
[31, 32]). It is convenient to employ methods developed by Leontovich in those text-
books in order to find a channel capacity and optimal distributions. These methods
are based on systematic usage of thermodynamic potentials and formulae where
a cost function comes as an analog of energy. Temperature, free energy and other
thermodynamic concepts find their appropriate place in theory. Thus the mathemat-
ical apparatus of this branch of information theory very closely resembles that of
statistical thermodynamics.
Such analogy will be subsequently observed in the second and the third varia-
tional problems (Sections 8.2 and 9.4, respectively). It should be emphasized that
the indicated analogy is not just fundamental, but is also important for applications.
It allows us to use methods and results of statistical thermodynamics.
The solution of a number of particular problems (Example 3.3 from Section 3.4)
of optimal encoding with penalties is mathematically equivalent to the one-dimen
sional Ising model well known in statistical thermodynamics.
At the end of the chapter, the results obtained are extended to the case of a more
general entropy definition, which is valid for continuous and arbitrary random vari-
ables.

https://doi.org/10.1007/978-3-030-22833-0 3
54 3 Encoding in the presence of penalties. First variational problem
3.1 Direct method of computing information capacity of a

message for one example
Consider a particular example of information encoding, for which the calculation

of informational or channel capacity ln M, where M is the number of realizations of
the code sequence, is nontrivial. This is the case in which the symbols have different
duration.
Let there be m symbols V1 , . . . , Vm with lengths l(1), . . . , l(m), respectively.
Possibly (but not necessarily) they are combinations of more elementary letters A1 ,
. . . , AD of the same length, where l(i) is a number of such letters. Then, gener-
ally speaking, symbols Vi will be code words like those considered before. Now,
however, filling with letters is fixed and does not change during our consideration
process.
We take four telegraph symbols as an example:
Symbols Vi l (i)
Dot +− 2
Dash + + +− 4 (3.1.1)
Spacing between letters −−− 3
Spacing between words − − − − −− 6
The second column gives these symbols in a binary (D = 2) alphabet: A = + and

B = −. The third column gives their length in this alphabet. To be sure, the symbols
Vi can also be called not ‘words’, but ‘letters’ of a new, more complex alphabet.
Let there be given a recording (or a transmission)
Vi1 , Vi2 , . . . ,Vik (3.1.2)
that consists of k symbols. Its length is apparently equal to L = l(i1 ) + l(i2 ) + · · · +

l(ik ).
We fix this total recording length and compute the number M(L) of distinct re-
alizations of a recording with the specified length. We neglect the last symbol Vik
and obtain a sequence of a smaller length L − l(ik ). Each such truncated sequence
can be realized in M(L − l(ik )) ways. By summing different scenarios we obtain the
following equation
n
M (L) = ∑ M (L − l (i)) , (3.1.3)
i=1
that allows us to find M(L) as a function of L. After M(L) has been found, it becomes
easy to determine information that can be transmitted by a recording of length L. As
in the previous cases, maximum information is attained when all of M(L) scenarios
are equiprobable. At the same time
H L /L = ln(M(L))/L
3.1 Direct method of computing information capacity of a message for one example 55
where H L = Hvi1 ...vik is entropy of the recording. We take the limit of this relation-
ship for L → ∞ and, thereby, obtain entropy of a recording meant for a unit of length:
H1 = lim ln(M(L))/L. (3.1.4)

L→∞
As is seen from this formula, there is no need to find an exact solution of equa-
tion (3.1.3) but it is sufficient to consider only asymptotic behaviour of ln M(L) for
large L. Equation (3.1.3) is linear and homogeneous. As for any such equation, we
can look for a solution in the form
M (L) = Ceλ L . (3.1.5)
Such a solution usually turns out to be possible only for certain (‘eigen’) val-
ues λ = λ1 , λ2 , . . .. With the help of the spectrum of these ‘eigenvalues’ a general
solution can be represented as follows:
M (L) = ∑ Cα eλα L , (3.1.6)

α
where constants C1 , C2 , . . . are determined from initial conditions. Substituting (3.1.5)

to (3.1.3) we obtain the ‘characteristic’ equation
m
1 = ∑ e−λ l(i) , (3.1.7)
i=1
that allows us to find eigenvalues λ1 , λ2 , . . .. As is easily seen, the function
Φ (X) ≡ ∑ X −l(i) (3.1.8)

i
turns out to be a monotonic decreasing function of X on the entire half-line X > 0,

i.e., this function decreases from ∞ to 0 since all l(i) > 0. Therefore, there exists
only one single positive root W of the equation
∑ W −l(i) = 1 (3.1.9)
i
equations (3.1.7), (3.1.9) merge if we set
λ = lnW. (3.1.10)
Thus, equation (3.1.7) also has only one real root.

For sufficiently large L in (3.1.6) a term with the largest real part of eigenvalue
λα will be dominant. We denote this eigenvalue as λm with index m:
Reλm = max Reλα .

α
Furthermore, it follows from (3.1.6) that

M (L) ≈ Cm eλm L . (3.1.11)
Since function M(L) cannot be complex and alternating-sign, the eigenvalue must
be real: Im λm = 0. But (3.1.10) is the only real eigenvalue. Consequently, the value
lnW is the desired eigenvalue λm with the maximum real part. Formula (3.1.11) then
takes the form
M (L) ≈ Cm W L
and limit (3.1.4) appears to be equal to
H = λm = lnW (3.1.12)
that yields the solution of the problem in question. This solution has been first found
by Shannon [45] (the English original is [38]).
3.2 Discrete channel without noise and its capacity
The recording considered above or a transmission having a fixed length in the spec-
ified alphabet is an example of a noiseless discrete channel. Here we provide a more
general definition of this notion.
Given variable y taking a discrete set Y (not necessarily with a finite cardinality)
of values. Further, some numerical function c(y) is given on Y . By reasons that will
get clear later we call it a cost function.
Let there be given some number a and a fixed condition
c (y) a. (3.2.1)
System [Y, c(y), a] fully characterizes a noiseless discrete channel.

Constraint (3.2.1) picks out some subset Ya from set Y . This subset Ya is assumed
to consist of a finite number Ma of elements. These elements can be used for in-
formation recording and transmission. Number Ma is interpreted as a number of
distinct realizations of a recording similarly to number M(L) in Section 3.1.
For the particular case considered in Section 3.1 set Y is a set of different record-
ings
y = Vi1 , . . . ,Vik . (3.2.2)
In this case it is reasonable to choose the function
c (y) = l (i1 ) + · · · + l (ik ) . (3.2.3)
Thereby, long recordings are penalized more, which is indirectly reflected in a

desire to reduce a recording length. With this choice, condition (3.2.1) coincides
with the condition
l (i1 ) + · · · + l (ik ) L (L plays the role of a) (3.2.4)

3.2 Discrete channel without noise and its capacity 57
that is, with the requirement that a recording length does not exceed the specified
value of L. The consideration of this example will be continued in Section 3.4 (Ex-
ample 3.4).
Variable Ma (the number of realizations) tells us about capabilities of the given
system (channel) to record or transmit information. The maximum amount of infor-
mation that can be transmitted is evidently equal to ln Ma . This amount can be called
information capacity or channel capacity. However, a direct calculation of Ma is
coupled sometimes with some difficulties, as it is seen from the example covered in
Section 3.1. Thus, it is more convenient to define the notion of channel capacity not
as ln Ma but a bit differently.
We introduce a probability distribution P(y) on Y and replace condition (3.2.1)
by an analogous condition for the mean
∑ c (y) P (y) a (3.2.5)
and it should be understood as a constraint imposed on distribution P(y). We define

channel capacity C or information capacity of channel [Y, c(y), a] as the maximum
value of the entropy
C = sup Hy . (3.2.6)
P(y)
Here maximization is conducted over different distributions P(y), which are com-
patible with constraint (3.2.5).
Hence, channel capacity is defined as a solution of one variational problem.
As it will be seen from Section 4.3, there exists a direct asymptotic relationship
between value (3.2.6) and Ma . Speaking of applications of these ideas to statistical
physics, relationships (3.2.1) (taken with an equality sign) and (3.2.5) correspond to
microcanonical and canonical distributions, respectively. Asymptotic equivalence of
these distributions is well known in statistical physics.
The above definitions of a channel and its capacity can be modified a little, for
instance, by substituting inequalities (3.2.1), (3.2.6) with the following two-sided
inequalities
a1 c (y) a2 ; a1 ∑ c (y) P (y) a2 (3.2.7)
y
or with inequalities in the opposite direction (if the number of realizations remains
finite). Besides, in more complicated cases there can be given several numerical
functions or a function taking values from some other space. All these modifications
do not usually relate to fundamental changes, so we do not especially address them
here.
The cost function c(y) may have different physical or technical meaning in dif-
ferent problems. It can characterize ‘costs’ of certain symbols, point to unequal
costs incurred in recording or transmission of some symbol, for instance, a differ-
ent amount of paint or electric power. It can also correspond to penalties placed
on different adverse factors, for instance, it can penalize excessive height of letters
and other. In particular, if length of symbols is exposed to penalties, then (3.2.3) or
c(y) = l(y) holds true for y = i.
Introduction of costs, payments and so on is typical for mathematical statistics,

optimal control theory and game theory. We will repeatedly discuss them later on.
When function c(y) does actually have a physical meaning of penalties and costs
incurred in information transmission and those costs are of benefit, as a rule the
maximum value of entropy in (3.2.6) takes place for the largest possible costs, so
that we can keep only an equality sign in (3.2.5) or (3.2.7) supposing that
∑ c (y) P (y) = a. (3.2.8)

y
The variational problem (3.2.6), (3.2.8) is close to another inverse variational

problem: the problem of finding distribution P(y) that minimizes average losses
(risk)
R ≡ ∑ c (y) P (y) = min (3.2.9)
y
for the fixed value of entropy
− ∑ P (y) ln P (y) = I (3.2.10)

y
where I is a given number.

Finally, the third scenario for the problem statement is possible: it is required to
maximize some linear combination of indicated expressions
Hy − β R ≡ − ∑ P (y) ln P (y) − β ∑ c (y) P (y) = min . (3.2.11)

y y
It is essential that all three specified problem statements lead to the same solu-
tion if parameters a, I, β are coordinated properly. We call any of these mentioned
statements the first variational problem.
It is convenient to study the addressed questions with the help of thermodynamic
potentials introduced below.
3.3 Solution of the first variational problem. Thermodynamic

parameters and potentials
1. Variational problems (3.2.6), (3.2.8) formulated in the previous paragraph and

problems (3.2.9), (3.2.10) are typical problems on a conditional extremum. Thus,
they can be solved via the method of Lagrange multipliers. Also, the normalization
condition
∑ P (y) = 1 (3.3.1)
y
must be added to the indicated conditions.

Further, the requirement of non-negative probabilities
3.3 Solution of the first variational problem. Thermodynamic parameters and potentials 59
P (y) 0, y ∈ Y. (3.3.2)
must be satisfied. However, this requirement should not be necessarily included into
the set of constraints, since the solution of the problem with all other constraints
retained but without the given requirement turns out (as the successive inspection
will show) to satisfy it automatically. We introduce Lagrangian multipliers β , γ and
try to find an extremum of the expression
K = − ∑ P (y) ln P (y) − β ∑ c (y) P (y) − γ ∑ P (y) . (3.3.3)

y y y
In consequence of uncertainty of multipliers β , γ finding of this extremum is equiva-

lent to finding of an extremum of an analogous expression for problem (3.2.9), (3.2.10)
or (3.2.11) with the help of other indefinite multipliers. The constraints of the ex-
tremum has the form
∂K
− = ln P (y) + 1 + β c (y) + γ = 0. (3.3.4)
∂ P (y)
Here differentiation is carried out with respect to those and only those P(y),
which are different from zero in the extremum distribution P(·). We assume that
this distribution exists and is unique. A subset of Y , the elements of which have
non-zero probabilities in the extremum distribution, is denoted by Y . We call this Y
an ‘active’ domain. Hence, equation (3.3.4) is valid only for elements y belonging
to Y . From (3.3.4) we get
P (y) = e−1−β c(y)−γ , y ∈ Y . (3.3.5)
Probabilities are equal to zero for other y’s beyond Y .

We see that extremum probabilities (3.3.5) turn out to be non-negative, thereby
constraint (3.3.2) can be disregarded as a side condition. Multiplier e−1−γ can be
immediately determined from the normalization condition (3.3.1) that yields

P (y) = e−β c(y) ∑ e−β c(y) , (3.3.6)
y
if the sum in the denominator converges.

Parameter β is given in problem (3.2.11) and there is no need to expressly define
it. In problems (3.2.6), (3.2.8), (3.2.9), (3.2.10) it should be defined from the side
conditions of (3.2.8) or (3.2.10), correspondingly.
2. The derived solution of the extremum problem can be simplified due to validity
of the following theorem.
Theorem 3.1. When solving the problem of maximum entropy (3.2.6) under the con-
straint (3.2.8), probabilities of all elements of Y , at which the cost function c(y)
takes finite values, are different from zero in the extremum distribution. Therefore, if
function c(y) is finite for all elements, then set Y coincides with the entire set Y .
Proof. Assume the contrary, i.e. assume that some elements y ∈ Y have null prob-
ability P0 (y) = 0 in the extremum distribution P0 (y). Since distribution P0 is ex-
tremum, for another measure P1 (even non-normalized) with the same nullity set
Y − Y the following relationship
− ∑ P1 (y) ln P1 (y) − β ∑ c (y) P1 (y) − γ ∑ P1 (y) +

y y y
+ ∑ P0 (y) ln P0 (y) − β ∑ c (y) P0 (y) − γ ∑ P0 (y) =

y y y
1

=−∑ [P1 (y) − P0 (y)]2 + · · · = O (P1 − P0 )2 (3.3.7)
y∈Y P0 (y)
is valid.
Now we consider P(y1 ) = 0, y1 ∈ Y − Y and suppose that P(y) = 0 at other points
of subset Y − Y . We choose other probabilities P(y), y ∈ Y in such a way that the
constraints
∑ P (y) = 1; ∑ c (y) P (y) = ∑ c (y) P0 (y) (3.3.8)
y y y
are satisfied and differences P(y) − P0 (y) (y ∈ Y ) are linearly dependent on P(y1 ).
Non-zero probability P(y1 ) = 0 will lead to the emergence of the extra term
−P(y1 ) ln P(y1 ) in the expression for entropy Hy . Supposing P1 (y) = P(y) for y ∈ Y
and taking into account (3.3.7), we obtain that
− ∑ P (y) ln P (y) − β ∑ c (y) P (y) − γ ∑ P (y) +

y y y
+ ∑ P0 (y) ln P0 (y) + β ∑ c (y) P0 (y) + γ ∑ P0 (y) =

y y y

= −P (y1 ) ln P (y1 ) − [β c (y1 ) + γ ] P (y1 ) + O (P − P1 )2 .
Further, we use (3.3.8) to get
− ∑ P (y) ln P (y) + ∑ P0 (y) ln P0 (y) =

1

= P (y1 ) ln − [β c (y1 ) + γ ] P (y1 ) + O (P (y1 ))2 . (3.3.9)
P (y1 )
For sufficiently small P(y1 ) (when − ln P(y1 ) − β c(y1 ) − γ + O(P(y1 )) > 0) the ex-
pression in the right-hand side of (3.3.9) is certainly positive. Thus, the entropy of
distribution P satisfying the same constraints (3.3.8) will exceed the entropy of the
extremum distribution P0 , which is impossible. So, element y1 having a zero proba-
bility in the extremum distribution does not exist. The proof is complete.
So, as a result of the fact that Y and Y coincide for finite values of c(y) derived
in Theorem 3.1, henceforth we use formula (3.3.5) or (3.3.6) for the entire set Y .
Expression (3.3.6) is analogous to the corresponding expression for the Boltz-

mann distribution that is well known in statistical physics, i.e., c(y) is an analog of
energy and also β = 1/T is a parameter reciprocal to absolute temperature (if we
use the scale, in which the Boltzmann constant equals 1). With this analogy in mind
we call T temperature and c(y) energy. The fact that the thermodynamic equilibrium
(balanced) distribution (3.3.6) serves as a solution to the problem of an extremum
of entropy with a given fixed energy (i.e. to the extremum problem (3.2.6), (3.2.8))
is also a well-known fact in statistical physics.
3. Now we want to verify that the found distribution (3.3.5), (3.3.6) actually cor-
responds to the maximum (and not to, say, the minimum) of entropy Hy . Calculating
the second-order derivatives of entropy, we have
∂ 2 Hy 1 !
= − δ yy

, y, y ∈ Y
∂ P (y) ∂ P (y ) P (y)
where δyy is a Kronecker symbol. Thus, taking into account disappearance of the
first-order derivatives and neglecting the third-order terms with respect to deviation
Δ P(y), y ∈ Y from the extremum distribution P we obtain
1
Hy = Hy −C = − ∑ [P (y)]2 . (3.3.10)

P (y)
y∈Y
Since all P(y) included here are positive, the difference Hy − C is negative that
proves maximality of C.
It can also be proven that if β > 0, then the extremum distribution found above
corresponds to the minimum of average cost (3.2.9) with the fixed entropy (3.2.10).
In order to prove it, we need to take into account that analogously to (3.3.10) the
following relationship
1
K = K − Kextr = − ∑ [P (y)]2 < 0.

P (y)
y∈Y
is valid for function (3.3.3) considered as a function of independent variables P(y),

y ∈ Y (suppose there are m of them). Hence, the maximum of the expression K takes
place at the extremum point.
Next we need to take into account that we are interested in a conditional ex-
tremum corresponding to conditions (3.2.10) and (3.3.1). The latter constitute a hy-
perplane of dimension m − 2 containing the extremum point in the m-dimensional
space Rm . We should restrain ourselves only with shifting from the extremum point
within the hyperplane. But if there exists a maximum for every shift along the hy-
perplane, then we can also find a maximum for any shifts of a particular type. Since
the first and the third terms in (3.3.3) remain the same when shifting within the hy-
perplane, the extremum point turns out to be a point of conditional maximum of
expression −β ∑ c(y)P(y), i.e. a point of the minimum of average cost (if β > 0).
This ends the proof.
4. The found distribution (3.3.6) allows us to determine channel capacity C and

average cost R as a function of β or ‘temperature’ T = 1/β . Excluding parame-
ter β or T from these dependencies it is simple to find a dependence of C from
R = a for problem (3.2.6),(3.2.8) or a dependence of R from C = I for prob-
lem (3.2.11),(3.2.10) . When solving these problems in practice it is convenient to
use a technique that is employed in statistical physics. At first we calculate the sum
(assumed to be finite)
Z = ∑ e−c(y)/T , (3.3.11)
y
called a partition function. This function is essentially a normalization factor in the

distribution (3.3.6). The free energy is expressed through it:
F (T ) = −T ln Z. (3.3.12)
With the help of the mentioned free energy we can compute entropy C and energy
R. We now prove a number of relations common in thermodynamics as separate
theorems.
Theorem 3.2. Free energy is connected with entropy and average energy by the
simple relationship
F = R − TC. (3.3.13)
This is well known in thermodynamics.

Proof. Using formulae (3.3.12), (3.3.11) we rewrite the probability distribution
(3.3.6) in the form
P (y) = e(F−c(y))/T (3.3.14)
(Gibbs distribution).
According to (1.2.1) we find the random entropy
c (y) − F
H (y) = .
T
Averaging of this equality with respect to y leads to relationship (3.3.13). The proof
is complete.
Theorem 3.3. Entropy can be computed via differentiating free energy with respect
to temperature
C = −dF/dT. (3.3.15)
Proof. Differentiating expression (3.3.12) with respect to temperature and taking

into account (3.3.11) we obtain that
dF dZ c(y) c (y)
− = ln Z + T Z −1 = ln Z + T Z −1 ∑ e− T ,
dT dT y T2
i.e. due to (3.3.6), (3.3.11) we have

dF
− = ln Z + T −1 E[c(y)].
dT
Taking account of equalities (3.3.12), (3.3.13) we find that
dF
− = T −1 (−F + R) = C.
dT
This end the proof.
As is seen from formulae (3.3.13), (3.3.15), energy (average cost) can be ex-
pressed using free energy as follows:
dF
R = F −T . (3.3.16)
dT
It is simple to verify that this formula can be rewritten in the following more compact
form:
d d ln Z 1
R= (β F) = − , β= . (3.3.17)
dβ dβ T
After calculation of functions C(T ), R(T ) it is straightforward to find the channel
capacity C = C(a) (3.2.6). Equation (3.2.8), i.e. the equation
R (T ) = a (3.3.18)
will determine T (a). So, the channel capacity will be equal to C(a) = C(T (a)).
Similarly, we can find the minimum average cost (3.2.9) with the given amount
of information I for problem (3.2.9), (3.2.10). Averaging (3.2.10) taking the form
C (T ) = I,
determines the temperature T (I) corresponding to the given amount of information

I. In this case, the minimum average cost (3.2.9) is equal to R(T (I)).
In conclusion of this paragraph we formulate two theorems related to facts well
known in thermodynamics.
Theorem 3.4. If distribution (3.3.6) exists, i.e. the sum (3.3.11) converges, then the
following formula
dR
= T, (3.3.19)
dC
is valid, so that for T > 0 the average cost is an increasing function of entropy, and
for T < 0 it is decreasing.
Proof. We consider a differential from expression (3.3.13) and obtain that
dR = dF +CdT + T dC.
But on account of (3.3.15) dF +CdT = 0, so that
dR = T dC. (3.3.20)
The desired formula (3.3.19) follows from here.

Equation (3.3.20) is nothing but the well-known thermodynamic equation
dH = dQ/T
where dQ is the amount of heat entering the system and augmenting its internal
energy by dR. H = C is the entropy.
Theorem 3.5. If distribution (3.3.6) exists and c(y) is not constant within Y , then
R turns out to be an increasing function of temperature. Also, channel capacity
(entropy) C is an increasing function of T for T > 0.
Proof. Differentiating (3.3.17) we obtain
dR 1 d 2 (β F)
=− 2 . (3.3.21)
dT T dβ 2
If we substitute (3.3.20) to here, then we will have
dC 1 d 2 (β F)
=− 3 . (3.3.22)
dT T dβ 2
In order to prove the theorem it is left to show that
d 2 (β F)
− >0 (3.3.23)
dβ 2
i.e. β F is a concave function of β . According to (3.3.12), β F is nothing but − ln Z.

Therefore,
d 2 (β F) 1 d2Z 1 dZ 2
− = − . (3.3.24)
dβ 2 Z dβ 2 Z dβ
Differentiating (3.3.11) we obtain that
1 dZ
= − ∑ c (y) e−β c(y) / ∑ e−β c(y) = −E [c (y)]
Z dβ
(3.3.25)
1 d2Z
= ∑ c2 (y) e−β c(y) / ∑ e−β c(y) = E c2 (y)
Z dβ 2
by virtue of (3.3.6). Hence, expression (3.3.24) is exactly the variance of costs:

E[c(y) − E[c(y)]]2 that is a non-negative variable (it is a positive variable if c(y) is
not constant within Y ). That proves inequality (3.3.23) and also the entire theorem
due to (3.3.21), (3.3.22).
As a result of (3.3.15) formula (3.3.22) yields
3.4 Examples of application of general methods for computation of channel capacity 65
2
dC d F 1 d 2 (β F)
− = = 3 .
dT dT 2 T dβ 2
From the latter formula, due to (3.3.23), we can make a conclusion about concavity
of function F(T ) for T > 0 and convexity of F(T ) for T < 0.
The above-mentioned facts are a particular demonstration of properties of ther-
modynamic potentials. The relative examples of such phenomena are F(T ), ln Z(β ).
Hence, these potentials play a significant role not only in thermodynamics but also
in information theory. In the next chapter, we will touch upon subjects related to
asymptotic equivalence between constraints of types (3.2.1) and (3.2.5).
3.4 Examples of application of general methods for computation

of channel capacity
Example 3.1. For simplicity let there be only two symbols initially: m = 2; y = 1, 2,
which correspond to different costs
c (1) = b − a, c (2) = b + a. (3.4.1)
In this case, the partition function (3.3.11) is equal to
Z = e−β b+β a + e−β b−β a = 2e−β b cosh β a.
By formula (3.3.12) the free energy

a
F = b − T ln 2 − T ln cosh
T
corresponds to it. Applying formulae (3.3.12), (3.3.13) we find entropy and average
energy
a a a
C = Hy (T ) = ln 2 + ln cosh − tanh ,
T T T
a
R (T ) = b − a tanh . (3.4.2)
T
The graphs of these functions are presented on Figure 3.1. It is also shown there
how to determine channel capacity with the given level of costs R = R0 in a simple
graphical way.
When temperature changes from 0 to ∞, entropy changes from 0 to ln 2 and en-
ergy goes from b − a to b (where a > 0). Also in a more general case as T → ∞
entropy Hy goes to limit ln m. This value appears to be the channel capacity corre-
sponding to the absence of constraints. Also, this value cannot be exceeded for any
average cost.
The additive parameter b implicated in the cost function (3.4.1) is not essential.
Neither entropy nor a probability distribution depends on it. This corresponds to the
fact that the additive constant from the expression for energy (recall that R is an
Fig. 3.1 Thermodynamic functions for Example 3.1
analog of energy) can be chosen arbitrarily. In the example in question the optimal
probability distribution has the form

P (1, 2) = e±a/T 2 cosh (a/T ) (3.4.2a)
according to (3.3.14). The determined functions (3.4.2) can be used in cases, for
which there exists sequence yL = (y1 , . . . , yL ) of length L that consists of symbols
described above. If the number of distinct elementary symbols equals 2, then the
number of distinct sequences is evidently equal to m = 2L . Next we assume that the
costs imposed on an entire sequence are the sum of the costs imposed on symbols,
which constitute this sequence. Hence,
L
c yL = ∑ c (yi ) . (3.4.3)
i=1
Application of the above-derived formula (3.3.6) to the sequence shows that in this
case the optimal probability distribution for the sequence is decomposed into a prod-
uct of probabilities of different symbols. That is, thermodynamic functions F, R, H
related to the given sequence are equal to a sum of the corresponding thermody-
namic functions of constituent symbols. In the stationary case (when an identical
distribution of the costs corresponds to symbols situated at different places) we have
F = LF1 ; R = LR1 ; H = LH1 , where F1 , R1 , H1 are functions for a single symbol,
which have been found earlier [for instance, see (3.4.2)].
Example 3.2. Now we consider a more difficult example, for which the principle
of the cost additivity (3.4.3) does not hold. Suppose the choice of symbol y = 1 or
y = 2 is cost-free but a cost is incurred when switching symbols. If symbol 1 follows
1 or symbols 2 follows 2 in a sequence, then there is no cost. But if 1 follows 2 or
2 follows 1, then the cost 2d is observed. It is easy to see that in this case the total
cost of the entire sequence yL can be written as follows:
L−1
c yL = 2d ∑ 1 − δ y j y j+1 . (3.4.4)
j=1
Further, it is required to find the conditional capacity of such a channel and its
optimal probability distribution. We begin with a calculation of the partition func-
tion (3.3.11):
L−1
2β d δy j ,y j+1
Z = e−2β d(L−1) ∑ ∏e . (3.4.5)
y1 ,...yL j=1
It is convenient to express it by means of the matrix

" " V11 V12 1 e−2β d
"
V = Vyy =" = −2β d
V21 V22 e 1
in the form L−1

Z= ∑ V y 1 yL
. (3.4.6)
y1 ,yL
If we consider the orthogonal transformation

√ √
1/√2 −1/√ 2
U=
1/ 2 1/ 2
bringing the mentioned matrix to a diagonal form, then we can find

!
1 (1 + k)L−1 + (1 − k)L−1 (1 + k)L−1 − (1 − k)L−1
V L−1 = L−1 L−1 L−1 L−1 , k = e−2β d ,
2 (1 + k) − (1 − k) (1 + k) + (1 − k)
and
Z = 2 (1 + k)L−1 = 2L e−β d(L−1) coshL−1 β d.
Consequently, due to formula (3.3.12) we have

d
F = −LT ln 2 + (L − 1) d − (L − 1) T ln cosh .
T
According to (3.3.15), (3.3.16) it follows from here that

d d d
C = L ln 2 + (L − 1) ln cosh − (L − 1) tanh
T T T

d
R = (L − 1) d − (L − 1) d tanh .
T
With the help of these functions it is easy to find channel capacity and average cost
meant for one symbol in the asymptotic limit L → ∞:
C d d d
C1 = lim = ln 2 + ln cosh − tanh
L→∞ L T T T
R1 = d − d tanh (d/T ) .
Now an optimal probability distribution for symbols of the recording cannot be

decomposed into a product of (marginal) probability distributions for separate sym-
bols. In other words, the sequence y1 , y2 , . . ., yL is not a process with independent
constituents. However, this sequence turns out to be a simple Markov chain.
Example 3.3. Let us now consider the combined case when there are costs of
both types: additive costs (3.4.3) of the same type as in Example 3.1 and paired
costs (3.4.4). The total cost function has the form
L L−1
c yL = bL + a ∑ (−1)y j + 2d ∑ 1 − δ y j y j−1 .
j=1 j=1
As in Example 3.2, it is convenient to compute the partition function corresponding

to this function via a matrix method by using the formula

Z = ∑ exp {−β α (−1)y1 } V L−1 y y
1 L
y1 ,yL
which is a bit more complicated than (3.4.6). However, now the matrix has a more
difficult representation
β a+β d −β a−β d
e e
V = e−β b−β d β a−β d −β a+β d . (3.4.7)
e e
Only an asymptotic case of large L is of practical interest. If we restrict ourselves

to this case in advance, then calculations will be greatly simplified. Then the most
essential role in sum Z will be played by terms corresponding to the largest eigen-
value λm of matrix (3.4.7). As in other similar cases (for instance, see Section 3.1)
when taking logarithm ln Z and dividing it by L only those terms make an impact on
the limit. Thus, we will have
ln Z
lim = ln λm . (3.4.8)
L→∞ L
The characteristic equation corresponding to matrix (3.4.7) is
β a+β d !
e − λ e−β a−β d
= 0 λ
= e−β b−β d
λ
eβ a−β d e−β a+β d − λ
that is the equation
λ 2 − 2λ eβ d cosh β a + e2β d − e−2β d = 0.
Having taken the largest root of this equation and taking into account (3.4.8), we
find the limit free energy computed for one symbol
#
a 2 a −4d/T
F1 = −T ln λm = b − T ln cosh + sinh +e .
T T
From here we can derive free energy for one symbol and respective channel capacity
in the usual way.
As in Example 3.2, an optimal probability distribution corresponds to a Markov
process. A respective transition probability P(y j+1 | y j ) is connected with ma-
trix (3.4.7) and differs from this matrix by normalization factors. Next we present
the resulting expressions
P (1 | 1) = A1 eβ a+β d ; P (2 | 1) = A1 e−β a−β d ;

P (1 | 2) = A2 eβ a−β d ; P (2 | 2) = A2 e−β a+β d ;
β a+β d
(A−1
1 =e + e−β a−β d ; A−1
2 =e
β a−β d
+ e−β a+β d ).
The statistical systems considered in the last two examples have been studied
in statistical physics under the name of ‘Ising model’ (for instance, see Hill [21]
(originally published in English [20]) and Stratonovich [46]).
Example 3.4. Now we apply methods of the mentioned general theory to that case of
different symbol lengths, which was investigated in Section 3.1. We suppose that y is
an ensemble of variables k, Vi1 , . . ., Vik , where k is a number of consecutive symbols
in a recording and Vi j is a length of symbol situated on the j-th place. If l(i) is a
length of symbol i, then we should consider function (3.2.4) as a cost function.
According to the general method we calculate the partition function for the case
in question
∞
1
Z=∑ ∑ exp [−β l (i1 ) − · · · − β l (ik )] = ∑ Z1k (β ) = 1 − Z1 (β )
k i1 ,...,ik k=0
(if the convergence condition Z1 (β ) < 1 is satisfied). Here we denote
Z1 (β ) = ∑ e−β l(i) . (3.4.9)

i
Applying formulae (3.3.17), (3.3.13) we obtain that
d dZ1
R= ln (1 − Z1 ) = − (β ) (1 − Z1 )−1
dβ dβ
dZ1
C = β ln (1 − Z1 ) − β (1 − Z1 )−1 (F = T ln (1 − Z1 )) . (3.4.10)
dβ
Let L be a fixed recording length. Then condition (3.2.8), (3.3.18) will take the form
of the equation
dZ1 (β )
− (1 − Z1 (β ))−1 =L (3.4.11)
dβ
from which β can be determined.
Formulae (3.4.10), (3.3.11), (3.4.9) provide the solution to the problem of chan-
nel capacity C(L) computation. It is also of our interest to find the channel capacity
rate
C1 = C/L = C/R = β − β (F/R) (3.4.12)

[we have applied (3.3.13)] and, particularly, its limit value as L = R → ∞.
Differentiating (3.4.9) it is easy to find out that −Z1−1 dZ1 /d β has a meaning of
average symbol length:
1 dZ1
− = ∑ l (i) P (i) = lav
Z1 d β i
analogously to (3.3.25). That is why equation (3.4.11) can be reduced to the form
Z1 (β ) / (1 − Z1 (β )) = L/lav .
It is easily seen from the latter equation that Z1 (β ) converges to one:
Z1 (β ) → 1 as L/lav → ∞. (3.4.13)
But it is evident now that
(1 − Z1 ) ln (1 − Z1 ) → 0 as Z1 → 1.
Therefore, due to (3.4.10) we have
βF 1
− = (1 − Z1 ) ln (1 − Z1 ) → 0 as L→∞ (3.4.14)
R dZ1 /d β
provided that dZ1 /d β (i.e. lav ) tends to the finite value dZ1 (β0 )/d β . According
to (3.4.12) and (3.4.14) we have
C1 = lim β = β0 (3.4.15)
in this case. Due to (3.4.13) the limit value β0 is determined from the equation
Z1 (β0 ) = 1. (3.4.16)
In consequence of (3.4.9) this equation is nothing but equations (3.1.7), (3.1.9) de-
rived earlier. At the same time formula (3.4.15) coincides with relationship (3.1.12).
So, we see that the general standardized method yields the same results as the spe-
cial method applied earlier produces.
3.5 Methods of potentials in the case of a large number of

parameters
The method of potentials worded in Sections 3.3 and 3.4 can be generalized to
more difficult cases when there are a larger number of external parameters, i.e., this
method resembles methods usually used in statistical thermodynamics even more.
Here we outline possible ways to realize the above generalization and postpone a
more elaborated analysis to the next chapter.
Let cost function c(y) = c(y, a) depend on a numeric parameter a now and be
differentiable with respect to this parameter. Then free energy
F(T, a) = −T ln ∑ e−c(y,u)/T (3.5.1)

y
and other functions introduced in Section 3.4 will be dependent not only on temper-
ature T (or parameter β = 1/T ), but also on the value of a. The same holds true for
the optimal distribution (3.3.14) now having the form

F(T, a) − c(y, a)
P(y | a) = exp . (3.5.2)
T
Certainly, formula (3.3.15) and other results from Section 3.3 will remain valid
if we account parameter a to be constant when varying parameter T , i.e. if regular
derivatives are replaced with partial ones.
Hence, entropy of distribution (3.5.2) is equal to
∂ F(T, a)
Hy = − . (3.5.3)
∂T
Now in addition to these results we can derive a formula of partial derivatives taken
with respect to the new parameter a.
We differentiate (3.5.1) with respect to a and find
−1
∂ F(T, a) c(y, a) ∂ c(y, a) c(y, a)
∂a
= ∑ exp − T ∑ ∂ a exp − T
y
or, equivalently,
∂ F(T, a) ∂ c(y, a)
=∑ P(y | a) ≡ −E [B(y) | a] , (3.5.4)
∂a y ∂a
if we take into account (3.5.2) and (3.5.1).

The function
∂ c(y, a)
B(y) = −
∂a
is called a random internal thermodynamic parameter conjugate to the external
parameter a.
As a consequence of formulae (3.5.3), (3.5.4) the total differential of free energy
can be written as
dF(T, a) = −Hy dT − Ada. (3.5.5)
In particular, it follows from here that
∂F
A=− . (3.5.5a)
∂a
Internal parameter A defined by such a formula is called conjugate to parameter a
with respect to potential F.
In formula (3.5.5) Hy and A = E[B] come as functions of T and a, respectively.
However, they can be interpreted as independent variables. Then instead of F(T, a)
we consider another potential Φ (Hy , A) expressed in terms of F(T, a) via the Leg-
endre transform:
∂F ∂F
Φ (Hy , A) = F + Hy T + Aa = F − Hy −a . (3.5.6)
∂ Uy ∂a
Then parameters T , a can be regarded as functions of Hy , A, respectively, by differ-

entiation:
∂Φ ∂Φ
T= , a= , (3.5.7)
∂ Hy ∂A
since (3.5.5), (3.5.6) entail d Φ = T dHy + adA.
Of course, the Legendre transform can be performed in one of two variables. It
is convenient to use the indicated potentials and their derivatives for solving various
variational problems related to conditional channel capacity. Also it is convenient
to use those variables, which are given in the problem formulation as independent
variables.
Further we consider one illustrative example from a variety of possible problems
pertaining to choice of optimal encoding under given cost function and a number of
constraints imposed on the amount of information, average cost and other quantities.
Example 3.5. The goal is to encode a recording by symbols ρ . Each symbol has cost
c0 (ρ ) as an attribute. Besides this cost, it is required to take account of one more
additional expenditure η (ρ ), say, the amount of paint used for a given symbol. If
we introduce the cost of paint a, then the total cost will take the form
c(ρ ) = c0 (ρ ) + aη (ρ ). (3.5.8)
Let it be required to find encoding such that the given amount of information I
(meant per one symbol) is recorded (transmitted), at the same time the fixed amount
of paint K is spent (on average per one symbol) and, besides, average costs E[c0 (ρ )]
are minimized.
In order to solve this problem we introduce T and a as auxiliary parameters,
which are initially indefinite and then found from additional conditions. Paint con-
sumption η (ρ ) per symbol ρ is considered as the random variable η . As the second
variable ζ we choose a variable complementing η to ρ so that ρ = (ν , ζ ). Thus, the
cost function (3.5.8) can be rewritten as
c(η , ζ ) = c0 (η , ζ ) + aη .
Now we can apply formulae (3.5.1)–(3.5.7) and other ones, for which B = −η , to
the problem in question. According to (3.5.1) free energy F(T, a) is determined by
the formula

1
F(T, a) = −T ln ∑ exp[−β c0 (η , ζ ) − β aη ] β= , (3.5.9)
η ,ζ
T
and the optimal distribution (3.5.2) has the form
P(η , ζ ) = exp[β F − β c0 (η , ζ ) − β aη ]. (3.5.10)
For the final determination of probabilities, entropy and other variables it is left
to concretize parameters T and a employing conditions formulated above. Namely,
average entropy Hηζ and average paint consumption E[η ] are assumed to be fixed:
Hηζ = I; A ≡ E[η ] = K. (3.5.11)
Using formulae (3.5.3) and (3.5.4) we obtain the system of two equations
∂ F(T, a) ∂ F(T, a)
− = I; =K (3.5.12)
∂T ∂a
for finding parameters T = T (I, K) and a = a(I, K).
The optimal distribution (3.5.10) minimizes the total average cost R = E[c(η , ζ )]
as well as the partial average cost
Φ = E[c0 (η , ζ )] = R − aE[η ] ≡ R + aA,
since difference R− Φ = aK remains constant for any variations. In view of R = F +

T Hη it is easily seen that partial cost Φ = F +T Hη +aA (considered as a function of
Hη ,A) can be derived via the Legendre transform (3.5.6) of F(T, a), i.e. this partial
cost is an example of potential Φ (Hy , A). Taking into account formulae (3.5.7) we
represent the optimal distribution (3.5.10) in the following form:
⎡ ⎤
c0 (ρ ) − η ∂∂ΦA (I, −K)
P(ρ ) = const exp ⎣− ∂Φ
⎦.
∂ Hη (I, −K)
After having determined optimal probabilities P(ρ ) completely we can perform ac-
tual encoding by methods of Section 3.6.
3.6 Capacity of a noiseless channel with penalties in a

generalized version
1. In Sections 3.2 and 3.3 we considered capacity of a discrete channel without

noise, but with penalties. As it was stated, computation of that channel capacity can
be reduced to solving the first variational problem of information theory. The results
derived here can be generalized to the case of arbitrary random variables, which can
assume continuous values, in particular. Some formulae relative to the generalized
version of the formula and provided in Section 1.6 give us some hints how to do so.
We suppose that there is given a noiseless channel (not necessarily discrete) if
there are given measurable space (X, F), random variable ξ taking values from the
indicated space, F-measurable function c(ξ ) called a cost function, and also mea-
sure ν on (X, F) (normalization of ν is not required). We define channel capacity
C(a) for the level of losses a as the maximum value of entropy (1.6.9):

P(d ξ )
Hξ = − P(d ξ ) ln , (3.6.1)
v(d ξ )
compatible with the constraint

c(ξ )P(d ξ ) = a. (3.6.2)
Basically, the given variational problem is solved in the same way as it was
done in Section 3.3. But note that partial derivatives are replaced with variational
derivatives in this modified approach. After variational differentiation with respect
to P(dx) instead of (3.3.4) we will have the extremum condition:
P(d ξ )
ln = β F − β c(ξ ), (3.6.3)
ν (d ξ )
where β F = −1 − γ .
From here we obtain the extremum distribution
P(d ξ ) = eβ F−β c(ξ ) ν (d ξ ). (3.6.3a)
Averaging (3.6.3) and taking into account (3.6.1), (3.6.2) we obtain that
Hξ = β E[c(ξ )] − β F, C = β R − β F. (3.6.3b)
The latter formula coincides with equality (3.3.13) of the discrete version. As in
Section 3.3 we can introduce the following partition function (integral)

Z= e−c(ξ )/T ν (d ξ ), (3.6.4)
3.6 Capacity of a noiseless channel with penalties in a generalized version 75
serving as a generalization of (3.3.11). We also introduce free energy F(T ) =

−T ln Z here.
Relationship (3.3.15) and other results are proven in the same way as in Sec-
tion 3.3. Formulae from Section 3.4 are extended to the case in question analogously,
and formulae (3.5.1), (3.5.2) are replaced with

F(T, a) = −T ln e−c(ξ ,a)/T ν (d ξ ) (3.6.5)

F(T, a) − c(ξ , a)
P(d ξ | a) = exp ν (d ξ ). (3.6.6)
T
Finally, resulting equalities (3.5.3), (3.5.5) and other ones remain intact.
2. As an example we consider the case when X is an r-dimensional real space
ξ = (x1 , . . . , xr ) and function c(ξ ) is represented as a linear quadratic form
1 r
2 i,∑
c(ξ ) = c0 + (xi − bi )ci j (x j − b j ),
j=1
where ci j is a non-singular positive definite matrix.

We suppose that ν (d ξ ) = d ξ = dx1 , . . . , dxr . Then formula (3.6.2) turns into

β
Z ≡ e−F/T exp −c0 β − ∑ yi ci j y j dy1 , . . . , dyr (yi = xi − bi ).
2
Calculating the integral and taking the logarithm we find that

T rT T
F(T ) = c0 + ln det β ci j = c0 − ln T + ln det ci j .
2 2 2
Hence, taking account of (3.3.16) equation (3.3.18) takes the form
c0 + rT /2 = a. (3.6.7)
At the same time formula (3.3.15) results in

r 1 " "
C(a) = (1 + ln T ) − ln det "ci j " ,
2 2
i.e. due to (3.6.7) we obtain that

r r 1
C(a) = 1 − ln − ln det ci j .
2 2(a − c0 ) 2
In this case the extremum distribution is Gaussian and its entropy C(a) can be also
found with the help of formulae from Section 5.4.
Chapter 4
First asymptotic theorem and related results
In the previous chapter, for one particular example (see Sections 3.1 and 3.4) we
showed that in calculating the maximum entropy (i.e. the capacity of a noiseless
channel) the constraint c(y) a imposed on feasible realizations is equivalent, for
a sufficiently long code sequence, to the constraint E[c(y)] a on the mean value
E[c(y)]. In this chapter we prove (Section 4.3) that under certain assumptions such
equivalence takes place in the general case; this is the assertion of the first asymp-
totic theorem. In what follows, we shall also consider the other two asymptotic
theorems (Chapters 7 and 11), which are the most profound results of information
theory. All of them have the following feature in common: ultimately all these the-
orems state that, for utmost large systems, the difference between the concepts of
discreteness and continuity disappears, and that the characteristics of a large collec-
tion of discrete objects can be calculated using a continuous functional dependence
involving averaged quantities. For the first variational problem, this feature is ex-
pressed by the fact that the discrete function H = ln M of a, which exists under the
constraint c(y) ≤ a, is asymptotically replaced by a continuous function H(a) cal-
culated by solving the first variational problem. As far as the proof is concerned, the
first asymptotic theorem turns out to be related to the theorem on canonical distri-
bution stability (Section 4.2), which is very important in statistical thermodynam-
ics and which is actually proved there when the canonical distribution is derived
from the microcanonical one. Here we consider it in a more general and abstract
form. The relationship between the first asymptotic theorem and the theorem on the
canonical distribution once more underlines the intrinsic unity of the mathematical
apparatus of information theory and statistical thermodynamics.
Potential Γ (α ) and its properties are used in the process of proving the indicated
theorems. The material about this potential is presented in Section 4.1. It is related
to the content of Section 3.3. However, instead of regular physical free energy F we
consider dimensionless free energy, that is potential Γ = −F/T . Instead of parame-
ters T , a2 , a3 , . . . common in thermodynamics we introduce symmetrically defined
parameters α1 = −1/T , α2 = a2 /T , α3 = a3 /T , . . .. Under such choice the temper-
ature is an ordinary thermodynamic parameter along with the others.

https://doi.org/10.1007/978-3-030-22833-0 4
78 4 First asymptotic theorem and related results
When ‘thermodynamic’ potentials are used systematically, it is convenient to

treat the logarithm μ (s) = ln Θ (s) of the characteristic function Θ (s) as a character-
istic potential. Indeed, this logarithm is actually the cumulant generating function
just as well the potential Γ (α ).
Section 4.4 presents auxiliary theorems involving the characteristic potential.
These theorems are subsequently used in proving the second and third asymptotic
theorems.
4.1 Potential Γ or the cumulant generating function
Consider a thermodynamic system or an informational system, for which for-

mula (3.5.1) is relevant. For such a system we introduce symmetrically defined
parameters α1 , . . . , αr and the corresponding potential Γ (α ) that is mathematically
equivalent to free energy F, but has the following advantage over F: Γ is the cumu-
lant generating function.
For a physical thermodynamic system the equilibrium distribution is usually rep-
resented by Gibbs’ formula:

F − H (ζ , a)
P(d ζ ) = exp dζ (4.1.1)
T
where ζ = (p, q) are dynamic variables (coordinates and impulses); H (ζ , a) is a

Hamilton’s function dependent on parameters a2 , . . ., ar . Formula (4.1.1) is an ana-
log of formula (3.5.2). Now assume that the Hamilton’s function is linear with re-
spect to the specified parameters
H (ζ , a) = H0 (ζ ) − a2 F 2 (ζ ) − · · · − ar F r (ζ ). (4.1.2)
Then (4.1.1) becomes
P(d ζ ) = exp[F − H0 (ζ ) + a2 F 2 (ζ ) + · · · + ar F r (ζ )]d ζ . (4.1.3)
Further we introduce new parameters
α1 = −1/T ≡ −β ; α2 = a2 /T ; ... ; αr = ar /T, (4.1.4)
which we call canonical external parameters. Parameters
B1 = H0 ; B2 = F 2 ; ... ; Br = F r (4.1.5)
are called random internal parameters and also
A1 = E[H0 ]; A2 = E[F 2 ]; ... ; qAr = E[F r ]

4.1 Potential Γ or the cumulant generating function 79
are called canonical internal parameters conjugate to external parameters α1 , . . .,

αr . Moreover, we introduce the potential Γ and rewrite distribution (4.1.3) in a
canonical form:
P(d ζ | d) = exp[−Γ (α ) + B1 (ζ )α1 + · · · + Br (ζ )ar ] d ζ . (4.1.6)
Here −Γ (α ) = F/T or, as it follows from the normalization constraint, equivalently,
Γ (α ) = ln Z (4.1.7)
where
Z= exp[B1 α1 + · · · + Br αr ] d ζ (4.1.8)
is the corresponding partition function.

In the symmetric form (4.1.6) the temperature parameter α1 is on equal terms
with the other parameters α2 , . . . , αr , which are connected with a2 , . . . , ar . If c(y, a)
is linear with respect to a, then expression (3.5.2) can also be represented in this
form.
We integrate (4.1.6) with respect to variables η such that (B, η ) together form ζ .
Then we obtain the distribution over the random internal parameters
P(dB | α ) = exp[−Γ (α ) + Bα − Φ (B)] dB (4.1.9)
where
Bα = B1 α1 + · · · + Br αr ; e−Φ (B) dB = dζ .
ΔB ΔB
In turn, we call the latter distribution canonical.
In the case of the canonical distribution (4.1.6) it is easy to express the character-
istic function
Θ (iu) = eiuB P(dB | α ) (4.1.10)
of random variables B1 , . . . , Br by Γ . Indeed, we substitute (4.1.9) into (4.1.10) to

obtain
Θ (iu) = e−Γ (α ) exp[(iu + α )B − Θ (B)]dB.
Taking into account

Γ (α )
e = exp[Bα − Φ (b)]dB (4.1.10a)
and a normalization of distribution (4.1.9) it follows from here that
Θ (iu) = exp[Γ (α + iu) − Γ (α )]. (4.1.11)
The logarithm
μ (s) = ln Θ (s) = ln esB(ζ ) P(d ζ | α ) (4.1.11a)
of characteristic function (4.1.10) represents the characteristic potential or the cu-

mulant generating function, as is well known, the cumulants are calculated by dif-
ferentiation:
∂ mμ
k j1 ,..., jm = (0). (4.1.12)
∂ s j1 · · · ∂ s jm
Substituting the equality following from (4.1.11)
μ (s) = Γ (α + s) − Γ (α ), (4.1.12a)
to here, we eventually find that
∂ m Γ (α )
k j1 ,..., jm = . (4.1.13)
∂ α j1 · · · ∂ α jm
Hence we see that the potential Γ (α ) is the cumulant generating function for the
whole family of distributions P(dB | α ).
For m = 1 we have from (4.1.13) that
∂Γ (α )
k j ≡ A j ≡ E[B j ] − . (4.1.14)
∂αj
The first of these formulae

∂Γ (α )
E[B1 ] ≡ A1 =
∂ α1
is equivalent to formulae (3.3.16), (3.3.17). The other formulae
∂Γ (α )
Al = , l = 2, . . . , r
∂ αl
are equivalent to the equalities
∂F
Al = − , l = 2, . . . , r,
∂ al
of type (3.5.5a).
With the given definition of parameters, the relationship defining entropy has a
peculiar form when energy (average cost) and temperature have the appearance of
regular parameters. Substituting (4.1.1) to the formula for physical entropy

P(d ζ )
H = −E ln
dζ
or applying an analogous formula for a discrete version, we get
H = −F/T + (1/T )E[H (ζ , a)]. (4.1.14a)

4.1 Potential Γ or the cumulant generating function 81
Plugging (4.1.2) in here and taking account of notations (4.1.4), (4.1.5), we obtain
H = Γ − α E[B]
or, equivalently, if we take into account (4.1.14),

r
∂Γ (α )
H = Γ (α ) − ∑ α j . (4.1.14b)
j=1 ∂αj
Further we provide one more corollary from formula (4.1.13).

Theorem 4.1. Canonical potential Γ (α ) is a convex function of parameters α de-
fined in their domain Qa .
Proof. We present out proof in the presence of an extra condition about twice dif-
ferentiability of a potential. We employ formula (4.1.13) for m = 2. It takes the form
∂ 2 Γ (α )
ki j = . (4.1.15)
∂ αi ∂ α j
We also take into account that the correlation matrix ki j is positive semi-definite.
Therefore, the matrix of the second derivatives ∂ 2Γ /∂ αi ∂ α j is positive semi-
definite as well, which proves convexity of the potential. This ends the proof.
Corollary 4.1. In the presence of just one parameter α1 , r = 1, function H(A1 )
defined by the Legendre transform

dΓ
H(A1 ) = Γ − α1 (A1 )A1 A1 = (α1 ) (4.1.16)
d α1
is concave.
Indeed, as it follows from formula (4.1.16) and the formula of the inverse Leg-
endre transform

dH
Γ (α1 ) = H + α1 A1 (α1 ) α1 = − (A1 ) ,
dA1
the following relationships
d 2Γ (α1 ) dA1 d 2 H(A1 ) d α1

= ; =− .
d α12 d α1 dA21 dA1
are valid.
Since by virtue of Theorem 4.1 inequality d 2Γ /d α12 0 holds true (when the
differentiability condition holds), we deduce that dA1 /d α1 0 and thereby
d2H d α1
=− 0. (4.1.16a)
dA21 dA1
This statement can also be proven without using the differentiability condition.
In conclusion of this paragraph we provide formulae pertaining to a characteristic
potential of entropy that has been defined earlier by the formula of type (1.5.15).
Nonetheless, these formulae generalize the previous one.
Theorem 4.2. For the canonical family of distributions
P(d ζ | α ) = e−Γ (α )+α B(ζ ) v(d ζ ) (4.1.17)
the characteristic potential

μ0 (s0 ) = ln es0 H(ζ |α ) P(d ζ | α ) (4.1.18)
of random entropy
P(d ζ | α )
H(ζ | α ) = − ln
v(d ζ )
has the form
μ0 (s0 ) = Γ ((1 − s0 )α1 , . . . , (1 − s0 )αr ) − (1 − s0 )Γ (α1 , . . . , αr ). (4.1.19)
To prove this, it is sufficient to substitute (4.1.17) into (4.1.18) and take into con-
sideration the formula Γ (α ) = ln eα B(ζ ) ν (d ζ ), which is of type (4.1.10a) defining
Γ (α ).
Differentiating (4.1.19) by s0 and equating s0 to zero (analogously to (4.1.12) for
m = 1), we can find a mean value of entropy that coincides with (4.1.14b). Repeated
differentiation will yield the expression for variance.
4.2 Some asymptotic results of statistical thermodynamics.

Stability of the canonical distribution
The deepest results of information theory and statistical thermodynamics have an

asymptotic nature, i.e. are represented in the form of limiting theorems under growth
of a cumulative system. Before considering the first asymptotic theorem of informa-
tion theory, we present a related (as it is seen from the proof) result from statistical
thermodynamics, namely, an important theorem about stability of the canonical dis-
tribution. In the case of just one parameter the latter distribution has the form
P(ξ | α ) = exp[−Γ (α ) + α B(ξ ) − ϕ (ξ )]. (4.2.1)
If B(ξ ) = H (p, q) is perceived as energy of a system that is Hamilton’s function and

ϕ (ξ ) is supposed to be zero, then the indicated distribution becomes the canonical
Gibbs distribution:
4.2 Asymptotic results of statistical thermodynamics. Stability of the canonical distribution 83

F(T ) − H (p, q) 1
exp F(T ) = −T Γ − ,
T T
where T = −1/α is temperature. The theorem about stability of this distribution (i.e.
about the fact that it is formed by a ‘microcanonical’ distribution for a cumulative
system including a thermostat) is called Gibbs theorem.
Adhering to a general and formal exposition style adopted in this chapter, we
formulate the addressed theorem in abstract form.
Preliminary, we introduce several additional notions. We call the conditional dis-
tribution
Pn (ξ1 , . . . , ξn | α ) (4.2.2)
an n-th degree of the distribution
P1 (ξ1 | α ), (4.2.3)
if
Pn (ξ1 , . . . , ξn | α ) = P1 (ξ1 | α ) · · · P1 (ξn | α ). (4.2.4)
Let the distribution (4.2.3) be canonical:
ln P1 (ξ1 | α ) = −Γ1 (α ) + α Bn (ξ1 ) − ϕ1 (ξ1 ). (4.2.5)
Then in consequence of (4.2.4) we obtain that the following equality holds true for
the joint distribution (4.2.2):
ln Pn (ξ1 , . . . , ξn | α ) = −nΓ1 (α ) + α Bn (ξ1 , . . . , ξn ) − ϕn (ξ1 , . . . , ξn ), (4.2.6)
where
n n
Bn (ξ1 , . . . , ξn ) = ∑ B1 (ξk ), ϕn (ξ1 , . . . , xn ) = ∑ ϕ1 (ξk ).
k=1 k=1
Apparently, it is canonical as well.

There may be several parameters α = (α1 , . . . , αr ). In this case we mean inner
product ∑ αi Bi by α B.
Canonical internal parameters B1 (ξk ), for which constraint Bn = ∑k B1 (ξk ) is
valid, are called extensive. In turn, the respective external parameters α are called
intensive.
It is easy to show that if n-th degree of a distribution is canonical, then the distri-
bution is canonical itself. Indeed, applying (4.2.4) and the definition of canonicity
we obtain that
n
∑ ln P1 (ξk | α ) = −Γn (α ) + α Bn (ξ1 , . . . , ξn ) − ϕn (ξ1 , . . . , ξn ).
k=1
Suppose here ξ1 = ξ2 = · · · = ξn . This yields
n ln P1 (ξ1 | α ) = −Γn (α ) + α Bn (ξ1 , . . . , ξ1 ) − ϕn (ξ1 , . . . , ξ1 ),

i.e. the condition of canonicity (4.2.1) is actually satisfied for the ‘small’ sys-
tem (4.2.3), and
1 1
B1 (ξ1 ) = Bn (ξ1 , . . . , ξ1 ); ϕ1 (ξ1 ) = ϕn (ξ1 , . . . , ξ1 ).
n n
The derivation of the canonical ‘small’ distribution from the canonical ‘large’
distribution is natural, of course. The following fact proven below is deeper: the
canonical ‘small’ distribution is approximately formed from a non-canonical ‘large’
distribution. Therefore, the canonical form of a distribution appears to be stable in
a sense that it is asymptotically formed from different ‘large’ distributions. In fact,
this explains the important role of the canonical distribution in theory, particularly,
in statistical physics.
Theorem 4.3. Let the functions B1 (ξ ), ϕ1 (ξ ) be given, for which the corresponding
canonical distribution is of the form
P̃1 (ξ | α ) = exp[−Γ (α ) + α B1 (ξ ) − ϕ1 (ξ )]. (4.2.7)
Let the distribution of the ‘large’ system be also given:

n
−Ψn (A)−∑nk=1 ϕ1 (ξk )
Pn (ξ1 , . . . , ξn | A) = e δ ∑ B1 (ξk ) − nA , (4.2.8)
k=1
where the functions Ψn (A) are determined from the normalization constraint, and A
plays the role of a parameter. Then the distribution of a ‘small’ system
P1 (ξ1 | A) = ∑ Pn (ξ1 , . . . , ξn | A) , (4.2.9)

ξ2 ,...,ξn
which is formed from the initial distribution by summation, asymptotically trans-

forms into the distribution (4.2.7), when the change α = α (A) of the parameter is
made. Namely,
1
ln P1 (ξ1 | A) = −ψ (A) + α (A)B1 (ξ1 ) − ϕ1 (ξ1 ) + B21 χ (A) + O(n−2 ). (4.2.10)
n
The functions ψ (A), α (A), χ (A) are defined by formulae (4.2.17) mentioned below.
We also assume that function (4.2.13) is differentiable a sufficient number of times
and that equation (4.2.15) has a root.
A distribution of type (4.2.8) is called ‘microcanonical’ in statistical physics.
Proof. Using the integral representation of the delta-function

∞
1
δ (x) = eiκx dκ, (4.2.11)
2π −∞
we rewrite the ‘microcanonical’ distribution (4.2.8) in the form

∞

n
1
Pn (ξ1 , . . . , ξn | A) = exp −Ψn (A) − iκnA + ∑ [iκB1 (ξk ) − ϕ1 (ξk )] dκ
2π −∞ k=1
and substitute this equality to (4.2.9). In the resulting expression

1 ∞ n
P1 (ξ1 | A) = ∑ exp −Ψn (A) − iκnA + ∑ [iκB1 (ξk ) − ϕ1 (ξk )] dκ
ζ ,...,ζ
2π −∞ k=1
2 n
we perform summation by ζ2 , . . . , ζn employing formula (4.1.10a). That results in

1 ∞
P1 (ξ1 | A) = exp{−Ψ (A) − iκnA + (n − 1)Γ (iκ) + iκB1 (ξ1 ) − ϕ1 (ξ1 )}dκ
2π −∞
≡ exp[−Ψ (A) − ϕ1 (ξ1 )]I, (4.2.12)
where
Γ (α ) = ln ∑ exp[α B1 (ξk ) − ϕ1 (ξk )]. (4.2.13)
ξk
We apply the method of the steepest descent to the integral in (4.2.12) using the fact
that n takes large values.
Further, we determine a saddle point iκ = α0 from the extremum condition for
the expression situated in the exponent of (4.2.12), i.e. from the equation
dΓ
(n − 1) (α0 ) = nA − B1 (ξ1 ). (4.2.14)
dα
In view of that the point turns out to be dependent on ξ , it is convenient to also
consider point α1 independent from ξ1 and defined by the equation
dΓ
(α1 ) = A, (4.2.15)
dα
Point α1 is close to α0 for large n.
It follows from (4.2.14), (4.2.15) that
1 A − B1
Γ (α0 ) − Γ (α1 ) = Γ (α1 )ε + Γ (α )ε 2 + · · · = (ε = α0 − α1 ).
2 n−1
From here we have
A − B1 Γ
ε=
− ε 2 − · · ·
(n − 1)Γ 2Γ
A − B1 Γ (A − B1 )2
=
− + O(n−3 ). (4.2.16)
(n − 1)Γ 2(n − 1)2 (Γ )3
Since Γ (α ) can be interpreted as variance σ 2 = k11 of some random variable ac-

cording to (4.1.15), the inequality
Γ (α0 ) > 0 (4.2.16a)
holds true and, consequently, a direction of the steepest descent of the function
(n − 1)Γ (α ) − α nA − α B1 at point α0 (and also at point α1 ) is orthogonal to the
real axis. Indeed, if difference α − α0 = iy is imaginary, then
(n − 1)Γ (α ) − nα A + α B1 =
1
(n − 1)Γ (α0 ) − nα0 A + α0 B1 − (n − 1)Γ (α0 )y2 + O(y3 ).
2
Drawing the contour of integration through point α0 in the direction of the steep-
est descent, we use the equality that follows from the formula

a 2 b 3 c 4
exp − x + x + x + · · · dx ≈ (2π /a)1/2 exp[c/8a2 + · · · ], (4.2.16b)
2 6 24
3/2
where a > 0, b/a2 1, c/a2 1, . . . . This equality is

1
I≡ exp {(n − 1)Γ (iκ) + iκ[B1 (ξ1 ) − nA]} dκ
2π
(
= [2π (n − 1)Γ (α0 )]−1/2 exp (n − 1)Γ (α0 )+
Γ (α0 ) )
−2
+ α0 [B1 (ξ1 ) − nA] + + O(n ) . (4.2.16c)
8(n − 1)Γ (α0 )2
Employing (4.2.16) here, we can express value α0 in terms of α1 by performing

expansion in series with respect to ε = α0 − α1 and taking into account a required
number of terms. This leads to the result
(
I = [2π (n − 1)Γ (α1 )]−1/2 exp (n − 1)Γ (α1 ) + α1 [B1 (ξ1 ) − nA]−
Γ (α1 )[A − B1 (ζ )] [B1 (ξ1 ) − A]2 Γ (α1 ) )
−2
− − + + O(n ) .
2nΓ (α1 )2 2nΓ (α1 ) 8nΓ (α1 )2
Substituting this expression to (4.2.12) and denoting

1
ψ (A) =Ψn (A) − (n − 1)Γ (α1 ) + nα1 A + ln[2π (n − 1)Γ (α1 )]
2
Γ (α1 ) A2 AΓ (α1 )
− + + ; (4.2.17)
8nΓ (α1 )2 2nΓ (α1 ) 2nΓ (α1 )2
A Γ (α1 ) 1
α (A) = α1 + + ; χ (A) = − ,
nΓ (α1 ) 2nΓ (α1 )2 2Γ (α1 )
we obtain (4.2.10). The proof is complete.
It is not necessary to account for the first equality in (4.2.17) because function
ψ (A) is unambiguously determined by functions α (A), χ (A) due to the normaliza-
tion constraint.
Since a number of terms in (4.2.17) disappears in the limit n → ∞, the limit
expression (4.2.10) has the form
ln P1 (ξ1 | A) = const + α B1 (ξ1 ) − ϕ1 (ξ1 ) (Γ (α ) = A),
or ln P1 (ξ | A) = −Γ (α ) + α B1 (ξ1 ) − ϕ1 (ξ1 ) as it follows from considerations of

normalization.
Theorem 4.3 can be generalized in different directions. A generalization is trivial
in the case when there is not one but several (r) parameters B1 , . . . , Br . The delta-
function in (4.2.8) needs to be replaced with a product of delta-functions and also
expression α B should be understood as an inner product. For another generalization
the delta-function in formula (4.2.8) can be substituted by other functions. In order
to apply the theory considered in Section 3.2 to the problem of computation of
capacity of noiseless channels it is very important to use the functions

1 if x > 0
ϑ+ (x) = ϑ− (x) = 1 − ϑ+ (x), (4.2.18)
0 if x 0,
and also the functions of the type
ϑ+ (x) − ϑ+ (x − c) = ϑ− (x − c) − ϑ− (x).
Indeed, introduction of function ϑ− (·) to the distribution

n
∑ B1 (ξk ) − nA e−Ψn (A)−∑k=1 ϕ1 (ξk )
n
Pn (ξ1 , . . . , ξn | A) = ϑ−
k=1
is equivalent to introduction of constraint ∑nk=1 B1 (ξk ) nA, i.e. constraint (3.2.1)

with c(y) = ∑nk=1 B1 (ξk ) and a = nA. Analogously, the first constraint (3.2.7) corre-
sponds to introduction of the function

n n
ϑ− ∑ B1 (ξk ) − a1 − ϑ− ∑ B1 (ξk ) − a2 .
k=1 k=1
In order to cover these cases, the ‘microcanonical’ distribution (4.2.8) has to be

replaced with a more general distribution

n
∑ B1 (ξk ) − nA e−Ψn (A)−∑k=1 ϕ1 (ξk ) .
n
Pn (ξ1 , . . . , ξn | A) = ϑ (4.2.19)
k=1
Here ϑ (·) is a given function having the spectral representation


ϑ (z) = expiκz θ (iκ) dκ. (4.2.20)
Such a generalization will require quite insignificant changes in the proof of the
theorem. Expansion (4.2.11) needs to be substituted by expansion (4.2.20), where
the extra term ln θ (iκ) will appear in the exponent in formula (4.2.12) and others.
This will lead to a minor complication of final formulae.
Results of Theorem 4.3 also allow a generalization in a different direction. It
is not necessary to require that the factor e− ∑k=1 ϕ1 (ξk ) (independent from A and
n
proportional to the conditional probability P(ξ1 , . . . , ξk | A, Bn ) = P(ξ1 , . . . , ξk | Bn ))

can be decomposed into a product of factors, especially identical ones, in formu-
lae (4.2.8), (4.2.19). However, we need to introduce some weaker requirement to
keep the asymptotic result valid. In order to formulate such a requirement, we in-
troduce a notion of canonical stability of a random variable. We call a sequence of
random variables {ζn , n = 1, 2, . . . given by probability distributions Pn (ζn ) in sam-
ple space Z (independent of n) canonically stable regarding Pn (ζn ) if values of their
characteristic potentials
μn (β ) = ln ∑ eβ ζn Pn (ζn ) (4.2.21)
indefinitely increase for various β in proportion to each other (they together con-
verge to ∞). The latter means the following:
μn (β ) μ0 (β )
μn (β ) → ∞,
→ (β , β ∈ Q)
μn (β ) μ0 (β )
as n → ∞, where μ0 (β ) is some function independent of n [Q is a feasible region of

β , for which (4.2.21) makes sense].
It is easy to see that random variables equal to expanding sums
n
ζn = ∑ B1 (ξk )
k=1
of independent identically distributed random variables are canonically stable be-

cause μn (β ) = nμ1 (β ) holds for them. However, other examples of canonically
stable families of random variables are also possible. That is why the theorem for-
mulated below is a generalization of Theorem 4.3.
Theorem 4.4. Let Pn (ζn , ηn ), n = 1, 2, . . . , be a sequence of probability distributions
such that the sequence of random variables ζn is canonically stable. With the help
of these distributions we construct the joint distribution
P(ξ1 , ζn , ηn | An ) = θ (B1 (ξ1 ) + ζn − An ) e−Ψn (A)−ϕ1 (ξ1 ) Pn (ζn , ηn ), (4.2.22)
where θ (·) is a function independent of n with spectrum θ (iκ). Then the summation
of distribution P(ξ1 | An ) over ηn and ζn transforms (4.2.22) into an expression
of type (4.2.10). In this expression, functions ψ (An ), α (An ), χ (An ) are determined
from the corresponding formulae like (4.2.17); as n → ∞, function α (An ) turns into
the function inverse to the function
4.3 Asymptotic equivalence of two types of constraints 89
An = μn (α ), (4.2.23)
while function ψ (An ) turns into Γ (α (An )), where
μn (β ) = ln ∑ eβ ζn Pn (ζn ),
ζn
Γ (α ) = ln ∑ eα B1 (ξ1 )−ϕ1 (ξ1 ) .

ξi
It is assumed that equation (4.2.23) has a root tending to a finite limit as n → ∞.
Proof. The proof is analogous to the proof of Theorem 4.3. The only difference is
that now there is an additional term ln θ (iκ) and the expression (n − 1)Γ1 (iκ) must
be replaced with μn (iκ). Instead of formula (4.2.12) now we have

−Ψn (A)−ϕ1 (ξ1 )
P(ξ1 | An ) = e exp[iκB1 (ξ1 ) − iκAn + μn (iκ) + ln θ (iκ)]dκ
after summation over ηn and ζn . The saddle point iκ = α0 is determined from the
equation
θ (α0 )
μn (α0 ) − An + B1 (ξ ) + = 0.
θ (α0 )
The root of the latter equation is asymptotically close to root α of equation (4.2.23).
Namely,
−1 θ
α0 − α = (μn ) B1 + + · · · = O(μ −1 ).
θ
Other changes do not require explanation.
The theorems presented in this paragraph characterize the role of canonical dis-
tributions just as the Central Limit Theorem characterizes the role of Gaussian dis-
tributions. As a matter of fact, this also explains the fundamental role of canonical
distributions in statistical thermodynamics.
4.3 Asymptotic equivalence of two types of constraints
Consider asymptotic results related to the content of Chapter 3. We will show that
when computing maximum entropy (capacity of a noiseless channel), constraints
imposed on mean values and constraints imposed on exact values are asymptoti-
cally equivalent to each other. These results are closely related to the theorem about
stability of the canonical distribution proven in Section 4.2.
We set out the material in a more general form than in Sections 3.2 and 3.3 by
using an auxiliary measure ν (dx) similarly to Section 3.6.
Let the space X with measure ν (dx) (not normalized to unity) be given. Entropy
will be defined by the formula

P(dx)
H =− ln P(dx) (4.3.1)
v(dx)
[see (1.6.13)]. Let the constraint
B(x) A (4.3.2)
or, more generally,

B(x) ∈ E (4.3.3)
be given, where E is some (measurable) set, B(x) is a given function. Entropy of
level A (or set E) is defined by maximization

P(dx)
H̃ = sup − P(dx) ln . (4.3.4)
P∈G̃ v(dx)
of feasible distributions P(·) is characterized by the fact that a probability

Here set G
is concentrated within subspace X ⊂ X defined by constraints (4.3.2) and (4.3.3), i.e.
P(X̃) = 1; P[B(x) ∈ E] = 1. (4.3.5)
Constraints (4.3.2), (4.3.3) will be relaxed if we substitute them by analogous con-

straints for expectations:
E[B(x)] A, where E[B(x)] ∈ E, (4.3.6)
where symbol E means averaging with respect to measure P. Averaging (4.3.2), (4.3.3)
with respect to measure P from G we obtain (4.3.6), thereby set G of distributions
defined by constraint (4.3.5). Hence,
P defined by constraint (4.3.6) embodies set G
if we introduce the following entropy:

P(dx)
H = sup − P(dx) ln , (4.3.7)
P∈G v(dx)
then we will have

H H̃. (4.3.8)
Finding entropy (4.3.7) under constraint (4.3.6) is nothing else but the first vari-
ational problem (see Sections 3.3 and 3.6). Entropy H coincides with entropy of a
certain canonical distribution
P(dx) = exp[−Γ (α ) + α B(x)]v(dx) (4.3.9)
[this formula coincides with (3.6.3a) when B(x) = c(x); ξ = x; α = −β ], i.e.

dΓ
E[B(x)] = ∈ E. Usually the value of E[B(x)] coincides with either the maxi-
dα
mum or the minimum point of interval E.
It is easy to make certain that entropy (4.3.4) is determined by the formula
= ln v(X).
H (4.3.10)
At the same it follows from (4.3.9) that

dΓ
H = Γ (α ) − α A A= (α ) (4.3.11)
dα
holds for entropy (4.3.7) [see (3.6.3b)].

Deep asymptotic results related to the first variational problem are that en-
tropies (4.3.10) and (4.3.11) are close to one another in some cases and compu-
tations of their values are interchangeable (i.e. in order to compute one of them
it is sufficient to compute the other one). Usually it is convenient to compute en-
tropy (4.3.11) of the canonical distribution (4.3.9) instead of (4.3.10) via applying
methods regular in statistical thermodynamics. The specified results affirming a big
role of the canonical distribution (4.3.9) are close to the results stated in the previous
paragraph both in matter and in methods of proof.
We implement the described above ‘system’ (or a noiseless channel) in a double
way (ν (dx), B(x), A relate to an indicated system). On the one hand, we suppose that
there is given a ‘small system’ (channel), which contains ν1 (dx1 ), B1 (x1 ), A1 . On
the other hand, let there be a ‘large system’ (channel) for which it holds that
vn (dx1 , . . . , dxn ) = v1 (dx1 ) · · · v1 (dxn );

n
Bn (x1 , . . . , xn ) = ∑ B1 (xk ); An = nA1 .
k=1
The ‘large system’ appears to be an n-th degree of the ‘small system’. The afore-
mentioned formulae (4.3.1–4.3.11) can be applied to both mentioned systems. A
n , H n can be provided both for the ‘small’ and for the ‘large’
definition of entropies H
systems. For the ‘small’ system, the values H 1 , H 1 are essentially different, but for
the ‘large’ system the values H n , H n , according to the foregoing, are relatively close
to each other in the asymptotic sense
n /H n → 1,
H n → ∞. (4.3.12)
As it is easy to see, H n = nH 1 . Because of that relationship (4.3.12) can be rewritten

as follows:
1
Hn → H 1 . (4.3.13)
n
This formula together with its various generalizations indeed represents the main
result of the first asymptotic theorem.
For the ‘large system’ we consider constraints (4.3.3), (4.3.6) in the form

n n
∑ B1 (xk ) − An ∈ E0 ; E ∑ B1 (xk ) − An ∈ E0 ,
k=1 k=1
where E0 is independent of n. If we introduce the function


1 z ∈ E0
ϑ (z) =
0 z∈/ E0
then constraint ∑ B1 (xk ) − An ∈ E0 can be substituted by the formula

P(dx1 , . . . , dxn | An ) = const ϑ ∑ B1 (xk ) − An P(dx1 , . . . , dxn ).
k
The extremum distribution (the one yielding H n ) has the following form:

P(dx1 , . . . , dxn | An ) = N −1 ϑ ∑ B1 (xk ) − An v1 (dx1 ) · · · v1 (dxn ), (4.3.14)
k

where N is a normalization constant quite simply associated with entropy H:

H = ln N = ln ϑ ∑ B1 (xk ) − An v1 (dx1 ) · · · v1 (dxn ). (4.3.15)
k
Formula (4.3.14) is an evident analogy of both (4.2.19) and (4.2.22). It means that
the problem of entropy (4.3.15) calculation is related to the problem of calculation
of the partial distribution (4.2.9) considered in the previous paragraph. As there, the
conditions of the exact multiplicativity
vn (dx1 , . . . , dxn ) = v1 (dx1 ) · · · v1 (dxn )
and the exact additivity

n
Bn (x1 , . . . , xn ) = ∑ B1 (xk )
k=1
are not necessary for the proof of the main result (convergence (4.3.12)). Next we
formulate the result directly in a general form by employing the notion (introduced
in Section 4.2) of a canonically stable sequence of random variables (ξn being un-
derstood as the set x1 , . . . , xn ).
Theorem 4.5 (The first asymptotic theorem). Let νn (d ξn ), n = 1, 2, . . . , be a se-

quence of measures and Bn (ξn ) be a sequence of random variables that is canoni-
cally stable with respect to the distribution
P(d ξn ) = vn (d ξn )/vn (Ξn ) (4.3.16)
(the measure νn (Σn ) of the entire space of values ξn is assumed to be finite). Then
n can be computed by the asymptotic formula
the entropy H
= −γn (An ) − 1 ln[Γ (α1 )] + O(1),

H (4.3.17)
2
where
γn (An ) =α1 An − Γn (α1 ) = −H n (Γn (α1 ) = An ),

Γn (α ) = ln exp[α Bn (ξn )]vn (d ξn ) (4.3.18)
is the potential implicated in (4.3.11). Therefore, the convergence (4.3.12) takes

place as n → ∞.
Condition νn (Σn ) < ∞ is needed only for existence of probability measure
(4.3.16). If it is not satisfied, then we only need to modify a definition of canonical
stability for Bn (ξn ), namely, to require that potential (4.3.18) converges to infinity
as n → ∞ but not a characteristic potential.
Proof. As in Section 4.3 we use the integral representation (4.2.20). We integrate
the equality
= ln θ (Bn (ξn ) − An )vn (d ξn )
H
over ξn after substitution by (4.2.20). Due to (4.3.18) we have

= ln
H exp[−iκAn + Γn (iκ) + ln θ (iκ)]dκ.
Computation of the integral can be carried out with the help of the saddle-point
method (the method of the steepest descent), i.e. using formula (4.2.16b) with vary-
ing degrees of accuracy. The saddle point iκ = α0 is determined from the equation
Γn (α0 ) = An − θ (α0 )/θ (α0 ).
Applicability of the saddle-point method is guaranteed by the condition of canonical

stability.
In order to prove Theorem 4.4 the following accuracy suffices:

H = −α0 An + Γn (α0 ) − 1 ln 1 Γn (α0 )[1 + O(Γ −1 )]
2 2π
1
= −α1 An + Γn (α1 ) − ln Γn (α1 ) + O(1).
2
Here α1 is a value determined from equation Γn (α1 ) = An . Also, α1 varies little
from α0 :
0
α1 − α0 = (Γn )−1 + · · · = O(Γ −1 ).
0
Certainly, more accurate asymptotic results can be also derived similarly.
As is seen from the proof, the condition of canonical stability of random variables
Bn is sufficient for validity of the theorem. However, it in no way follows that this
condition is necessary and that it is impossible to declare any weaker level, for which
the theorem is valid. Undoubtedly, Theorem 4.5 (and also Theorem 4.4, possibly)
can be extended to a more general case. By implication this is evidenced by the
fact that for the example covered in Sections 3.1 and 3.4 the condition of canonical
stability is not satisfied but, as we have seen, the passage to the limit (4.3.12) occurs
as L → ∞.
Along with other asymptotic theorems, which will be considered later on (Chap-
ters 7 and 11), Theorems 4.3–4.5 constitute the principal content of information
theory perceived as ‘thermodynamic’ in a broad sense, i.e. asymptotic theory. Many
important notions and relationships of this theory take on their special significance
during the passage to the limit pertaining to an expansion of systems in considera-
tion. Those notions and relationships appear to be asymptotic in the indicated sense.
4.4 Some theorems about the characteristic potential
1. Characteristic potential μ (s) = ln Θ (s) of random variables Bi (ξ ) (or a cumulant

generating function) was defined by formula (4.1.11a) earlier. If there is given the
family of distributions p(ξ | α ) = exp[−Γ (α ) + α B(ξ ) − ϕ0 (ξ )], then μ (s) is ex-
pressed in terms of Γ (α ) by formula (4.1.12a). If there is merely a given random
variable ξ with probability distribution P(d ξ ) instead of a family of distributions,
then we can construct the following family of distributions:
P(d ξ | α ) = consteα B(ξ ) P(d ξ ).
Because of (4.1.11a) the normalization constant is expressed through μ (α ), so that
P(d ξ | α ) = exp[−μ (α ) + α B(ξ )]P(d ξ ) (4.4.1)
(ln p(ξ ) = −ϕ0 (ξ )).

Thus, we have built the family {P(d ξ | α ), α ∈ Q} on basis of P(ξ ). Besides the
characteristic potential μ (s)of the initial distribution, we can find a characteristic
potential of any distribution (4.4.1) from the indicated family by formula (4.1.12a),
i.e. by the formula
μ (s | α ) = μ (α + s) − μ (α ), (4.4.2)
2. At first consider an easy example of a single variable B(ξ ), r = 1 and prove a
simple but useful theorem.
Theorem 4.6. Suppose that the domain Q of the parameter implicated in (4.4.1)
contains the interval s1 < α < s2 (s1 < 0, s2 > 0), and the potential μ (α ) is differ-
entiable on this interval. Then the cumulative distribution function
F(x) = P[B(ξ ) < x] (4.4.3)
satisfies the Chernoff’s inequality

4.4 Some theorems about the characteristic potential 95
F(x) exp[−sμ (s) + μ (s)] (4.4.4)
where
μ (s) = x, (4.4.5)
if the latter equation has a root s ∈ (s1 , 0].
Proof. Taking into consideration (4.4.1), we rewrite (4.4.3) in the form
F(x) = ∑ exp[μ (α ) − α B(ξ )]P(ξ | α ) (4.4.6)

B(ξ )<x
(the case of a discrete random variable ξ ). It is apparent that
∑ e−α B(ξ ) P(ξ | α ) e−α x ∑ P(ξ | α ), α 0 (4.4.7)

B(ξ )<x B(ξ )<x
and besides
∑ P(ξ | α ) 1. (4.4.8)
B(ξ )<x
Substituting (4.4.7), (4.4.8) to (4.4.6) we obtain that
F(x) e−α x+μ (α ) (4.4.9)
holds true for any α ∈ (s1 , 0], including the case α = s, where s is a solution of
equation (4.4.5) if it exists. Apparently, (4.4.9) turns into (4.4.4) for this solution.
This ends the proof.
Since value μ (0) is nothing else but the following mean value
A = ∑ B(ξ )P(ξ )
ξ
in order to suffice for constraint s 0 it is necessary to satisfy constraint x < A

in consequence of monotonicity (non-decrease) of function μ (α ) (μ (α ) 0 by
Theorem 4.1). If the opposite inequality x > A is valid, then this theorem needs to
be replaced with the following one.
Theorem 4.7. If the conditions of Theorem 4.6 are satisfied with the only one differ-
ence that equation (4.4.7) has a positive root s ∈ [0, s2 ], then instead of (4.4.4) the
following inequality is valid:

1 − F(x) e−sμ (s)+μ (s) . (4.4.10)
The proof of this theorem is analogous to that of the previous one, and we shall
skip it here.
When comparing (4.1.5) with (4.1.14) [see also (3.5.5a)] it is easy to see that x
plays the role of a parameter conjugate to α with respect to potential μ (α ). The
expression
γ (x) = s(x)x − μ (s) = sμ (s) − μ (s) (4.4.11)
considered as a function of x is actually the Legendre transform of potential μ (α ).
With the help of potential (4.4.11) and (4.4.4) formula (4.4.10) is reduced to
P{[B(ξ ) − x]sign γ (x) 0} e−γ (x) . (4.4.12)
The above-formulated theorems also allow a multivariate generalization, i.e. a

generalization for the case of many (r) variables B1 (ξ ), . . . , Br (ξ ). In this case
there are not only two different possible formulae (4.4.4), (4.4.10) but 2r formulae
depending on signs of roots s1 , . . . , sr . The right-hand sides of all these formulae
can be represented as follows:

∂γ ∂γ
P [B1 − x1 ]sign 0, . . . , [Br − xr ]sign 0 e−γ (x1 ,...,xr ) . (4.4.13)
∂ xi ∂ xr
Then there is the same expression e−γ permanently located in the indicated right-
hand sides of the formulae, where

∂μ
γ (x1 , . . . , xr ) = ∑ xi si − μ (s) xi = , i = 1, . . . , r (4.4.14)
∂ si
is the Legendre transform of the multivariate characteristic potential μ (s) = ln E[esB ].

For another possible generalization of probabilities in the left-hand side of (4.4.12)
and (4.4.13) that comply with the distribution
p(ξ | α ) = exp[−Γ (α ) + α B(ξ ) − ϕ0 (ξ )]
[in particular, (4.4.1)] and not with p(ξ ). Then instead of (4.4.13), (4.4.14) we will
have
P {[B1 − x1 ]sign s1 0, . . . , [Br − xr ]sign sr 0}

exp[−Γ (α ) + α x − γ (x)], (4.4.15)
where
∂Γ
γ (x) = ∑ αi xi − Γ (α ), (a) = x j (4.4.16)
i ∂aj
is the Legendre transform of potential Γ (α ). In order to make sure that these formu-
lae are valid, we need to take into account (4.1.12a) and formulae (4.4.13), (4.4.14)
from the previous case.
3. Now we derive formulae that hold with an equality sign as opposed to (4.4.3),
(4.4.10), (4.4.13), (4.4.15) but are asymptotic, i.e. valid for large n and Γ .
Theorem 4.8. Suppose that the random variable B is a sum of identically distributed
independent random variables B1 (ξ ), . . . , Bn (ξ ) and the corresponding to it char-
acteristic potential μ (α ) = nμ1 (α ) is defined and differentiable (a sufficient number
of times) on the closed interval s1 α s2 (s1 < 0, s2 > 0). Then for values of x1 ,
for which the equation
μ (s) = nx1 ≡ x (4.4.17)
has a root s ∈ (s1 , s2 ) and the inequality
γ (x) < min[s1 μ (s1 ) − μ (s1 ), s2 μ (s2 ) − μ (s2 )], (4.4.18)
holds true, the cumulative distribution function (4.4.3) of random variable B can be
found from the asymptotic formula:

for x < E[B], F(x)
= [2π μ (s)s2 ]−1/2 e−γ (x) [1 + O(n−1 )],
for x > E[B], 1 − F(x)
γ (x) = sx − μ (s), μ (s) = x. (4.4.19)
Proof. We choose values x < E[B] and x > E[B] such that the corresponding roots
s and s of equation (4.4.17) lie on segment (s1 , s2 ). Then we apply the famous
inversion formula to them (the Lévy formula):
c −itx
1 e − e−itx μ (it)
F(x ) − F(x) = lim e dt. (4.4.20)
c→∞ 2π −c it
Here eμ (it) = enμ1 (it) is a characteristic function. We represent the limit in the right-
hand side of (4.4.20) as a limit of a difference of two integrals

1 1
e−zx+μ (z)−ln z dz − e−zx +μ (z)−ln z dz = I − I . (4.4.21)
2π i L 2π i L
As a contour of integration L we consider a contour connecting −ic and ic through

the saddle point z = z0 of the first integral. Note that this saddle point can be deter-
mined from the equation
1 1
μ (z0 ) − x − = 0, μ1 (z0 ) − x1 − =0 (x1 = x/n). (4.4.22)
z0 nz0
This equation is derived by equating the derivative of the expression located in the
exponent to zero. We denote by L a contour going from −ic to ic passing by the
saddle point z0 of the second integral. This second saddle point can be found from
the equation
1
μ1 (z0 ) − x1 − = 0.
nz0
Now we try to substitute the contour of integration L by L in the second integral

from (4.4.21).
Since nx1 < E[B] = nμ1 (0) and also nx1 > E[B] = nμ1 (0), the mentioned two
saddle points lie on different sides from the origin z = 0 for sufficiently large n.
Therefore, in order to replace a contour of integration (L by L ) in a specified way, we
need to take into account a residue at point z = 0 in the second integral from (4.4.21).
This manipulation yields

1 −zx +μ (z) dz 1 dz

I = e = e−zx +μ (z) − e−μ (0) . (4.4.23)
2π i L z 2π i L z
We compute integrals (4.4.21), (4.4.23) by the method of the steepest descent.
Taking account that μ1 (z0 ) 0 (Theorem 4.1), it is easy to see that a direction of
the steepest descent is perpendicular to the real axis:
− zx + μ (z) − ln z =
3
1 y2 1 1 iy
− z0 x + μ (z0 ) − ln z0 − μ (z0 )y2 − 2 + μ (z0 )(iy)3 − + y4 · · ·
2 2z0 6 3 z0
We use the expansion

1 n 1
e−nzx1 +nμ1 (z)−ln(z) = exp −z0 x + μ (z0 ) − μ1 (z0 ) + 2 y2 + ny3 · · ·
z0 2 nz
0
1 n 1
= exp −z0 x + μ (z0 ) − μ + y2 [1 + ny3 · · · ]
z0 2 1 nz20
and obtain that

1 dz
I≡ exp−nzx1 +nμ1 (z)
2π iL z
−1/2
1 −z0 x+μ (z0 ) 1
= e 2π n μ1 (z0 ) + 2 [1 + O(n−1 )] (4.4.24)
z0 nz0
due to (4.2.16b) (note that the largest term of a residue O(n−1 ) is given by the
1 (4) −2
fourth derivative Γ (4) and takes the form μ (μ1 ) ). Comparing (4.4.22) with
8n 1
equation μ1 (s) − x1 = 0 or, equivalently, with (4.4.17), we have
1 1
μ1 (s)(z0 − s) + (z − s)2 · · · = ; z0 − s = + O(n−2 ). (4.4.25)
nz0 nz0 μ1 (s)
In order to derive the desired relationship (4.4.19) it is sufficient to apply the formula
resulted from (4.4.24) and (4.4.25)
1
I = e−sx+μ (s) {2π nμ1 (s)}−1/2 [1 + O(n−1 )] (4.4.26)
s
where sx − μ (s) = γ (x), μ (s) = x (s < 0).

The second integral (4.4.23) is taken similarly that yields
1
I = {2π μ (s )}−1/2 e−γ (x ) [1 + O(n−1 )] − 1 (s > 0). (4.4.27)
s
We substitute (4.4.26) and (4.4.27) to (4.4.21) we get

F(x ) − F(x) = 1 − [2π (s )2 μ (s )]−1/2 exp−γ (x ) [1 + O(n−1 )]
− [2π s2 μ (s)]−1/2 e−γ (x) [1 + O(n−1 )] (4.4.28)
by virtue of (4.4.20) (it is accounted here that s < 0; roots [. . .]−1/2 are positive).
The last equality determines F(x) and F (x) up to some additive constant:
F(x) =[2π μ (s)s2 ]−1/2 e−γ (x) [1 + O(n−1 )] + K(n);

F(x ) =1 − [2π μ (s )(s )2 ]−1/2 e−γ (x ) [1 + O(n−1 )] + K(n).
In order to estimate constant K(n) we consider point s∗ belonging to segment (s1 , s),
point s∗ from segment (s , s2 ) and values x∗ , x∗ corresponding to them. We take into
account inequality F(x∗ ) 1 − F(x∗ ) + F(x∗ ) and substitute (4.4.28) to its right-
hand side and, finally, obtain that
F(x∗ ) [2π μ (s∗ )s2∗ ]−1/2 e−γ (x∗ ) [1 + O(n−1 )]

∗
+ [2π μ (s∗ )(s∗ )2 ]−1/2 e−γ (x ) [1 + O(n−1 )].
Hence,
∗
F(x∗ ) = O(e−nγ1 (x∗1 ) ) + O(e−nγ (x1 ) )
(γ1 (x1 ) = γ (x1 )/n = x1 μ1 − μ1 ; x∗1 = x∗ /n, x1∗ = x∗ /n).
Substituting this expression into the equality

1 − F(x ) = −F(x∗ ) + [2π μ (s )(s )2 ]−1/2 e−γ (x ) [1 + O(n−1 )]+
+ [2π μ (s∗ )s2∗ ]−1/2 e−γ (x∗ ) [1 + O(n−1 )],
that follows from (4.4.28) when x = x∗ , we find that

1 − F(x ) = 2π [μ (s )(s )2 ]−1/2 e−γ (x ) [1 + O(n−1 )]+
∗
+ O(e−nγ1 (x∗1 ) ) + O(e−nγ1 (x1 ) ). (4.4.29)
It is easy to obtain from (4.4.28), (4.4.29) that

F(x) = 2π [μ (s)s2 ]−1/2 e−γ (x) [1 + O(n−1 )]

∗
+ O(e−nγ1 (x∗1 ) ) + O(e−nγ1 (x1 ) ). (4.4.30)
If x∗1 , s∗ and x1∗ , s∗ do not depend on n and are chosen in such a way that
γ1 (x∗1 ) > γ1 (x1 ); γ1 (x1∗ ) > γ1 (x1 ), (4.4.31)
then terms O(e−nγ1 (x∗1 ) ), O(e−nγ1 (x1 ) ) in (4.4.30) converge to zero as n → ∞ faster
than e−nγ1 (x1 ) O(n−1 ). Also, we obtain the first equality (4.4.19) from (4.4.30). The
second equality follows from (4.4.29) when the inequalities
γ1 (x∗1 ) > γ1 (x1 ); γ (x1∗ ) > γ1 (x1 ). (4.4.32)
are satisfied.
Since γ (x) is monotonic within segments μ (s1 ) x μ (0) and μ (0) x
μ (s2 ), points x∗1 and x1∗ (for which inequalities (4.4.31), (4.4.32) would be valid)

can be a fortiori chosen if constraint (4.4.18) is satisfied. The proof is complete.
As it is seen from the provided proof, the requirement (one of the conditions of
Theorem 4.8) that random variable B is equal to a sum of identically distributed
independent random variables turns out not to be necessary. Mutual (proportional)
increase of resulting potential μ (s) is sufficient for terms similar to term μ (4) /(μ )2
in the right-hand side of (4.4.24) to be small. That is why formula (4.4.19) is valid
even in a more general case if we replace O(n−1 ) with O(μ −1 ) and understand this
estimation only in the specified sense.
If we apply formula (4.4.19) to distribution p(B | α ) dependent on a parameter
instead of distribution p(B), then in consequence of (4.1.12a) we will have μ (s) =
Γ (α + s) − Γ (α ); μ (s) = Γ (α + s); sμ (s) − μ (s) = sΓ (α + s) − Γ (α + s) +
Γ (α ) = γ (x) − α x + Γ (α ) and, thereby, formula (4.4.19) becomes

x < Γ (α ) F(x)
=
x > Γ (α ) 1 − F(x)
= 2π [Γ (α + s)s2 ]1/2 e−Γ (α )+α x−γ (X) [1 + O(Γ −1 )], (4.4.33)
where γ (x) = ax − Γ (a) (Γ (a) = x) is a Legendre transform.

Also, it is not difficult to generalize Theorem 4.8 to the case of many random
variables B1 , . . . , Br . If we apply the same notations as in (4.4.13), then the respec-
tive generalization of formula (4.4.19) will take the form
P{[B1 − x1 ]sign s1 0, . . . , [Br − xr ]sign sr 0}

"
−1/2 "
"
∂ 2 μ (s) " −γ (x)
= det "2π" si s j " [1 + O(Γ −1 )], (4.4.34)
∂ si ∂ s j "e
where γ (x) is defined by (4.4.14).

Finally, the generalization of the last formula to the case of a parametric distribu-
tion [the generalization of (4.4.33) to a multivariate case, correspondingly] will be
represented as follows:
P{(B1 − x1 )sign s1 0, . . . , (Br − xr )sign sr 0 | α }

"
−1/2 " ∂ 2 Γ (α + s) "
"
= 2(π ) −r/2 −1
|s1 , . . . , sr | det " " " e−Γ (α )+α (x)−γ (x) [1 + O(Γ −1 )].
∂ si ∂ s j "
(4.4.35)
The aforementioned results speak for an important role of potentials and their im-
ages under the Legendre transform. The applied method of proof unite Theorem 4.8
with Theorems 4.3–4.5.
Chapter 5
Computation of entropy for special cases.
Entropy of stochastic processes
In the present chapter, we set out the methods for computation of entropy of many
random variables or of a stochastic process in discrete and continuous time.
From a fundamental and practical points of view, of particular interest are the sta-
tionary stochastic processes and their information-theoretic characteristics, specifi-
cally their entropy. Such processes are relatively simple objects, particularly a dis-
crete process, i.e. a stationary process with discrete states and running in discrete
time. Therefore, this process is a very good example for demonstrating the basic
points of the theory, and so we shall start from its presentation.
Our main attention will be drawn to the definition of such an important charac-
teristic of a stationary process as the entropy rate, that is entropy per unit of time or
per step. In addition, we introduce entropy Γ at the end of an interval. This entropy
together with the entropy rate H1 defines the entropy of a long interval of length T
by the approximate formula
HT ≈ H1 T + 2Γ ,
which is the more precise, the greater is T . Both constants H1 and Γ are calculated
for a discrete Markov process.
The generalized definition of entropy, given in Section 1.6, allows for the appli-
cation of this notion to continuous random variables as well as to the case when
the set of these random variables is continuum, i.e. to a stochastic process with a
continuous parameter (time).
In what follows, we show that many results related to a discrete process can be
extended both to the case of continuous sample space and to continuous time. For
instance, we can introduce the entropy rate (not per one step but per a unit of time)
and entropy of an end of an interval for continuous-time stationary processes. The
entropy of a stochastic process on an interval is represented approximately in the
form of two terms by analogy with the aforementioned formula.
For non-stationary continuous-time processes, instead of constant entropy rate,
one should consider entropy density, which, generally speaking, is not constant in
time.

https://doi.org/10.1007/978-3-030-22833-0 5
104 5 Computation of entropy for special cases. Entropy of stochastic processes
Entropy and its density are calculated for various important cases of continuous-
time processes: Gaussian processes and diffusion Markov processes.
Entropy computation of stochastic processes carried out here allows us to cal-
culate the Shannon’s amount of information (this will be covered in Chapter 6) for
stochastic processes.
5.1 Entropy of a segment of a stationary discrete process and

entropy rate
Suppose that there is a sequence of random variables . . ., ξk−1 , ξk , ξK+1 , . . . . Index

(parameter) k can be interpreted as discrete time t taking integer values . . ., k − 1,
k, k + 1, . . .. The number of different values of the index can be unbounded from
both sides: −∞k < ∞; unbounded from one side, for instance: 0 < k < ∞, or finite:
1 k N. The specified values constitute the feasible region K of the parameter.
Suppose that every random variable ξk takes one value from a finite or countable
number of values, for instance, ξk = (A1 , A2 , . . . , Am ) (not so much finiteness of m
as finiteness of entropies Hξk is essential for future discussion). We will call the
indicated process (a discrete random variable as a function of a discrete parameter
k) a discrete process.
A discrete process is stationary if all distribution laws P(ξk1 , . . . , ξkr ) (of arbitrary
multiplicity r) do not change under translation:
P[ξk1 = x1 , . . . , ξkr = xr ] = P[ξk1 +a = x1 , . . . , ξkr +a = xr ] (5.1.1)
where a is any integer number.

The magnitude a of the translation is assumed to be such that the values k1 + a,
. . . , kr + a remain in the domain K of parameter k. Henceforth, we shall not stipulate
this condition assuming that, for example, the parameter domain is not bounded in
both directions.
Consider different conditional entropies of one of the random variables ξk from
a discrete stationary stochastic process. Due to property (5.1.1), its unconditional
entropy Hξk = − ∑ξk P(ξk ) ln P(ξk ) is independent of the chosen value of parameter
k. Analogously, according to (5.1.1), conditional entropy Hξk |ξk−1 is independent of
k. Applying Theorem 1.6, we obtain the inequality
Hξk |ξk−1 Hξk
or, taking into account stationarity,
Hξ2 |ξ1 = Hξ3 |ξ2 = · · · = Hξk |ξk−1 Hξ1 = Hξ2 = · · · = Hξk .
If we introduce conditional entropy Hξk |ξk−1 ξk−2 , then applying Theorem 1.6a for
ξ = ξk , η = ξk−1 , ζ = ξk−2 will yield the inequality
5.1 Entropy of a segment of a stationary discrete process and entropy rate 105
Hξk |ξk−1 ,ξk−2 Hξk |ξk−1 .
Due to the stationarity condition (5.1.1) this entropy is independent of k.

Similarly, when increasing a number of random variables in the condition, we
will have monotonic change (non-increasing) of conditional entropy further:
Hξk Hξk |ξk−1 Hξk |ξk−1 ,ξk−2 · · · Hξk |ξk−1 ,...,ξk−l · · · 0. (5.1.2)
Besides, all conditional entropies are non-negative, i.e. bounded below. Hence, there
exists the non-negative limit
H1 = lim Hξk |ξk−1 , ..., ξk−l , (5.1.3)

l→∞
which we also denote by h in order to avoid an increase in the number of indices.

We define this limit as entropy rate meant for one element of sequence {ξk }. The
basis of such a definition is the following theorem.
Theorem 5.1. If {ξk } is a stationary discrete process such that Hξk < ∞, then the
limit
lim Hξ1 , ..., ξl /l
l→∞
exists and equals to (5.1.3).
Proof. We consider entropy Hξ1 ...ξm+n and represent it in the form
Hξm+n ...ξ1 = Hξm ...ξ1 + Hξm+1 |ξm ...ξ1 + Hξm+2 |ξm+1 ...ξ1 + · · · + Hξm+n |ξm+n−1 ...ξ1 . (5.1.4)
Since
Hξm+l |ξm+l−1 ...ξ1 = H1 + om+l (1) = H1 + om (1) (i 1)
due to (5.1.3) (here is o j (1) → 0 as j → ∞), it follows from (5.1.4) (after dividing
by m + n) that
Hξm+n ...x1 Hξ ...ξ n n
= m 1+ H1 + om (1). (5.1.5)
m+n m+n m+n m+n
Let m and n converge to infinity in such a way that n/m → ∞. Then n/(m + n)
converges to 1, while the ratio Hξm ...ξ1 /(m + n), which can be estimated as
1 m
Hξm ...ξ1 H ,
m+n m + n ξ1
clearly converges to 0.
Therefore, we obtain the statement of the theorem from equality (5.1.5). The
proof is complete.
It is also easy to prove that as l grows, the ratio Hξ1 ...ξl /l changes monotonically,
i.e. does not increase. For that we construct the difference
1 1 1 1
δ = Hξ1 ...ξl − Hξ1 ···ξl+1 = Hξ1 ...ξl − [H + Hξl+1 |ξ1 ...ξl ]
l l +1 l l + 1 ξ1 ...ξl
1 1
= Hξ1 ...ξl − H
l(l + 1) l + 1 ξl+1 |ξ1 ...ξl
and reduce it to
l
1 1
δ=
l(l + 1)
[Hξ1 ...ξl − lHξl+1 |ξ1 ...ξl ] = ∑ [H
l(l + 1) i=1 ξi |ξ1 ...ξi−1
− Hξi |ξi−l ...ξi−1 ].
(5.1.6)
In consequence of inequalities (5.1.2) summands in the right-hand side of (5.1.6)
are non-negative. Thus, non-negativity of difference δ follows from here.
By virtue of Theorem 5.1 the following equality holds:
Hξ1 ...ξn = nH1 + non (1). (5.1.7)
Further, we form the combination
Hξ1 ...ξm + Hξ1 ...ξn − Hξ1 ...ξm+n = Hξ1 ...ξm + Hξm+1 ...ξm+n − Hξ1 ···ξm+n . (5.1.8)
It is easy to prove that this combination is a monotonic function of m with fixed

n (or vice versa). Indeed, it can be rewritten as follows:
Hξm+1 ...ξm+n − Hξm+l ...ξm+n |ξ1 ...ξm . (5.1.9)
Since conditional entropies do not increase with a growth of m according to (5.1.2),

expression (5.1.9) does not decrease when m increases. The same holds for de-
pendence on n, because m and n are symmetrical in (5.1.8). It evidently follows
from (5.1.9) that combination (5.1.8) is non-negative (by virtue of Theorem 1.6
with ξ = (ξm+1 , . . . , ξm+n )). Consider the limit
2Γ = lim lim [Hξ1 ...ξm + Hξ1 ...ξn − Hξ1 ...ξm+n ]

n→∞ m→∞
= lim lim [Hξ1 ...ξm + Hξ1 ...ξn − Hξ1 ...ξm+n ]. (5.1.10)
m→∞ n→∞
We can switch the order of limits here due to the mentioned symmetry about a
transposition between m and n. By virtue of the indicated monotonicity this limit
(either finite or infinite) always exists. Passing from form (5.1.8) to form (5.1.9) and
using the hierarchical relationship of type (1.3.4)
n
Hξm+1 ...ξm+n |ξ1 ,...,ξm = ∑ Hξm+i |ξ1 ,...,ξm+i−1
i=1
we perform the passage to the limit m → ∞ and rewrite equality (5.1.10) as follows:
2Γ = lim [Hξ1 ...ξn − nH1 ], (5.1.11)

n→∞
5.2 Entropy of a Markov chain 107
since Hξm+i |ξ1 ,...,ξm+i−1 → H1 as m → ∞.

If we represent entropy Hξ1 ,...,ξn in the form of the hierarchical sum (1.3.4) here,
then formula (5.1.11) will be reduced to
∞
2Γ = ∑ [Hξ j+1 |ξ1 ...ξ j − H1 ]. (5.1.12)
j=0
According to (5.1.11) we have
Hξ1 ...ξn = nH1 + 2Γ + on (1). (5.1.13)
This formula refines (5.1.7).

Since there is entropy H1 in average for every element of sequence {ξk }, nH1
accounts for n such elements. Due to (5.1.13) entropy 2Γ differs from nH1 for large
n by the value 2Γ that can be interpreted as entropy at the ends of the segment.
Thus, entropy of each end of a segment of a stationary process is equal to Γ in the
limit.
If we partition one long segment of a stationary process into two segments and
liquidate statistical ties (correlations) between processes relative to these two seg-
ments, then entropy will increase approximately by 2Γ since there will appear two
new ends.
Applying formula (5.1.13) for three segments of lengths m, n, m + n and forming
combination (5.1.8) we will have
Hξ1 ...ξm + Hξ1 ...ξn − Hξ1 ...ξm+n = 2Γ + om (1) + on (1) + om+n (1).
This complies with (5.1.10) and confirms the above statement about the increase of
entropy by 2Γ .
5.2 Entropy of a Markov chain
1. Let the discrete (not necessarily stationary) process {ξk } be Markov. This means
that joint distribution laws of two consecutive random variables can be decomposed
into the product
P(ξk , ξk+1 , . . . , ξk+l ) = P(ξk )πk (ξk , ξk+1 ) · · · πk+l−1 (ξk+l−1 , ξk+l ) (5.2.1)
of functions π j (ξ , ξ ) = P(ξ | ξ ), which are called transition probabilities. Proba-

bilities P(ξk ) correspond to a marginal distribution of random variable ξk in (5.2.1).
Transition probability π j (ξ j , ξ j+1 ) equals to conditional probabilities P(ξ j+1 |
ξ j ) and thereby it is non-negative and normalized
∑ π j (ξ , ξ ) = 1. (5.2.2)
ξ
A discrete Markov process is also called a Markov chain.

Passing from probabilities (5.2.1) to conditional probabilities, it is easy to find
out that in the case of a Markov process we have
P(ξk+m+1 | ξk , . . . , ξk+m ) = πk+m (ξk+m , ξk+m+1 ) (m 1)
and, consequently,
P(ξk+m+1 | ξk , . . . , ξk+m ) = P(ξk+m+1 | ξk+m ).
Hence,
H(ξk+m+1 | ξk , . . . , ξk+m ) = − ln πk+m (ξk+m , ξk+m+1 ) = H(ξk+m+1 | ξk+m ). (5.2.3)
Due to (5.2.3) the hierarchical formula (1.3.6) takes the form
H(ξ1 , . . . , ξn ) = H(ξ1 ) + H(ξ2 | ξ1 ) + H(ξ3 | ξ2 ) + . . . + H(ξn | ξn−1 ). (5.2.4)
Similarly, we have
Hξk+m+1 |ξk ...ξk+m = Hξk+m+1|ξ , (5.2.5)

k+m
Hξ1 ...ξn = Hξ1 + Hξ2 |ξ1 + Hξ3 |ξ2 + · · · + Hξn |ξn−1 (5.2.6)
for mean entropies. A discrete Markov process is stationary if transition probabili-

ties πk (ξ , ξ ) do not depend on the value of parameter k and all marginal distribu-
tions P(ξk ) are identical (equal to Pst (ξ )). Constructing joint distributions according
to formula (5.2.1), it is easy to prove that they also satisfy the stationarity condi-
tion (5.1.1). Stationarity of marginal distribution Pst (ξ ) means that the equation
∑ Pst (ξ )π (ξ , ξ ) = Pst (ξ ) (5.2.7)

ξ
is satisfied.
The latter can be easily derived if we rewrite the joint distribution P(ξk , ξk+1 ) =
Pst (ξk )π (ξk , ξk+1 ) according to (5.2.1) and sum it over ξk = ξ that yields P(ξk+1 ).
Taking into account (5.2.3) it is easy to see that entropy rate (5.1.3) for a station-
ary Markov process coincides with entropy corresponding to transition probabilities
π (ξ , ξ ) with the stationary probability distribution:
H1 = − ∑ Pst (ξ ) ∑ π (ξ , ξ ) ln π (ξ , ξ ). (5.2.8)
ξ ξ
Indeed, averaging out (5.2.3) with stationary probabilities Pst (ξ )π (ξ , ξ ) we confirm

that the same conditional entropy Hξk+m+1 |ξk+1 ...ξk+m equal to (5.2.8) corresponds to
all values m 1.
Further, we average out formula (5.2.4) with stationary probabilities. This will
yield the equality
Hξ1 ...ξn = Hξ1 + (n − 1)H1 . (5.2.9)

Comparing the last equality with (5.1.13) we observe that in the stationary Markov
case we have
π (ξ , ξ )
2Γ = Hξ1 − H1 = ∑ Pst (ξ )π (ξ , ξ ) ln Pst (ξ )
(5.2.10)
ξ ,ξ
and on (1) = 0, i.e. the formula
Hξ1 ...ξn = nH1 + 2Γ (5.2.11)
holds true exactly. We can also derive the result (5.2.10) with the help of (5.1.12). In-
deed, because Hξ j+1 |ξ1 ...ξ j = H1 , as was noted earlier, there is only one non-zero term
left in the right-hand side of (5.1.13): 2Γ = Hξ1 − H1 , which coincides with (5.2.10).
2. So, given the transition probability matrix
π = π (ξ , ξ ) , (5.2.12)
in order to calculate entropy, one should find the stationary distribution and then
apply formulae (5.2.8), (5.2.9). equation (5.2.7) defines the stationary distribution
Pst (ξ ) quite clearly if a Markov process is ergodic, i.e. if eigenvalue λ = 1 is a
non-degenerate eigenvalue of matrix (5.2.12). According to the theorem about de-
composition of determinants, equation det(π − 1) = 0 entails (5.2.7), if we assume
*
Pst (ξ ) = Aξ ξ ∑ Aξ ξ , (5.2.13)
ξ
where Aξ ξ is an algebraic co-factor of element aξ ξ of matrix
aξ ξ = π (ξ , ξ ) − δξ ξ ≡ π − 1.
Here the value of ξ is arbitrary, so that it is sufficient to calculate algebraic co-

factors for one column of this matrix.
If a Markov process is non-ergodic, then the corresponding stationary distribution
is not completely defined by the transition matrix, but also depends on an initial (or
some other marginal) distribution. In this case, algebraic co-factors equal zero in
formula (5.2.13), and thereby the indeterminate form of type 0/0 takes place, i.e.
the stationary distribution can be expressed in terms of lower-order minors and an
initial distribution.
As is well known (see Doob in Russian [8] and the respective English original
[7]), states of a discrete-time Markov process can be put in such an order that its
transition matrix has the following ‘box’ form:
⎛ 00 01 02 03 ⎞
π π π π ···
π = ⎝ 0 π 11 0 0 · · ·⎠ ≡ Π 00 + ∑(Π 0i + Π ii ), (5.2.14)
0 0 π 22 0 · · · i
where ⎛ ⎞ ⎛ ⎞
π 00 0 ··· 0 π 01 0 ···
⎜ 0 · · ·⎟ ⎜0 0 · · ·⎟
Π 00 = ⎝ 0 ⎠, Π =⎝
01 0 ⎠,
.. .. .. .. .. .. ..
. . . . . . .
and so on.
Here π i j denotes a matrix of dimensionality ri × r j that describes the transition
from subset Ei containing ri states to subset E j containing r j states. There are zeros
in the rest of matrix cells. Sets E1 , E2 , . . . constitute ergodic classes. Transitions from
set E0 to ergodic classes E1 , E2 , . . . occur. There is no exit from each of these classes.
Hence, its own stationary distribution is set up within each class. This distribution
can be found by the formula of type (5.2.13)
*
Psti (ξ ) = Aiiξ ξ ∑ Aiiξ ξ , ξ ∈ Ei (5.2.15)
ξ ∈E i
with the only difference that now we consider algebraic co-factors of submatrix
πξllξ − δξ ξ , ξ , ξ ∈ El , which are not zeros. Probabilities Pstl (ξ ) address to El .
The full stationary distribution appears to be the linear combination
Pst (ξ ) = ∑ qi Psti (ξ ) (5.2.16)

i
of particular distributions (5.2.15), which are orthogonal as is easy to see. Indeed,
∑ Psti (ξ )Pstj (ξ ) = 0 i = j.
ξ
Coefficients qi of this linear combination are determined by the initial distribution

P(ξ1 ) = P1 (ξ1 ). They coincide with a resultant probability of belonging to this or
that ergodic class and satisfy the normalization constraint ∑i qi = 1. Taking into ac-
count the form (5.2.14) of the transition probability matrix and summing up transi-
tions from E0 to Ei at different stages, we can find how they are explicitly expressed
in terms of the initial distribution:
qi = ∑ P1 (ξ ) + ∑ P1 (ξ )[Πξ0iξ + (Π 00 Π 0i )ξ ξ + (Π 00 Π 00 Π 0i )ξ ξ + · · · ], i = 0.
ξ ∈Ei ξξ
In parentheses here we have a matrix product. By summing up powers of matrix

Π 00 we get
qi = ∑ P1 (ξ ) + ∑ P1 (ξ )([1 − Π 00 ]−1 Π 0i )ξ ξ , q0 = 0.
ξ ∈Ei ξ ,ξ
Accounting (5.2.16), (5.2.14) we obtain that it is convenient to represent entropy

rate (5.2.8) as the following sum:
H1 = ∑ qi H1i , (5.2.17)
i
where particular entropies
H1i = − ∑ Psti (ξ )π (ξ , ξ ) ln π (ξ , ξ ) (5.2.18)

ξ ,ξ ∈Ei
are computed quite similar to the ergodic case. The reason is that a non-ergodic
process is a statistical mixture (with probabilities qi ) of ergodic processes having a
fewer number of states ri .
Summation in (5.2.17), (5.2.18) is carried out only over ergodic classes E1 ,
E2 , . . ., i.e. subspace E0 has a zero stationary probability. The union of all ergodic
classes E1 + E2 + · · · = Ea (on which a stationary probability is concentrated) can be
called an ‘active’ subspace. Distributions and transitions exert influence on entropy
rate H1 in the ‘passive’ space E0 up to the point they have influence on probabili-
ties qi . If the process is ergodic, i.e. there is just one ergodic class E1 besides E0 ,
then q1 = 1 and the passive space does not have any impact on the entropy rate. In
this case the entropy rate of a Markov process in space E0 + Ea coincides with the
entropy rate of a Markov process taking place in subspace Ea = E1 and having the
transition probability matrix π11 .
3. In order to illustrate application of the derived above formulae, we consider in
this paragraph several simple examples.
Example 5.1. At first we consider the simplest discrete Markov process—the pro-
cess with two states, i.e. matrix (5.2.12) is a 2 × 2 matrix. In consequence of the
normalization constraint (5.2.2) its elements are not independent. There are just two
independent parameters μ and ν that define matrix π :

1−μ μ
π= .
ν 1−ν
Because in this case

−μ μ
a = π −1 = ; A11 = −ν ; A21 = −μ ,
ν −ν
the stationary distribution is derived according to (5.2.13):

ν μ
Pst (1) = ; Pst (2) = . (5.2.19)
μ +ν μ +ν
Furthermore,
1 ν − μν μν
Pst (ξk , ξk+1 ) = .
μ +ν μν μ − μν
Applying formula (5.2.8), we find the entropy rate
ν μ
H1 = h2 (μ ) + h2 (ν ), (5.2.20)
μ +ν μ +ν
where
h2 (x) = −x ln x − (1 − x) ln (1 − x). (5.2.20a)
Next, one easily obtains the boundary entropy by formula (5.2.10) as follows:

μ ν h2 (μ ) + μ h2 (ν )
2Γ = h2 − .
μ +ν μ +ν
Example 5.2. Now suppose there is given a process with three states that have the
transition probability matrix
⎛ ⎞
1 − μ μ μ
π = ⎝ ν 1 − ν ν ⎠ (μ = μ + μ etc).
λ λ 1 − λ
At the same time it is evident that

⎛ ⎞
−μ μ μ
a = ⎝ ν −ν ν ⎠ .
λ λ −λ
We find the respective stationary distribution by formula (5.2.13)
Pst (ξ ) = Aξ 1 /(A11 + A21 + A31 ), (5.2.21)
where

−ν ν
A11 = = λ ν − λ ν = λ ν + λ ν + λ ν ;
λ −λ
A21 =μ λ + μ λ + λ μ ;
A31 =μ ν + ν μ + ν μ .
The entropy rate (5.2.8) turns out to be equal to
H1 = Pst (1)h3 (μ , μ ) + Pst (2)h3 (ν , ν ) + Pst (3)h3 (λ , λ ),
where we use the denotation
h3 (μ , μ ) = −μ ln μ − μ ln μ − (1 − μ − μ ) ln (1 − μ − μ ). (5.2.22)
The given process with three states appears to be non-ergodic if, for instance, λ =
λ = 0, μ = ν = 0, so that the transition probability matrix has a ‘block’ type
⎛ ⎞
1−μ μ 0
π = ⎝ ν 1 − ν 0⎠ .
0 0 1
5.3 Entropy rate of components of a discrete and conditional Markov process 113
For such a matrix, the third state remains constant, and transitions are made only
between the first and the second states. Algebraic co-factors (5.2.21) vanish. As is
easy to see, the following distributions are stationary:
⎧
⎪
⎨ν /(μ + ν ), for ξ = 1;
Pst1 (ξ ) = μ /(μ + ν ), for ξ = 2; Pst2 (ξ ) = δξ 3 .
⎪
⎩
0, for ξ = 3;
The first of them coincides with (5.2.19). Functions Pst1 (ξ ) and Pst2 (ξ ) are orthogonal.
Using the given initial distribution P1 (ξ ), we find the resultant stationary distribution
by formula (5.2.16):
Pst (ξ ) = [P1 (1) + P1 (2)]Pst1 (ξ ) + P1 (3)Pst2 (ξ ).
Now, due to (5.2.17,) it is easy to rewrite the corresponding entropy rate as follows:
H1 = [P1 (1) + P1 (2)](ν h(μ ) + μ h(ν ))/(μ + ν ).
5.3 Entropy rate of part of the components of a discrete Markov

process and of a conditional Markov process
1. At first we compute the entropy rate of a stationary non-Markov discrete process

{yk }, which can be complemented to a Markov process by appending additional
components xk . Assemblage ξ = (x, y) will constitute a phase space of a Markov
process and process {ξk } = {xk , yk } will be Markov in its turn. The results from
the previous paragraph can be applied to the new process, in particular, we can find
the entropy rate that we denote as Hxy1 = h . Entropy H
xy y1 ...yn of the initial process
differs from the entropy
Hξ1 ...ξn = Hy1 ...yn + Hx1 ...xn |y1 ...yn (5.3.1)
of the Markov process by the conditional entropy Hx1 ...xn |y1 ...yn called entropy of the
conditional Markov process {xk } (for fixed {yk }).
Along with entropy rate hxy = hξ of a stationary Markov process, we introduce
the entropy rates of the initial y-process
1
hy = lim Hy1 ...yn (5.3.2)
n→∞ n
and the conditional x-process

1
hx|y = lim H . (5.3.3)
n→∞ n x1 ...xn |y1 ...yn
By virtue of (5.3.1) they are related to hxy as follows:
hy + hx|y = hxy . (5.3.4)
One may infer that (in a general case) the x-process can be considered as a non-
Markov a priori process and the conditional y-process with fixed x. Their entropy
rates will be, respectively,
1
hx = lim Hx1 ...xn ; hy|x = lim Hy1 ...yn |x1 ...xn . (5.3.5)
n→∞ n n→∞
The relationships (5.3.1), (5.3.4) are replaced by
Hx1 ...xn + Hy1 ...yn |x1 ...xn = Hξ1 ...ξn ; (5.3.6)

hx + hy|x = hxy . (5.3.7)
Since we already know how to find entropy of a Markov process, it suffices to learn
how to calculate just one of the variables hy or hx|y . The second one is found with
the help of (5.3.4). Due to symmetry the corresponding variable out of hx , hy|x can
be computed in the same way, whereas the second one—from (5.3.7).
The method described below (clause 2) can be employed for calculating entropy
of a conditional Markov process in both stationary and non-stationary cases. As a
rule, stationary probability distributions and limits (5.3.2), (5.3.3), (5.3.5) exist only
in the stationary case.
Conditional entropy Hx1 ...xn |y1 ...yn can be represented in the form of the sum
Hx1 ...xn |y1 ...yn = Hx1 |y1 ...yn + Hx2 |x1 y1 ...yn + Hx3 |x1 x2 y1 ...yn + · · · + Hxn |x1 ...xn−1 y1 ...yn
Also, limit (5.3.3) can be substituted by the limit
hx|y = lim Hxk |x1 ...xk−1 y1 ...yn = Hxk |...ξk−2 ξk−1 yk yk+1 ... . (5.3.8)
k→∞,n−k→∞
Equivalence of two such representations of entropy rate was discussed in Theo-

rem 5.1. Certainly, now process {xk } is conditional, and hence non-stationary. We
cannot apply Theorem 5.1 to this process directly. Hence, we need to generalize that
theorem.
Theorem 5.2. For stationary process {ξk } = {xk , yk } the limits (5.3.3) and (5.3.8)
are equal.
Proof. By virtue of
lim Hxk |x1 ...xk−1 y1 ...yn = Hxk |ξ1 ...ξk−1 yk yk+1 ... ,
n−k→∞
lim Hxk |ξ1 ...ξk−1 yk yk+1 ... = Hxk |...ξk+2 ξk−1 yk yk+1 ... ,
k→∞
we have
Hxk |x1 ...xk−1 y1 ...yn = Hxk |...ξk−2 ξk−1 yk yk+1 ... + ok (1) + on−k (1) (5.3.9)
(it is assumed that on−k (1) is uniform with respect to k).

Further, we represent n as the sum of three numbers: n = m + r + s. Substitut-
ing (5.3.9) to the equality
Hx1 ...xn |y1 ...yn = Hx1 ...xm |y1 ...yn + Hxm+1 |ξ1 ...ξm ym+1 ...yn
+ · · · + Hxm+r |ξ1 ...ξm+r−1 ym+r ...yn
+ Hxm+r−1 ...xn |ξ1 ...ξm+r ym+r+1 ...yn
we obtain that
1 1
H = H +
n x1 ...xn |y1 ...yn m + r + s x1 ...xm |y1 ...yn
r r
+ H + [om (1) + os (1)]+
m + r + s xk |...ξk−1 yk ... m + r + s
1
+ H . (5.3.10)
m + r + s xm+r+1 ...xn |ξ1 ...ξm+r ym+r+1 ...yn
Here we mean that
m
Hx1 ...xm |y1 ...yn = ∑ Hxk |x1 ...xk−1 y1 ...yn mHxk
k=1
and
Hxm+r+1 ...xn |ξ1 ...ξm+r ym+r+1 yn sHxk ,
because conditional entropy is less or equal than regular one. Thus, if we make
the passage to the limit m → ∞, r → ∞, s → ∞ in (5.3.10) such that r/m → ∞ and
r/s → ∞, then there will be left only one term Hxk |...,ξk−1 ,yk in that expression. This
proves the theorem.
As is seen from the above proof, Theorem 5.2 is valid not only in the case of a
Markov joint process {xk , yk }. Furthermore, in consequence of the Markov condi-
tion we have
P(xk | x1 . . . xk−1 y1 . . . yn ) = P(xk | xk−1 , yk−1 , yk , . . . , yn ) (n k)
and
Hxk |x1 ...xk−1 y1 ...yn = Hxk |ξk−1 yk ...yn
in the case of a Markov process. Consequently, formula (5.3.8) takes the form
hx|y = Hxk |xk−1 yk−1 yk yk+1 .
2. Let us now calculate entropies of the y-process and a conditional process in-
duced by a Markov joint process. In order to do this, we consider the conditional
entropy
Hyk |y1 ...yk−1 = −E [ln P(yk | y1 . . . yk−1 )], (5.3.11)
that defines (according to (5.1.3)) the entropy rate hy in the limit
hy = lim Hyk |y1 ...yk−1 . (5.3.12)

k→∞
We use the Markov condition (5.2.1) and then write down the multivariate prob-
ability distribution
P(ξ1 , . . . , ξn ) = P(ξ1 )π (ξ1 , ξ2 ) · · · π (ξk−1 , ξk ),
where ξ j denotes pair x j , y j .

Applying the formula of inverse probability (Bayes’ formula) we obtain from
here that
∑x1 ...xk P(ξ1 )π (ξ1 , ξ2 ) . . . π (ξk−1 , ξk )
P(yk | y1 . . . yk−1 ) = . (5.3.13)
∑x1 ...xk−1 P(ξ1 )π (ξ1 .ξ2 ) · · · π (ξk−2 , ξk−1 )
According to the theory of conditional Markov processes (see Stratonovich [56])

we introduce final a posteriori probabilities
∑x1 ...xk−2 P(ξ1 )π (ξ1 , ξ2 ) · · · π (ξk−2 , ξk−1 )

P(xk−1 | y1 . . . yk−1 ) ≡ Wk−1 (xk−1 ) = .
∑x1 ...xk−1 P(ξ1 )π (ξ1 , ξ2 ) · · · π (ξk−2 , ξk−1
(5.3.14)
With the help of these probabilities expression (5.3.13) can be written as follows:
P(yk | y1 , . . . , yk−1 ) = ∑ Wk−1 (xk−1 )π (xk−1 , yk−1 ; xk , yk )

xk−1 ,xk
or, equivalently,
P(yk | y1 , . . . , yk−1 ) = ∑ Wk−1 (x)π (x, yk−1 ; yk ), (5.3.15)

x
if we denote ∑x π (x, y; x , y ) = π (x, y; y ). Then formula (5.3.11) takes the form

Hyk |y1 ...yk−1 = −E ln ∑ Wk−1 (x)π (x, yk−1 ; yk ) .
x
Here the symbol E denotes the averaging over y1 , . . . , yk−1 , yk . Furthermore, if

we substitute (5.3.15) to the formula

Hyk |y1 ...yk−1 = −E ∑ P(yk | y1 , . . . , yk−1 ) ln P(yk | y1 , . . . , yk−1 ) ,
yk
then we obtain
Hyk |y1 ...yk−1 =

− E ∑ ∑ Wk−1 (x)π (x, yk−1 ; yk ) ln ∑ Wk−1 (x)π (x, yk−1 ; yk ) , (5.3.16)
yk x x
where symbol E denotes averaging by y1 , . . . , yk−1 .

It is convenient to consider a posteriori probabilities Wk−1 (·) as a distribution
with respect to both variables (x, y) = ξ redefining them by the formula

Wk (x), for y = yk−1 ,
Wk−1 (x, y) = (5.3.17)
0, for y = yk−1 .
Then formula (5.3.15) is reduced to
P(yk | y1 . . . yk−1 ) = ∑ Wk−1 (ξ )π (ξ , yk ) (5.3.17a)

ξ
and we can also rewrite expression (5.3.16) as

Hyk |y1 ...yk−1 = −E ∑ ∑ Wk−1 (ξ )π (ξ , yk ) ln ∑ Wk−1 (ξ )π (ξ , yk )

. (5.3.18)
yk ξ ξ
Here E, as in (5.3.16), denotes averaging over y1 , . . . , yk−1 , but the expression to

be averaged depends on y1 , . . . , yk−1 only via a posteriori probabilities Wk−1 (·).
That is why we can suppose that symbol E in (5.3.18) corresponds to averaging
over random variables Wk−1 (A1 ), . . . , Wk−1 (AL ) (A1 , . . . , AL is the state space of
process ξ = (x, y), and L is the number of its states). Introducing the distribution
law P(dWk−1 ) for these random variables we reduce formula (5.3.18) to the form
Hyk |y1 ...yk−1 =

=− P(dWk−1 ) ∑ ∑ Wk−1 (ξ )π (ξ , y ) ln ∑ Wk−1 (ξ )π (ξ , y ). (5.3.19)
y ξ ξ
That is what the main results are. We see that in order to calculate conditional
entropy of some components of a Markov process we need to investigate posterior
probabilities {Wk−1 (·)} as a stochastic process in itself. This process is studied in
the theory of conditional Markov processes. It is well known that when k increases,
the corresponding probabilities are transformed by certain recurrent relationships.
In order to introduce them, let us write down the equality analogous to (5.3.14)
replacing k − 1 with k:
∑xk−1 [∑x1 ...xk−2 P(ξ1 )π (ξ1 , ξ2 ) · · · π (ξk−2 , ξk−1 ]π (ξk−1 , ξk )

Wk (xk ) = .
∑xk−1 ,xk [∑x1 ...xk−2 P(ξ1 )π (ξ1 , ξ2 ) · · · π (ξk−2 , ξk−1 )]π (ξk−1 , ξk )
Substituting (5.3.14) to the last formula we obtain the following recurrent relation-
ships:
∑xk−1 Wk−1 (xk−1 )π (xk−1 , yk−1 ; xk , yk )
Wk (xk ) = . (5.3.20)
∑xk−1 ,xk Wk−1 (xk−1 )π (xk−1 , yk−1 ; xk , yk )
Process {Wk−1 (·)} (considered as a stochastic process by itself) is called a sec-
ondary a posteriori W -process. As is known from the theory of conditional Markov
processes, this process is Markov. Let us consider its transition probabilities.
Transformation (5.3.20) that can be represented as
∑ξ Wk−1 (ξ )π (ξ , ξ )δyyk
Wk (ξ ) = , (5.3.21)
∑ξ Wk−1 (ξ )π (ξ , yk )
explicitly depends on random variable yk . This variable’s a posteriori probabilities

P(yk | y1 , . . . , yk−1 ) coincide with (5.3.15) and, consequently, they are fully defined
by ‘state’ Wk−1 (·) of the secondary a posteriori process at time k − 1. Thus, every
transformation (5.3.21) occurs with probability (5.3.15). This means that the ‘tran-
sition probability density’ of the secondary process {Wk } can be rewritten as
π W (Wk−1 ,Wk ) =

δyy ∑ξ Wk−1 (ξ )π (ξ , ξ )
= ∑ δ Wk (ξ ) − k )π (ξ , y ) ∑ Wk−1 (ξ )π (ξ , yk ). (5.3.22)
yk ∑ ξ Wk−1 ( ξ k ξ
Here δ (W ) is the L-dimensional δ -function defined by the formula

f (W )δ (W ) ∏ dW (ξ ) = f (0).
ξ
This δ -function corresponds to a measure concentrated on the origin of the L-

dimensional space.
Hence, we see that the entropy of process {yk } (that is non-Markov but is also a
part of some Markov process) can be defined by methods of the theory of Markov
processes after we pass on to a secondary a posteriori process, which is Markov.
In the stationary case, we can use a conventional approach to find the stationary
distribution Pst (dW ) that is the limit of distribution Pk (dW ) implicated in (5.3.19)
for k → ∞. Substituting (5.3.19) to (5.3.12) and passing to the limit k → ∞ we obtain
the following entropy rate:

hy = − Pst (dW ) ∑ W (ξ )π (ξ , y) ln ∑ W (ξ )π (ξ , y). (5.3.23)
yξ ξ
Further, we can find hx|y by formula (5.3.4), if required.

Before we consider an example, let us prove the following theorem having a
direct relation to the above.
Theorem 5.3. Entropy Hy2 ...yl |y1 of a non-Markov y-process coincides with the anal-
ogous entropy of the corresponding secondary a posteriori (Markov) process:
Hy2 ...yl |y1 = HW2 ...Wl |W1 . (5.3.24)
Therefore, the entropy rates are also equal: hy = hW .
Proof. Since the equalities

l
Hy2 ...yl |y1 = ∑ Hyk |y1 ...yk−1 ,
k=2
l
HW2 ...Wl |W1 = ∑ HWk |W1 ...Wk−1 ,
k=2
are valid together with formula (5.3.12) and the analogous formula for {Wk }, it is
sufficient to prove the equality
Hyk |y1 ...yk−1 = HWk |W1 ...Wk−1 = HWk |Wk−1 . (5.3.25)
Here we have HWk |W1 ...Wk−1 = HWk |Wk−1 due to the Markovian property.
Let S0 be some initial point from the sample space W (·). According to (5.3.21)
transitions from this point to other points S(yk ) may occur depending on the value
of yk (S(yk ) is the point with coordinates (5.3.21)). Those points are different for
different values of yk . Indeed, if we assume that two points from S(yk ), say S = S(y )
and S = S(y ), coincide, i.e. the equality
Wk (x, y) = Wk (x, y),
is valid, then it follows from the latter equality that
∑ Wk (x, y) = ∑ Wk (x, y). (5.3.26)

x x
But due to (5.3.17) the relationship
∑ Wk (x, y)δyyk = δyy ,

x
holds true and thereby equality (5.3.26) entails y = y (a contradiction).

Thus, we can set a one-to-one correspondence (bijection) between points S(yk )
and values yk , respectively. Hence,
P(S(yk ) | S0 ) = P(yk | y1 , . . . yk−1 ), HS(yk ) (· | S0 ) = Hyk (· | y1 , . . . , yk−1 ).
But, if we properly choose S0 = Wk−1 , then the events, occurring as conditions,

coincide, and after averaging we derive (5.3.25). This ends the proof.
Example 5.3. Suppose that a non-Markov process {yk } is a process with two states,
i.e. yk can take one out of two values, say 1 or 2. Further, suppose that this process
can be turned into a Markov process by adding variable x and thus decomposing
state y = 2 into two states: ξ = 2 and ξ = 3. Namely, ξ = 2 corresponds to x = 1,
y = 2; ξ = 3 corresponds to x = 2, y = 2. State y = 1 can be related to ξ = 1, x = 1,
for instance.
The joint process {ξk } = {xk , yk } is stationary Markov and is described by the
transition probability matrix
⎛ ⎞
π11 π12 π13
πξ ξ = ⎝π21 π22 π23 ⎠ .
π31 π32 π33
According to (5.3.17) a posteriori distributions Wk (ξ ) = Wk (x, y) have the form
(Wk (ξ = 1),Wk (ξ = 2),Wk (ξ = 3)) = (1, 0, 0), for yk = 1,

(5.3.27)
(Wk (ξ = 1),Wk (ξ = 2),Wk (ξ = 3)) = (0,Wk (2),Wk (3)), for yk = 2.
Due to (5.3.21) value yk = 2 is referred to the transformation
Wk (1) = 0;
Wk−1 (1)π12 +Wk−1 (2)π22 +Wk−1 (3)π32
Wk (2) = ;
Wk−1 (1)(π12 + π13 ) +Wk−1 (2)(π22 + π23 ) +Wk−1 (3)(π32 + π33 )
(5.3.28)
Wk (3) = 1 −Wk (2).
We denote point (1, 0, 0) that belongs to the sample space (W (1),W (2),W (3))
and corresponds to distribution (5.3.27) as S0 . Further, we investigate possible tran-
sitions from that point. Substituting value Wk−1 = (1, 0, 0) to (5.3.28) we obtain the
transition to the point
W (0) = 0;
π12
W (2) = (5.3.29)
π12 + π13
π13
W (3) =
π12 + π13
which we denote as S1 . In consequence of (5.3.18) such a transition occurs with
probability
p1 = π12 + π13 . (5.3.30)
The process stays at point S0 with probability 1 − p1 = π11 .
Now we consider transitions from point S1 . Substituting Wk−1 by values (5.3.29)
in formula (5.3.28) with yk = 2 we obtain the coordinates
W (1) = 0;
π12 π22 + π13 π32
W (2) = (5.3.31)
π12 (π22 + π23 ) + π13 (π32 + π33 )
W (3) = 1 −W (2)
of the new point S2 . The transition from S1 to S2 occurs with probability
π12 (π22 + π23 ) + π13 (π32 + π33 )

p2 = . (5.3.32)
π12 + π13
This expression is derived by plugging (5.3.29) into the formula
P(yk = 2 | y1 , . . . , yk−1 ) = P(Sk | Sk−1 ) =

= Wk−1 (2)(π22 + π23 ) +Wk−1 (3)(π32 + π33 ), (5.3.33)
obtained from (5.3.17a). The return to point S0 occurs with probability 1 − p2 . Sim-
ilarly, substituting values (5.3.31) to (5.3.33) we obtain the following probability:
(π12 π22 + π13 π32 )(π22 + π23 ) + (π12 π23 + π13 π33 )(π32 + π33 )
p3 = (5.3.34)
π12 (π22 + π23 ) + π13 (π32 + π33 )
of the transition from S2 to the next point S3 and so forth. Transition probabilities pk
to the points following each other are calculated consecutively as described. Each
time a return to point S0 occurs with probability 1 − pk . The probability that there
has been no return yet to point S0 at time k is apparently equal to p1 p2 · · · pk . If
consecutive values pk do not converge to 1, then the indicated probability converges
to zero as k → ∞. Therefore, usually a return to point S0 eventually occurs with
probability 1. If we had chosen some different point S0 as an initial point, then a walk
over some different sequence of points would have been observed but eventually the
process would have returned to point S0 . After such a transition the aforementioned
walk over points S0 , S1 , S2 , . . . (with already calculated transition probabilities) will
be observed.
The indicated scheme of transitions of the secondary Markov process allows us
to easily find a stationary probability distribution. It will be concentrated in points
S0 , S1 , S2 , . . .
For those points the transition probability matrix has the form
⎛ ⎞
1 − p1 p1 0 0 · · ·
⎜1 − p2 0 p2 0 · · ·⎟
⎜ ⎟
π w = ⎜1 − p3 0 0 p3 · · ·⎟ . (5.3.35)
⎝ ⎠
.. .. .. .. ..
. . . . .
Taking into account equalities Pst (Sk ) = pk Pst (Sk−1 ), k = 1, 2, . . . , following

from (5.2.7), (5.3.35) and the normalization constraint, we find the stationary prob-
abilities for the specified points:
1 p1 · · · pk
Pst (S0 ) = ; Pst (Sk ) = ,
1 + p1 + p1 p2 + · · · 1 + p1 + p1 p2 + · · ·
k = 1, 2, . . . (5.3.36)
In order to compute entropy rate (5.3.12) it is only left to substitute expres-

sions (5.3.36) to the formula
∞
hy = − ∑ Pst (Sk )[pk+1 ln pk+1 + (1 − pk+1 ) ln(1 − pk+1 )], (5.3.37)
k=0
which is resulted from (5.3.23) or from Theorem 5.3 and formula (5.2.8) applied to
transition probabilities (5.3.35).
In the particular case when the value of x does not affect transitions from one
value of y to another one we have
π22 + π23 = π32 + π33 . (5.3.38)
Here is drawn probability P(yk = 2 | yk−1 = 2, x) = 1 − ν (ν is the transition prob-

ability P(yk = 1 | yk−1 = 2, x)) that does not depend on x now. In this case (5.3.33)
yields
p2 = p3 = · · · = [Wk−1 (2) +Wk−1 (3)](π22 + π23 ) = π22 + π23 = 1 − ν (5.3.39)
and, consequently, (5.3.37) entails
hy = Pst (S0 )h2 (μ ) + [1 − Pst (S0 )]h2 (ν ) (μ = p1 = π12 + π13 ). (5.3.40)
Furthermore, due to (5.3.36), (5.3.39) we obtain

−1
p1 ν
Pst (S0 ) = 1 + = . (5.3.41)
1 − p2 ν +μ
That is why formula (5.3.40) coincides with (5.2.20), and this is natural, because,
when condition (5.3.38) is satisfied, process {yk } becomes Markov itself.
In the considered example, stationary probabilities were concentrated on a count-

able set of points. Applying the terminology from Section 5.2 we can say that the
active space Ea of W -process consisted of points S0 , S1 , S2 , . . . and, consequently,
was countable albeit the full space Ea + E0 was continuum. We may expect that a
situation is the same in other cases of discrete processes {xk , yk }, that is the station-
ary distribution Pst (dW ) is concentrated on the countable set Ea of points in a quite
general case of a finite or countable number of states of the joint process {xk , yk }.
5.4 Entropy of Gaussian random variables 123
5.4 Entropy of Gaussian random variables
1. Let us consider l Gaussian random variables ξ , . . . , ξl that are described by the

vector of mean values E [ξk ] = mk and a non-singular correlation matrix
Ri j = E [(ξi − mi ) (ξ j − m j )] . (5.4.1)
As is known, such random variables have the following joint probability density
function:
p ( ξ1 , . . . , ξl ) =

1 l
= (2π ) −1/2
det −1/2
Ri j exp − ∑ (ξi − mi ) ai j (ξ j − m j ) (5.4.2)
2 i, j=1
where ai j = Ri j −1 is the inverse matrix of the correlation matrix.

In order to determine entropy of the specified variables, we introduce an aux-
iliary measure ν (d ξ1 , . . . , d ξl ) satisfying the multiplicativity condition (1.7.9). For
simplicity we choose measures ν j (d ξ j ) with the same corresponding uniform prob-
ability density
v j (d ξ j )
= v1 = (v0 )1/l . (5.4.3)
dξ j
As will be seen further, it is convenient to set
v1 = (2π e)−1/2 . (5.4.4)
At the same time
v(d ξ1 , . . . , d ξl ) = vl1 d ξ1 · · · d ξl = (2π e)−1/2 d ξ1 · · · d ξl .
Taking into account (5.4.2) we can see that if the density
v(d ξ1 , . . . , d ξl )/d ξ1 · · · d ξl
is constant, then the condition of absolute continuity of measure P with respect to

measure ν is satisfied. Indeed, equality ν (A) = 0 for some set A, consisting of points
from the l-dimensional real space Rl , means that the l-dimensional volume of set
A equals zero. But probability P(A) also equals zero for such sets. This absolute
continuity condition would not have been satisfied, if the correlation matrix (5.4.1)
had been singular. Therefore, the non-singularity condition appears to be essential.
For the specified choice of measures, the random entropy (1.6.14) takes the form
H(ξ1 , . . . , ξl ) =
1 1 1 l
ln 2π + l ln v1 + ln det Ri j + ∑ (ξi − mi )ai j (ξ j − m j ). (5.4.5)
2 2 2 i, j=1
To calculate entropy (1.6.13), one only needs to perform averaging of the latter
expression. Taking into account (5.4.1), we obtain
1 1 1 l
Hξ1 ,...,ξl = ln 2π + l ln v1 + ln det Ri j + ∑ ai j Ri j
2 2 2 i, j=1
1 1 1
= ln 2π + l ln v1 + ln det Ri j + . (5.4.6)
2 2 2
For the choice (5.4.4), the result has the following simple form:
1
Hξ1 ,...,ξl = ln det Ri j . (5.4.6a)
2
Matrix R = Ri j is symmetric. Therefore, as is known, there exists a unitary trans-
formation U that diagonalizes this matrix:
∑ Uir∗ Ri jU js = λr δrs .
i, j
Here λr are the eigenvalues of the correlation matrix. They satisfy the equation
∑ R jkUkr = λrU jr . (5.4.7)

k
With the help of these eigenvalues we can rewrite the entropy Hξ1 ...ξl as follows:
1 l
Hξ1 ,...,ξl = ∑ ln λr .
2 r=1
(5.4.8)
This formula can also be reduced to

1
Hξ1 ,...,ξl = tr ln R. (5.4.8a)
2
The derived result, in particular, allows us to find easily the conditional entropy
Hξk |ξ1 ...ξk−1 . Using the hierarchical property of entropy we can write down
Hξ1 ,...,ξk = Hξ1 + Hξ2 |ξ1 + · · · + Hξk |ξ1 ,...,ξk−1 ,

Hξ1 ,...,ξk−1 = Hξ1 + Hξ2 |ξ1 + · · · + Hξk−1 |ξ1 ,...,ξk−1
and, subtracting one expression from the other, we obtain
Hξk |ξ1 ,...,ξk−1 = Hξ1 ,...,ξk − Hξ1 ,...,ξk−1 .

Each of the two entropies on the right-hand side of this equation can be determined
by formula (5.4.8). This leads to the relationship
1 1
Hξk |ξ1 ,...,ξk−1 = tr ln r(k) − tr ln r(k−1) (5.4.9)
2 2
where
⎛ ⎞ ⎛ ⎞
R11 . . . R1k R11 . . . R1,k−1
⎜ .. ⎟ , ⎜ ⎟
r(k) = ⎝ ... . . . . ⎠ r(k−1) = ⎝ ... . . . ..
. ⎠
Rk1 . . . Rkk Rk−1,1 . . . Rk−1,k−1 .
Similarly, we can determine the entropy
Hξk ξk−1 |ξ1 ,...,ξk−2 = Hξ1 ,...,ξk − Hξ1 ,...,ξk−2 , . . .
2. Let us now choose more complicated measures νk (d ξk ). Specifically, let us

assume that they have joint probability density function Nk qk (ξk ), where qk (ξk ) is a
Gaussian density

−1/2 1 ( ξk − m
k )2
qk (ξk ) = (2π λk ) exp − .
2
λk
Here we assume the multiplicativity condition (1.7.9). In this case, the random en-
tropy (1.6.14) turns out to be equal to
1 1
H(ξ1 , . . . , ξl ) = ln N + ln det Ri j − ∑ ln
λk
2 2 k
1 ( ξk − m
k )2 1
2∑
− + ∑(ξi − mi )ai j (ξ j − m j ) , (5.4.10)
k

λk 2 i, j
where N = ∏ Nk .
Now averaging of this entropy entails the relationship
Hξ1 ,...,ξl =
1 1 1 1 l
ln N + ln det Ri j − ∑ ln
λk − ∑ k )2 + .
Rkk + (mk − m (5.4.11)
2 2 k 2 k
λk 2
Introducing matrix R =
λk δkr and employing a matrix form, we can reduce equal-
ity (5.4.11) to
Hξ1 ,...,ξl =
1 1 − 1 (m − m)
ln N + tr ln(R−1 R) − tr R−1 (R − R) T R−1 (m − m).
(5.4.12)
2 2 2
Here we have taken into account that det R−det R = det R−1 R; tr R−1 R = tr I = l; m−
is a column-matrix; T denotes transposition. Comparing (5.4.12) with (1.6.16) it
m
P/Q
is easy to see that thereby we have found entropy Hξ ...ξ of distribution P with
1 l
respect to distribution Q. That entropy turns out to be equal to
1
T R−1 (m − m)
= tr G(R−1 R) + (m − m)
P/Q
Hξ (5.4.13)
1 ,...,ξl 2
where G(x) = (x − 1 − ln x)/2.
The latter formula has been derived under the assumption of multiplicativ-
ity (1.7.9) of measure ν (d ξ1 , . . . , d ξl ) and, consequently, multiplicativity of Q(d ξ1 ,
. . . , d ξl ). However, it can be easily extended to a more general case. Let measure
Q be Gaussian and be defined by vector m k = EQ [ξk ] and the correlation matrix
Rkr that is not necessarily diagonal. Then entropy H P/Q is invariant with respect
to orthogonal transformations (and more generally with respect to non-singular
linear transformations) of the l-dimensional real space Rl . By performing a ro-
tation, we can achieve a diagonalization of matrix R and afterwards we can ap-
ply formula (5.4.13). However, formula (5.4.13) is already invariant with respect
to linear non-singular transformations, and thereby it is valid not only in the case
of a diagonal matrix R, but also in the case of a non-diagonal matrix R. Indeed,

for the linear transformation ξk = ∑r Ckr ξr (i.e. ξ = Cξ ) the following transfor-
mations take place: m − m = C(m − m), R = CRCT , R = CRC T . Consequently,
(R ) = (C ) R C , (R R) = (C ) R RC hold true as well. That is why
−1 T −1 −1 −1 −1 T −1 −1 T
combinations tr f (R−1 R) and (m − m) T R−1 (m − m)

remain invariant. This proves
that formula (5.4.13) remains valid not only in the multiplicative case (1.7.9), but in
the more general case when formulae (5.4.8), (5.4.9), (5.4.10) may be invalid.
3. Concluding this section let us calculate the variance of entropy that is useful
to know when studying the question of entropic stability (see Section 1.5) of the
family of Gaussian random variables.
We begin by considering random entropy (5.4.5). Subtracting (5.4.6), we find the
random deviation
1
2∑
H(ξ1 , . . . , ξl ) − Hξ1 ,...,ξl = a jk (η j ηk − R jk) =
1 1 1 l
= η T aη − tr 1 = η T aη − . (5.4.14)
2 2 2 2
The mean square of this random deviation coincides with the desired variance. Here
we have denoted η j = ξ j − m j . When averaging the square of the given expression,
one needs to take into account that
E [η j ηk ηr ηs ] = R jk Rrs + R jr Rks + R js Rkr (5.4.15)
according to well-known properties of Gaussian random variables. Therefore, we

have
2
Var H(ξ1 , . . . , ξl ) = E H(ξ1 , . . . , ξl ) − Hξ1 ,...,ξl =
1 1
= ∑ a jk ars (R jr Rks + R js Rkr ) = tr aRaR . (5.4.16)
4 2
That is
1 l
Var H(ξ1 , . . . , ξl ) =
tr 1 = .
2 2
Proceeding to random entropy (5.4.10) we have
H(ξ1 , . . . , ξl ) − Hξ1 ,...,ξl =

1 1 1
= η T (a − R−1 )η − tr(a − R−1 )R − (m − m)
T R−1 η .
2 2 2
In turn, using (5.4.15) we obtain
1
2 1
Var H(ξ1 , . . . , ξl ) =tr (a − R−1 )R) + (m − m)
T R−1 RR−1 (m − m)

2 4
1 1
= tr(1 − R−1 R)2 + (m − m)
T R−1 RR−1 (m − m).
(5.4.17)
2 4
It is not difficult also to compute other statistical characteristics of random en-
tropy of Gaussian variables, particularly, its characteristic potential
μ0 (s) = ln E [esH(ξ1 ,...,ξl ) ]
(see (1.5.15), (4.1.18)). Thus, substituting (5.4.5) to the last formula and taking into
consideration the form (5.4.2) of the probability density function, we obtain that

−1/2 sl
−l/2
μ0 (s) = ln (2π ) ( det R) exp sHξ1 ,...,ξl − ×
2

1−s
2 ∑
× exp − ηi ai j η j d η1 · · · d ηl
i, j
sl
= sHξ1 ,...,ξl − + ln det−1/2 [(1 − s)a] + ln det−1/2 R (5.4.18)
2
(ηi = ξi − mi ).
Here we used the following formula:

" "
1 " di j "
exp − ∑ ηi di j η j d η1 · · · d ηl = det−1/2 "
" 2π "
" (5.4.19)
2 i, j
that is valid for every non-singular positive definite matrix di j .

Since a = R−1 , terms with ln det−1/2 R cancel out in (5.4.18) and we obtain
l sl
μ0 (s) = − ln(1 − s) − + sHξ2 ,...,ξl
2 2
that holds true for s < 1. In particular, this result can be used to derive for-
mula (5.4.16).
5.5 Entropy of a stationary sequence. Gaussian sequence
1. In Section 5.1 we considered the entropy of a segment of stationary process {ξk }

in discrete time, i.e. of a stationary sequence. There we assumed that each element
of the sequence was a random variable itself. The generalization of the notion of en-
tropy given in Section 1.6 allows us to consider the entropy of a stationary sequence
that consists of arbitrary random variables (including continuous random variables),
and therefore generalizing the results of Section 5.1.
If the auxiliary measure ν satisfied the multiplicativity condition (1.7.9), then (as
it was shown before) conditional entropies in the generalized version possess all the
properties that conditional entropies in the discrete version possess. The specified
properties (and essentially only them) were used in the material of Section 5.1. That
is why all the aforesaid in Section 5.1 can be related to arbitrary random variables
and entropy in the generalized case.
Measure ν is assumed to be multiplicative
r
v(d ξk1 , . . . , d ξkr ) = ∏ vki (d ξki ). (5.5.1)
i=1
At the same time ‘elementary’ measures νk are assumed to be identical in view of

stationarity. We also assume that the condition of absolute continuity of probability
measure P(d ξk ) with respect to νk (d ξk ) is satisfied as well. Process {ξk } appears to
be stationary with respect to distribution P, i.e. the condition of type (5.1.1) is valid
for all k1 , . . . , kr , a.
The entropy rate H1 is introduced by formula (5.1.3). However, Hξk |ξk−l ,...,ξk−1 in
this case should be understood as entropy (1.7.13). Hence, the mentioned definition
corresponds to the formula

P(d ξk | ξk−1 , ξk−2 , . . .)
H1 = − ln P(d ξk | ξk−1 , ξk−2 , . . .). (5.5.2)
vk (d ξk )
Theorem 5.1 is also valid in the generalized version. Luckily, that theorem can
be proven in the same way. Now it means the equality

1 P(d ξ1 , . . . , d ξl )
H1 = − lim ··· ln P(d ξ1 , . . . , d ξl ). (5.5.3)
l→∞ l v1 (d ξ1 ) . . . vl (d ξl )
Further, similarly to (5.1.11) , (5.1.12) we can introduce the auxiliary variable

5.5 Entropy of a stationary sequence. Gaussian sequence 129

∞
2Γ = lim Hξ1 ,...,ξn − nH1 =
n→∞
∑ ξ j+1 |ξ j ...ξ1
H − H1 (5.5.4)
j=0
which is non-negative because
Hξ j+1 |ξ j ...ξ1 Hξ j+1 |ξ j ξ j−1 ... = H1 (5.5.5)
holds true. This variable can be interpreted as the entropy of the ends of the segment
of the sequence in consideration.
Besides the aforementioned variables and relationships based on the definition of
entropy (1.6.13), we can also consider analogous variables and relationships based
on definition (1.6.17) in their turn. Namely, similarly to (5.5.2), (5.5.4) we can in-
troduce
P/Q P/Q
H1 = Hξ
k |ξk−1 ξk−2 ...

P(d ξk | ξk−1 , ξk−2 , . . .)
= ln P(d ξk | ξk−1 , ξk−2 , . . .) (5.5.6)
Q(d ξk )
∞

2ΓP/Q = ∑ Hξ |ξ ...ξ − H1
P/Q P/Q
. (5.5.7)
j+1 j 1
j=0
However, due to (1.7.19) the inequality

P/Q P/Q
Hξ Hξ
j+1 |ξ j ...ξi j+1 |ξ j ...ξ1 ,...
(inverse to inequality (5.5.5)) takes place for entropy H P/Q . Therefore, ‘entropy of
the end” Γ P/Q has to be non-positive.
If we use expressions of types (1.7.13), (5.5.2) for conditional entropies, then
it will be easy to see that difference Hξ j+1 |ξ1 ,...,ξ j − H1 turns out to be indepen-
dent of νk . Consequently, boundary entropy Γ does not depend on νk . Analogously,
Γ P/Q appears to be independent of Q (if the multiplicativity condition is satisfied),
whereas the equation Γ P/Q = −Γ is valid. This is useful to keep in mind when
writing down formula (5.1.13) for both entropies. That formula takes the form
P/Q P/Q
Hξ1 ,...,ξl = lH1 + 2Γ + ol (1), Hξ = lH1 − 2Γ + ol (1). (5.5.7a)
1 ...ξl
These relationships allow us to find the entropy of a segment of a stationary process

more precisely than by using a simple multiplication of entropy rate H1 with the
interval length l.
2. Now we find entropy rate H1 in the case of a stationary Gaussian sequence.
a) At first, we assume that there is given s stationary sequence ξ1 , . . . , ξl on
a circle such that the distribution of the sequence is invariant with respect to ro-
tations of the circle. At the same time the elements of the correlation matrix Ri j
will depend only on the difference j − k and satisfy the periodicity condition:
R jk = R j−k , R j−k+l = R j−k . Then equation (5.4.7) will have the following solutions:
1
U jr = √ e2π i jr/l , (5.5.8)
l
l−1
λr = ∑ Rs e−2π isrt/l , r = 0, 1, . . . , l − 1. (5.5.9)
s=0
Indeed, substitution of (5.5.8) into (5.4.7) yields
∑ R j−k e2π ikr/l = λr e2π i jr/l ,

k
which is satisfied due to (5.5.9). So, (5.5.8) defines the transformation that diag-
onalizes the correlation matrix R j−k . It is easy to check its unitarity. Indeed, the
Hermitian conjugate operator
" "
" ∗ " " 1 −2π t jr/l "
" "
U = U jr = " √ e
+ " "
l "
coincides with the inverse operator U −1 as a consequence of the equalities
1 l ( j−k)r ε l−1
ε 1 − εl
∑ U jrUkr∗ = l ∑ e2π i l =
l ∑ εr = l 1−ε
= δ jk
r r=1 r=0
2π i( j−k)/l
(ε = e ).
After the computation of the eigenvalues (5.5.9), one can apply formula (5.4.8)
to obtain the entropy Hξ1 ,...,ξl .
In the considered case of invariance with respect to rotations it is easy to also
calculate entropy (5.4.13). Certainly, it is assumed that not only measure P but also
measure Q possesses the described property of symmetry (‘circular stationarity’).
Consequently, the correlation matrix R jk of the latter measure has the same proper-
ties that R jk has. Besides, mean values mk , m k are constant for both measures (they
are equal to m and m, respectively).
The unitary transformation U = U jr diagonalizes not only matrix R, but also
matrix R (even if the multiplicativity condition does not hold). Furthermore, simi-
larly to (5.5.9) the mean values of R have the form
l−1

λr = ∑ Rs e−2π isr/l , r = 1, . . . , l − 1. (5.5.10)
s=0
into the vector

This transformation turns vector m − m
⎛ l ⎞
∑ j=1 e−2π i0 j/l (m j − m j)
1
= √ ⎝. . . . . . . . . . . . . . . . . . . . . . . . .⎠ =
U + (m − m)
l
∑lj=1 e−2π i(l−1) j/l (m j − m j)
⎛√ ⎞
⎛ l ⎞
l(m − m)
∑ j=1 e−2π i0 j/l ⎜ 0 ⎟
m−m ⎜ ⎜ ⎟
. ⎟ ⎜ . ⎟
= √ ⎝ .. ⎠=⎜ .. ⎟
l ⎜ ⎟
l
∑ j=1 e −2 π i(l−1) j/l ⎝ 0 ⎠
.
As a result due to formula (5.4.13) we obtain that

λr
l−1
2
l (m − m)
= ∑G
P/Q
Hξ ...ξ + , (5.5.11)
1 l
r=0

λr 2
λ0
where G(x) = 12 (x − 1 − ln x).

Consider the entropy of a single element of the sequence. According to (5.4.8),
(5.5.11) we obtain that
1 l−1
H1 = ∑ ln λr ,
2l r=0
(5.5.12)
1 l−1 λr 1 (m − m)
2
∑
P/Q
H1 = ln + . (5.5.13)
l r=0 λr 2
λ0
b) Now, let ξ1 , . . . , ξl constitute a segment of a stationary sequence in such a

way that R jk = R j−k for j, k = 1, . . . , l but the periodicity condition R j−k+l = R j−k is
not satisfied. Then each of two ends of the segment plays some role and contributes
Γ to the aggregated entropy (5.5.7a). If we neglect this contribution, then we can
connect the ends of the segment and reduce this case to the case of circular symmetry
covered before. In so doing, it is necessary to form a new correlation function
∞
R̄ j, j+s = R̄s = ∑ Rs+nl (5.5.14)
n=−∞
which already possesses the periodicity property. If Rs is quite different from zero
only for s l, then supplementary terms in (5.5.14) will visibly affect only a
small number of elements of the correlation matrix R jk situated in corners where
j l, l − k l or l − j l, k l.
After the transition to the correlation matrix (5.5.14) (and, if needed, after the
we can use formulae (5.5.12), (5.5.13),
analogous transition for the second matrix R)
(5.5.9), (5.5.10) derived before.
Taking into account (5.5.9) we will have
∞ l−1 ∞
λr = ∑ ∑ Rs+nl e−2π isr/l = ∑ Rσ e−2π iσ r/l
n=−∞ s=0 σ =−∞
for eigenvalues or, equivalently, λr = ϕ (r/l) if we make a denotation

∞
ϕ (μ ) = ∑ Rσ e−2π iμσ . (5.5.15)
σ =−∞
Analogously,
r! ∞
λr = ϕ = ∑ e−2π iσ r/l Rσ .
l σ =−∞
Furthermore, formulae (5.5.12), (5.5.13) apparently take the form
1 l−1 r!
H̄1 = ∑
2l r=0
ln ϕ
l
, (5.5.16)

1 l−1 ϕ (r/l) 2
1 (m − m)
H̄1 = ∑ G
P/Q
+ .
l r=0 ϕ(r/l) 2 ϕ(0)
Further we increase length l of the chosen segment of the sequence. Making a

passage to the limit l → ∞ in the last formulae we turn the sums into the integrals
1 1/2
1 1
H1 = ln ϕ (μ )d μ = ln ϕ (μ )d μ , (5.5.17)
2 0 2 −1/2
1
P/Q ϕ (μ ) 2
1 (m − m)
H1 = G dμ +
0 ϕ(μ ) 2 ϕ(0)
1/2
ϕ (μ ) 2
(m − m)
= G dμ + . (5.5.18)
−1/2 ϕ(μ ) 2ϕ(0)
Here when changing the limits of integration we account for the property ϕ (μ + 1) =
ϕ (μ ) that follows from (5.5.15).
For large l the endpoints of the segment do a relatively small impact in compar-
ison with large full entropy having order lH1 . Passage (5.5.14) to the correlation
function Rs changes entropy Hξ1 ,...,ξl by some number that does not increase when l
grows. Thus, the following limits are equivalent:
1 1
lim H̄ξ1 ,...,ξl = lim Hξ1 ,...,ξl .
l→∞ l l→∞ l
Here H and H correspond to the correlation functions Rs and Rs , respectively. There-

fore, expressions (5.5.17), (5.5.18) coincide with above-defined entropy rates (5.5.3),
(5.5.6). Hence, we have just determined the entropy rates for the case of stationary
Gaussian sequences. Those entropies turned out to be expressed in terms of spectral
densities ϕ (μ ), ϕ(μ ).
The condition of absolute continuity of measure P with respect to measure Q
takes the form of the condition of integrability of function G(ϕ (μ )/ϕ(μ )) impli-
cated in (5.5.18).
Formula (5.5.18) is valid not only when the multiplicativity condition (5.5.1) is
satisfied. For the stationary Gaussian case this condition means that matrix R jk is a
multiple of the unit matrix: R jk = ϕδ
jk , ϕ(μ ) = ϕ = const. Then formula (5.5.18)
yields
1
P/Q ϕ (μ ) 2
1 (m − m)
H1 = G dμ + .
0 ϕ 2 ϕ
The provided results can be also generalized to the case when there are not only
one random sequence {. . . , ξ1 , ξ2 , . . .} but several (r) stationary and stationary asso-
ciated sequences {. . . , ξ1α , ξ2α , . . .}, α = 1, . . . , r described by the following correla-
tion matrix: " "
" α ,β "
"R j−k " , α , β = 1, . . . , r
αβ
or by the matrix of spectral functions ϕ (μ ) = ϕ αβ (μ ) , ϕ αβ (μ ) = ∑∞
σ =−∞ Rσ
e−2π iμσ and the column-vector (by index α ) of mean values m = mα .
Now formula (5.5.17) is replaced by the matrix generalization
1/2 1/2
1 1
H1 = tr [ln ϕ (μ )] d μ = ln [det ϕ (μ )] d μ . (5.5.19)
2 −1/2 2 −1/2
Let measure Q be expressed via the matrix of spectral function ϕ(μ ) = ϕ αβ (μ )

= m
and mean values m α . Then instead of (5.5.18) we will have the analogous
matrix formula
1/2
1
tr G(ϕ−1 (μ )ϕ (μ ))d μ + (m − m)
T ϕ−1 (0)(m − m).
P/Q
H1 = (5.5.20)
−1/2 2
Certainly, the represented results follow from formulae (5.4.6a), (5.4.13). By the
form they represent the synthesis of (5.4.6a), (5.4.13) and (5.5.17), (5.5.18).
3. The obtained results allow to make a conclusion about entropic stability (see
Section 1.5) of a family of random variables {ξ l } where ξ l = {ξ1 , . . . , ξl } is a seg-
ment of a stationary Gaussian sequence. Entropy Hξ1 ,...,ξl increases approximately
linearly with a growth of l. According to (5.4.16) the variance of entropy also grows
linearly. That is why ratio Var[Hξ1 ,...,ξl ]/Hξ2 ,...,ξ converges to zero, so that the con-
1 l
dition of entropic stability (1.5.8) for entropy (5.4.5) turns out to be satisfied.
Further we move to entropy (5.4.10). The conditions
VarH(ξ1 , . . . , ξl ) VarH P/Q (ξ1 , . . . , ξl )

→ 0, →0
Hξ2 ,...,ξ (Hξ
P/Q
)2
1 l 1 ,...,ξl
will be satisfied for it if variance Var[Hξ1 ,...,ξl ] = Var[H P/Q (ξ1 , . . . , ξl )] (defined by
formula (5.4.17)) increases with a growth of l approximately linearly, i.e. if there
exists the finite limit
1
D1 = lim VarH P/Q (ξ1 , . . . , ξl ). (5.5.21)
l→∞ l
In order to determine limit (5.5.21) of expression (5.4.17) we can apply the

P/Q
same methods as we used for calculation of limit liml→∞ 1l Hξ ,...,ξ corresponding
1 l
to expression (5.4.13). Similarly to the way of getting (5.5.18) from (5.4.13), for-
mula (5.4.17) can be used to find limit (5.5.21) as follows:
1/2 2
1 ϕ (μ ) 2 ϕ (0)
(m − m)
D1 = 1− dμ + . (5.5.22)
2 −1/2 ϕ(μ ) 4ϕ2 (0)
5.6 Entropy of stochastic processes in continuous time. General

concepts and relations
1. The generalized definition of entropy, given in Chapter 6, allows us to calculate

entropy of stochastic processes {ξi } dependent on continuous (time) parameter t.
We assume that process {ξi } is given on some interval a t b. Consider an arbi-
trary subinterval α t β lying within the feasible interval of the process. We use
β β
notation ξα = {ξt , α t β } for it. Therefore, ξα denotes the value set of process
{ξt } on subinterval [α , β ].
The initial process {ξt } is described by probability measure P. According to the
definition of entropy given in Section 1.6, in order to determine entropy Hξ β for any
α
distinct intervals [α , β ] we need to introduce an auxiliary non-normalized measure ν
or the corresponding probability measure Q. Measure ν (or Q) has to be defined on
the same measurable space, i.e. on the same field of events related to the behaviour
of process ξ (t) on the entire interval [a, b], i.e. process {ξ (t)} having probabilities
Q can be interpreted as a new auxiliary stochastic process {η (t)} different from the
original process {ξ (t)}.
Measure P has to be absolutely continuous with respect to measure Q (or ν ) for
the entire field of events pertaining to the behaviour of process {ξ (t)} on the whole
feasible interval [a, b]. Consequently, the condition of absolute continuity will be
satisfied also for any of its subinterval [α , β ].
Applying formula (1.6.17) to values of the stochastic process on some chosen
subinterval [α , β ] we obtain the following definition of entropy of this interval:
β
P/Q P(d ξα ) β
H β = ln β
P(d ξα ). (5.6.1)
ξα Q(d ξα )
Furthermore, according to the contents of Section 1.7 (see (1.7.17)) we can intro-
duce the conditional entropy
ρ
P(d ξα | ξγδ ) β
P(d ξα d ξγδ )
P/Q
H β = ln β
(5.6.2)
ξα |ξγδ Q(d ξα )
where [γ , δ ] is another subinterval not overlapping with [α , β ].

5.6 Entropy of stochastic processes in continuous time. General concepts and relations 135
The introduced entropies obey regular relationships met in the discrete version.
For instance, they obey the additivity condition
P/Q P/Q P/Q
H =H +H , α < β < δ. (5.6.3)
ξαδ ξα
β β
ξβδ |ξα
When writing formulae (5.6.2), (5.6.3) it is assumed that measures Q, ν satisfy the
multiplicativity condition
β β β β
Q(d ξα d ξγδ ) = Q(d ξα )Q(d ξγδ ), v(d ξα ξγδ ) = v1 (d ξα ) + v2 (d ξγδ ) (5.6.4)
([α , β ] does not overlap with [γ , δ ]) that is analogous to (1.7.8). The indicated mul-
tiplicativity condition for measure Q means that the auxiliary process {ηt } is such
β
that its values ξα and ξγδ for non-overlapping intervals [α , β ], [γ , δ ] must be inde-
pendent. The multiplicativity condition for measure ν means in addition that the
constants
β β
Nα = v(d ξα )
are defined by some increasing function F(t) by the formula

β
Nα = eF(β )−F(α ) (5.6.5)
In the case of a stationary process function F(t) appears to be linear, so that

β
F(β ) − F(α ) = (β − α )hv , Nα = e(β −α )hv .
Taking into account (5.6.5), regular entropy of type (1.6.16) and conditional en-
tropy of type (1.7.4) can be found by the formulae
P/Q
Hξ β = F(β ) − F(α ) − H β , (5.6.6)
α ξα
P/Q
Hξ β |ξ δ = F(β ) − F(α ) − H β (5.6.7)
α γ ξα |ξγδ
where the right-hand side variables are defined via relations (5.6.1), (5.6.2).
2. Further we consider a stationary process {ξt } defined for all t. In this case it is
natural to choose the auxiliary process {ηt } to be stationary as well.
In view of the fact that the entropy in a generalized version possesses the same
properties as the entropy in a discrete version when the multiplicativity condition is
satisfied, the considerations and the results related to a stationary process in discrete
time and stated in Sections 5.1 and 5.5 can be extended to the continuous-time case.
Due to ordinary general properties of entropy, conditional entropy Hξ τ |ξ 0 (σ >
0 −σ
0) does not monotonically increase with a growth of σ . This fact entails existence
of the limit
lim Hξ τ |ξ 0 = Hξ τ |ξ 0 (5.6.8)
σ →∞ 0 −σ 0 −∞
which we define as Hξ τ |ξ 0 .
0 −∞
Because the general property (1.7.3) implies
Hξ τ1 +τ2 |ξ 0 = Hξ τ1 |ξ 0 + Hξ τ1 +τ2 |ξ τ1 (5.6.9)

0 −σ 0 −σ τ1 −σ
while the stationarity condition implies
Hξ τ1 +τ2 |ξ τ1 = Hξ τ2 |ξ 0
τ1 −σ 0 − τ1 − σ
we can pass to the limit σ → ∞ in (5.6.9) and obtain
Hξ τ1 +τ2 |ξ 0 = Hξ τ1 |ξ 0 + Hξ τ2 |ξ 0 .
0 −∞ 0 −∞ 0 −∞
Therefore, conditional entropy Hξ τ |ξ 0 is linearly dependent on τ . The correspond-

0 −∞
ing proportionality coefficient is defined as the following entropy rate:
1
h = Hξ τ |ξ 0 . (5.6.10)
τ 0 −∞
The next theorem is an analog of Theorem 5.1 in the continuous case.
Theorem 5.4. If entropy Hξ t is finite, then entropy rate (5.6.10) can be determined
0
by the limit
1
h = lim Hξ t . (5.6.11)
t→∞ t 0
Proof. The proof exploits the same method as the proof of Theorem 5.1 does. We
use the additivity property (1.7.4a) and thereby represent entropy Hξot in the form
Hξ t = Hξ0σ + Hξ σ +1 |ξ σ + Hξ σ +2 |ξ σ +1 + · · · + Hξ t t−1 (σ = t − n) . (5.6.12)

0 σ 0 σ +1 0 t−1 |ξ0
Due to (5.6.8), (5.6.10) we have
Hξ σ +k+1 |ξ σ +k = h + oσ +k (1) (0 k t − σ − 1), (5.6.13)

σ +k 0
where oσ +k (1) converges to 0 as σ → ∞. Substituting (5.6.13) to (5.6.12) we obtain

that
1 1 n n
Hξ t = Hξ0σ + h+ oσ (1). (5.6.14)
t 0 σ +n σ +n σ +n
Here we let σ and n go to infinity in such a way that n/σ → ∞. Since
Hξ0σ = Hξ 1 + Hξ 2 |ξ 1 + · · · + Hξ σ m−1 mHξ 1 (m − σ + 1),

0 1 0 m−1 |ξ0 0
1
we observe that term σ +n Hξ0σ σm+n Hξ 1 converges to 0. Term σ +n
n
oσ (1) also goes
0
to 0. At the same time σ +n h converges to h because n/(σ + n) → 1. Therefore, the
n
desired relation (5.6.11) follows from (5.6.14). The proof is complete.

5.7 Entropy of a Gaussian process in continuous time 137
The statements from Section 5.1 related to boundary entropy Γ can be general-
ized to the continuous-time case. Similarly to (5.1.10) that entropy can be defined
by the formula !
2Γ = lim Hξ0σ + Hξ0τ − Hξ σ +τ (5.6.15)
σ →∞,τ →∞ 0
and be represented in the form

∞ !
1
Γ= Hξ d τ |ξ 0 − hd τ , (5.6.16)
2 0 0 −τ
analogous to (5.1.12).
With the help of variables h, Γ entropy Hξ t of a finite segment of a stationary
0
process can be expressed as follows:
Hξ t = th + 2Γ + ot (1). (5.6.17)
0
If we take into account (5.6.1), then we can easily see from definition (5.6.15) of
boundary entropy Γ that this entropy is independent of choice of measure Q (or ν )
similarly to Section 5.4.
If the multiplicativity conditions are satisfied, then the formula
P/Q
Hξ t = thP/Q − 2Γ + ot (1), (5.6.18)
0
will be valid for entropy H P/Q in analogy with (5.6.17). In the last formula Γ is the
same variable as the one in (5.6.17).
5.7 Entropy of a Gaussian process in continuous time
1. A Gaussian stochastic process ξ (t) in continuous time t(a t b) is charac-

terized by the vector of mean values E[ξ (t)] = m(t) and the correlation matrix
R(t,t ) = E [ξ (t) − m(t)][ξ (t ) − m(t )] similarly to random variables considered in
Section 5.4. The only difference is that now the vector represents a function of t de-
fined on interval [a, b], whereas in Section 5.4 the vector consisted of l components.
Furthermore, a matrix (say R) is a function of two arguments t, t that defines the
linear transformation
b
y(t) = R(t,t )x(t )dt (t ∈ [a, b])
a
of vector x(t). As is well known, all the main results of the theory of finite-
dimensional vectors can be used in this case. In so doing, it is required to make
trivial changes in the formulas such as replace a sum by an integral, etc. The meth-
ods of calculating entropy given in Section 5.4 can be extended to the case of con-
tinuous time if we implement the above-mentioned changes. The resulting matrix
formulae (5.4.8a), (5.4.13) retain their meaning with new understanding of matrices
and vectors.
Certainly, the indicated expression is not supposed to be finite now. The condition
of their finiteness is connected with the condition of absolute continuity of measure
P with respect to measure ν or Q.
If we understand vectors and matrices in a generalized sense, then formu-
lae (5.4.8a), (5.4.13) are valid for both finite and infinite domain intervals [a, b] of a
process in stationary and non-stationary cases. Henceforth, we shall only consider
stationary processes and determine their entropy rates h, hP/Q .
For this purpose we can apply the approach employed in clause 2 of Section 5.5.
This approach uses the passage to a periodic stationary process. While considering
a process on interval [0, T ] with correlation function R(t,t ) = R(t − t ), one can
construct the new correlation function
∞
R̄(τ ) = ∑ R(τ + nT ) (5.7.1)
n=−∞
which, apart from stationarity, also possesses the periodicity property. Formula (5.7.1)
is analogous to formula (5.5.14). The process
1 N−1
ξ̄ (t) = √ ∑ ξ (t + jT )
N j=0
has such a correlation function in the limit as N → ∞.

Stationary periodic matrix R(t −t ) of type (5.7.1) can be diagonalized by unitary
transformation U with the matrix
1
Utr = √ e2π itr/T t ∈ [0, 1], r = . . . , −1, 0, 1, 2, . . .
T
Therefore, the eigenvalues λ of matrix R(t − t ) are

T
λr = R̄τ e−2π iτ r/T d τ .
0
If we take into account (5.7.1), then we will obtain

∞
r!
λr = e−2π iτ r/T R(τ )d τ = S 2π (5.7.2)
−∞ T
from the latter formula, where S(ω ) denotes the spectral density
∞
S(ω ) = e−iωτ R(τ )d τ = S(−ω ) (5.7.3)
−∞
of process ξ (t).
Substituting (5.7.2) to (5.4.8a) we obtain
1 1 1 r!
H̄ξ T = tr ln R = ∑ ln λr = ∑ ln S 2π . (5.7.4)
0 2 2 r 2 r T
P/Q
Now we move to entropy Hξ T that is determined by formula (5.4.13). It is
0
convenient to apply this formula after diagonalizations of matrices R(t − t ) and
− t ). Then
R(t
λr
tr G(R−1 R) = ∑ G (5.7.5)
r
λr
holds true, where
∞
r!
λr = S 2π , S = τ )d τ
e−iω t R( (5.7.6)
T −∞
in analogy with (5.7.2), (5.7.3). After diagonalization the second term in the right-
hand side (5.4.13) takes the form
1 T
(m − m = c+U −1 R−1Uc = ∑
T )R−1 (m − m) λr−1 cr (5.7.7)
2

(c = U + (m − m)).
Because

√
for r = 0
T (m − m)
cr =
Utr+ [m(t) − m(t)]dt =
0 for r = 0
equation (5.4.13) yields

∞
S(2π r/T ) 2
(m − m)
∑
P/Q
H̄ξ T = G +T (5.7.8)
0 r=−∞ π r/T )
S(2
S(0)
in consequence of (5.7.5), (5.7.7).

The summation over r in the right-hand side of (5.7.8) contains an infinite number
of terms. In order for the series to converge to a finite limit, it is necessary that ratio
π r/T ) goes to 1 as |r| → ∞ because function G(x) = (x − 1 − ln x)/2
S(2π r/T )/S(2
turns to zero (in the region of positive values of x) only for the x = 1. In the proximity
of point x = 1 the function behaves like:
1
G(x) = (x − 1)2 + O((x − 1)3 ).
4
Therefore, when the constraints
2
∞
S(2π r/T ) S(2π r/T )
∑ − 1 < ∞, 0<
π r/T )
<∞ (5.7.9)
r=0 S(2π r/T ) S(2
are satisfied, entropy (5.7.8) turns out to be finite, which speaks for absolute conti-
nuity of measure P with respect to measure Q.
2. Now we move to finding of entropy rate for a stationary process defined on
a continuous-time scale. Since substitution (5.7.1) of a correlation function makes
a significant impact only on boundary effects, entropies (5.7.4), (5.7.8) differ from
entropies corresponding to correlation function R(τ ) of order 1 and not of order
T
1. That is why derived expressions (5.7.4), (5.7.8) can be used to determine the
entropy rate
1 1
h = lim H T = lim H̄ξ T ,
T ξ0
T →∞ T →∞ T 0
1 P/Q 1 P/Q
hP/Q = lim Hξ T = lim H̄ξ T .
T →∞ T 0 T →∞ T 0
Thus, in the obtained formulae, we should pass to the limit T → ∞ with the
summations over r becoming integrals. From (5.7.4) we have
∞
1 2π r 2π 1 ∞
h= lim ∑ ln S = ln S(ω )d ω . (5.7.10)
4π T →∞ r=−∞ T T 4π −∞
Besides, (5.7.8) yields

∞

1 S(ω ) 2
(m − m)
h P/Q
= G dω +
2π −∞ ω)
S(
S(0)

1 ∞ S(ω ) 2
(m − m)
= G dω + . (5.7.11)
π 0 ω)
S(
S(0)
These results can be also obtained from formulae (5.5.17), (5.5.18), which de-
fine an entropy rate of stationary sequences as a limiting results of an unlimited
concentration of points on a time scale. If we pick up points tk = kΔ from a station-
ary Gaussian process in continuous time t, then we can obtain stationary Gaussian
sequence {ξ (tk )} having the correlation matrix
R jk = R(( j − k)Δ)
and the same values m. Comparing formula (5.5.15) with (5.7.3) it is easy to see that
∞ ∞
ωΔ
ϕ Δ = ∑ R(σ Δ)e−iωσ Δ Δ = R(τ )e−iωτ d τ + oΔ (1)
2π σ =−∞ −∞
i.e. functions ϕ (μ ), S(ω ) are interconnected by the following relationship:
ϕ (μ )Δ = S(ω ) + oΔ (1) , for μ = ω Δ/2π , |ω | ∼ 1. (5.7.12)
Considering entropy with respect to a unit of time, rather than a single element
of the sequence, we have
P/Q
h = lim H1 /Δ, hP/Q = lim H1 /Δ.
Δ→∞ Δ→0
Substituting (5.5.17), (5.5.18) hereto and accounting for equality (5.7.12) and the
we obtain that
analogous equality for ϕ, S,
π /Δ
1
h= lim [− ln Δ + ln S(w)] dw, (5.7.13)
4π Δ→0 −π /Δ

1 ∞ S(ω ) (m − m) 2
hP/Q
= G dω + . (5.7.14)
2π −∞ ω)
S(
2S(0)
Formula (5.7.14) coincides with (5.7.11). In turn (5.7.13) differs from (5.7.10) by
the term − ln Δ , which can be referred to as the appropriately selected measure ν .
Using the freedom of choice of measure ν we can represent formulae (5.7.10),
(5.7.13) in a different form that is more convenient. We assume that spectral density
S(ω ) converges to a finite non-zero limit as ω → ∞:
S(∞) = lim S(ω ).

|ω |→∞
We take into account that ϕ (μ )Δ = S(2π μ /Δ ) + oΔ (1) and thereby obtain that
the value of ϕ (μ )Δ converges to S(∞) when μ is fixed and Δ → 0. It is expedient
to mark this limit value and rewrite formula (5.5.16) as follows:
1 S(∞) 1 l−1 ϕ (r/l)Δ

H̄1 = ln + ∑ ln . (5.7.15)
2 Δ 2l r=0 S(∞)
Further, we define measure ν by the formula
v(d ξ (t j ))
= (2π e)−1/2 (S(∞)/Δ)−1/2 (5.7.16)
d ξ (t j )
instead of formula (5.4.4). Then the first term in (5.7.15) will be related to measure
ν and instead of (5.7.15) we will get
1 l−1 ϕ (r/l)Δ
H̄1 = ∑ ln S(∞) .
2l r=0
(5.7.17)
Accordingly, formula (5.5.17) will be replaced by the following formula:

1/2
1 ϕ (μ )Δ
H1 = ln dμ,
2 −1/2 S(∞)
from which, similarly to (5.7.13), we obtain

∞
1 S(ω )
h= ln dω . (5.7.18)
4π −∞ S(∞)
2
Here the integrand converges to zero as |ω | → ∞. Because ln (1 + y) = y − y2 + · · · ,
the integral in (5.7.18) converges if
∞ ∞ 2
S(ω ) − S(∞) S(ω )
d ω < ∞, − 1 dω < ∞ (5.7.19)
c S(∞) c S(∞)
where c is some number. Hence, we obtain a finite value for entropy rate h if con-
straints (5.7.19) are satisfied and all the particularities of the spectral density (points
at which it turns into zero or infinity) are logarithmically integrable as, for instance,
zeros and poles of the type
S(ω ) = |ω − ω1 |r , |r| < ∞.
If these conditions are satisfied, then measure P is absolute continuous with respect
to measure ν constructed in this special way and defined by (5.7.16) and the multi-
plicativity condition.
The convergence condition of the other integral (5.7.11) in the upper limit has
the form 2
∞
S(ω )
− 1 d ω < ∞, (5.7.19a)
c ω)
S(
analogous to (5.7.9). This condition is necessary in order for measure P to be abso-
lutely continuous with respect to Q.
If the multiplicativity condition is satisfied for measure Q, then (in a stationary
case) its correlation matrix has to be a multiple of an identity matrix and its spectral
ω ) = S.
density has to be constant: S( Equality S(∞) = S(∞)
is necessary for absolute
continuity. Consequently, S( ω ) = S(∞). Substituting the latter value to (5.7.11) we
obtain the formula

1 ∞ S(ω )
h P/Q
= G dω (G(x) = (x − 1 − ln x)/2). (5.7.20)
2π −∞ S(∞)
for equivalent means. Similarity of the last formula with (5.7.18) is evident. The
only difference between them is the choice of an integrand.
Example 5.4. Given a stationary Gaussian process with a spectral density
ω2 + γ2
S(ω ) = S0 .
ω2 + β 2
We would like to find its entropy rates. Since S(∞) = S0 , according to (5.7.18) we
have
1 ∞ ω2 + γ2 1
h= ln 2 d ω = (γ − β ).
2π 0 ω +β2 2
Further we apply formula (5.7.20) and conclude that
∞ 2
1 ω + γ2 1 ∞ ω2 + γ2
hP/Q = − 1 − ln 2
2π 0 ω2 + β 2 2π 0 ω +β2
γ −β2
2 1
= −h− (γ − β )2 . (5.7.21)
4β 4β
Boundary entropy Γ involved in (5.6.17), (5.6.18) can be determined by the method

proposed by Stratonovich [50]. For the given example it turns out to be the follow-
ing:
1 (γ /β )1/2 + (β /γ )1/2
Γ = ln .
2 2
Example 5.5. Now let the process in question have spectral density S(ω ) = S0 /(ω 2 +
β 2 ). Then application of formula (5.7.18) does not yield a finite result. In order to
get a finite entropy rate by formula (5.7.14), we need to select an appropriate spec-
ω ). We suppose that
tral density S(
ω ) = S0 /ω 2 .
S( (5.7.22)
Then the integral in (5.7.14) will be reduced to the integrals involved in (5.7.21) (for
γ = 0) and we will obtain
hP/Q = β /4. (5.7.23)
When choosing the spectral density (5.7.22), the measure Q (it describes the
stochastic process η (t)) does not satisfy the multiplicativity condition. However,
this condition will hold for derivative η̇ (t) = d η (t)/dt because it has a uniform
spectral density S0 . When passing to the derivative η̇ (t), the process ξ (t) also
needs to be replaced with its derivative ξ̇ (t). That is why we can consider that re-
sult (5.7.23) complies with the multiplicativity condition.
3. For a stationary Gaussian process as well as for a Gaussian sequence (see

Section 5.4, clause 3) a linear growth of variance Var[H(ξoT )] for increasing T under
the condition of a linear increase of entropy Hξ T entails satisfaction of the condition
0
of entropic stability (1.5.8). Therefore, in the case of a non-zero finite entropy rate
h we should confirm a finiteness of the limit
1
D0 = lim VarH(ξ0T )
T →∞ T
in order to prove entropic stability. Further we need to apply formula (5.4.17) that
allows the calculation of the limit in question. Diagonalizing the matrices involved
in the previous formula it is easy to obtain (similarly to the derivation of (5.7.14)
from (5.4.13)) the following equality:
∞
2
1 S(ω ) 2 S(0)
(m − m)
D0 = −1 dω + . (5.7.24)
2π 0 ω)
S( 4S2 (0)
This equality is a generalization of equality (5.5.22) to the continuous-time case.

The condition of integral finiteness in (5.7.24) is related to the aforementioned con-
dition (5.7.19a). Thus, the condition of entropic stability for Gaussian measures
turns out to be closely related to the condition of absolute continuity of measure P
with respect to measure Q.
The above-stated result can be also generalized to the case of several stationary
and stationary associated processes {ξ α (t)}, which are characterized by the column
of mean values m = mα and the matrix of spectral densities
" "
" "
S(ω ) = "Sαβ (ω )" ,
∞

Sαβ (ω ) = e−iωτ E [ξ α (t) − mα ][ξ β (t + τ ) − mβ ] d τ .
−∞
ω ), then the formula

S(
Furthermore, if measure Q is described by matrices m,
that generalizes formula (5.7.14) takes the form
∞ !
1 1
hP/Q = tr G S−1 (ω )S(ω ) d ω + (mT − m
T )S−1 (0)(m − m).
(5.7.25)
2π −∞ 2
The analogy of the last formula with (5.5.20) is evident.
5.8 Entropy of a stochastic point process
Consider a stochastic point process on interval a t b, which is regarded as a set

of random points, the number of which and their positions are random. Sometimes
such a process is called a ‘random flow’. However, we regard this term as inappro-
priate, because it is natural to use the word ‘flow’ in another sense referring it to a
motion in space as is done in physics.
Denote the number of sampled points by n. Further, we use notations τ1 , . . . , τn
for sampled locations of those points (of course, τi ∈ [a, b] for all i = 1, . . . , n). We
assume without loss of generality that values τ1 , . . . , τn are already indexed in a non-
descending fashion: τ1 τ2 · · · τn . Coincidence τi = τ j , i = j is assumed to be
of zero probability.
1. For the specified process, the measure P is characterized by the assignment of
the probabilities
P(n) = Pn (5.8.1)
and the system of probability densities
p(τ1 | 1), p(τ1 , τ2 | 2), . . . (5.8.2)
which satisfy the normalization constraint

5.8 Entropy of a stochastic point process 145
b b b b
d τ1 d τ2 d τ3 · · · d τn p(τ1 , . . . , τn | n) = 1. (5.8.3)
a τ1 τ2 τn−1
It is required to determine the entropy of the given point process. Probabili-

ties (5.8.1) and densities (5.8.3) define the probability
Δ P = Pn pn (t1 , . . . ,tn )Δ1 · · · Δn (1 + O(Δmax )) (t1 < · · · < tn ),

Δmax = max(Δ1 , . . . , Δn ) (5.8.4)
of the event that the first point τ1 from the set of n consecutive random points τ1 <
· · · < τn lies within interval [t1 ,t1 + Δ1 ], the second one—within interval [t2 ,t2 + Δ2 ]
and so forth, the last one—within interval [tn ,tn + Δn ] and also there are no other
random points lying in other places of [a, b]. The simplest system of random points
is a Poisson system for which
b n b
1
Pn = β (t) dt exp − β (t) dt ,
n! a a
* b n
pn (τ1 , . . . , τn ) = n!β (τ1 ) · · · β (τn ) β (t) dt . (5.8.5)
a
Here β (τ ) is an average density of points, which is constant in the stationary case.

Consideration of random points τ1 , . . . , τn is equivalent to consideration of the
stochastic process
n
ξ (t) = ∑ δ (t − τ j ).
j=1
Occurrences of random points of a Poisson process on non-overlapping intervals

β
[α , β ], [γ , δ ] are mutually independent (in other words, ξα are independent of ξγδ ).
Hence, in order to satisfy the mutiplicativity condition (5.6.4) it is convenient to
consider entropy of a point process defined in compliance with (1.6.16), (1.6.17) and
to select measure Q corresponding to the Poisson process. We suppose for simplicity
that the constant Poisson density β corresponds to measure Q.
According to (5.8.5) the ratio of elementary probabilities of type (5.8.4) will be
expressed as
dP Pn pn (t1 , . . . ,tn )
=
dQ β n e−β (b−a)
in the stationary case. That is why entropy (1.6.17) will be defined by the formula
∞
dP
=∑
P/Q
Hξ b = E ln ··· Pn pn (t1 , . . . , tn )×
a dQ n=0
a<t1 <···<tn <b
× ln[Pn eβ (b−a) pn (t1 , . . . ,tn )β −n ] dt1 · · · dtn . (5.8.6)
Taking β out of the sign of the logarithm, this expression can be apparently written
as follows:
P/Q P/Q1
Hξ b = (β − 1)T − ln β E [n] + Hξ b (T = b − a). (5.8.7)
a a
Then
∞
∑
P/Q
Hξ b 1 = ··· Pn pn (t1 , . . . ,tn ) ln[eT Pn pn (t1 , . . . ,tn )] dt1 · · · dtn (5.8.8)
a
n=0
a<t1 <···<tn <b
is the entropy corresponding to Poisson measure Q1 with a unit density.

Entropy Hn of a random number of points
∞
Hn = − ∑ Pn ln Pn (5.8.9)
n=0
can be extracted from entropy (5.8.8) if we represent (5.8.8) in the form

P/Q1 P/Q
Hξ b = Hξ b |n 1 − Hn
a a
where
∞
∑ Pn
P/Q
Hξ b |n 1 = ··· pn (t1 , . . . ,tn ) ln[eT pn (t1 , . . . ,tn )] dt1 · · · dtn . (5.8.10)
a
n=0
a<t1 <···<tn <b
In consequence of general properties of the entropy of expression (5.8.6), expres-

sions (5.8.8)–(5.8.10) are always non-negative.
We consider a Poisson system of points with constant density γ as an example.
Then due to (5.8.5), (5.8.8) we will have
= E [ln(eT −γ T γ n )] = γ T ln γ + (1 − γ )T
P/Q1
Hξ b (5.8.11)
a
since E [n] = γ T .
Next we obtain from (5.8.7) that
P/Q γ
Hξ b = (β − γ )(b − a) + γ (b − a) ln . (5.8.12)
a β
We divide the last result by b − a to get entropy rate of one Poisson measure with
respect to another:

γ β β
hP/Q = β − γ + γ ln = γ − 1 − ln 0.
β γ γ
Entropies (5.8.11), (5.8.12) are proportional to b − a due to the fact that Poisson pro-
cess are processes with independent increments. Entropies (5.8.9), (5.8.10) depend
on an interval length in a more complicated way. If we introduce the function
∞
xn
S(x) = e−x ∑ n! ln n!
n=0
then we can rewrite the expressions related to Poisson measure P as follows:

P/Q
Hn = γ T [1 − ln(γ T )] + S(γ T ), Hξ b |n 1 = T + S(γ T ) − γ T ln T.
a
2. The aforementioned formulae were based on definition (1.6.17) of entropy of

one probabilistic measure with respect to another one. Furthermore, we can move to
entropy (1.6.13), (1.6.16) corresponding to non-normalized measure ν . This mea-
sure will satisfy the multiplicativity condition if it is proportional to Poisson proba-
bilistic measure Q.
According to the aforesaid in Section 1.6, generalized entropy (1.6.16) can be
interpreted as a limit case of entropy of a discrete version in which the number of
possible outcomes is either finite or countable. In order to pass from the case of
continuous time t to the discrete case, we partition the interval in consideration into
a finite number of elementary intervals [tm ,tm+1 ) and consider set Z of points t0 =

a;t1 , . . . ,tN−1 ;tN = b instead of interval [a, b]. We define a point process on Z in the
following way: suppose that ξ (tm ) = 1 if at least one point τk hits interval [tm ,tm+1
)

and, otherwise, ξ (tm ) = 0. Meaning to consider a stationary process afterwards, we
suppose the partition to be uniform:

tm+1 − tm = Δ , m = 0, 1, . . . , N − 1 (N = (b − a)/Δ ).
The probability that an elementary interval is hit by more than one random point
is assumed to be a value of order Δ 2 , i.e. of a higher order of smallness. Then
for sufficiently large N, number n of random points each lying within [a, b] with
probability close to one will not differ from the number of points n = ∑k=0 N−1
ξ (tk )
hitting Z.
A point process on Z is a discrete process having 2N distinct realizations in total.
Event n = 0 is realized by only one scenario. Event n − 1 can be realized by N
different scenarios, i.e. it constitutes N different realizations. Event n = 2 consists of
N(N − 1)/2 different realizations and so forth. Similarly to Section 1.6 we introduce
a measure conveying information about a number of different realizations to obtain
v(n = 0) = 1 , v(n = 1) = N , v(n = 2) = N(N − 1)/2! , ...

... , v(n ) = N!/n !(N − n )! , ...
Note that
∞
∑

v(n ) = 2N .
n =0
It is easy to also compute the number of realizations of the event that τ1 hits interval
[t1 ,t1 + Δ1 ), at the same time τ2 hits interval [t2 ,t2 + Δ2 ) and so on, τn hits interval
[tn ,tn + Δn ). Assuming Δk
Δ and n N, the number of such realizations equals
Δ1 Δn
Δv ≈ ··· .
Δ Δ
Next we remind that the probability of such a set is given by formula (5.8.4). Since
ΔP
≈ Pn pn (t1 , . . . ,tn )Δ n
Δv
we employ formulae (1.6.5), (1.6.6) to obtain
Hξ ≈ − ∑ Pn pn (t1 , . . . ,tn ) ln[Pn pn (t1 , . . . ,tn )Δ n ].
Factoring out Δ from the sign of the logarithm, we can express the derived result in
terms of entropy (5.8.8):
1 P/Q
Hξ ≈ ln E [n] + b − a − Hξ b 1 . (5.8.13)
Δ a
P/Q
That substantiates the introduction of entropy Hξ 1 of measure P with respect to
Poisson measure Q1 . If we conduct partitioning of interval [a, b] not in a uniform
fashion, then due to relationship tm+1 − tm = β (tm )Δ , entropy Hξ will be expressed
P/Q
(in analogy with (5.8.13)) in terms of entropy Hξ of measure P with respect to
Poisson measure Q having non-uniform density β (t).
3. In the case of a stationary point process we can consider entropy rate per unit
of time in average. Here we provide two alternative approaches.
1) According to Theorem 5.2, we can calculate this by formula (5.6.11). Here we
need to investigate the behaviours of entropies (5.8.8), (5.8.10) for large n.
In the case of a stationary process the mean number of points E [n] = n is propor-
tional to interval length b − a = T . Besides, for an ergodic process the dependence
of its variance Var[n] on T , for large n, approaches a linear dependence rule:
Var(n) = D0 T + O(1).
Also, random variable n obeys the Central Limit Theorem. For large n the proba-
bility of inequality n0 n n0 + Δ N can be found with the help of the Gaussian
distribution:
1
e−(n0 −n̄)
2 /Var(n)
ΔP ≈ ΔN (Δ N 2 Var(n)), (5.8.14)
2π Var(n)
where Δ N has the same meaning as in (1.6.4a). This allows us to compute Hn ap-
proximately as an entropy of a Gaussian variable. Thus, we obtain from (5.8.14)
that
ΔP 1 (n − n̄)2
H(n) = − ln Pn ≈ − ln ≈ ln(2π Var(n)) + . (5.8.15)
ΔN 2 2Var(n)
This approximation is a particular case of formula (5.4.5). Averaging out (5.8.15)
we find that
1 1
Hn ≈ln 2π eVar(n) ≈ ln(2π eD0 T ).
2 2
It follows from the obtained dependence that, in particular, the limit
1
lim Hn = 0
T →∞ T
vanishes. This means that a contribution of entropy Hn to entropy rate (5.6.11) can
be reduced to zero. Consequently, only entropy (5.8.10) influences the entropy rate:
1 P/Q1
hP/Q1 = lim H . (5.8.16)
T →∞ T ξ0T |n
Using the denotation

P/Q1
Φ (n) = Hξ 0 (| n) = ··· pn (t1 , . . . , tn )×
T
0<t1 <···<tn <T
× ln[eT pn (t1 , . . . , tn )]dt1 · · · dtn (5.8.17)
we can rewrite the entropy in question as follows:

P/Q
Hξ 0 |n 1 = E [Φ (n)]. (5.8.18)
T
Variable n is quasi-continuous for large T and n is slightly different from [n]

(brackets [. . .] mean an integer part). The differences
Φ (n) = Φ (n + 1) − Φ (n)
Φ (n) = Φ (n + 1) − 2Φ (n) + Φ (n − 1)
can be interpreted as derivatives. Discarding non-essential complications related to

a discrete nature of variable n, we will obtain from (5.8.17) that

1
E [Φ (n)] = E Φ (n̄) + Φ (n̄)(n − n̄) + Φ (n̄)(n − n̄) + · · ·
2
2
1
= Φ (n̄) + Φ (n̄)Var(n) + · · · .
2
Substituting the latter expression into (5.8.18) and (5.8.16) we obtain that
1
hP/Q1 = lim Φ (n̄) (5.8.19)
T →∞ T
if
Var(n)
lim Φ (n̄) = 0. (5.8.20)
T →∞ T
It has been already indicated that ratio Var[n]/T usually converges to a finite limit
D0 as T → ∞. Therefore, condition (5.8.20) is satisfied if
Φ (n̄) → 0 as T → ∞. (5.8.21)
As the investigation shows, condition (5.8.21) stating that dependence Φ (n) ap-
proaches a direct proportionality as n → cT is satisfied for a large number of practi-
cal examples.
Thus, in compliance with formula (5.8.19) entropy rate can be computed if we
suppose that the number of random points sampled from large interval [0, T ] is not
random but known in advance and equal to n = E [n].
Example 5.6. In this example we calculate entropy of a Poisson process by the spec-
ified method. The number of points on the entire interval is assumed to be non-
random and equal to n = γ T . At the same time pn (t1 , . . . ,tn ) = n!T −n . We substitute
the latter function to (5.8.17) and find out that
Φ (n) = T + ln n! − n ln T.
Employing Stirling’s formula

√
n! = nn e−n 2π n[1 + On (1)],
we obtain that
1
Φ (n) = T + n ln n − n + ln(2π n) − n ln T.
2
Condition (5.8.21) actually takes place because of
1 1
Φ (n̄) ≈ = .
n̄ γ T
We substitute the derived expression for Φ (n) to (5.8.19) with n = γ T and pass to
the limit T → ∞. Thus, we obtain
hP/Q1 = 1 + γ ln γ − γ (5.8.22)
which, of course, coincides with the entropy rate determined from (5.8.11).
2) The other method to calculate entropy rate is based on its definition (5.6.10) as
conditional entropy. Fixation of process ξ−∞0 means fixation of all random points . . .,
τ−2 , τ−1 , τ0 < 0 occurred before moment t = 0. Therefore, (5.6.10) can be rewritten
as
1 P/Q 1 P/Q
hP/Q1 = Hξ τ |...,1 τ ,τ = E [Hξ τ 1 (| . . . , τ−1 , τ0 )]. (5.8.23)
τ 0 −1 0 τ 0
P/Q
Here entropy Hξ τ (| . . . , τ−1 , τ0 ) can be defined by formula (5.8.8) with probabil-
0
ities Pn and densities pn (τ1 , . . . , τn ) replaced with the respective conditional proba-
bilities and densities
P(n | . . . , τ−1 , τ0 ), pn (τ1 , . . . , τn | . . . , τ−1 , τ0 ) (5.8.24)
i.e. by the formula

∞
∑ P(n | . . . , τ−1 , τ0 )×
P/Q1
Hξ τ (| . . . , τ−1 , τ0 ) =
0
n=0

× ··· pn (t1 , . . . ,tn | . . . , τ1 , τ0 ) ln[eτ Pn (n | . . . , τ−1 , τ0 )×
t1 <···<tn
× pn (t1 , . . . ,tn | . . . , τ−1 , τ0 )]dt1 · · · dtn . (5.8.24a)
It is desirable to consider a small length τ of interval [0, τ ] because in this case the
probability of two or more points hitting the interval is negligibly small and those
probabilities can be disregarded. Then we can remain only two terms in expres-
sion (5.8.24) supposing p1 (t1 | . . . , τ−1 , τ0 ) = 1/τ . We denote that
lim P(n = 1 | . . . , τ−1 , τ0 )/τ = p(0 | . . . , τ−1 , τ0 )

τ →0
and conclude from (5.8.24) that
P/Q1
Hξ τ (| . . . , τ−1 , τ0 ) = [1 − p(0 | . . . , τ−1 , τ0 )τ ]×
0
( )
× ln eτ [1 − p(0 | . . . , τ−1 , τ0 )τ ] +
+ p(0 | . . . , τ−1 , τ0 )τ ln[eτ p(0 | . . . , τ−1 , τ0 )] + O(τ 2 )
or, equivalently,
P/Q1
Hξ τ (| . . . , τ−1 , τ0 ) = τ − p(0 | . . . , τ−1 , τ0 )τ +
0
+ p(0 | . . . , τ−1 , τ0 )τ ln p(0 | . . . , τ−1 , τ0 ) + O(τ 2 ). (5.8.25)
We substitute the last expression into (5.8.23), make a passage to the limit τ → 0
and thereby obtain the desired entropy rate
hP/Q1 = E [1 − p(0 | . . . , τ−1 , τ0 ) + p(0 | . . . , τ−1 , τ0 ) ln p(0 | . . . , τ−1 , τ0 )] (5.8.26)
where E means the expectation taken over random points . . ., τ−1 , τ0 .
Example 5.7. We apply the derived formula to a stationary point process with a
bounded after-action. We suppose that intervals σ = tm+1 − tm between adjacent
random points are independent random variables having an identical probability
density function w (σ ). Then evidently
*
∞
P(n = 1 | . . . , τ−1 , τ0 ) = ω (−τ0 )τ ω (σ )d σ + O(τ 2 )
− τ0
so that *
∞
p(0 | . . . , τ−1 , τ0 ) = ω (−τ0 )τ ω (σ )d σ .
− τ0
Substitution to (5.8.26) yields

∞
ω (σ ) ω (σ )
h P/Q1
= 1+ ln ∞ −1 ∞ P(d σ ) (5.8.27)
0 σ ω (ρ )d ρ σ ω (ρ )d ρ
where P(d σ ) = P[σ −τ0 < σ + d σ ] ≡ p(σ )d σ is a distribution law of a random

distance from the fixed point t = 0 to the random point closest to the left. This law
can be expressed in terms of density w (σ ) by the simple formula
∞
dσ
p(σ )d σ = ω (ρ )d ρ , σ̄ = ρω (ρ )d ρ . (5.8.28)
σ̄ σ
In consequence of equalities (5.8.27), (5.8.28) we obtain

∞
1 ∞
hP/Q1
= 1+ ln ω (σ ) − ln ω (ρ )d ρ − 1 ω (σ )d σ (5.8.29)
σ̄ 0 σ
that gives a solution to the problem of calculating entropy rate.
Example 5.8. Further, we consider a more complicated example. Let the increments
τm+1 − τm of a given stationary point process be mutually independent as before
but their probability density functions alternate between w1 (σ ) and w2 (σ ), which
are different from each other. In other words, if τm+1 − τm is distributed as w1 , then
τm+2 − τm+1 is distributed as w2 ; furthermore, increment τm+3 − τm+2 is distributed
as w1 and also τm+4 − τm+3 is distributed as w2 and so forth. Such a point process is
equivalent to a stationary process with two states A1 and A2 when the probabilities
of staying in each state are mutually independent and have distributions w1 (for the
time in state A1 ) and w2 (for the time in state A2 ), respectively. Moreover, the random
points can be classified by the additional parameter ϑ , supposing that ϑm = 1 if the
transition from A1 to A2 occurs at point τm and ϑm = 2 if the reverse transition from
A2 to A1 is observed.
In the described case the probability density p(0 | . . . , τ−1 , τ0 ) depends not only
on time τ0 of the occurrence of the last random point but also on its type ϑ0 . Namely,
*
∞
p(0 | . . . , τ−1 , τ0 ) = p(0 | τ0 , ϑ0 ) = ωϑ0 (−τ0 ) ωϑ0 (ρ )d ρ . (5.8.29a)
− τ0
Averaging in (5.8.26) is conducted with respect to both τ0 and ϑ0 .

We denote the joint distribution of random variables ϑ = ϑ0 , σ = −τ0 by
P(ϑ , d σ ). Then we obtain from (5.8.26), (5.8.29a) that

2
ωϑ (σ ) ωϑ (σ )
hP/Q1 = 1 + ∑ ln ∞
ωϑ (ρ )d ρ
−1 ∞ P(ϑ , d σ ). (5.8.30)
ϑ =1 σ σ ωϑ (ρ )d ρ
We still need to calculate P(ϑ , d σ ). A priori, the probability of hitting the elemen-
tary interval [−σ , −σ + d σ ] by a random point of any of two specified types is the
same and equals
5.9 Entropy of a discrete Markov process in continuous time 153
∞
dσ
dP = , σ̄ϑ = ρωϑ (ρ )d ρ ,
σ̄1 + σ̄2 0
because the mean density of points is equal to (σ 1 + σ 2 )−1 for each type.

If point
ϑ0 = 1 comes out, then this point will be the last one with probability σ∞ w1 (ρ )d ρ ,
i.e. other points will not hit interval [−σ + d σ , 0]. If ϑ0 = 2, then point τ0 will be
last with probability σ∞ w2 (ρ )d ρ . Therefore, formula (5.8.30) can be rewritten as
hP/Q1 =
2 ∞ ∞
1
1+ ∑
σ̄1 + σ̄2 ϑ =1 0
ln ωϑ (σ ) − ln
σ
ωϑ (ρ )d ρ − 1 ωϑ (σ )d σ . (5.8.31)
Let w1 , w2 have, for instance, the exponential form
ωϑ (σ ) = μϑ e−μϑ σ .
Then ∞
1
σ̄ϑ = , ωϑ (ρ )d ρ = e−μϑ σ ,
μϑ σ
and we obtain from (5.8.31) that
μ1 μ2 2
μ1 + μ2 ϑ∑
hP/Q1 = 1 + [ln μϑ − 1]
=1
2 μ1 μ2 μ1 μ2
= 1− + (ln μ1 + ln μ2 ). (5.8.32)
μ1 + μ2 μ1 + μ2
If, in particular, μ1 = μ2 = γ , then formula (5.8.32) coincides with (5.8.22) since
under this condition the point process in consideration turns into a Poisson process.
5.9 Entropy of a discrete Markov process in continuous time
Consider a Markov process ξ (t) with a discrete state space, i.e. a process having
either a finite or a countable number of possible states. In contrast to the process
considered in Section 5.2, it flows in continuous time now. Let its transition proba-
bilities be defined by the differential transition probability πt (x, x ) according to the
formula
P[ξ (t + Δ ) | ξ (t)] = δξ (t+Δ ), ξ (t) + πt (ξ (t), ξ (t + Δ ))Δ + O(Δ 2 )
due to the normalization constraint we have ∑x πt (x, x ) = 0 that corresponds

to (5.2.2). If process ξ (t) is stationary, then matrix πt (x, x ) = π (x, x ) is independent
of time t. The methods and the results stated in Section 5.2 can be extended to the
given Markov process.
Every finite ensemble of random values ξ (t1 ), . . . , ξ (tn ) is an ensemble of dis-

crete random variables for the process in question. Therefore, their entropy can be
determined by the formulae from the discrete version. In particular, if the process is
given on interval [0, T ], then the entropy of the initial value ξ (0) is equal to
Hξ (0) = − ∑ P(ξ (0)) ln P(ξ (0)). (5.9.1)

ξ (0)
However, if we want to calculate the entropy of the continuous set of values ξ0T ,
then we have to apply the formulae of the generalized version, i.e. the theory of
Sections 1.6 and 5.6.
Denote by {τ j } points on the time axis (i.e. time moments), at which we observe
jumps from one state of the process to another. We define the auxiliary measure
Q1 (ξ0T ) as a Poisson measure (with a unit density) for the system of random points
{τ j }. We pick out the initial entropy (5.9.1) by the formula
Hξ T = Hξ (0) + Hξ T
0 0+0 |ξ (0)
and define conditional entropy Hξ T |ξ (0) (or, more succinct, Hξ T |ξ (0) ) according to
0+0 0
formulae (1.7.17a), (1.7.17), (5.6.1) from the generalized version. Then we will have
P/Q1 P/Q
Hξ T = −Hξ (0) + Hξ T |ξ1(0) . (5.9.1a)
0 0
In consequence of the multiplicativity property of measure Q1 entropy HP/Q1

possesses the hierarchical additivity property
P/Q1 P/Q1 P/Q1
H β =H β +H γ β .
ξα | ξ ( α ) ξα | ξ ( α ) ξβ | ξα
Due to the Markovian property of the process, the latter formula can be rewritten as
follows:
P/Q P/Q P/Q
H γ 1 = H β 1 +H γ 1 .
ξα | ξ ( α ) ξα | ξ ( α ) ξβ | ξ ( β )
So that
P/Q1 P/Q1 P/Q P/Q
Hξ T = −Hξ (0) + H t + H t2 1 + · · · + Hξ T |ξ1(t ) (5.9.1b)
0 ξ01 |ξ (0) ξt |ξ (t1 ) tN N
1
(0 < t1 < · · · < tN < T ).
This equality is an analog of (5.2.6) related to the discrete-time case.

P/Q
As it was mentioned in Section 5.6, entropy Hξ τ |ξ 10 is proportional to τ for the
0 −∞
stationary process ξ (t). Owing to the Markovian property, the same deduction is
P/Q
applied to entropy Hξ τ |ξ 1(0) . Further, formula (5.6.10) takes the form
0
P/Q
Hξ τ |ξ 1(0) = τ hP/Q1 . (5.9.2)
0
5.9 Entropy of a discrete Markov process in continuous time 155
Taking this into account, we derive from (5.9.1a) and (5.9.2) that
P/Q1
Hξ T = −Hξ (0) + hP/Q1 T.
0
We compare the last formula with (5.6.18) and conclude that in this case ot (1) = 0
and
2Γ = Hξ (0) = − ∑ P(ξ (0)) ln P(ξ (0)). (5.9.3)
ξ (0)
It is convenient to use (5.9.2) for small τ in order to determine entropy rate hP/Q1
similarly to the approach used in Section 5.8 (see (5.8.23) and (5.8.24a)).
We can neglect the probability of more than one transition point hitting (0, τ ] for
small τ . Then possible scenarios are the following: either there is no transition with
probability
1 + π (ξ (0), ξ (0))τ + O(τ 2 )
or the transition to state x = ξ (0) occurs with probability π (ξ (0), x )τ + O(τ 2 ).
Analogously to (5.8.25) we write down the entropy of these events as
( )
Hξ τ 1 (| ξ (0)) = [1 + π (ξ (0), ξ (0))τ ] ln eτ [1 + π (ξ (0), ξ (0))τ ]
P/Q
0
+ ∑ π (ξ (0), ξ )τ ln[eτ π (ξ (0), ξ )] + O(τ 2 ).

ξ =ξ (0)
Averaging with respect to ξ (0) and making the passage to the limit
1 P/Q
hP/Q1 = lim Hξ τ |ξ 1(0)
τ →0 τ 0
we obtain from here that
hP/Q1 = 1 + ∑[π (ξ , ξ ) + ∑ π (ξ , ξ ) ln π (ξ , ξ )]P(ξ ). (5.9.4)

ξ
ξ =ξ
Taking account of π (ξ , ξ ) = − ∑ξ =ξ π (ξ , ξ ), due to (5.9.4) we have
hP/Q1 = ∑ P(ξ )(1 + ∑ [π (ξ , ξ ) ln π (ξ , ξ ) − π (ξ , ξ )]).

ξ
ξ =ξ
P(ξ ) are stationary probabilities found from
dP(ξ )
= ∑ P(ξ )π (ξ , ξ ) = 0 (5.9.5)
dt ξ
herein as well as in (5.9.4).

The latter equation is analogous to (5.2.7). Formula (5.9.4) is a generalization
of (5.2.8) for the continuous-time case.
Example 5.9. Given a process with two states characterized by the differential tran-
sition probabilities
π (1, 1) π (1, 2) −μ μ
= . (5.9.6)
π (2, 1) π (2, 2) ν −ν
Then equation (5.9.5) is reduced to
−μ P(1) + ν P(2) = 0
so that the stationary probabilities turn out to be the following:

ν μ
P(1) = , P(2) = (5.9.7)
μ +ν μ +ν
(compare with (5.2.19)). We apply formula (5.9.4) to write down the formula for the
entropy rate:
μν
hP/Q1 = 1 + [ln μ + ln ν − 2]
μ +ν
This result coincides with (5.8.32), which is natural because the process covered
in Example 5.8 of Section 5.8 becomes a Markov process with two states if distribu-
tions w1 (σ ), w2 (σ ) are exponential. For other types of distributions a process with
two states will be non-Markov but will become Markov if we add one more variable
to ξ (t), namely, time τ = t − τ j elapsed from moment τ j of the last transition. The
combined process {ξ (t), σ (t)} will be Markov. Its space of states consists of two
half-lines (rays). In spite of such a complication, the aforesaid theory is applicable
to this process almost without any complications.
In conclusion of this paragraph we formulate the indicated generalization for the
case of an arbitrary space in a more complete way.
Let the state space X of a Markov process be arbitrary. The differential transi-
tion probability π (x, x ) is such that for each point x ∈ X there are only possible
transitions from it to either a finite or a countable number of points. Then the corre-
sponding entropy rate is represented by the formula

hP/Q = Pst (d ξ ) 1 + ∑

π (ξ , ξ )(ln π (ξ , ξ ) − 1) (5.9.8)
ξ =ξ
where Pst (d ξ ) is a stationary distribution found from the equation

P(d ξ ) ∑

π (ξ , ξ ) = 0
ξ ∈A
which generalizes (5.9.5) (A is arbitrary). In the given case formula (5.9.1) may
become invalid and thereby we may need to employ measure Q in order to define
Hξ (0) .
The provided formulae (5.9.4), (5.9.8) can be applied to calculate the average
entropy rate not only in a stationary case. Those formulae are also appropriate in
a non-stationary case due to the Markovian properties of the process, namely, they
5.10 Entropy of diffusion Markov processes 157
define the entropy density h(t) per unit of time, which may depend on time t. There-
fore, the respective averaging should be conducted with respect to the non-stationary
distribution P(d ξ ).
5.10 Entropy of diffusion Markov processes
Let {x(t),t ∈ [0, T ]} be a diffusion Markov process corresponding to measure P and

characterized by drift a(x,t) and local variance (diffusion coefficient) b(x,t) > 0.
This process is described by the Fokker–Planck equation
∂ 1 ∂2
ṗt (x) = − [apt (x)] + [bpt (x)] (5.10.1)
∂x 2 ∂ x2
where ( )
pt (x) dx = P xt ∈ [x, x + dx] .
In order to determine the entropy HxT for the given process, we need to select mea-
0
sure Q such that measure P is absolutely continuous with respect to it. It is also
desired for such measure Q to be as simple as possible.
It is known (for instance, see [48]) that one of the desired measures corresponds
to a diffusion process with the same local variance b(x,t) but with a null drift, i.e. a
measure, for which (5.10.1) is replaced with the equation
1 ∂2 ( )!
q̇t (x) = [bt q(x)] qt (x)dx = Q xt ∈ [x, x + dx] .
2 ∂ x2
That is, the Radon–Nikodym derivative takes the form
T
P(dx0T ) p(x(0)) a(x(t),t) ∗ 1
= exp d x(t) − a(x(t),t)dt (5.10.2)
Q(dx0T ) q(x(0)) 0 b(x(t),t) 2

where the stochastic integral . . . d ∗ x(t) is understood in the Itô sense. Further, we
try to satisfy the multiplicativity property (5.6.4) in order for the theory of Sec-
tion 5.6 to be applicable. For this purpose we represent process {x(t),t ∈ [0, T ]} as
process {ξ (0), ξ (t),t ∈ [0, T ]} where ξ (0) = x(0), ξ (t) = ẋ(t),t > 0 and require that
b(x,t) = b(t) does not depend on x. Then measure Q will correspond to the Gaussian
delta-correlated process:

ξ (t)dQ = 0, ξ (t)ξ (t )dQ = b(t)δ (t − t )
so that the multiplicativity condition (5.6.4) is satisfied. Employing (5.10.2) it is not

difficult to determine entropy (5.6.1) and also entropy rate, assuming stationarity,
1 P/Q 1 P/Q
hP/Q = Hξ t+τ |ξ t = Hxt+τ |x(t) .
τ t 0 τ t
P/Q
Therefore, it is convenient to compute entropy Hxt+τ |x(t) approximately, taking into
t
consideration smallness of τ . Then we pass to the limit
1 P/Q
hP/Q = lim Hxt+τ |x(t) . (5.10.3)
τ →0 τ t
For small τ we obtain from (5.10.2) that

⎡ ⎤
P[dxtt+τ | x(t)]
E ⎣ln ( )⎦ =
Q d[x(σ ) − x(t)], σ ∈ (t,t + τ ]
t+τ
a(x(σ )) ∗ 1
=E d x(σ ) − a(x(σ ))d σ
t b 2

a(x(t)) 1
≈E x(t + τ ) − x(t) − a(x(t))τ + o(τ ).
b 2
P/Q
1
In order to calculate Hxt+τ |x(t) we perform the following averaging:
t
P/Q a(x(t)) ( 1 )
Hxt+τ |x(t) = E [x(t + τ ) − x(t)] − a(x(t))τ + o(τ ).
t b 2
Taking into account that according to the definition of drift a the relationship
E [x(t + τ ) − x(t) | x(t)] = a(x(t))τ + o(τ )
is valid, we have
1 P/Q 1 a(x(t))2
H t+τ = E + o(1).
τ xt |x(t) 2 b
Substituting the last expression into (5.10.3) we find the entropy density
1
hP/Q = E [a(x(t))]2 (5.10.4)
2b
which is independent of t in a stationary case.
It was assumed above that a and b do not depend on time t according to the sta-
tionarity condition. If that condition was not satisfied, we would obtain the entropy
density
1 a(x(t),t)2
hP/Q (t) = E (5.10.5)
2 b(x(t),t)
dependent on time via the described method (here the condition of independence of
b from x can also be not satisfied). In terms of the indicated density the full entropy
obtained from (5.10.2)
P/Q P(dx0T )
HxT = E ln
0 Q(dx0T )
5.10 Entropy of diffusion Markov processes 159
can be expressed with the help of integration

T
P/Q P/Q P/Q
HxT = Hx(0) + H0 (t)dt. (5.10.6)
0 0
Then in a stationary case with b = const the entropy density can be found by for-
mula (5.10.4) employing the stationary probability density function pst (x), which
satisfies the stationary Fokker–Planck equation
d b d2
− [a(x)pst (x)] + pst (x) = 0.
dx 2 dx2
If we introduce the potential function
x
U(x) = − a(x)dx (5.10.7)
c
then the stationary distribution can be rewritten as

1 −2U(x)/b
pst (x) = e , N= e−2U(x)/b dx. (5.10.8)
N
Hence, it follows from (5.10.4) that
2
P/Q 1 −2U(x)/b dU(x)
h = e dx.
2bN dx
Applying integration by parts (also accounting that density pst (x) vanishes on
boundaries, for instance, as t → ±∞), the latter formula can be reduced to the form
2

1 −2U(x)/b d U(x) 1 da(x)
h P/Q
= e dx = − E . (5.10.9)
4N dx2 4 dx
Example 5.10. Let function a(x) be linear: a(x) = −β x + γ . In order for the process
to be stationary, positiveness β > 0 is necessary. Function (5.10.7) is given by
1
U(x) = β x2 − γ x
2
and distribution (5.10.8) appears to be Gaussian:
# 2
β β γ 2 πb
pst (x) = exp − x− , N= .
πb b β β
Application of formula (5.10.9) yields
hP/Q = β /4. (5.10.10)

As is known, the given process is a Gaussian process having the spectral density
S(ω ) = 2b/(ω 2 + β 2 ). That is why it coincides with the process considered in Sec-
tion 5.7 (Example 5.5). Certainly, result (5.10.10) is equivalent to the corresponding
result (5.7.23), the derivation of which is based on the theory of Gaussian processes.
The results provided in this paragraph may be generalized to a multivariate case
when there is a multicomponent Markov process {x(t)} = {x1 (t), . . . , xl (t)}. Sup-
pose it is characterized by drifts aρ (x,t), ρ = 1, . . . , l and the matrix of local vari-
ances (diffusion parameters) bρσ (x,t), ρ , σ = 1, . . . , l. Then selecting measure Q
described in the formulation of Theorem 4.1 of Stratonovich’s monography [48],
we obtain the Radon–Nikodym derivative

P(dx0T | x(0)) T 1
Q(dx0T | x(0))
= exp ∑ aρ bρ σ d xσ − 2 aσ dt
0 ρ ,σ
−1 ∗
(5.10.11)
which serves as a generalization of formula (5.10.2). Here b−1 ρ σ is an inverse ma-

trix of the non-singular submatrix bρ σ , ρ , σ = 1, . . . , l l of the matrix of local
variances bρσ , ρ , σ = 1, . . . , l.
Using the approach inherited from a univariate case, we obtain from (5.10.11)
the entropy density

1
h (t) = E ∑ aρ (x(t),t)bρ σ (x(t),t)aσ (x(t),t) .
P/Q −1
(5.10.12)
2 ρ ,σ
In particular, if the l × l matrix bρσ is not singular, then summation in (5.10.12)

over ρ and σ needs to be conducted in the range from 1 to l. At the same time, the
measure Q corresponds to the same matrix of local variances bρσ but with a zero
drift vector.
In a stationary case, averaging in (5.10.12) means integration with the station-
ary probability density function pst (x), which satisfies the stationary Fokker–Planck
equation
∂ 1 ∂2
−∑ (aρ pst ) + ∑ (bρσ pst ) = 0.
ρ ∂ xρ 2 ρ ,σ ∂ xρ ∂ xσ
In an important particular case the stationary distribution is such that probabilistic
flows turn into zero:
1 ∂
Gρ = aρ pst − ∑
2 σ ∂ xσ
(bρσ pst ).
In this case it follows from (5.10.12) that

1 ∂
4 ρ ∑
hP/Q (t) = aρ b−1 [b pst ]dx1 · · · dxl .
,σ ,π
ρσ ∂ xπ σ π
5.11 Entropy of a composite, conditional Markov process and its components 161
Applying Green’s formula and taking into account the fact that an integral over
boundary vanishes, we obtain that

1 ∂
4 ρ ∑
hP/Q (t) = − bσ π pst [a b−1 ]dx1 · · · dxl
,σ ,π
∂ xπ ρ ρ σ

1 ∂ −1
= − E bσ π [a b ] .
4 ∂ xπ ρ ρ σ
If matrix bρσ is non-singular and independent of x, then the following simple for-
mula takes place:

1 ∂ aρ
hP/Q = − ∑ E bσ π b−1 σρ
4 ρ ,σ ,π ∂ xπ

1 ∂ aρ 1
=− E ∑ ≡ − E [div a].
4 ρ ∂ xρ 4
This formula is a multivariate generalization of formula (5.10.9).
5.11 Entropy of a composite Markov process, a conditional

process, and some components of a Markov process
1. The results and the methods stated in Section 5.3 can be generalized to the
continuous-time case. It is assumed that the joint process {ξ (t)} = {x(t), y(t)} is
P/Q
Markov. The theory of Markov processes allows us to calculate entropy Hξ T =
a
P/Q
HxT yT , where Q is a probability measure such that the measure P is absolutely con-
a a
tinuous with respect to it. It is convenient to select measure Q in such a way that
processes {x(t)} and {y(t)} are independent for the selected measure:
Q(d ξaT ) = Q1 (dxaT )Q2 (dyTa ),
Also those processes are supposed to be Markov for Q, i.e.
Q(dxtt+τ | xta yta ) = Q(dxtt+τ | x(t)),

τ τ
Q(dyt+
t | xta yta ) = Q(dyt+
t | y(t))
Then the relationships

P/Q P/Q2 P/Q

HxT yT = HyT + HxT |yT1 (5.11.1)
a a a a a
P/Q P/Q1 P/Q
HxT yT = HxT + HyT |xT2 (5.11.2)
a a a a a
P/Q P/Q
Hxt+τ yt+τ |xt t = Hxt+τ yt+τ |x(t) y(t) (5.11.3)
t t a ya t t
will be valid. Note that they are analogous to the relationships (5.3.1), (5.3.6), (5.2.5)
from the respective discrete version.
In the case of continuous time it is convenient to introduce the entropy densities
P/Q 1 P/Q
hxy (t) = lim Hxt+τ yt+τ |x(t) y(t) (5.11.4)
τ →0 τ t t
P/Q2 1 P/Q2
hy (t) = lim Hyt+τ |yt (5.11.5)
τ →0 τ t a
P/Q1 1 P/Q1 P/Q

hx|y (t) = lim [Hxt+τ |yt+τ − Hxt |yt 1 ]. (5.11.6)
τ →0 τ a a a a
Assuming initially that T = t + τ and then T = t in (5.11.1), we take the difference

of these expressions and obtain the relation
P/Q P/Q P/Q P/Q
Hxt+τ yt+τ |xt t = Hyt+τ |y2 t + Hxt+τ |y1 t+τ − Hxt |yt 1 (5.11.7)
t t a ya t a a a a a
by using the additivity property of type (1.7.18).

Next we divide (5.11.7) by τ and pass to the limit τ → 0 that yields
P/Q P/Q2 P/Q
hxy (t) = hy (t) + hx|y 1 (t). (5.11.8)
The similar relation

P/Q P/Q1 P/Q
hxy (t) = hx (t) + hy|x 2 (t) (5.11.8a)
is apparently valid for the other pair of entropy densities
P/Q 1 P/Q
hx (t) = lim Hxt+τ |x1 t ,
τ →0 τ t a
1
P/Q2 (5.11.9)
P/Q2 P/Q
hy|x (t) = lim Hyt+τ |xt+τ − Hyt |xt 2 .
τ →0 τ a a a a
P/Q
In a stationary case, as is seen from (5.11.4), density hxy does not depend on
t, i.e. it is reasonable to determine other entropy densities with the help of an ad-
ditional passage to the limit a → −∞. In such a case entropies get strictly propor-
tional to τ and the passage to the limit τ → 0 becomes redundant. Moreover, formu-
lae (5.11.5), (5.11.6) are replaced with the following:
P/Q2 1 P/Q
hy (t) = Hyt+τ |y1 t ,
τ t −∞
1

P/Q P/Q P/Q
hx|y 1 (t) = lim Hxt+τ |y1 t+τ − Hxt |yt 1
τ a→−∞ a a a a
P/Q P/Q
(similarly for the other pair hx 1 , hy/x 2 ). At the same time, relationships (5.11.8),
(5.11.9) retain their significance. All those entropy densities will be constant in a
stationary case. Furthermore, we can prove that they are equivalent to the entropy
rates, i.e. we can prove the equalities:
P/Q2 1 P/Q P/Q1 1 P/Q

hy = lim Hyt 2 , hx|y = lim Hxt |yt 1 .
t→∞ t 0 t→∞ t 0 0
Besides, we can prove that the formula
P/Q1 1 P/Q1 1 P/Q

hx|y = Hxτ |x(0)y ∞ = = Hxτ |x(1τ )yτ , (τ > 0)
τ 0 0 τ 0 −∞
P/Q P/Q
is valid for density hx|y 1 (as well as for hy|x 2 , similarly) in a stationary case.
All these statements extend the respective results proven in Section 5.3 to the
continuous-time case.
P/Q
Generalizing the methods of Section 5.3, we can compute the entropy HyT 2 or
0
P/Q
its density hy 2 for part of the components {y} of the Markov process ξ and the
P/Q P/Q
entropy HxT |yT1 or density hx|y 1 of the conditional Markov process {x(t)}, as well
0 0
P/Q1 P/Q1 P/Q P/Q
as the analogous entropies HxT , hx and HyT |xT2 , hy|x 2 .
0 0 0
P/Q
Now consider entropy Hyt+τ |y2 t . It can be represented in the form:
t 0
τ
P/Q P[dyt+
t | yt0 ]
Hyt+τ |y2 t = E ln τ . (5.11.10)
t 0 Q2 [dyt+
t | y(t)]
Then we have

P[dyt+
t
τ
| yt0 ] P[dyt+
t
τ
| x(t), yt0 ] t
τ =E τ y .
Q2 [dyt+
t | y(t)] Q2 [dyt+
t | y(t)] 0
Due to the following Markovian property

τ τ
P[dyt+
t | x(t), yt0 ] = P[dyt+
t | x(t), y(t)]
the specified equality can be rewritten as:

τ τ
P[dyt+
t | yt0 ] P[dyt+
t | x(t), y(t)]
t+τ = τ W (dx(t)) (5.11.11)
Q2 [dyt | y(t)] Q2 [dyt+t | y(t)]
where we use the notation

W [dx(t)] = P[dx(t) | yt0 ].
Substituting (5.11.11) to (5.11.10), we obtain the formula

τ
P/Q P[dyt+
t | x(t), y(t)]
Hyt+τ |y2 t = E ln τ W (dx) (5.11.12)
t 0 Q2 [dyt+
t | y(t)]
if a posteriori measure W is interpreted as a distribution in a combined space ξ =

(x, y) (see (5.3.17)), then formula (5.11.12) will be reduced to
τ
P/Q2 P(dyt+
t | ξ)
Hyt+τ |yt = E ln τ W (d ξ )
t 0 Q2 (dyt+
t | y)
or, equivalently,
τ
P/Q τ P(dyt+
t | ξ)
Hyt+τ |y2 t =E P(dyt+
t | ξ )W (d ξ ) ln t+τ W (d ξ ) (5.11.12a)
t 0 Q2 (dyt | y)
where averaging E is taken only over random variables W (·). Note that these vari-
ables W (·) constitute the secondary a posteriori W -process, which is a Markov pro-
cess with known transition probabilities.
Formulae (5.11.12), (5.11.12a) appear to be a generalization of formulae (5.3.15),
(5.3.16) of the discrete version. They are valid for all τ but it is more convenient to
P/Q
use them for small τ if we want to determine hy 2 . The particular examples con-
sidered below will confirm the last statement.
2. Now let {x(t)} be a Markov process with a discrete number of states that is
similar to the process considered in Section 5.9. It is characterized by the differential
transition matrix πt (x, x ) (introduced in Section 5.9) defining transition probabili-
ties P(ξ (t + Δ ) | ξ (t)). Selecting a Poisson measure for transition points as Q1 , we
P/Q P/Q
obtain density hx 1 (t) of entropy HxT 1
0

∑ P(x(t)) ∑
P/Q
hx 1 (t) = 1+ ln πt (x(t), x ) − 1 πt (x(t), x ) . (5.11.13)
xt x =xt
Process {y(t)} = {y1 (t), . . . , yl (t)} should be constructed in the following way.
Given the vector of drifts aρ (x(t), y(t),t) (dependent not only on y(t), but also on
x(t)) and the matrix of local variances bρσ (y(t),t). This matrix is assumed to be
non-singular and independent of x(t).
Then process y(t) will be a diffusion process considered in Section 5.10 for a
fixed realization of {x(t)}. Applying the results derived therein, we can find the
P/Q P/Q
density hy/x 2 (t) of the entropy HyT |xT2 . We choose a measure of a diffusion process
0 0
with zero drift and the same matrix of local variances bρσ (y,t) as Q2 . According
to (5.10.12) we get

1
2 ρ∑
aρ (x(t), y(t),t)b−1
P/Q2
hy (t | x0T ) = ρσ (y(t),t)×
,σ
× aσ (x(t), y(t),t)P(dy(t) | x0T ). (5.11.14)
P/Q P/Q
In order to find density hy/x 2 of entropy HyT |xT2 it is left to carry on extra averaging
0 0
with respect to x0T in (5.11.14):
1
2 ρ∑
aρ (x(t), y(t),t)b−1
P/Q
hy|x 2 (t) = ρσ (y(t),t)×
,σ
× aσ (x(t), y(t),t)P(dy(t)dx(t)). (5.11.15)
P/Q
Due to additivity (5.11.2), (5.11.9), thus, we have found entropy density hxy
for the joint (combined) Markov process {ξ (t)} = {x(t), y(t)}. Namely, we should
substitute expressions (5.11.13), (5.11.15) into (5.11.9).
P/Q P/Q
According to (5.11.1), other entropy densities hy 2 and hx|y 1 are connected by
analogous relationship (5.11.8). Hence, in order to complete the computation of all
densities, it only remains to compute one of them.
The above-described process {y(t)} (without a fixed realization of {x(t)}) is a
non-Markov process. We determine the corresponding entropy by simply employing
formula (5.11.12).
For small τ it follows from (5.10.11) that
τ
P[dyt+
t | x(t), y(t)]
τ =
Q2 [dyt+t | y(t)]

τ
= exp ∑ aρ bρσ yσ (t + τ ) − yσ (t) − aσ + o(τ )
−1
ρ ,σ 2

τ
= 1 + ∑ aρ b−1 ρσ σ y (t + τ ) − yσ (t) − aσ
ρ ,σ 2
1
2 ρ∑
+ aρ b−1 −1
ρσ [yσ (t + τ ) − yσ (t)][yπ (t + τ ) − yπ (t)]bπν aν + o(τ ).
,σ
Substituting the latter expression to (5.11.11) and denoting ∑x(t) . . .Wt (xt ) = Eps . . .
for brevity, we obtain that
τ
P[dyt+t | yt0 ]
t+τ =
Q2 [yt | y(t)]
= 1 + ∑ Eps [aρ ]b−1
ρσ [yσ (t + τ ) − yσ (t)]
ρ ,σ
1 ( )
+ ∑ b−1 −1 −1
ρσ [yσ (t + τ ) − yσ (t)][yπ (t + τ ) − yπ (t)]bπν − τ bρν Eps [aν aρ ]
2
+o(τ ).
Due to (5.11.10), (5.11.12) we should take the logarithm of the last expression.
When taking the indicated logarithm we need to keep only the following terms:
τ
P[dyt+t | yt0 ]
ln τ =
Q2 [dyt+
t | y(0)]
= Eps [aρ ]b−1
ρσ [yσ (t + τ ) − yσ (t)]
1 ( −1 )
+ bρσ [yσ (t + τ ) − yσ (t)] [yπ (t + τ ) − yπ (τ )] b−1
πν − b−1
ρν
2
1 ( ) 2
×Eps [aν aρ ] − Eps [aρ ]b−1
ρσ [yσ (t + τ ) − yσ (t)] + o(τ ). (5.11.16)
2
τ
The corresponding averaging will be carried out in several steps: first, over yt+
t with
t t
fixed x(t), y0 ; second, over x(t) (with weight Wt (x(t))) with fixed y0 ; and, finally,
over yt0 or, equivalently, over W . At the first stage of averaging we need to account
for the formulae
E {yσ (t + τ ) − yσ (t) | x(t), y(t)} = aσ (x(t), y(t),t)τ + o(τ ),

( )
E [yσ (t + τ ) − yσ (t)][yπ (t + τ ) − yπ (t)] | x(t), y(t) = bσ π (x(t), y(t),t)τ + o(τ )
so that

1 P[dyt+t
τ
| yt0 ]
E ln τ
t
x(0), y0 =
τ Q2 [dyt+
t | y(0)]
1
= Eps [aρ ]b−1 −1
ρσ aσ − Eps [aρ ]bρσ aσ Eps [aσ ] + o(1).
2
Further averaging will result in
1 P/Q2 1

Hyt+τ |yt = E Eps [aρ ]b−1
ρσ E ps [aσ ] + o(1).
τ t 0 2
Next, we take the passage to the limit τ → 0 and find out that

1
hy (t) = E ∑ Eps [aρ ]bρσ Eps [aσ ]
−1
P/Q2
2 ρ ,σ

1
= P(dWt , dy(t)) ∑ ∑ Wt (x)aρ (x, y(t),t)
2 x,x ρ ,σ
× b−1
ρσ (y(t),t)aσ (x , y(t),t)Wt (x ). (5.11.17)
The following fact helps to find distribution P(dWt , dy(t)) involved in the latter
formula: variables {Wt , yt } constitute a Markov process called a secondary a posteri-
ori W -process in Stratonovich’s monography [48]. The probability density function
pt (W, y) of process {Wt , y(t)} satisfies the particular Fokker–Planck equation that
was also derived in that monography. It takes the form
d pt ∂ ∂
=
[πxx W (x)pt ] − [Eps [aρ ]pt ]
dt ∂ W (x ) ∂ yρ
1 ∂2
+ ∑
2 ∂ W (x)∂ W (x )
( )
× W (x)(aρ (x) − Eps [aρ ])b−1
ρσ (aσ (x ) − Eps [aσ ])W (x )pt
∂2 ( )
+∑ W (x)(aρ (x) − Eps [aρ ])pt
∂ W (x)∂ yρ
1 ∂2
+ ∑
2 ∂ yρ ∂ yσ
[aρσ pt ]. (5.11.18)
The specified results are valid in a stationary case as well as in a non-stationary

case. In the stationary case, functions πt (x, x ), aρ (x, y,t), bρσ (y,t) are independent
P/Q
of time, and the entropy density hy 2 is constant and can be determined by for-
mula (5.11.17) with the help of stationary distribution Pst (dWt , dy(t)) that is a solu-
tion of stationary equation (5.11.18), in which it should be noted that d pt /dt = 0.
Similarly to (5.3.17) we can consider W (·) instead of W (x) as distribution
Wt (dxdy) in the entire ‘phase space’ ξ = (x, y) of a Markov process (such a pro-
cess {Wt (·)} will be Markov). Then formula (5.11.17) can be rewritten as follows:

1
∑ Wt (d ξ )aρ (ξ ,t)b−1
P/Q2
hy (t) = P(dWt ) ρσ (y,t)aσ (ξ ,t)Wt (d ξ ).
2 ρ ,σ
P/Q P/Q P/Q P/Q

Hence, entropy rate hy 2 (as well as entropy hx|y 2 = hxy 2 − hy 2 ) is calcu-
lated by methods, which use results of the theory of conditional Markov processes.
Example 5.11. Let {x(t)} be a process with two states and transition matrix (5.9.6).
In Section 5.9 we have found the corresponding entropy
P/Q1 μν μν
hx = 1+ ln 2 . (5.11.19)
μ +ν e
We define process {y(t)} as a univariate diffusion process with a constant local

variance b and drifts a(x), which do not depend on y, t but depend on x.
Then it follows from formula (5.11.15) that
P/Q2 1
hy|x = [P(x = 1)a2 (1) + P(x = 2)a2 (2)]
2b
or, equivalently,
P/Q2 ν a2 (1) + μ a2 (2)
hy|x = (5.11.20)
2b(μ + ν )
if we take into account (5.9.7).
P/Q
The sum of expressions (5.11.19), (5.11.20) yields entropy rate hxy of the com-
bined process.
P/Q
Now we proceed to the computation of entropy rate hy 2 for the example in
question. If we introduce variable zt = Wt (1) −Wt (2), then mean drift Eps [a] can be
represented as
a(1) + a(2) a(1) − a(2)

Eps [a] = Wt (1)a(1) +Wt (2)a(2) = + z.
2 2
Substituting the latter expression to (5.11.17) we obtain that
P/Q2 1
hy = [a(1) + a(2)]2 +
8b
1 ( 1 )
+ E [a2 (1) − a2 (2)]z + [a(1) − a(2)]2 z2 . (5.11.21)
4b 2
It is easy to see that averaging of a posteriori probabilities Wt (x) = P[xt = x | yt0 ]
results in a priori probabilities P[xt = x], which have form (5.9.7) in a stationary
case. Therefore,
ν −μ
E [z] = E [Wt (1) −Wt (2)] = P[x = 1] − P[x = 2) =
μ +ν
and formula (5.11.21) can be reduced to
P/Q2 1 (ν − μ )[a2 (1) − a2 (2)]

hy = [a(1) + a(2)]2 +
8b 4b(μ + ν )
1
1
+ [a(1) − a(2)]2 z2 pst (z)dz
8b −1
or, equivalently,
1
P/Q ν a2 (1) + μ a2 (2) [a(1) − a(2)]2
hy 2 = − (1 − z2 )pst (z)dz. (5.11.22)
2b(μ + ν ) 8b −1
In the given case process {zt } appears to be Markov itself. The corresponding
Fokker–Planck equation
∂ ( ) 1 ∂2
ṗ(z) = − [ν − μ + (μ + ν )z]p(z) + [(1 − z2 )2 p(z)]
∂z 2b ∂ z2
follows from (5.11.18).
Equating derivative ṗst (z) to zero and integrating the resulting equation, we ob-
tain the stationary probability density function
z
const dx
pst (z) = exp 2b [ν − μ − (μ + ν )x] } (5.11.23)
(1 − z2 )2 0 (1 − x 2 )2
The integral with this probability density function involved in expression (5.11.22)
has been computed in [48] (translated to English [52]) and turned out to be equal to
(
√ √
1− z2 p(z)dz = 4Kq (b μν ) 2Kq (b μν )+
# # )−1
ν √ μ √
+ Kq+1 (b μν ) + Kq−1 (b μν ) (5.11.24)
μ ν
where q = b2 (ν − μ ) and also Kq (z) is McDonald’s function (see Ryzhik and Grad-
stein [36], the corresponding translation to English is [37]).
The substitution of (5.11.24) to (5.11.22) solves the problem of determining
P/Q
entropy rate hy 2 of the stationary non-Markov process y. Combining (5.11.19),
P/Q P/Q P/Q P/Q
(5.11.20), (5.11.22) it is easy to find entropy hx|y 1 = hx 1 + hy|x 2 − hy 2 of the
conditional Markov process x(t).
3. The aforementioned derivation of formula (5.11.17) is also applicable in other
cases, for instance, when process {x(t)} is a diffusion Markov process or consti-
tutes a portion of components of a combined diffusion Markov process {ξ (t)} =
{x(t), y(t)}. We consider the last case in more details. We suppose that the com-
bined Markov process having the components
ξα (t) = xα (t), α = 1, . . . , m; ξρ (t) = yρ (t), ρ = m + 1, . . . , m + l
is described by the vector of drifts a j (ξ ,t) and the (m + l) × (m + l)-matrix of local

variances (diffusion coefficients) b jk (ξ ,t), j, k = 1, . . . , m + l. It is assumed that
the l × l-submatrix of the former matrix bρσ (ξ ,t), ρ , σ = m + 1, . . . , m + l is non-
singular and independent of x.
Then in analogy with (5.11.17) the formula

m+l
1
∑ Eps [aρ ]bρσ Eps [aσ ]
−1
P/Q2
hy (t) = E (5.11.25)
2 ρ ,σ =m+1
is valid, where averaging Eps corresponds to averaging with respect to a posteriori

probability density

Eps [aσ ] = aσ (x, y,t)ωt (x)dx1 · · · dxm
which satisfies the following equation derived in the indicated monography [48]
(translated to English [52]):
∂ 1 ∂2
ωt (x) = ∑ [b ωt ]
∂t 2 α ,β ∂ xα ∂ xβ αβ
∗
m
∂ m+l
d yσ
−∑ aα ωt + ∑ bαρ bρσ ωt −1
− Eps [aσ ]
α =1 ∂ xα σ ,ρ =m+1 dt
m+l ∗
d yσ
+ ∑ (aρ − Eps [aρ ])b−1 ω
ρσ t − E ps σ .
[a ]
ρ ,σ =m+1 dt
The second averaging in (5.11.25) corresponds to averaging with respect to Markov

process {wt (x), y(t)} as a stationary stochastic process.
4. So far process {y(t)} has been assumed to be a diffusion process. Analogous
results can be obtained, say, for the case when process {y(t)} is a process with either
a finite or a countable number of states.
Let {x(t)} be a Markov process with transition matrix πt (x, x ) that is similar to
one in clause 2. For a fixed realization of {x(t)} let process {y(t)} be a Markov pro-
cess with discrete states, which are defined by differential transition probabilities
πt (x, y, y ) dependent on x = x(t). Then the combined process {ξ (t)} = {x(t), y(t)}
turns out to be a Markov process with discrete states as well. It will have the follow-
ing differential transition matrix:
πt (ξ , ξ ) = πt (x, x )δyy + πt (x, y, y )δxx . (5.11.26)
In the given case entropies of process {x(t)} and combined process {x(t), y(t)}
P/Q
can be found by applying formula (5.9.4). Calculation of entropy HyT 2 and its
0
P/Q
density hy 2 (t) requires a special approach because process {y(t)} itself is not
Markov.
Fundamentally, the respective calculation can be carried out in the same way as
in clause 2. We use formula (5.11.12) selecting a Poisson measure (for transition
time moments of process y(t)) with a unit density as measure Q2 . For small τ we
have

τ
P[dyt+
t | x(t), y(t)] eτ [1 + πt (x(t), y(t), y(t))τ ] + O(τ 2 ) for y(t + τ ) = y(t)
t+τ = τ
Q2 [dyt | y(t)] e πt (x(t), y(t), y(t + τ ))τ + O(τ 2 ) for y(t + τ ) = y(t).
Substituting the latter expression to (5.11.12) we find out that

3 4
Hyt+τ |yt = E 1 + Eps [πt (x(t), y(t), y(t))]τ ln eτ 1 + Eps [πt (x(t), y(t), y(t))]τ
P/Q
t 0

+ ∑ Eps [πt (x(t), y(t), y(t + τ ))]τ ln et Eps [πt (x(t), y(t), y(t + τ ))]
y(t+τ )=y(t)
+ O(τ 2 ). (5.11.27)
τ
First of all, here we implement averaging E [. . . | x(t), yt0 (t)] over yt+
t ; then, av-
eraging E [. . . | y0 ] over x(t) with weight Wt ; and then, finally, averaging over other
t
variables.
Next, (5.11.27) entails the following formula for the entropy density:
P/Q2
hy (t) =

= E 1+

∑ Eps [π (x(t), y(t), y )] ln Eps [π (x(t), y(t), y )] − 1 } (5.11.28)
y =y(t)
The distinction of this formula from the formula

∑
P/Q
hy 2 = E 1+ [π (y(t), y )](ln π (y(t), y ) − 1) (5.11.29)
y =y(t)
(i.e. formula (5.9.4) taken for ξ = y) is that π (y, y ) are replaced with a posteriori
means
Eps [π (x(t), y(t), y )] = ∑ Wt (x)π (x, y(t), y )
x
and averaging with respect to {Wt , y(t)} with weight P[dWt dy(t)] is considered in-
stead of averaging with respect to y(t) with weight P(dy). Process {Wt , y(t)} ap-
pears to be a secondary a posteriori Markov process and thereby it is not difficult
to find its transition probabilities which define, in particular, stationary distribution
Pst [dWt dy(t)].
It is interesting to compare formula (5.11.28) with the conditional entropy

∑ π (x, y(t), y )[ln π (x(t), y(t), y ) − 1] .
P/Q
hy|x 2 (t) = E 1 + (5.11.30)
y =y(t)
We have written down the last expression similarly to (5.11.29) by first supposing
that x(t) is fixed and then averaging with respect to it.
The provided method of computation of entropy density hy can be extended to
the case when process {x(t)} is not Markov itself but combined process {x(t), y(t)}
is a Markov process (with discrete states). Suppose it is described by the differential
transition matrix πt (x, y, x , y ). In this case the form of resultant formula (5.11.28)
P/Q
remains unchanged. Density hx can be determined analogously.
It is seen from the aforesaid that the described method of computation of entropy
for a portion of components of a Markov process has a wide range of applications.
The most difficult stage is finding the distribution P(dWt , dy(t)) of a secondary a
posteriori W -process.
Chapter 6
Information in the presence of noise. Shannon’s
amount of information
In this chapter, we consider the concept, introduced by Shannon, of the amount of

mutual information between two random variables or two groups of random vari-
ables. This concept is central to information theory, the independent development
of which was initiated by Shannon in 1948. The amount of mutual information is
defined as the difference between a priori and a posteriori (conditional) entropies.
Explicit computation of the amount of mutual information in complex cases,
when the role of random variables is played by stochastic processes, is not an easy
task, generally speaking. In the last two paragraphs of the chapter this problem is
solved for particular cases: we determine the rate of the amount of information for
stationary Gaussian and Markov processes using the methods and the results of
Chapter 5.
6.1 Information losses under degenerate transformations and

simple noise
1. Suppose we have a random variable x with a finite number m of states (i.e. of

realizations) and probabilities P(x). Consider the transformation η = f (x) of the
given random variable. Of course, function f is defined for all possible values of the
random variable.
If the transformation η = f (x) is not degenerate, i.e. there exists an inverse func-
tion f −1 (η ) = x and a one-to-one correspondence (bijection) between x and η can
be established, then the transformation does not affect the entropy:
Hx = Hη .
Indeed, in consequence of the one-to-one correspondence between values of x

and values of η , the expressions of type (1.2.3) for entropies Hx and Hy will differ
only by the order of terms, i.e. indexing of those terms. The indicated values can

https://doi.org/10.1007/978-3-030-22833-0 6
174 6 Information in the presence of noise. Shannon’s amount of information
be enumerated by numbers 1, . . . , m. Hence, we can replace the range of random

variable x by set 1, . . . , m in the same way as we did before.
The case of a degenerate transformation η = f (x) is apparently more compli-
cated. In this case, a fixed value η relates to subset E(η ), the number Ω (η ) of
elements of which is greater than 1. Even knowing η we cannot say which value x
from E(η ) has induced the given value η . In order to answer this question, besides
knowing η it is also required to know the index ζ of the desired element x in the set
E(η ). Then the pair (η , ζ ) can replace x, indeed. We can also establish a one-to-one
correspondence (bijection) between possible values of this pair and the range of x
and, therefore,
Hx = Hηζ .
The random variable ζ is complementary to η ; it complements η to find x. For in-
stance, let there be given function η = [x/10] on natural numbers, where brackets
mean an integer part. Then the number η represents an index of the first decimal
power (ten), to which the natural number x belongs. The additional variable ζ rep-
resents in its turn an index of the number in the respective set. The pair (η , ζ ) is
nothing else but a representation of x in the decimal system and thereby substitutes
x.
Applying the formulae from Section 1.3 [of type (1.3.4)] we have
Hx = Hη + Hζ |η .
It follows from here that

Hx Hη (6.1.1)
since Hζ |η 0.
Further, we consider the case when there is an equality sign in (6.1.1). Because
Hζ |η = ∑ P(η )Hζ (| η )
the equality Hζ |η = 0 holds if and only if the conditional entropy Hζ (| η ) vanishes

for all values η having a non-zero probability. However, the entropy
Hζ (| η ) = − ∑ P(ζ | η ) ln P(ζ | η )
ζ
turns to zero only if probabilities P(ζ | η ) are concentrated on a unique element

from E(η ). By selecting this element, we can establish a one-to-one correspondence
between values η having a non-zero probability and values (η , ζ ) = x also having
a non-zero probability of occurrence. Consequently, the equality Hx − Hη = 0, i.e.
Hζ /η = 0, means that this transformation can be made non-degenerate as a result
of a removal of elements having zero probability, and also that it is equivalent to a
non-degenerate transformation. If the transformation is essentially degenerate (i.e.
it merges at least two elements with non-zero probabilities into one), then the strict
inequality below takes place:
Hx > Hη . (6.1.2)
6.1 Information losses under degenerate transformations and simple noise 175
Thus, we have proven the following:
Theorem 6.1. An essentially degenerate transformation x → η of random variables

reduces (destroys) information H, which may be contained in the random variable.
2. Now let us consider noise. The transmission of signals through a channel is

usually accompanied by noise or distortion. If x is a signal on one end of the channel,
then signal y on the other (reception) end of the channel, as a matter of fact, differs
from x because of random distortions. If x signifies a recording made by a memory
device at some time moment, then this recording can be disturbed after certain time
and the memory device will contain a different recording y instead of original x.
Such random distortions are described by transition probabilities P(y | x). If we
denote initial probabilities of an original message or recording as P(x), then a dis-
turbed message or recording will apparently have probabilities
P(y) = ∑ P(x)P(y | x).

x
The joint probability distribution will be
P(x, y) = P(x)P(y | x). (6.1.3)
In contrast to the aforementioned transformation η = f (x), now the transforma-

tion x → y related to noise is treated as random, i.e. y is random even when x is
fixed. We show that despite this drastic difference a degenerate transformation and
random distortions sometimes have much in common.
Let us consider a special case of simple noise. We call noise simple if it mixes
values of x only within some classes. More precisely, suppose that the feasible re-
gions of x and y can be partitioned into subsets E1 , E2 , . . . (non-overlapping) and
G1 , G2 , . . ., respectively, in such a way that:
(a) A transition from region Ek can end up only at region Gk :
P(y ∈ Gl | x ∈ Ek ) = 0, if l = k (6.1.4)
(b) Transitions from all points of region Ek occur with equal probabilities and lead
to y ∈ Gk
P (y | x ∈ Ek ) = P (y | k) (6.1.5)
It is easy to see that it follows from (a) that
Hl|k = 0 (6.1.6)
or, equivalently,
Hk = Hkl (6.1.7)
(k is the index of region Ek , l is the index of region Gl ). Also
Hy|x = Hy|k (6.1.8)

follows from (b).
Theorem 6.2. From an information theoretic point of view, simple noise is equiva-
lent to a non-random degenerate transformation k = k(x) (where k(x) is the index
of subset Ek containing x), i.e. Hx|y = Hx|k .
Proof. In order to prove the statement we should consider a posteriori probability

distribution P(x | y). Substituting (6.1.5) into the formula of inverse probability (the
Bayes formula), transforming it to
P (x) P (y | x) P (x) P (y | l)
P (x | y) = = , where y ∈ Gl
∑x P (x ) P (y | x ) ∑x ∈El P (x ) P (y | l)

(l = k) and reducing the resulting fraction by common factor P(y | l), we obtain
P (x)
P (x | y) = if x ∈ El , P (x | y) = 0 if x ∈
/ El .
P (El )
The latter expression depends on index l of set Gl , to which y belongs. But it is

independent of a certain value of y within Gl , i.e.
P (x | y) = p (x | l) (y ∈ Gl ). (6.1.9)
If a similar equality holds true, then we say that variable l is a sufficient variable
or a sufficient statistic, which replaces y. Hence, we have obtained that index l of a
set serves as a sufficient statistic in the given case. It follows from equality (6.1.9)
that Hx (| y) = Hx (| l) and (after averaging this equality) also
Hx|y = Hx|l . (6.1.10)
Equalities (6.1.9), (6.1.10) indicate that observing variable y is equivalent to observ-

ing variable l = k. The proof is complete.
One can see from the definition of simple noise and Theorem 6.2 that the notion
of simple noise is reversible: the noise corresponding to the inverse transition with
probabilities P(x | y) is simple, if the noise of the original transition with probabili-
ties P(y | x) is simple as well.
Indeed, substituting (6.1.4) to (6.1.3) it is easy to verify that are different from
zero only those probabilities P(x, y), for which x and y hit regions Ek , Gk with the
same index k. Probabilities P(x | y) are null if indices k and l of regions Ek x, Gl y
are not equal. Therefore, property (6.1.4) is reversible. In addition to (6.1.6), (6.1.7)
the relations
Hk|l = 0, Hl = Hkl = Hk (6.1.11)
are valid. Furthermore, equality (6.1.9) is evidently an inverse of equality (6.1.5).
Thus, the indicated reversibility has been proven: besides the degenerate transfor-
mation k = k(x) we can also consider a non-randomized degenerate transformation
l = l(y), where l is the index of region Gl point y belongs to.
6.1 Information losses under degenerate transformations and simple noise 177
Theorem 6.2 entails that noise destroys information because such a destruction
takes place for a degenerate transformation l = l(x) according to Theorem 6.1.
3. In order to accurately convey information in the case of either a degenerate
transformation or simple noise, we need to connect information not with variable x
(which is distorted under the transformation) but with variable η = k (which remains
constant) because l = k. Consequently, the amount of transmitted information will
be equal to
I = Hk . (6.1.12)
Next we reduce this relation to another form by employing
Hk = Hxk − Hx|k = Hx + Hk|x − Hx|k (6.1.13)
which follows from the definition of conditional entropies (Section 1.3). For fixed x
number k(x) of region Ek x is completely determined. Hence,
Hk|x = 0 (6.1.14)
and we obtain from (6.1.12), (6.1.13) that
I = Hx − Hx|k .
According to (6.1.10) this relation can be rewritten as
I = Hx − Hx|y . (6.1.15)
Furthermore, in analogy with (6.1.14) we have Hl|y = 0 or Hk|y = 0 (that is the same
because l = k with probability 1). Therefore,
Hk = Hk − Hk|y . (6.1.16)
Taking into account (6.1.12), (6.1.11), (6.1.16), (6.1.15) we obtain that
I = Hk − Hk|l = Hk − Hk|y = Hx − Hx|y . (6.1.17)
The stated results are referred to the case of simple noise, however, they can be
implicitly (in the asymptotical sense) extended to the case of arbitrary noise as it
can be seen from the following (Section 7.6). Approximate relations (7.6.19) are
derived instead of exact relations (6.1.17) therein. In order to do so we need not
consider single random variables x and y but instead sequences x1 , . . . , xn and y1 ,
. . . , yn of such variables as n → ∞. Just as the case of arbitrary probabilities is
asymptotically reduced to the case of equally probable possibilities (refer to Sec-
tions 1.4 and 1.5), the case of arbitrary noise is asymptotically (as n → ∞) reduced
to the case of simple noise. That is why, if we connect information with correspond-
ing (approximate) sufficient statistics k(x) = 1, . . . , M (for that it has to be at most
Hk − Hk|l : ln M Hk − Hk|l ), then in the limit n → ∞ we can avoid errors invoked
by distortions. This fact is related to the famous Shannon’s theorem and is one of its
possible interpretations (see Chapter 7).
6.2 Mutual information for discrete random variables
Proceeding to the case of arbitrary noise, let us consider the quantity
Ixy = Hx − Hx|y (6.2.1)
which is called mutual (pairwise) information between random variables x and y. It

can be interpreted in some sense as the amount of information about x contained in
y.
The amount of information (6.2.1) was introduced by Shannon [45] (the English
original is [38]), who also showed the significance of this notion in information
theory. Henceforth, we call it the Shannon’s amount of information.
There are quite a lot of indirect reasons to treat (6.2.1) as a measure of the amount
of information (for instance, non-negativity, additivity, etc.). However, eventually
the main reason is the fact that (6.2.1) is asymptotically reduced to the Hartley’s
amount of information, ln M. This is stated by Shannon’s theorem, which will be
considered further in Chapter 7. Without it, the usage of (6.2.1) as a measure would
be more speculative rather than practical.
Since Hx = Hpr is the amount of a priori uncertainty, and Hx|y = Hps is the average
amount of a posteriori uncertainty when observing variable y, the random amount
of information (6.2.1), which can be written as
Ixy = Hpr − Hps ,
reflects the average amount of uncertainty, which has disappeared due to the recep-
tion of information. Such interpretation of the amount of information was introduced
earlier in Section 1.1 (equation (1.1.2)).
Applying the regular relationship Hx|y = Hxy − Hy (refer to (1.3.4)) we can rep-
resent formula (6.2.1) in the form
Ixy = Hx + Hy − Hxy . (6.2.2)
The symmetry of this value is seen from the last formula. In other words, it remains
unchanged if x and y switch roles.
Hence, the same amount of uncertainty will disappear on average, if x is treated
as a quantity observed and y is treated as an independent variable:
Ixy = Hy − Hy|x . (6.2.3)
The relations among values Hx , Hy , Hxy , Hx|y , Hy|x , Ixy are described graphically
on Figure 6.1. Since conditional entropy does not surmount regular entropy (Theo-
rem 1.6), information Ixy is non-negative.
6.2 Mutual information for discrete random variables 179
Fig. 6.1 Relationships between the information characteristics of two random variables
Representing the entropies Hx , Hy , Hxy in (6.2.2) by the standard formula of

type (1.2.3), we rewrite the amount of information between two random variables
as
P (x, y)
Ixy = E [H (x) + H (y) − H (x, y)] = E ln . (6.2.4)
P (x) P (y)
Evidently, the next equivalent forms are also possible:

P (x | y) P (y | x)
Ixy = E ln = E ln .
P (x) P (y)
Just as we considered random entropy H(ξ ) = − ln P(ξ ) in addition to the mean

entropy Hξ in Section 1.2, we can introduce random mutual information
P (x, y) P (x | y)
I (x, y) = H (x) + H (y) − H (x, y) = ln = ln . (6.2.5)
P (x) P (y) P (x)
Then it is easy to see that (6.2.4) takes the form
Ixy = E[I (x, y)]. (6.2.6)
Further, (6.2.5) entails the following formula for the joint distribution:
P (x, y) = e−H(x)−H(y)+I(x,y) . (6.2.7)
Consequently, pairwise mutual information I(x, y) extends the marginal probability

distributions
P (x) = e−H(x) , P (y) = e−H(y) (6.2.8)
to a bivariate joint distribution.
Since we can use summation to obtain marginal distributions from a joint distri-
bution
∑ P (x, y) = P (x) , ∑ P (x, y) = P (y)
y x
in consequence of (6.2.7), (6.2.8) random mutual information ought to satisfy the

equations
∑ eI(x,y)−H(y) = 1, ∑ eI(x,y)−H(x) = 1.
y x
As it was mentioned, the mutual information (6.2.4) is obviously a non-negative

value. That does not hold for random mutual information (6.2.5). Although positive
values prevail for it, negative values are also possible. So that averaging leads to a
non-negative result. Let us prove the fact that conditional averaging of I(x, y) only
with respect to either x or y yields a non-negative value.
Proof. First, we represent formula (6.2.5) in the form
P (x)
I (x, y) = − ln . (6.2.9)
P (x | y)
We apply the inequality
P (x) P (x)
ln −1 +
P (x | y) P (x | y)
and obtain that

P (x)
I (x, y) 1 − . (6.2.10)
P (x | y)
If we conditionally average out this inequality with respect to x, i.e. if we take a
conditional expectation E [. . . | y], then due to

P (x) P (x)
E |y =∑ P (x | y) = 1
P (x | y) x P (x | y)
we derive from (6.2.10) the following
E [I (x, y) | y] 0.
Switching x and y in the previous argument, we obtain the second desired in-
equality: E [I(x, y)|x] 0.
It is interesting to note that the reverse inequality

∑ I (x, y) P (x) 0 where ∑y I(x, y)P(y) 0
x
will take place if we implement averaging with weight P(x) (or P(y)) instead of
P(x | y).
Indeed, rewriting the right-hand side of (6.2.9) or (6.2.5) in the form ln P(y|x)
P(y) or
employing the inequality
P (y | x) P (y | x)
ln −1
P (y) P (y)
6.3 Conditional mutual information. Hierarchical additivity of information 181
we obtain
P (y | x)
I (x, y) −1
P (y)
instead of (6.2.10). Further, we average out the latter inequality with weight P(x)
and obtain
1
∑ I (x, y) P (x) P(y) ∑ P (y | x) P (x) − 1 = 0
x x
because
∑ P (y | x) P (x) = P (y) .
x
Hence, there are plenty of negative values that random mutual information I(x, y)
assumes. This is one of the reasons why we treat the corresponding mean value Ixy
but not I(x, y) as the amount of information.
6.3 Conditional mutual information. Hierarchical additivity of

information
1. If there are several random variables x, y, z, . . . , then we can define conditional

mutual information in analogy with formula (6.2.5) by simply taking conditional
probabilities instead of regular ones:
P (x | yz)
I (x, y | z) = ln (6.3.1)
P (x | z)
or, equivalently,
I (x, y | z) = H (x | z) + H (y | z) − H (x, y | z) . (6.3.2)
We denote the result of a particular or full averaging as:
Ixy (| z) = E [I (x, y | z) | z] = ∑ I (x, y | z) P (x, y | z)

x,y
Ixy|z = E [I (x, y | z)] = ∑ I (x, y | z) P (x, y, z) . (6.3.3)

x,y,z
Substituting (6.3.1) or (6.3.2) hereto we obtain
Ixy (| z) = Hx (| z) + Hy (| z) − Hxy (| z) = Hx (| z) − Hx|y (| z)

Ixy|z = Hx|z + Hy|z − Hxy|z = Hx|z − Hx|yz . (6.3.4)
Now we see that conditional or regular mutual information can be expressed in terms
of corresponding conditional or regular entropies, respectively.
Since entropy possesses the property of hierarchical additivity (Section 1.3), mu-
tual information possesses an analogous property as well. Let x from formula (6.2.1)
consist of several components x = (x1 , . . . , xn ). Then the formula of hierarchical ad-
ditivity (1.3.4) can be applied to entropies Hx , Hx|y yielding
Hx = Hx1 + Hx2 |x1 + · · · + Hxn |x1 ...xn−1

Hx|y = Hx1|y + Hx2 |x1 y + · · · + Hxn |x1 ...xn−1 y
According to (6.2.1) we take the difference Hx − Hx|y , group terms in pairs and
thereby obtain

I(x1 ...xn )y = Hx1 − Hx1 |y + Hx2 |x1 − Hx2 |x1 y + · · · + Hxn |x1 ...xn−1 − Hxn |x1 ...xn−1 y .
But due to (6.3.4) every difference Hxk |x1 ...xk−1 − Hxk |x1 ...xk−1 y is nothing else but con-
ditional mutual information Ixk y|x1 ...xk−1 . Therefore,
I(x1 ...xn )y = Ix1 y + Ix2 y|x1 + Ix3 y|x1 x2 + · · · + Ixn y|x1 ...xn−1 . (6.3.5)
It is not difficult to understand that the same formula is valid for conditional mutual
information as well
n
I(x1 ...xn )y|z = ∑ Ixk y|x1 ...xk−1 |z . (6.3.6)
k=1
Now we assume that the second random variable is also compound: y = (y1 , . . . , yr ).
Then applying formula (6.3.6) to every mutual information Ixk y|x1 ...xk−1 we will have
r
Ixk (y1 ...yr )|x1 ...xk−1 = ∑ Ixk yl |x1 ...xk−1 y1 ...yl−1 .
l=1
Consequently, we derive (from (6.3.5)) the formula containing a double summation:

n r
I(x1 ...xn )(y1 ...yn ) = ∑ ∑ Ixk yl |x1 ...xk−1 y1 ...yl−1 . (6.3.7)
k=1 l=1
According to this formula, all possible pairs xk , yl contribute to mutual information

I(x1 ...xn )(y1 ...yr ) . Of course, actual non-zero contributions are made only by those pairs
xk , yl , which are not statistically independent random variables.
As an example, we consider mutual information I(x1 x2 )(y1 y2 ) , where variable x1 is
statistically connected only with y1 . At the same time, x2 is statistically connected
only with y2 . Then only two terms remain in sum (6.3.7) instead of four:
I(x1 x2 )(y1 y2 ) = Ix1 y1 + Ix2 y2 |x1 y1 . (6.3.8)
Here, Ix2 y2 |x1 y1 = Ix2 y2 if P(x2 , y2 | x1 , y1 ) = P(x2 , y2 ), i.e. if both x2 and y2 are inde-
pendent of x1 , y1 .
Just as the property of hierarchical additivity (1.3.4) is valid not only for aver-
age entropies, but also for random entropies (1.3.6), relations analogous to (6.3.5)–
(6.3.8) can be represented for random mutual information in the case of mutual
information. For instance,
n r
I (x1 , . . . , xn ; y1 , . . . , yr ) = ∑ ∑ I (xk yl | x1 , . . . , xk−1 , y1 , . . . , yl−1 ) .
k=1 l=1
This reasoning is quite parallel to the previous one if we use (1.3.6) instead
of (1.3.4).
2. Conditional mutual information (6.3.3) are non-negative that can be derived,
for example, from formulae (6.3.4) by taking into account Theorems 1.6, 1.6a. This
fact allows us to obtain various inequalities from the formulae of hierarchical ad-
ditivity (6.3.5), (6.3.7). Namely, mutual information I(x1 ...xn )y or I(x1 ...xn )(y1 ...yr ) from
the left-hand side is at least a sum of any portion of terms from the right-hand side of
the equality. We present a simple example to illustrate the latter argument: consider
the mutual information I(x1 x2 )y between a pair of two random variables x1 , x2 and
variable y. Formula (6.3.5) yields I(x1 x2 )y = Ix1 y + Ix2 y|x1 . Since Ix2 y|x1 0, the next
inequality follows
I(x1 ,x2 )y Ix1 y . (6.3.9)
The equality sign
I(x1 ,x2 )y = Ix1 y (6.3.10)
takes place if and only if
Ix2 ,y|x1 = Hx2 |x1 − Hx2 |x1 y = Hy|x1 − Hy|x1 x2 = 0.
This constraint is satisfied if
P (x2 | x1 , y) = P (x2 | x1 ) where P(y | x1 x2 ) = P(y | x1 ). (6.3.11)
The last equality is exactly the condition of Markovian communication of the triplet
x2 , x1 , y.
We can also derive the following relationship from (6.3.7):
I(x1 ,x2 )(y1 ,y2 ) Ix1 y1 . (6.3.12)
Thus, we see that the mutual information of the given random variables is greater
or equal than the mutual information of a portion of the indicated variables. This is
analogous to the inequality Hx1 Hx1 x2 for entropy (because Hx2 |x1 0). Mean-
while, the relation Hx|z Hx does not have a counterpart for mutual information.
The inequality Ixy|z Ixy is not valid in a general case.
Example 6.1. Let x, y, z be random variables with two feasible values, which are
represented by the probabilities
P (z1 ) = P (z2 ) = 1/2,

" " " "
"P (yi , x j | z1 )" = 3/8 1/8 , "P (yi , x j | z2 )" = 3/4 1/4 (i, j = 1, 2) .
1/8 3/8 0 0
Then
Hx (| z1 ) = h (1/2) , Hx|y (| z1 ) = h (1/4) , Hx (| z2 ) = h (1/4) , Hx|y (| z2 ) = h (1/4)
where h(p) = −p ln p − (1 − p) ln(1 − p).

Due to formula (6.3.4) we have
2Ixy|z = Ixy (| z1 ) + Ixy (| z2 ) = h (1/2) − h (1/4) . (6.3.13)
At the same time

" "
"P (yi , x j )" = 9/16 3/16 , P (xi ) = (5/8, 3/8)
1/16 3/16
Therefore,
Hx = h (3/8) , Hx|y = h (1/4)
so that
Ixy = h (3/8) − h (1/4) . (6.3.14)
Finally, the following difference can be found from (6.3.13), (6.3.14):
Ixy − Ixy|z = h (3/8) − (1/2) h (1/4) − (1/2) h (1/2) = 0.183 bits > 0. (6.3.15)
Example 6.2. Now we assume that random variables x, y, z with two feasible values
are described by the probabilities
1
P (z1 ) = P (z2 ) = ,
2
" " 3/8 1/8 " "
"P (yi , x j | z1 )" = , "P (yi , x j | z2 )" = 1/3 1/6
1/8 3/8 1/6 1/3.
In this case we have
Hx (| z1 ) = Hx (| z2 ) = Hx = h (1/2) ,
Ixy (| z1 ) = h (1/2) − h (1/4) ,
Ixy (| z2 ) = h (1/2) − h (1/3) ,
Ixy|z = h (1/2) − (1/2) h (1/4) − (1/2) h (1/3)
Furthermore, since

" " " "
"P (yi , x j )" = 17/48 7/48 , "P (xi | y j )" = 17/24 7/24
7/48 17/48 7/24 7/24
we obtain that
Ixy = h (1/2) − h (7/24) .
Therefore,
Ixy − Ixy|z = (1/2) h (1/4) + (1/2) h (1/3) − h (7/24) = −0.04 bits < 0. (6.3.16)
The sign of the difference Ixy −Ixy|z in these examples has been influenced by con-
cavity of function h(p) = −p ln p− (1− p) ln (1 − p). Indeed, for a concave function
we have
h (E[ξ ]) − E[h (ξ )] 0. (6.3.17)
Furthermore, values 3/8, 7/24 are nothing else but the mean:
3/8 = (1/4 + 1/2) /2 = E[ξ ], 7/24 = (1/4 + 1/3) /2 = E[ξ ].
Owing to (6.3.17), the difference (6.3.15) is positive, while (6.3.16) is negative.
Thus, as we can see, the difference can have any sign.

3. The definition of pairwise mutual information (6.2.1) between two random
variables can be generalized to three or more random variables. Let us define triple
mutual information by the formula
Ixyz = Ixy − Ixy|z . (6.3.18)
Substituting (6.3.4) hereto and expressing conditional entropies in terms of regular

ones, we obtain
Ixyz = Hx + Hy + Hz − Hxy − Hxz − Hyz + Hxyz (6.3.19)
or, equivalently,
Ixyz = Hxyz + Ixy + Iyx + Izx − Hx − Hy − Hz . (6.3.20)
It is apparent that this formula is symmetric about x, y, z, i.e. it is invariant to permu-

tations of these random variables. We define the random triple mutual information
(which is analogous to (6.3.20)) by the formula
I (x, y, z) = H (x, y, z) + I (x, y) + I (y, z) + I (z, x) − H (x) − H (y) − H (z) . (6.3.21)
Taking into account that H(x, y, z) = − ln P(x, y, z) we obtain the following formula
for the three-fold distribution law from (6.3.21):
P (x, y, z) = exp {−H (x) − H (y) − H (z) + I (x, y) + I (y, z) + I (z, x) − I (x, y, z)} .
This is a generalization of formula (6.2.7).

In the examples considered above random mutual information (6.3.21) is the
following:
4 4
" " " "
"I (yi , x j , z1 )" = ln 54 ln 34 , "I (yi , x j , z2 )" = ln 6 ln 2
ln 5 ln 3 5 3
(mutual information is undefined when y = y2 , z = z2 ) for Example 6.1 and

17 51 21
" " ln 76 " "
"I (yi , x j , z1 )" = ln 18
7 , "I (y ,
i jx , z)" =
ln 48 ln 24
ln 16 ln 17
18 ln 21 51
24 ln 48
for Example 6.2. Mean Ixyz equals 0.183 bits and −0.04 bits, respectively, according
to (6.3.15), (6.3.16).
Thus, non-negativity of triple correlation is not necessary.
Mutual information of a larger number of random variables is constructed in
analogy with formula (6.3.19). In a general case the n-fold mutual information is
defined by the formula
n
Ix1 ...xn = ∑ Hxi − ∑ Hxi x j + ∑ Hxi x j xk − · · · − (−1)n Hxi ...xn (6.3.22)
i=1

where summation
is taken over all possible distinct pairs n(n − 1)/2! terms ,
triplets n(n − 1)(n − 2)/3! and other combinations of indices 1, . . . , n.
We can prove that mutual information (6.3.22) is represented analogously to
(6.3.18) as the difference between regular and conditional mutual information of
a smaller multiplicity
Ix1 ...xn+1 = Ix1 ...xn − Ix1 ...xn |xn+1 . (6.3.23)
It is reasonable to use an induction in order to prove the equivalence of (6.3.22)
and (6.3.23). We should subtract the expression for the conditional mutual informa-
tion from the expression situated in the right-hand side of (6.3.22)
Ix1 ...xn |xn+1 = ∑ Hxi |xn+1 − ∑ Hxi xk |xn+1 − · · · − (−1)n Hxi ...xn |xn+1
n(n − 1)
= ∑ Hxi xn+1 − nHxn − ∑ Hxi xk xn+1 + Hxn
2
− · · · − (−1)n Hx1 ...xn xn+1 + (−1)n Hxn+1 .
By analyzing the terms involved in the resulted expression and taking into account
the equation
n (n − 1)
−n + − · · · + (−1)n = (1 − 1)n − 1 = −1
2
we ascertain that the latter expression is equivalent to the sum in the right-hand side
of formula (6.3.22) if we substitute n by n + 1 therein. Equality (6.3.22) is valid for
n = 2 and n = 3, thereby, it will be valid for all larger values of n.
In the case of high multiplicities (starting with the triple connection considered
above) we cannot say anything certain about non-negativity of mutual information
that took place in the case of the pairwise mutual information.
6.4 Mutual information in the general case 187
6.4 Mutual information in the general case
1. In the last paragraphs, it was assumed that variables in consideration are discrete,
i.e. they take values from either a finite or a countable sample space, i.e. only general
properties of entropy were essentially used. It stands to reason that the aforemen-
tioned formulae and statements related to mutual information can be propagated to
the case of arbitrary continuous or combined random variables. Indeed, it was shown
in Sections 1.6 and 1.7 how to define the notion of entropy (possessing all properties
of the entropy from the discrete version) for arbitrary random variables. We should
just use the relations provided therein. The presence of regular properties of the en-
tropy in the generalized version will provide the validity of the same relations (as in
the discrete version) for mutual information in the generalized version.
Further we consider two arbitrary random variables x, y. Referring to the abstract
language of the measure theory it means that there are given the probabilistic space
(Ω , F, P) and two Borel subfields F1 ⊂ F, F2 ⊂ F. The first F1 is defined by con-
straints imposed on x(ω ), ω ∈ Ω (i.e. by constraints of type x(ω ) < c). The second
F2 is defined by constraints imposed on y(ω ), ω ∈ Ω . It is also assumed that besides
probability measure P there are given:
1. measure ν on the united Borel field F12 = σ (F1 ∪ F2 );
2. measure ν1 on field F1 ;
3. measure ν2 on field F2 .
These measures are such that the constraint
ν (AB) = ν1 (A) ν2 (B) , A ∈ F1 , B ∈ F2 (6.4.1)
corresponding to (1.7.8) is satisfied and also measure P is absolutely continuous

on F12 with respect to measure ν (as well as measures ν1 (A), ν2 (B)). Under the
described conditions we can define the entropies

P (dx) P (dx | y)
Hx = − ln P (dx) , Hx|y = − ln P (dxdy) . (6.4.2)
ν1 (dx) ν1 (dx)
According to formulae (6.2.2), (6.2.5), (6.2.6) mutual information is defined as the

difference
Ixy = Hx − Hx|y (6.4.3)
or, if we substitute (6.4.2) hereto,

P (dx | y)
Ixy = ln P (dxdy) . (6.4.4)
P (dx)
The last equality can be treated as an original definition of mutual information inde-
pendently of use of the notion of entropy that is quite convenient in general. Such a
definition allows not to introduce auxiliary measures ν1 , ν2 , ν .
The following expression for the random entropy corresponds to formula (6.4.4):
P (dx | y)
I (x, y) = ln . (6.4.5)
P (dx)
The argument of the logarithm is defined as the Radon–Nikodym derivative of mea-

sure P(A | y) with respect to measure P(A), A ∈ F1 .
As is known, there is some arbitrariness in such a definition. Various definitions
of it may not coincide on a set of zero probability. Of course, this arbitrariness
does not affect the magnitude of average mutual information (6.4.4). Also, rela-
tions (6.4.5), (6.4.4) can be reduced to
P (dxdy) P(dy | x)
I (x, y) = ln = ln (6.4.6)
P (dx) P (dy) P(dy)
Ixy = E[I (x, y)]. (6.4.7)
These relations are analogous to formulae (6.2.4), (6.2.5) from the discrete version.
In the generalized case formulae (6.2.8), (6.2.7) take the form:
P (dx) = ν1 (dx) e−H(x) , P (dy) = ν2 (dy) e−H(y) ,

P (dxdy) = ν (dxdy) e−H(x,y) = ν (dxdy) e−H(x)−H(y)+I(x,y) .
The other formulae from Sections 6.2 and 6.3 can be duplicated with no trouble.
Henceforth, we write down the respective formulae in the required forms.
Satisfaction of the multiplicativity condition (6.4.1) for the normalized measure

Q (dxdy) = ν (dxdy) /N = ν (dxdy) ν (dxdy)
means the equality

Q (dxdy) = Q (dx) Q (dy) (6.4.8)

see (1.7.16) . Therefore, we can use formula (1.7.17). Taking into account (1.6.17),
(1.7.17) it is easy to see that formula (6.4.3) can be rewritten as follows
P/Q P/Q P/Q P/Q P/Q
Ixy = Hx|y − Hx = Hxy − Hx − Hy (6.4.9)
Hence, entropies H P/Q also allow to calculate mutual information as a difference of

entropies like (6.4.3) but with another sign.
2. Random mutual information as in (6.4.5), (6.4.6) is a random variable. For
some purposes, for instance, in order to investigate the error probability when trans-
ferring messages through a channel with noise (Sections 7.1 and 7.2), it is important
to know not only its mean value Ixy = E [I(x, y)] but also other statistical characteris-
tics. One of the important characteristics of this random variable is its characteristic
potential μ (s). Applying the formula of type (4.1.11a) to random mutual informa-
tion (6.4.5) (taken with an opposite sign), we obtain
6.5 Mutual information for Gaussian variables 189

P (dxdy)
μ (s) = ln exp −s ln P (dxdy)
P (dx) P (dy)

= ln P1−s (dxdy) Ps (dx) Ps (dy) . (6.4.10)
This potential possesses the property
μ (0) = 0 (6.4.11)
as well as a characteristic potential (4.1.11a) of any random variable B(ξ ). Besides,

it complies with the additional property
μ (1) = 0 (6.4.12)

since P(dx)P(dy) = P(dx) P(dy) = 1. The important results of the theory of
optimal coding are expressed in terms of characteristic potential (6.4.10) in the pres-
ence of noise (Theorems 7.2, 7.3).
6.5 Mutual information for Gaussian variables
1. Consider two groups of Gaussian random variables x = (x1 , . . . , xr ), y = (y1 , . . . , ys )

≡ (yr+1 , . . . , yr+s ), and let us find mutual information (6.4.4) between them. To
this end, it is convenient to use the expressions for entropy of Gaussian vari-
ables derived in Section 5.3 by calculating mutual information as the difference
Hx − Hx|y = Hx + Hy − Hxy . In this case, the mutual information in question can be
determined similarly with formula (6.4.9) if we employ the expressions for entropy
H P/Q found in Section 5.4.
The indicated Gaussian variables are characterized by the mean vectors
mα = E[xα ], mρ = nρ −r = E[yρ −r ] = E[xρ ]
and the correlation matrices

R = Rαβ = E (xα − mα ) xβ − mβ ,

S = Sρσ = E yρ −r − nρ −r (yσ −r − nσ −r ) ,
U = Uασ = E (xα − mα ) (yσ −r − nσ −r )
(α , β = 1, . . . , r; ρ , σ = r + 1, . . . , r + s).
Matrices R, S are assumed to be non-singular (otherwise, we should discard some

variables and leave as many variables as the rank of a respective matrix). The matrix
of mutual correlations U may not be square (when r = s) in a general case. The joint
correlation matrix takes the form

R U
K= (6.5.1)
UT S
where T means transposition.

Using (5.4.5) we find the random mutual information by formula (6.2.5):
1 1 1
I (x1 , . . . , xr ; y1 , . . . , ys ) = − ln det K + ln det R + ln det S
2 2 2
1 r+s
− ∑ (xi − mi ) Ki,−1
j (x j − m j )
2 i, j=1
1 r
+ ∑
2 α ,β =1
(xα − mα ) R−1
α , β xβ − mβ
1 r+s
+ ∑
2 ρ ,σ =r+1
xρ − mρ Sρ−1,σ (xσ − mσ ) . (6.5.2)
In order to find average mutual information Ixy it is reasonable to use the simple
formula (5.4.6a), which yields
1 1 1
Ixy = − ln det K + ln det R + ln det S. (6.5.3)
2 2 2
This result can be represented in a different way. Since
ln det A = tr ln A (6.5.4)
for any matrix A, formula (6.5.3) can be reduced to

−1
1 −1 −1
1 R 0
Ixy = − tr ln K + ln R + ln S = − tr ln K
2 2 0 S−1
or, accounting for (6.5.1) and multiplying the matrices, the following

1 1 US−1 1 1 US−1
Ixy = − tr ln = − ln det ,
2 U T R−1 1 2 U T R−1 1
where I is a unit matrix.

Further we apply the formula

AB
det = det A − BD−1C D (6.5.5)
CD
[see (A.2.4) in Appendix] and obtain

1 1
Ixy = − ln det 1 −US−1U T R−1 = − tr ln 1 −US−1U T R−1 (6.5.6)
2 2
or, equivalently,
1
Ixy = − tr ln 1 −U T R−1US−1 . (6.5.7)
2
If we now employ the series expansion of the logarithmic function
∞
zk
ln (1 − z) = − ∑ , (6.5.8)
k=1 k
then we will obtain from (6.5.7) that

k
1 ∞ tr U T R−1US−1 1 ∞ 1 k
Ixy = ∑
2 k=1 k
= ∑ tr US−1U T R−1 .
2 k=1 k
(6.5.9)
We have derived several different equivalent formulae to compute the mutual

information using Gaussian variables. Some of them, for instance (6.5.7), can be
used by first diagonalizing the corresponding matrices. Others, for instance (6.5.3)
and (6.5.9), allow us to avoid this step.
2. Particular case 1. Let every group consist of one variable. The joint correla-
tion matrix (6.5.1) is represented as

σ12 σ1 σ2 R1
K= ,
σ1 σ2 R1 σ22
where R1 is the correlation coefficient and also all matrices R, S, U consist of one
element, namely, R = (σ12 ), S = (σ22 ), U = (σ1 σ2 R1 ). Then
−1 −1
US−1U T R−1 = (σ1 σ2 R1 ) σ22 (σ1 σ2 R1 ) σ12 = R21 (6.5.10)
and it follows from (6.5.6) that

1
Ixy = − ln 1 − R21 . (6.5.11)
2
Particular case 2. Here we consider three Gaussian variables with the respective
correlation matrix ⎛ ⎞
σ12 σ1 σ2 R3 σ1 σ3 R2
K = ⎝σ1 σ2 R3 σ22 σ2 σ3 R1 ⎠ (6.5.12)
σ1 σ3 R2 σ2 σ3 R1 σ32
and find the mutual information between the first random variable and the other two.
In this case matrix K decomposes into submatrices as follows:

R = σ12 , U = σ1 σ2 R3 , σ1 σ3 R2 ,

T σ1 σ2 R3 σ22 σ2 σ3 R1
U = , S= .
σ1 σ3 R2 σ2 σ3 R1 σ32
Because

1 σ32 −σ2 σ3 R1
S−1 = ,
σ22 σ32 1 − R21 −σ2 σ3 R1 σ22
R23 − 2R1 R2 R3 + R22
US−1U T = σ12
1 − R21
we have
US−1U T R−1 = R23 + R22 − 2R1 R2 R3 / 1 − R21 .
The latter matrix contains just one element. We use formula (6.5.6) to obtain

1 R23 + R22 − 2R1 R2 R3
Ix1 ,(x2 ,x3 ) = − ln 1 − . (6.5.13)
2 1 − R21
This result can be also derived from formula (6.5.3) by calculating the determi-
nants
det K = σ12 σ22 σ32 1 − R21 − R22 − R23 + 2R1 R2 R3
(6.5.14)
det S = σ22 σ32 1 − R21 , det R = σ12 .
Therefore, (6.5.3) yields
1 1
Ixy = ln 1 − R21 − ln 1 − R21 − R22 − R23 + 2R1 R2 R3
2 2
that is equivalent to (6.5.13).
In addition to the preceding we determine the triple mutual information (6.3.19)
among three Gaussian random variables. Without loss of generality we can consider
their correlation matrix in the form (6.5.12). Applying (5.4.6a), (6.5.14) we obtain
that
1 1 1 1
Hx1 x2 x3 − Hx1 − Hx2 − Hx3 = ln det K − ln σ12 − ln σ22 − ln σ32
2 2 2 2
1
= ln 1 − R21 − R22 − R23 + 2R1 R2 R3 .
2
According to (6.3.20), by adding pairwise mutual information Ix1 x2 , Ix2 x3 , Ix3 x1 hav-
ing the form (6.5.11) hereto, we derive that
1
Ix1 x2 x3 =ln 1 − R21 − R22 − R23 + 2R1 R2 R3
2
1 1 1
− ln 1 − R21 − ln 1 − R22 − ln 1 − R23
2 2 2
if we decompose the obtained expression by R1 , R2 , R3 , then, as is easy to see, we
will get the following formula for the triple mutual information:

Ix1 x2 x3 = R1 R2 R3 + O R4 .
It is useful to confront the last formula with the analogous formula of the pairwise
mutual information
1
Ix1 x2 = R21 + O R4
2
resulting from (6.5.11).
3. Particular case 3. Here we consider the case of additive independent noises
(disturbances) when the variables of the second group y1 , y2 , . . . , yr (s = r) are the
sums of the variables of the first group x1 , . . . , xr with independent Gaussian random
variables ξ1 , . . . , ξr :
yα = xα + ξα , α = 1, . . . , r. (6.5.15)
In this case the numbers of the first group variables and the second group variables
are equal (s = r). Let R and N be the correlation matrices of variables x1 , . . . , xr
and additive noises ξ1 , . . . , ξr , respectively. It follows from the condition of inde-
pendence between noises ξ and x that the correlation matrices for sum (6.5.15) are
expressed as T
S = R + N, U = R U =R . (6.5.16)
In order to apply formula (6.5.7) we calculate U T R−1US−1 . Here we have
U T R−1US−1 = R (R + N)−1
due to (6.5.16). However,
1 − R (R + N)−1 = [R + N − R] (R + N)−1 = N (R + N)−1

−1 −1 −1
= (R + N) N −1 = RN + 1
and thereby
−1
1 −U T R−1US−1 = RN −1 + 1 .
Consequently, formula (6.5.7) yields
1 1
Ixy = tr ln 1 + RN −1 = ln det 1 + RN −1 . (6.5.17)
2 2
Suppose that λk are eigenvalues of matrix RN −1 . Then it is apparent that
1
2∑
Ixy = ln (1 + λk ) .
k
If variables x1 , . . . , xr and ξ1 , . . . , ξr are mutually independent, then both matrices

have the diagonal representation
" " " "
R = "Rα δαβ " , N = "Nα δαβ " (6.5.18)
and (6.5.17) turns into the equation


1 Rα
2∑
Ixy = ln 1 + . (6.5.19)
α Nα
Elsewise, we can achieve diagonality (6.5.18) via a non-singular linear transforma-

tion C: " " " "
CRCT = "Rα δαβ " , CNCT = "Nα δαβ " .
This transformation keeps mutual information (6.5.17) invariant because of
−1
CRCT CNCT = CRCT CT −1 N −1 = C RN −1 C−1 . (6.5.20)
Hence, we can use formula (6.5.19) after the indicated transformation in the given
case too.
4. Let us find the characteristic potential (6.4.10) of the random mutual infor-
mation for Gaussian variables. For this purpose we represent the random mutual
information (6.4.6), (6.5.2) as follows:
1 r+s
2 i,∑
I (x1 , . . . , xr ; y1 , . . . , ys ) = Ixy − (xi − mi ) Ai j (x j − m j ) , (6.5.20a)
j=1
where we note
−1 −1 −1
−1 R 0 R U R 0
Ai j = A = K − = −
0 S−1 UT S 0 S−1 .
Substituting (6.5.20a) into the formula
μ (s) = ln E {exp [−sI (x1 , . . . , xr ; y1 , . . . , ys )]}

(
= ln (det(2π K))−1/2 exp − sI (x1 , . . . , xr ; y1 , . . . , ys )

1 )
− (xi − mi ) Ki−1
j (x j − m j ) dx1 , . . . , dxr+s
2
and taking into account that

1
exp − ∑ ξi Di j ξ j d ξ1 , . . . , d ξl = (det (D/2π ))−1/2
2
for an arbitrary positive definite matrix D = Di j , we obtain that

−1/2
−1
−1/2 −1 R 0
μ (s) = −sIxy + ln det K + ln det (1 − s) K + s .
0 S−1
Next we merge the two last terms:

−1
1 R 0
μ (s) = −sIxy − ln det 1 − s + sK . (6.5.21)
2 0 S−1
−1
R 0
Matrix (1 − s)K −1 + s is non-singular by design and positive definite
0 S−1
for 0 s 1 as a sum of two such positive definite matrices. That is why for-
mula (6.5.21) and subsequent formulae are valid at least on the interval [0, 1] s.
The obtained result can be expressed in various ways. Because
−1 −1
R 0 (1 − s) R 0 R 0
1 − s + sK = + sK
0 S−1 0 (1 − s) S 0 S−1
−1
R sU R 0
= (6.5.22)
sU T S 0 S−1
we take into account (6.5.3) and represent the characteristic potential (6.5.21) as

s 1−s 1−s 1 R sU
μ (s) = ln det K + ln det R + ln det S − ln det . (6.5.23)
2 2 2 2 sU T S
As is easy to verify, the derived expression yields μ (s) = 0 for s = 0 and s = 1

according to the aforementioned properties (6.4.11), (6.4.12).
After multiplying the matrices on the right-hand side of equality (6.5.22) we
obtain the matrix
1 sUS−1
.
sU T R−1 1
Therefore, formula (6.5.23) can be reduced to

1 1 sUS−1
μ (s) = − ln det − sIxy . (6.5.24)
2 sU T R−1 1
Applying formula (6.5.5) yields the following result:

1
μ (s) = − ln det 1 − s2US−1U T R−1 − sIxy
2
1 s
= − ln det 1 − s2 B + ln det (1 − B) (6.5.25)
2 2
B = US−1U T R−1
or, equivalently,
1
μ (s) = − ln det 1 − s2U T R−1US−1 − sIxy
2
1 ! s !
= − ln det 1 − s2 B + ln det 1 − B
2 ! 2
T −1
B = U R US −1
that corresponds to formulae (6.5.6), (6.5.7). If we employ expansion (6.5.8), then

we will get
s ∞ 1 2k−1 !
μ (s) = ∑k
2 k=1
s − 1 tr Bk (6.5.26)
in analogy with (6.5.9). In particular, it is easy to find the variance of the random
mutual information of Gaussian variables. In order to do so, we need to plug in the
coefficient incident to 12 s2 in the specified expansion (6.5.26) that results in

VarI (x1 , . . . , xr ; y1 , . . . , ys ) = tr B = tr B.
Thus, we see that all statistical properties of the random mutual information of Gaus-
sian variables are defined just by a single matrix B = US−1U T R−1 or B.
For the aforementioned particular case 1 we have
1 s
μ (s) = − ln 1 − s2 R21 + ln 1 − R21
2 2
according to formulae (6.5.25), (6.5.9). It is not difficult to apply the derived formu-
lae to other particular cases.
6.6 Information rate of stationary and stationary-connected

processes. Gaussian processes
1. Now we suppose that both the first and the second groups of random variables are
stationary processes
∞
λ−∞ = {xt , −∞ < t < ∞} , y∞
−∞ = {yt , −∞ < t < ∞} , (6.6.1)
in discrete or continuous time t. These processes are assumed to be not only station-
ary but also stationary-connected, so that the combined process z∞ −∞ = {xt , yt , −∞ <
t < ∞} is stationary. For the interval [0, T ], we can consider the entropies HxT , HyT ,
0 0
HzT defined according to formulae and results from Chapter 5. Here
0
x0T = {xt , 0 ≤ t < T } , yT0 = {yt , 0 ≤ t < T } , zT0 = {xt , yt , 0 ≤ t < T } .
According to the general formula (6.2.2) these entropies allow the determination
of mutual information
IxT ,yT = HxT + HyT − HxT ,yT . (6.6.2)
0 0 0 0 0 0
Let us define the mutual information rate of the processes {xt } and {yt } as the limit
1
ixy = lim I T T. (6.6.3)
T →∞ T x0 ,y0
If we substitute (6.2.2) hereto, then, evidently,
6.6 Information rate of stationary and stationary-connected processes. Gaussian processes 197
1 1 1
ixy = lim HxT lim HyT − lim HxT ,yT . (6.6.4)
T →∞ T 0 T →∞ T 0 T →∞ T 0 0
The limits situated in the right-hand side of this equality exist according to Theo-
rems 5.1 and 5.4. They are equal to
hx = Hx1 |x0 , hy = Hy1 |y0 , hxy = Hx1 y1 |x0 0 (6.6.5)

0 −∞ 0 −∞ 0 0 −∞ y−∞
respectively, (Hξ 1 |ξ 0 coincides with Hξ0 |...ξ−2 ξ−1 = H 1 in the discrete time case
0 −∞
when t is integer; we use formula (5.6.10) with τ = 1 in the continuous time case).
Therefore, (6.6.4) takes the form
ixy = hx + hy − hxy , 1
Ixy = Hx1 + Hy1 − Hxy
1
. (6.6.6)
Certainly, mutual information (6.6.2) and information rate (6.6.6) can be finite
not only in the case when HxT , HyT , HxT yT or hx , hy , hxy are finite individually. That
0 0 0 0
is why, theoretically, one can compute information leaving out a computation of en-
tropies. However, in practical cases, when information is finite, we can always select
auxiliary measures ν , Q (involved in the definition of entropy from Section 1.6) in
such a way that all terms of (6.6.2), (6.6.6) are finite. Then the problem of comput-
ing the information will be reduced to the simpler problem of computing entropy
rates (considered in Chapter 5) at least for one (the most convenient) selection of
measure ν or Q. According to the aforesaid in Section 6.4 we need to be sure that
either one of the multiplicativity constraints (6.4.1) and (6.4.8) is satisfied. Note that
these multiplicativity constraints are expressed as
∞
ν (dx−∞ dy∞ ∞ ∞
−∞ ) = ν1 (dx−∞ )ν2 (dy−∞ )
and
∞ ∞
Q(dx−∞ ) = Q1 (dx−∞ )Q2 (dy∞
−∞ ) (6.6.7)
for processes (6.6.1), respectively. In this case, according to (6.4.9) formulae (6.6.2),
(6.6.6) can be replaced with
P/Q P/Q1 P/Q2 P/Q P/Q1 P/Q2
IxT yT = HxT yT − HxT − HyT , ixy = hxy − hx − hy , (6.6.8)
0 0 0 0 0 0
P/Q P/Q1 P/Q2

where hxy , hx , hy are entropy densities of the type
1 P/Q P/Q
lim H T = Hξ 1 |ξ 0 .
T →∞ T ξ0 0 −∞
Besides the information rate we consider the information of the end of the inter-
val that is analogous to entropy Γ of the end of the interval involved in the formula
Hξ T = hT + 2Γ + oT (1). (6.6.9)
0
[see (5.6.17)].
By comparing relation (5.6.15) with formula (6.6.2) it is easy to see that constant
2Γ can be interpreted as the mutual information between stochastic processes on
two half-lines (−∞, 0) and (0, ∞):
2Γ = Iξ 0 ∞ . (6.6.10)
−∞ ξ0
Formulae (6.6.9), (6.6.10) are valid for each of processes {xt }, {yt }, {xt , yt }.
Substituting similar expressions for every entropy to (6.6.2) and taking into ac-
count (6.6.6), we obtain that
IxT yT = ixy T + Ix0 ∞ + Iy0 ∞ − I(x0 0 ∞ ∞ + oT (1). (6.6.11)

0 0 −∞ ,x0 −∞ ,y0 −∞ ,y∞ ),(x0 ,y0 )
This equality allows us to calculate the mutual information of the finite segment
[0, T ) more precisely than by formula IxT yT ≈ ixy T following from (6.6.3).
0 0
2. We apply the formulae provided above to calculate the information rate be-
tween two Gaussian stationary random sequences {xt , t = . . . , 1, 2, . . .}, {yt , t =
. . . , 1, 2, . . .}. Since mean
values of Gaussian variables
do not affect the value of the
mutual information for instance, see (6.5.3) , without loss of generality we can
suppose that the respective mean values are equal to zero:
E[xt ] = 0, E[yt ] = 0.
The correlation matrices

11
Rt−q = E[xt xq ], 22
Rt−q = E[yt yq ], 12
Rt−q = E[xt yq ]
or the corresponding spectral densities

∞
αβ
ϕ αβ (μ ) = ∑ e−2π iμσ Rσ
σ =−∞
are assumed to be given.

In order to compute the information rate we apply formula (6.6.6). The entropy
rates of stationary Gaussian sequences implicated in (6.6.6) have been derived in
Section 5.5. Entropies Hx1 , Hy1 are determined by equality (5.5.17). At the same time
formula (5.5.19) can be used to find entropy Hxy 1 . Thus, we have
1/2 1/2
1 1
1
Hxy = ln det ϕ αβ (μ ) d μ = ln[ϕ 11 (μ )ϕ 22 (μ )−ϕ 12 (μ )ϕ 21 (μ )]d μ ,
2 −1/2 2 −1/2
1/2
1
Hx = ln ϕ 11 (μ )d μ ,
2 −1/2
1/2
1
Hy = ln ϕ 22 (μ )d μ .
2 −1/2
Substitution of these expressions to (6.6.6) results into

1 1/2 |ϕ 12 (μ )|2
1
Ixy = ln 1 − 11 dμ. (6.6.12)
2 −1/2 ϕ (μ )ϕ 22 (μ )
|ϕ 12 (μ )|2
Because = R2μ is the square of the correlation coefficient for spectral
ϕ 11 (μ )ϕ 22 (μ )
components, which correspond to the value of μ , the expression in the right-hand
side of equality (6.6.12) can be interpreted as the sum of mutual information of
distinct spectral components. In turn, each summand is defined by the simple for-
mula (6.5.11).
Let us move on to a multivariate case. We find the information rate between a
group of stationary Gaussian sequences {xiα }, α = 1, . . . , r and another group of
ρ −r ρ
sequences {yt } = {xt }, ρ = r + 1, . . . , r + s. In aggregate these sequences are
ij
described by correlation matrix Rt−q , i, j = 1, . . . , r + s or by the matrix of spectral
densities 6.6.13
ϕx (μ ) ϕxy (μ )
ϕ (μ ) = ϕ (μ ) =
ij
(6.6.13)
ϕyx (μ ) ϕy (μ )
+ ( μ ) is the Hermitian transpose. Here ϕ ( μ ) = ϕ αβ ( μ ) is the
where ϕyx (μ ) = ϕxy x
density matrix for a group of processes {xtα }, ϕy (μ ) = ϕ ρσ (μ ) is the matrix for
ρ −r
processes {yt } and, finally, ϕxy (μ ) = ϕ ασ (μ ) is the matrix of mutual spectral
functions.
Incorporating (5.5.19) into formula (6.6.6) we find the entropy rate
1/2
1
1
Ixy = [ln det ϕx (μ ) + ln det ϕy (μ ) − ln det ϕxy (μ )]d μ . (6.6.14)
2 −1/2
We can apply all those transformations, which led from formula (6.5.3) to (6.5.6) to
the integrand in question. After that equality (6.6.14) will take the form
1/2
1
1
Ixy =− ln det[1 − ϕxy (μ )ϕy−1 (μ )ϕyx (μ )ϕx−1 (μ ) d μ (6.6.15)
2 −1/2
that is a matrix generalization of formula (6.6.12).

The results provided here can be also derived with the help of formulae (6.6.8),
(5.5.20), as we will see later.
3. Now we calculate the information rate between two groups of stationary Gaus-
ρ −r ρ
sian processes: {xtα }, α = 1, . . . , r and {yt } = {xt }, ρ = r + 1, . . . , r + s flow-
jk
ing in continuous time. They are described by the joint correlation matrix Rt−t , j,
k = 1, . . . , r + s or by the joint matrix of spectral densities

Sx (ω ) Sxy (ω )
S(ω ) = S (ω ) =
jk
Syx (ω ) Sy (ω )
∞
S jk = e−l ωτ Rτjk d τ = S∗k j (ω )
−∞
(∗ means a complex conjugation) analogous to matrix (6.6.13). Here Sx (ω ) =

Sαβ (ω ) is the matrix of spectral densities for the first group {xtα }, Sy (ω ) =
ρ −r
Sρσ (ω ) is the matrix for the second group {yt }, Sxy (ω ) = Sασ (ω ) is the
matrix of mutual spectral functions, which connects processes of the first and the
second groups.
In order to calculate the information rate we apply now formula (6.6.8) where en-
P/Q P/Q P/Q
tropy rates hx , hy , hxy (when Q1 = Q2 = Q) are determined by formula (5.7.25).
We treat a Gaussian measure given by the joint matrix of spectral densities

Sx (ω ) 0
S(ω ) = (6.6.16)
0 Sy (ω )
as measure Q. Here mutual spectral densities Sασ (ω ) are assumed to be zero in

order for the multiplicativity condition (6.6.7) to be satisfied. If we choose mean
equivalent to m j , j = 1, . . . , r + s (of course, that is not necessary when we
values m
want to obtain a final result), then it follows from (5.7.25) that

1 ∞
tr G(Sx−1 (ω )Sx (ω )) d ω ,
P/Q
hx =
2π −∞

1 ∞
tr G(Sy−1 (ω )Sy (ω )) d ω ,
P/Q
hy = (6.6.17)
2π −∞

1 ∞
tr G(S−1 (ω )S(ω )) d ω ,
P/Q
hxy =
2π −∞
where
G(z) = (z − 1 − ln z)/2.
The matrices Sx (ω ), Sy (ω ) are chosen in such a way that the specified integrals
converge. It is left to substitute expressions (6.6.17) to (6.6.8). It is not difficult to
make certain that contributions of the first two terms z/2 − 1/2 of function G(z)
mutually cancel out. Indeed, in consequence of the special form of matrix (6.6.16)
we have

S −1 S − 1 S −1 S
tr[S−1 S − 1] = tr x−1 x x xy
= tr[Sx−1 Sx − 1] + tr[Sy−1 Sy − 1].
Sy Syx Sy−1 Sy − 1
But it is nothing else but the sum of the respective expressions of G(Sx−1 Sx ) and
G(Sy−1 Sy ).
That is why only logarithmic terms will remain in the integrand:
P/Q P/Q P/Q
ixy = hxy − hx − hy

1 ∞
= [tr ln Sx + tr ln Sy − tr ln S − tr ln Sx − tr ln Sy + tr ln S]d
ω.
4π −∞
Terms with auxiliary spectral densities Sx , Sy are completely discarded because of
the absence of mutual correlations Sασ (ω ) again, since

S 0 1 0
tr ln Sx + tr ln Sy = tr ln x + ln
= tr ln S,
0 1 0 Sy
and thereby we obtain that

∞
1
ixy = [ln det Sx (ω ) + ln det Sy (ω ) − ln det S(ω )] d ω . (6.6.18)
4π −∞
Here we have also used formula (6.5.4). The integrand is analogous to the expression
situated in the right-hand side of (6.5.3). Just like in Section 6.5 it can be reduced to
the form (6.5.6) or (6.5.7). At the same time the derived formula (6.6.18) takes the
form
∞
1
ixy = − ln det[1 − Sxy (ω )Sy−1 (ω )Syx (ω )Sx−1 (ω ) d ω . (6.6.19)
4π −∞
There is an evident analogy of this result with the corresponding formula (6.6.15) for
stationary sequences. In a particular case when we consider the mutual information
between one process {xt } with another process {yt } from (6.6.19) we have

1 ∞ |Sxy (ω )|2
ixy = − ln 1 − dω .
4π −∞ Sx (ω )Sy (ω )
It is easy to apply the derived formulae to the case of additivity independent

noises (particular case 3 from Section 6.5). In this case the relations
ytα = xtα + ξtα , α = 1, . . . , r, (6.6.20)

Sxy (ω ) = Sx (ω ), Sy (ω ) = Sx (ω ) + Sξ (ω ), (6.6.21)
analogous to (6.5.15), (6.5.16) take place. Just as formula (6.5.6) is reduced to

(6.5.17), due to relations (6.6.21) expression (6.6.19) is transformed to the expres-
sion
1 ∞
ixy = ln det[1 + Sx (ω )Sξ−1 (ω )] d ω (6.6.22)
4π −∞
(see the work by Pinsker [33] for more details).
In the case of stationary Gaussian processes, in addition to the average infor-
mation rate, we can also calculate the characteristic potential of random mutual
information rate. The rate potential is expressed in terms of the full characteristic
potential (6.4.10) via the standard passage to the limit
1
μ 1 (s) = lim μ (s),
T →∞ T
so that μ (s) ≈ μ 1 (s)T . This quantity can be easily computed with the help of
formula (6.5.25) in the same way as information rate (6.6.19) can be determined
from (6.5.6). Taking into account the aforesaid, it is not difficult to find out what
form an expression of the rate potential will take in different cases. Thus, in the case
when formula (6.6.19) is valid the rate potential is represented as
∞
1
μ 1 (s) = − {ln det[1 − s2 Sxy (ω )Sy−1 (ω )Syx (ω )Sx (ω )]
4π −∞
+ s ln det[1 − Sxy (ω )Sy−1 (ω )Syx (ω )Sx−1 (ω )]} d ω .
Besides information rate (6.6.19) we can obtain several quantities from this result,
namely, the variance rate
∞
1
(VarIxy )1 = tr Sxy (ω )Sy−1 (ω )Syx (ω )Sx−1 (ω ) d ω
2π −∞
and also other statistical rate characteristics of random mutual information.
6.7 Mutual information of components of a Markov process
1. Let there be given arbitrary (not necessarily stationary) stochastic processes {xt },
{yt } in continuous or discrete time t. The information of these processes on a fixed
segment a t b equals
P/Q P/Q P/Q
Ixab ,yba = Hxb ,yb − Hxb − Hyb (6.7.1)
a a a a
due to (6.4.9). Here we have
xaa = {xt , a ≤ t ≤ b}, yba = {yt , a ≤ t ≤ b}.
Also measure Q is assumed to be multiplicative, i.e. it satisfies constraint (6.4.8)
Q(dxab , dyba ) = Q1 (dxab )Q2 (dyba ) (6.7.2)
for every a, b. We can use the denotations Q1 (dxab ) = Q(dxab ), Q2 (dyba ) = Q(dyba )
meaning that Q1 and Q2 are induced by measure Q(dxab dyba ).
Assuming the difference between the values of (6.7.1) with b = t + τ and b = t,
we find the increment of mutual information

τ P/Q P/Q P/Q P/Q P/Q P/Q
Ix,y = Hxt+τ yt+τ − Hxt yt + Hxt − Hxt+τ + Hyt − Hyt+τ .
a a a a a a a a
This expression can be represented with the help of conditional entropies as follows:
τ P/Q P/Q P/Q P/Q
Ix,y = Hyt+τ |xt yt + Hxt+τ |xt yt+τ − Hxt+τ |xt − Hyt+τ |yt . (6.7.3)
t a a t a a t a t a
6.7 Mutual information of components of a Markov process 203
Indeed,
P/Q P/Q P/Q P/Q P/Q
Hxt+τ yt+τ − Hxt yt = Hxt+τ yt+τ |xt yt = Hyt+τ |xt yt + Hxt+τ |xt yt+τ .
a a a a t t a a t a a t a a
Since every difference

P/Q P/Q
Hyt+τ |xt yt − Hyt+τ |yt = Iyt+ τ t t ,
t ,xa |ya
t a a t a
P/Q P/Q
Hxt+τ |xt yt+τ − Hxt+τ |xt = Ixtt+τ ,yt+ τ t
a |xa
t a a t a
is non-negative [see (1.7.19) and (6.4.11)], non-negativity of the increment (6.7.3)

is evident.
If processes run in continuous time, then we can divide (6.7.3) by τ and pass on
to the limit τ → 0 that yields the analogous formula
P/Q P/Q P/Q P/Q
ixy (t) = hy|xt yt (t) + hx|xt yt (t) − hx (t) − hy (t), (6.7.4)
a a a a
involving the entropy densities
P/Q 1 P/Q
hy|xt yt = lim Hyt+τ |xt yt ,
a a τ →0 τ t a a
P/Q 1 P/Q P/Q P/Q P/Q

hx|xt yt = lim Hxt+τ |xt yt+τ , hy|xt yt + hx|xt yt = hxy , (6.7.4a)
a a τ →0 τ t a a a a a a
P/Q 1 P/Q P/Q 1 P/Q

hx = lim Hxt+τ |xt , hy = lim Hyt+τ |yt
τ →0 τ t a τ →0 τ t a
[see (5.11.4), (5.11.5), (5.11.9)].

If the processes {xt }, {yt } are stationary and stationary-connected, then it is con-
venient to suppose that a = −∞ in (6.7.3), (6.7.4). Thus, expression (6.7.3) will
be proportional to τ (and we can suppose τ = 1) and also the densities in (6.7.4a)
will be independent of t, i.e. expression (6.7.4) will be equivalent to information
rate (6.6.3).
P/Q P/Q
Turning back to the arbitrary case, we compare entropies Hyt+τ |xt yt and Hyt+τ |yt
t a a t a
situated in the right-hand side of (6.7.3). Evidently, it follows from (6.7.2) that
τ
P/Q P(dyt+t | xta , yta ) t+τ
Hyt+τ |xt yt = E ln τ P(dyt | xa , ya ) ,
t t
(6.7.5)
t a a Q(dyt+t | yta )
and τ
P/Q P(dyt+
t | yta ) t+τ
Hyt+τ |yt = E ln τ P(dyt | yt
a .
) (6.7.6)
t a Q(dyt+
t | yta )
τ
However, probability P(dyt+
t | yt0 ) can be represented as
τ τ
P(dyt+
t | yta ) = E[P(dyt+
t | xta , yta ) | yta ]. (6.7.7)
Next, we denote the conditional averaging by xta , yt with weight P(dxta dyt | yta ) and
the (non-conditional) averaging by yta with weight P(dyta ) as E1 and E2 , respectively.
Then (6.7.7), (6.7.6) can be rewritten as
τ τ
P(dyt+
t | yta ) = E1 [P(dyt+
t | xta , yta )]
and τ
P/Q P(dyt+
t | yta ) t+τ
Hyt+τ |yt = E2 ln τ P(dyt | ya ) .
t
t a Q(dyt+
t | yta )
Averagings in (6.7.5) can be represented as the consecutive averagings E2 E1 . Then
the difference between entropies (6.7.5), (6.7.6) will take the form

τ
P/Q P/Q P(dyt+ t | xta , yta ) t+τ
Hyt+τ |xt yt − Hyt+τ |yt = E2 E1 ln τ P(dy t | x a a −
t t
, y )
t a a t a Q(dyt+ t | yta )

τ
P(dyt+
t | xt , yt )
τ
− ln E1 τ
a a
E1 [P(dyt+ t | xta , yta )] . (6.7.8)
Q(dyt+t | yta )
P/Q P/Q
Analogously, we can work out the second difference Hxt+τ |xt yt+τ − Hxt+τ |xt in-
t a a t a
τ t+τ
cluded in (6.7.3). We denote the averaging by yt+ a , xt with weight P(dya dxt | xa )
t
and the averaging by xa with weight P(dxa ) as E3 and E4 , respectively. Then we

t t
will have

P/Q P/Q P(dxtt+τ | xta , yt+
a )
τ
t+τ t t+τ
Hxt+τ |xt yt+τ − Hxt+τ |xt = E4 E3 ln P(dxt | xa , ya ) −
t a a t a Q(dxtt+τ | xta )

P(dxtt+τ | xta , yt+ τ
a ) t+τ t t+τ
− ln E3 E3 [P(dxt | xa , ya )] . (6.7.9)
Q(dxtt+τ | xta )
The change of mutual information (6.7.3) is equal to the sum of the indicated
expressions (6.7.8), (6.7.9). The subtrahend terms in the brackets differ from each
other by the order (that is not the same for them) of operations of averaging and
non-linear transformation. It is not difficult to make certain that measure Q does not
make any essential influence on them.
It is convenient to use these expressions (derived without using Markovian
properties) for calculation of the mutual information between one part of compo-
nents of a Markov process and the other part of its components. The joint process
{xt , yt } = {ξt } is assumed to be Markov with respect to measure P. At the same
time process {yt } is assumed to be Markov with respect to measure Q. Then
τ τ
P(dyt+
t | xta , yta ) = P(dyt+
t | xt , yt ) (τ > 0)
τ τ
Q(dyt+
t | xta , yta ) = Q(dyt+
t | yt )
[see also (6.7.2)].

This explains why averaging E1 in formula (6.7.8) will be reduced to the averag-
ing by dxt dyt performed with the weight
P(dxt dyt | yta ) ≡ Wt (dxt dyt ).
The specified formula will take the form

τ
P/Q P/Q P(dyt+
t | ξt ) t+τ
Hyt+τ |xt yt − Hyt+τ |yt
= E2 Wt (d ξt ) t+τ ln τ P(dyt | ξt ) −
t a a t a ξ yt Q(dyt+
t | yt )
τ
P(dyt+
t | ξt ) t+τ
− t+τ ln t+τ Wt (d ξt ) P(dyt | ξt )Wt (d ξt ) . (6.7.10)
yt ξt Q(yt | yt ) ξt
Here the second averaging E2 is apparently related to Wt and it can be performed

τ
with weight P(dWt ). If we change the order of integration by ξt and yt+
t in the latter
term of (6.7.10), then (6.7.10) will be rewritten as

P/Q P/Q τ
Hyt+τ |xt yt − Hyt+τ |yt = E2 Wt (d ξt ) t+τ P(dyt+ t | ξt )×
t a a t a ξt yt
τ τ
P(dyt+t | ξt ) P(dyt+ t | ξt )Wt (d ξt )
× ln τ − ln τ .
Q(dyt+t | yt ) Q(dyt+t | yt )
We can see from here that measure Q cancels out and we obtain
P/Q P/Q
Hyt+τ |xt yt − Hyt+τ |yt =
t a a
t a

τ
τ P(dyt+
t | ξt )
= E2 Wt (d ξt ) P(dyt+ | ξt ) ln t+τ . (6.7.11)
ξt yt+
t
τ t
ξt P(dyt | ξt )Wt (d ξt )
Let us consider the second expression of (6.7.8). The following relations are valid
for the joint Markov process {xt , yt } with respect to P and the Markov process {xt }
with respect to Q:
P(dxtt+τ | xta yt+ τ t+τ

a ) = P(dxt | xt , yt+ τ
t ),
Q(dxtt+τ | xta ) = Q(dxtt+τ | xt ) (τ > 0).
Besides, the weight corresponding to averaging E3 is equal to

τ τ
t dxt | xa ) =
P(dyt+ | xta , yt )P(dxt dyt | xta )
t
P(dyt+
t

= P(dyt+
t
τ t (dxt dyt ),
| xt yt )W
t (dxt dyt ) = P(dxt dyt | xta ).

where W
Therefore, formula (6.7.9) will be rewritten as

P/Q P/Q
Hxt+τ |xt yt+τ − Hxt+τ |xt = E4 t (d ξt )P(dyt+
W t
τ
| ξt )×
t a a t a ξt yt+ τ
a
τ
P(dxtt+τ | xt yt+ t )
× t+τ ln P(dxtt+τ | xt , yt+ τ
t )−
xt Q(dxtt+τ | xt )

P(dxtt+τ | ξt ) t+τ
− t+τ ln t+τ Wt (d ξt ) P(dxt )Wt (d ξt ) . (6.7.12)
xt ξt Q(dxt | xt ) ξt
t and can be performed with weight P(dW

Here averaging E4 relates only to W t ).
In analogy with (6.7.11) formula (6.7.12) can be reduced to the form, which does
not contain Q:

P/Q
Hxt+τ |xt yt+τ =
P/Q
Hxt+τ |xt = E4 t (d ξt )×
W
t a a t a ξt

P(dxtt+τ | xt , yt+
t )
τ
× P(dxtt+τ dyt+ τ
| ξt ) ln . (6.7.13)
xtt+τ yt+ τ t t+τ
P(dxt | ξt )Wt (d ξt )
t ξt
The sum of expressions (6.7.11), (6.7.13) gives the desired mutual informa-
tion (6.7.3):

τ
τ t+τ P(dyt+t | ξt )
Ixy = E2 Wt (d ξt ) t+τ P(dyt | ξt ) ln t+τ
ξt yt ξt P(dyt | ξt )Wt (d ξt )

t+τ t+τ
t (d ξt ) τ τ P(dx t | xt , yt )
+ E4 W P(dxt dyt | ξt ) ln
t+ t+
.
ξt xtt+τ yt+ τ t+τ
| ξt )Wt (d ξt )
t ξt P(dxt
We can briefly write down this formula by uniting both terms:

τ P(dxtt+τ dyt+ t
τ
| xt , yt )
Ixy = E ln . (6.7.14)
P(dxtt+τ t (d ξt ) P(dyt+
| ξt )W τ
| ξt )Wt (d ξt )
ξt ξt t
For a Markov joint process the sum of two first terms in (6.7.10), (6.7.12) can be
represented in the simpler form
P/Q P/Q P/Q P/Q

Hyt+τ |xt yt + Hxt+τ |xt yt+τ = Hxt+τ yt+τ |xt yt = Hξ t+τ |ξ =
a a t

t a a t t t a a t
P(d ξtt+τ | ξt ) t+τ

= ln P(d ξt | ξt ) P(d ξt ).
ξtt+τ Q(dxtt+τ | xt )Q(dyt+
t
τ
| yt )
Taking this into account when summing up expressions (6.7.10), (6.7.12), we also
obtain the result in the different form

τ P(d ξtt+τ | ξt ) t+τ
Ixy =E ln P(d ξ | ξ )
Q(d ξtt+τ | ξt ) t t
ξtt+τ
τ
P(dyt+
t | ξt ) t+τ
− E2 t+τ ln t+τ Wt (d ξt ) P(dyt | ξt )Wt (d ξt )
yt ξt Q(dyt | yt ) ξt

P(dxtt+τ | ξt ) t+τ
− E4 ln t+τ Wt (d ξt ) P(dxt | ξt )Wt (d ξt ) , (6.7.15)
xtt+τ ξt Q(dxt | yt ) ξt
where averagings E1 , E2 , E3 are performed over ξt , Wt , W t , respectively. Station-

t ) have to be used for this purpose in a
ary distributions Pst (d ξt ), Pst (dWt ), Pst (dW
stationary case, i.e. formula (6.7.15) yields (for τ = 1) information rate ixy . In the
continuous time case when computing using formula (6.7.15) it is convenient to
suppose that τ is small and (preliminary having divided the expression by τ ) pass
on to the limit τ → 0.
The stated method of calculating the mutual information between parts of com-
ponents of a Markov process is closely related to the method of computing entropy
of a part of components of a Markov process (the second method was provided
in Section 5.11). Formula (6.7.15) can be derived quicker with the help of formu-
lae (6.7.3), (5.11.12a).
The case when one of the processes, say {xt }, is Markov itself (with respect to
P) is of particular interest. Thus,
P/Q P/Q P/Q P/Q

Hxt+τ |xt = Hxt+τ |x , Hxt+τ |xt yt = Hxt+τ |x ,
t a t t t a a t t
and the formula
τ P/Q P/Q P/Q P/Q

Ixy = Hxt+τ |xt yt + Hyt+τ |xt+τ yt − Hxt+τ |xt = Hyt+τ |yt ,
t a a t a a t a t a
(which is analogous to (6.7.3)) contains only two terms left
τ P/Q P/Q P/Q P/Q

Ixy = Hyt+τ |xt+τ yt − Hyt+τ |yt = Hyt+τ |xt+τ y − Hyt+τ |yt .
a a t a t t t t a
Hence, in order to calculate mutual information Ixy τ it is sufficient to use for-
mula (6.7.12), where we need to swap x and y. Namely,

τ
τ P(dyt+t | xtt+τ , yt ) t+τ t+τ
Ixy = E t+τ ln τ P(dyt | xt , yt )
yt Q(dyt+t | yt )
τ
P(dyt+t | ξt ) t+τ
− E2 t+τ ln t+τ Wt (d ξt ) P(dyt | ξt )Wt (d ξt ) . (6.7.16)
yt ξt Q(dyt | yt ) ξt
2. Let us consider various particular cases and start with the case when {xt , yt }
is a discrete stationary Markov process in discrete time, i.e. a Markov chain. Then
there is no need to introduce measure Q in the formulae from the previous clause.
β
This measure can be disregarded by substituting P(dxβα )/Q(dxα ) and H P/Q with
β
P(xα ) and −H, respectively. Also we can directly use the results of Sections 5.2
and 5.3.
Assume that a Markov chain is described by transition probabilities π (x, y; x , y )
as well as in Sections 5.2 and 5.3. Applying formula (5.2.8) we find the entropy rate
of the combined process
hxy = − ∑ Pst (x, y) ∑ π (x, y; x , y ) ln π (x, y; x , y ). (6.7.17)

x,y x ,y
The entropy rate is expressed via formula (5.3.23) for components x and y taken
separately:

hx = − )
Pst (dW ∑ (x, y)π (x, y, x , y ) ln
W ∑ (x , y )π (x , y ; x , y ),
W
x,y,x ,y x ,y ,y

hy = − Pst (dW ) ∑ W (x, y)π (x, y; x , y ) ln ∑ W (x , y )π (x , y , x , y ).
x,y,x ,y x ,y ,x
In the given formulae stationary distribution Pst (x, y) is a solution of the equation
∑ Pst (x, y)π (x, y; x , y ) = Pst (x , y ) (6.7.18)

x,y
[see (5.2.7)] and also distribution Pst (dW ) = pst (W ) ∏ξ dW (ξ ) is a solution of the
analogous equation (which corresponds to the secondary a posteriori Markov pro-
cess having transition probabilities (5.3.22)). The latter equation has the following
form:

∑ pst (W (x, y))×
yk

δy ,yk ∑x,y W (x, y)π (x, y; x , y )
× ∏ δ W (x , y ) −
x ,y ∑x,y,x W (x, y)π (x, y; x , yk )

× ∑ W (x, y)π (x, y; x , yk ) ∏ dW (x, y)
x,y,x x,y

= pst (W (x , y )).
We can similarly write down the equation


∑ (x, y))×
pst (W
xk

(x, y)π (x, y; x , y )
δx ,xk ∑x,y W
(x , y ) −
×∏δ W
x ,y
(x, y)π (x, y; xk , y )
∑x,y,x W
× ∑ (x, y)π (x, y; xk , y ) ∏ dW
W (x, y)
x,y,x x,y

= pst (W (x , y ))
that serves to determine stationary distribution pst (W ), which corresponds to the

secondary a posteriori Markov process W (when observing process {xt }). Basically,
the provided formulae solve the problem of computing the information rate
ixy = hx + hy − hxy . (6.7.19)
Example 6.3. As an example, consider a Markov process {ξt } with three states ξ =
a, b, c covered in Section 5.3. It can be represented as a combination {xt , yt } of
processes {xt } and {yt }, each of which has two states x, y = 1, 2. States ξ = a,
ξ = b, ξ = c are interpreted as the combined states
(ξ = a) = (x = 1, y = 1), (ξ = b) = (x = 1, y = 2), (ξ = c) = (x = 2, y = 2).
Further, we consider the transition probability matrix

⎛ ⎞
π (1, 1; 1, 1) π (1, 1; 1, 2) π (1, 1; 2, 2)
π (x, y; x , y ) = ⎝π (1, 2; 1, 1) π (1, 2; 1, 2) π (1, 2; 2, 2)⎠
π (2, 2; 1, 1) π (2, 2; 1, 2) π (2, 2; 2, 2)
⎛ ⎞
1−λ −μ λ μ
=⎝ υ 1 − 2υ υ ⎠, (6.7.20)
μ λ 1−λ −μ
which has certain symmetry. It does not change if we swap processes {x}, {y} (i.e.
under the substitution x ↔ y, 1 ↔ 2). Therefore, we have
hx = hy , (6.7.20a)
for such a transition probability matrix.

Next, we find entropy hxy by formula (6.7.17). Employing denotation (5.2.22) we
obtain
− ∑ π (1, 1; x , y ) ln π (1, 1; x , y ) = − ∑ π (2, 2; x , y ) ln π (2, 2; x , y ) = h3 (λ , μ ),

x ,y x ,y
− ∑ π (1, 2; x , y ) ln π (1, 2; x , y ) = h3 (υ , υ )

(6.7.21)
x ,y
= −2υ ln υ − (1 − 2υ ) ln(1 − 2υ ).
Stationary distribution Pst (x, y) possesses the symmetry property Pst (1, 1) = Pst (2, 2).
When using it equation (6.7.18) yields
Pst (1, 1)(1 − λ ) + Pst (1, 2)υ = Pst (1, 1).
From this it follows that

υ
Pst (1, 1)λ = Pst (1, 2)υ , Pst (1, 1) = ,
λ + 2υ
λ υ
Pst (1, 2) = , Pst (2, 2) = ,
λ + 2υ λ + 2υ
and also due to (6.7.21) we obtain from (6.7.17) that
2υ λ
hxy = h3 (λ , μ ) + h3 (υ , υ ). (6.7.22)
λ + 2υ λ + 2υ
In consequence of (6.7.20a) in order to determine mutual information (6.7.19) it
only remains to find entropy hx or hy . For the example in question it was found in
clause 3 of Section 5.3. According to (5.3.36), (5.3.37) it can be computed by the
formula
h2 (p1 ) + ∑∞k=1 p1 · · · pk h2 (pk+1 )
hy = , (6.7.23)
1 + ∑∞k=1 p1 · · · pk
where p1 , p2 , . . . are the values determined by relations (5.3.30), (5.3.32), (5.3.34)
and so on. For this example in (6.7.20), these relations take the form
p1 = λ + μ ,
λ (1 − υ ) + μ (1 − μ )
p2 = ,
λ +μ
[λ (1 − 2υ ) + μυ ](1 − υ ) + [λ υ + μ (1 − λ − μ )](1 − μ )
p3 = ,
λ (1 − υ ) + μ (1 − μ )
···
Every pk ’s are found consecutively by the method described in Section 5.3. By

considering the difference 2hy − hxy of expressions (6.7.22) and (6.7.23) we obtain
the desired information rate ixy .
3. Now we assume that {xt , yt } is a discrete Markov process in continuous time.

It is described by a differential transition matrix π (x, y; x , y ), i.e. by the equation
dPt (xt , yt )

= ∑ Pt (xt , yt )π (xt , yt ; xt , yt ). (6.7.24)
dt xt ,yt
With the help of the results of Sections 5.9 and 5.11 or the formulae from clause 1
of the present paragraph we can calculate the density of the mutual information ixy ,
which coincides with the information rate in a stationary case.
In order for the mutual information density ixy to be finite (see below) we suppose
that matrix π (x, y; x , y ) takes the following special form [compare with (5.11.26)]:
π (x, y; x , y ) = π1 (x, y; x )δyy + π2 (x, y; y )δxx (6.7.25)

!
π1 (x, y; x) = − ∑

π1 (x, y; x ); π2 (x, y; y) = − ∑

π2 (x, y; y ) .
x =x y =y
Instead of the Poisson measure in (5.11.27) now we select Markovian measures

Q1 (x0T ) and Q2 (yT0 ) with the differential transition matrices

−1 if x = x,
π Q1 (x, x ) =
1/(m − 1) if x = x;

−1 if y = y,
π Q2 (y, y ) =
1/(l − 1) if y = y;
where m and l are the numbers of states of processes x(t) and y(t), respectively.
This means that the combined (composite) measure Q(x0T yT0 ) = Q1 (x0T )Q2 (yT0 ) are
described by the matrix
1 1
π Q (x, y; x , y ) = −2δxx , δyy + (1 − δxx )δyy + δ (1 − δyy ), (6.7.26)
m−1 l − 1 xx
which representation reminds (6.7.25).
Further we perform derivations using the formula
P/Q P/Q1 P/Q2
ixy = hxy − hx − hy (6.7.27)
[see (6.7.4)]. In consequence of (5.11.12), in analogy with (5.11.28) we have that

l −1
= 1 + E ∑ πW (y, y ) ln

P/Q2
hy πW (y, y ) , (6.7.28)
y =y
e
where
πW (y, y ) = ∑ W (x) ∑ π (x, y; x , y ). (6.7.29)
x x
Analogously,

m−1
= 1 + E ∑ πW (x, x ) ln
πW (x, x )
P/Q
hx 1 , (6.7.30)
x =x
e
where
(y) ∑ π (x, y; x , y ).
πW (x, x ) = ∑ W (6.7.31)
y y
P/Q
Density hxy is calculated via the methods provided in Section 5.9. In order for
this value to be finite we need the special form (6.7.25) of a combined transition
probability matrix. With the help of formula (5.9.8) for matrices (6.7.25), (6.7.26)
we obtain

3 4
hxy = 2 + E ∑ π1 (x, y; x ) ln[(m − 1)π1 (x, y; x )] − 1

P/Q
x =x

+E ∑ π2 (x, y; y ){ln[(l − 1)π2 (x, y; y )] − 1}

. (6.7.32)
y =y
In consequence of (6.7.25) equalities (6.7.29), (6.7.31) can be reduced to
πW (y, y ) = ∑ W (x)π2 (x, y; y ),

x
(6.7.33)
(y)π1 (x, y; x ).
πW (x, x ) = ∑ W

y
When substituting (6.7.28), (6.7.30), (6.7.32) to (6.7.27) we take into account that

E π2 (x, y; y ) = E π2 (x, y; y ) | yt0 = E ∑ π2 (x, y; y )P(x | yt0 ) = E πW (y, y )
x
and analogously
E π1 (x, y; x ) = E πW (x, x ) .
The latter equalities allow to cancel out the terms linear with respect to π1 or π2 and
rewrite the result as follows:

ixy = ∑ P(x, y) ∑

π1 (x, y; x ) ln π1 (x, y; x ) + ∑ π2 (x, y; y ) ln π2 (x, y; y )

x,y x =x y =y

− dx)
P(dW ∑ πW (x, x ) ln πW (x, x )
x =x

− P(dW dy) ∑

πW (y, y ) ln πW (y, y ). (6.7.34)
y =y
Certainly, parameters of measures Q1 and Q2 have disappeared from the final

result.
Distributions P(x, y), P(dW dx), P(dW dy) are found as probability distributions
of Markov processes with known transition probabilities. In a stationary case in
order to determine information rate we need to consider corresponding stationary
distributions.
In clause 4 of Section 5.11 we considered the particular case when process {xt } is
Markov if taken separately. In this case formula (6.7.34) can be simplified because
the terms with summation over x can be eliminated, and the mutual information
turns out to be equal to the difference of entropies (5.11.28) and (5.11.30).
4. Next we move to consideration of diffusion processes. We assume that the

combined process {xt , yt } = {ξt } is a diffusion process described by a drift vector
a j (ξ ,t), j = 1, . . . , r + s and a joint matrix of local variances

b (x,t) 0
b(ξ ,t) = x .
0 by (y,t)
Submatrix bx = bαβ , α , β = 1, . . . , r relates to components xα = ξα of x-process;

submatrix by = bσ ρ , σ , ρ = r + 1, . . . , r + s corresponds to y-process, i.e. the
other part of components of the joint process yρ −r (t) = ξρ . Cross local coefficients
are assumed to equal zero: bασ = 0, α = 1, . . . , r; sigma = r + 1, . . . , r + s in order
to avoid infinite values of mutual information density ixy . It is also assumed that
submatrix bx is independent of y1 , . . . , ys , by is independent of x1 , . . . , xr and these
submatrices are non-singular.
Under the given conditions it is convenient to select a measure, for which pro-
cesses {xt }, {yt } are independent diffusion Markov processes if considered sepa-
rately, as measure Q. The first of them has local variances bx and null drifts, the
second one—local variances by and null drifts. For such a choice we can find en-
P/Q P/Q P/Q
tropy densities hxy , hx , hy by formulae (5.10.12), (5.11.17) and thereby the
P/Q P/Q P/Q
mutual information density ixy = hxy − hx − hy as well. Otherwise, mutual in-
formation can be found by formula (6.7.14) without even introducing measure Q.
Either approach yields the next result:

r+s r
1 1
ixy = E ∑ a j b jk ak − E ∑
−1
Wt (dz)aα bαβ−1
Wt (dz)aβ
2 j,k=1 2 α ,β =1

r+s
1
2 ρ ,σ∑
− E aρ W (dz) b−1
ρσ aσ W (dz) .
=r+1
It can be written as follows:

r
1 (dz) b (dz)
ixy = E ∑ aα − aα W −1
αβ aβ − aαβ W
2 α ,β =1

r+s
1
2 ρ ,σ∑
−1
+ E aρ − aρ W (dz) bρσ aσ − aσ W (dz) . (6.7.35)
=r+1
It is easy to see from the latter formula that the obtained expression is non-negative
due to a positive definiteness of matrix b−1
σ ρ .
5. In conclusion of this paragraph we suppose that process {xt } is Markov sepa-
rately and at the same time process {yt } under a fixed realization of {xt } is a mul-
tivariate diffusion process with parameters aρ (x, y,t), bρσ (y,t), ρ , σ = 1, . . . , s. A
matrix of local variances bρσ is assumed to be independent of x and non-singular.
In order to determine mutual information Ixy τ or i in this case we can apply
xy
formula (6.7.16) defining each of two terms from the right-hand side by formu-
lae (5.10.12), (5.11.17) or (5.11.25).

s
1
ixy = E ∑ aρ (x, y)bρσ aσ (x, y)
−1
2 ρ ,σ =1

s
1
− E ∑ −1
aρ (x, y)Wt (dx) bρσ aσ (x, y)Wt (dx)
2 ρ ,σ =1
which can further be reduced to

s
1
ixy = E ∑ aρ − aρ Wt (dx) bρσ aσ − aσ Wt (dx)
−1
2 ρ ,σ =1

1 −1
≡ E2 Eps ∑ aρ − Eps [aρ ] bρσ aσ − Eps [aσ ] . . (6.7.36)
2 ρ ,σ =1
As in (6.7.35), here Wt (dx) is a posteriori distribution Wt (dxi ) = P(dxt | yta ),

which constitutes (together with yt ) a secondary Markov W -process with transition
probabilities defined by the theory of conditional Markov processes.
Example 6.4. In clause 2 of Section 5.11 we considered a Markov process xt with

two states and transition probability matrix (5.9.6) as an example. Process yt repre-
sented a univariate diffusion process, for which drift a(x) depended on x but local
variance b did not. The dependence of parameters a, b on y, t was assumed to be
absent.
In this example a posteriori distribution W (dx) is defined only by one variable
zt = Wt (1) −W2 (t). Thus,
Wt (1) = (1 + zt )/2, Wt (2) = (1 − zt )/2.
Therefore,
1−z
a(1) − ∑ a(z)W (z) = [a(1) − a(2)]W (2) = [a(1) − a(2)] ,
z 2
1+z
a(2) − ∑ a(z)W (z) = [a(2) − a(1)]W (1) = [a(1) − a(2)] ,
z 2
and formula (6.7.36) takes the form

1
1
ixy = [a(1) − a(2)]2 (1 − z2 )pst (z) dz. (6.7.37)
8b −1
The probability density function pst (z) is specified in clause 2 of Section 5.11. The
integral in (6.7.37) can be computed and according to (5.11.24) we obtain
1 √ ( √
ixy = [a(1) − a(2)]2 Kq (b μυ ) 2Kq (b μυ )
2b #
# )−1
υ √ μ √
+ Kq+1 (b μυ ) + Kq−1 (b μυ )
μ υ

1
q = b(υ − μ ) .
2
The determined mutual information is nothing else but the difference of entropies
(5.11.20), (5.11.22) found earlier.
In conclusion we note that result (6.7.36) is valid not only in the case when pro-
cess {xt } is Markov. It is only required for its validity that the conditional process
yt described by measure P(dyba | xab ) is diffusional and dependent on xta (in a causal-
P/Q P/Q
randomized sense) so that Hxt+τ |xt yt = Hxt+τ |xt . This generalization can be easily
t a a t a
derived from formula (6.7.12) (if we replace x with y therein), which is not con-
strained by the Markov condition for xt . The theory worded in clause 1 (which does
not employ Markovian properties) is also useful for derivation of other results.
Chapter 7
Message transmission in the presence of noise.
Second asymptotic theorem and its various
formulations
In this chapter, we provide the most significant asymptotic results concerning the ex-
istence of optimal codes for noisy channels. It is proven that the Shannon’s amount
of information is a bound on Hartley’s amount of information transmitted with
asymptotic zero probability of error. This is the meaning of the second asymptotic
theorem. Further we provide formulae showing how quickly the probability of error
for decoding decreases when the block length increases. Contrary to the conven-
tional approach, we represent the above results not in terms of channel capacity
(i.e., we do not perform the maximization of the limit amount of information with
respect to the probability density of the input variable), but in terms of Shannon’s
amount of information.
Theorems 7.1, 7.3, 7.5 successively strengthen one another. Such ordering is cho-
sen to facilitate studying this material. Certainly, each of these theorems makes the
previous ones redundant as far as results are concerned. However, the strengthening
of the results is achieved by complicating the proof. There are certain grounds to
believe that the above complications are not paid off by the importance of strength-
ening (passing from Theorem 7.3 to the much more complicated Theorem 7.5 has
very small influence on the behaviour of the coefficient α in the region where R is
close to Ixy ); we present all these theorems for the reader to study any of them if
they wish to.
All the mentioned results can be extended from the case of blocks of independent
identically distributed random variables to the case of an arbitrary family of infor-
mationally stable random variables. This generalization is performed in a standard
way, and we shall use it only once (in Theorem 7.2).
In spite of the fact that the presentation of this chapter is given in the simplest
discrete version, the results obtained here are of general importance. Their extension
to the general case is concerned essentially only with changing the way of writing
the formulae.

https://doi.org/10.1007/978-3-030-22833-0 7
218 7 Message transmission in the presence of noise. Second asymptotic theorem
7.1 Principles of information transmission and information

reception in the presence of noise
Consider some communication channel. We denote its input variable (at a selected
time moment), which we call a transmitted character or letter, as x. It can assume
discrete values from some set X. It is convenient to suppose that probabilities P(x)
are also given. Then x will serve as a given random variable.
We consider a noisy channel. This means that for a fixed value of x a variable on a
channel output (at a fixed time moment) is random, i.e. it is described by conditional
probabilities P(y | x). Random variable y can be called a received character or letter.
It is assumed that a process of the letter x transmission and the letter y reception
can occur many times with the same probabilities P(x), P(y | x) (although a general-
ization to the case of varied probabilities is possible, see Theorem 7.2). Let n letters
constitute a block or word, for instance,
ξ = (x1 , . . . , xn ), η = (y1 , . . . , yn ) (ξ ∈ X n , η ∈ Y n ). (7.1.1)
If we desire to transmit some messages through a channel, we have to associate

these messages with input words with the help of some code. Then a recipient on the
receiving end will read the received word and try to restore the transmitted message
using the code he knows.
Since there is noise in the channel, it is possible that the recipient makes a mistake
and receives a different message (not the one having been transmitted). The code
should be selected in such a way that the probability of a similar mistake is as small
as possible. The next questions are of fundamental and practical interest: what can
we achieve if we choose good codes, which codes are good and how to choose them?
The case of simple disturbances (noise) is especially clear (see Section 6.1). In
this case a transmitted message apparently needs to be connected with this or that
region—subset Ek of values x and thereby region Gk of values y. For simple noise
when transmitting letter x from region Ek , the received letter y definitely belongs
to region Gk . Therefore, if a message is confronted with regions Ek and Gk (or,
equivalently, with their index k), then a message reception will be errorless despite
of the presence of noise in the communication channel. In addition, of course, the
number of transmitted messages does not have to exceed the number L of regions
Ek (k = 1, . . . , L), which is equal to the number of regions Gk . Evidently, every letter
is capable to faultlessly transmit ln L units of information and every n-character
word - n ln L units. An attempt to send more information through the channel will
inevitably lead to emergence of errors.
The great Shannon’s accomplishment is the discovery of the fact that something
analogous takes place for arbitrary (not simple) disturbances in the asymptotic case
n → ∞. If the amount of information per letter is less than some limit, then the
probability of an erroneous reception of a message can be made infinitesimally small
on average by increasing the word length n when using a good code. Moreover,
it turned out that such good codes are not difficult to find: in order to do so it is
7.1 Principles of information transmission and reception in the presence of noise 219
sufficient to select code words at random. These questions and also methods of
decoding will be considered in the present chapter.
Assuming independence of processes of transmitting consecutive letters, it is
easy to write down the probabilities of words (7.1.1) in terms of the initial prob-
abilities
n n
P(ξ ) = ∏ P(xi ), P(η | ξ ) = ∏ P(yi | xi ),
i=1 i=1
P(ξ , η ) = P(ξ )P(η | ξ ) (7.1.2)
(in a similar case we say that channel [P(ξ ), P(η | ξ )] is an n-th degree of channel
[P(x), P(y | x)]).
We set a goal to transmit M messages consisting of n-character words, i.e. to
transmit the amount of information ln M. We may try to do it the following way. We
select M distinct words
ξ1 = (x11 , . . . , xn1 ), ..., ξM = (x1M , . . . , xnM ) (7.1.3)
out of all possible words of type ξ = (x1 , . . . , xn ). Their ensemble builds up the code,
which must be known at both ends (transmitting and receiving ones) of the channel.
Each of M messages is confronted with one of code words (7.1.3), say k-th message
is transmitted by word ξk . At the same time the word at the receiving end may be
random and scattered because of the noise effect. The probabilities corresponding to
it are P(η | ξ ). Having received word η , a recipient of information cannot precisely
say yet which of the two words ξk and ξl (l = k) was transmitted if probabilities
P(η | ξk ) and P(η | ξl ) are not zeros for both words. He/she can only speculate
about a posteriori probabilities of this or that code word. If we suppose that all a
priori probabilities P(k) of all M messages (and thereby of all code words ξk ) are
equivalent (equal 1/M), then according to Bayes’ formula we obtain a posteriori
probability for code word ξk :
M P(η | ξk )
1
P(η | ξk )
P [ξk | η ] = = . (7.1.4)
∑l M1 P(η | ξl ) ∑l P(η | ξl )
If a recipient of information chooses code word ξk for fixed η , then he/she will re-
ceive a correct message with probability P(ξk | η ) or make an error with probability
1 − P(ξk | η ). In order to minimize the probability of error, the recipient apparently
has to select word ξk , which corresponds to the maximum (out of M possible ones)
probability (7.1.4) or, equivalently, the maximum likelihood function P(η | ξk ). In
this case, the average probability Per of error will be determined by averaging the
probability of error 1 − P(ξk | η ) with respect to η :

P(η | ξk )
Per = 1 − E max P(ξk | η ) = 1 − ∑ max P(η )
k η k ∑l P(η | ξl )

due to (7.1.4) or
1
M∑
Per = 1 − max P(η | ξk ), (7.1.5)
η k
since
1
P(η ) = ∑ P(η | ξl ) .
l M
The choice of the maximum likelihood function among the functions
P(η | ξ1 ), ..., P(η | ξM ) (7.1.6)
is evidently equivalent to the choice of the maximum random information I(ξk , η ) =

ln [P(η | ξk )/P(η )] among the set
I(ξ1 | η ), ..., I(ξM | η ),
or the choice of a ‘minimum’ distance among the set
D0 (ξ1 , η ), ..., D0 (ξM , η ), (7.1.7)
if ‘distance’ D0 (ξ , η ) between points ξ and η is defined by the formula
P(η )
D0 (ξ , η ) = −I(ξ , η ) = ln . (7.1.8)
P(η | ξ )
According to the specified law of decoding, a recipient of information decides

whether a message associated with a code word ‘closest’ to a received word η has
been transmitted.
We denote the ensemble of output words η , which are closer to code word ξk
than to any other code word as Gk . Then for a fixed code the sample space will be
partitioned into decoding regions G1 , . . . , GM . Also we decide which message has
been transmitted based on what decoding region the received word η belongs to.
The scheme of coding and decoding is represented on Figure 7.1 for one selected
code. As is seen, not all input words ξ are used but only code words ξ1 , . . . , ξM .
Essentially, a transmitted message coincides with an index of a code word, and
also a received message coincides with an index of the decoding region, which con-
tains the received word. Assuming that the code is fixed, we find conditional and
mean probabilities of error. For a fixed message k the probability of error is equiv-
alent to the probability that point η occurring with probabilities P(η | ξk ) goes be-
yond region Gk , i.e.
Per (k) = 1 − ∑ P(η | ξk ). (7.1.9)
η ∈Gk
Since P(η | ξk ) = max P(η | ξl ) inside region Gk η (by the definition of this re-
gion), we have
Per (k) = 1 − ∑ max P(η | ξl ). (7.1.10)
η ∈Gk l
7.2 Random code and the mean probability of error 221
Fig. 7.1 Schematic diagram of encoding and decoding for a noisy channel
If now we perform the averaging
1 M
Per = ∑ Per (k)
M k=1
(7.1.11)
over all the messages, each of which occurs with probability 1/M, then the sum
in (7.1.10) will propagate to the entire sample space of η and, consequently, we
obtain formula (7.1.5).
7.2 Random code and the mean probability of error
For a fixed decoding rule, such as the optimal rule described in the previous section,
the probability of error (i.e. the probability that a recipient selects a wrong word ξk
different from a word actually transmitted) depends on a chosen code. In order to
decrease the frequency of decoding errors caused by noise, it is desirable to select
code words that are ‘dissimilar’, lying one from another, in some sense, as ‘far’
as possible. Because we cannot simultaneously increase the ‘distance’ between the
code points ξ1 , . . . , ξM without decreasing their number M, it is desirable to arrange
code points in the space X n of values ξ ‘as uniformly as possible’. The desired
‘uniformity’ is achieved due to the Laws of Large Numbers for large M (and n) if
we select the code points randomly and independently of each other.
The Shannon’s random code is constructed as follows. Code point ξ1 is obtained
as a result of sampling random variable ξ with probabilities P(ξ ). The second point
(also the third one and the others) is sampled independently of other ones and by
the same method. Consequently, the second point is an independent random variable
with probabilities P(ξ2 ). In aggregate, all code points ξ1 , . . . , ξM are described by
the probability distribution P(ξ1 ) · · · P(ξM ).
For every fixed code (ξ1 , . . . , ξM ) obtained by the specified method and a fixed
message k there is some probability of decoding error. We denote that probability as
Per (| k, ξ1 , . . . , ξM ). According to (7.1.9) it is equal to
Per (| k, ξ1 , . . . , ξM ) = 1 − ∑ P(η | ξk ). (7.2.1)

η ∈Gk
According to the definition of region Gk given in Section 7.1, the summation

in (7.2.1) has to be carried out over a region where all
P(η | ξl ) < P(η | ξk ), l = 1, . . . , k − 1, k + 1, . . . , M
and thereby
Per (| k, ξ1 , . . . , ξk ) = 1 − ∑ P(η | ξk ). (7.2.2)
∀P(η |ξl )<P(η |ξk )
Instead of calculating the error probabilities
1 M
Per (| k, ξ1 , . . . , ξM ) and Per (| ξ1 , . . . , ξM ) = ∑ (| k, ξ1 , . . . , ξM ),
M k=1
it is much easier to determine the probability
Per (| k) = ∑ Per (| k, ξ1 , . . . , ξM )P(ξ1 ) . . . P(ξM ), (7.2.3)

ξ1 ,...,ξM
averaged by different random codes. This convenience is an important advantage

of random codes. As is evident due to a symmetry standpoint, probability (7.2.3) is
independent of the index of the transmitted message. The additional averaging by k
involved in (7.1.11) is not required in the given case:
Per = Per (| k). (7.2.4)
Computations show that mean probability (7.2.3) has satisfactory asymptotic

properties (as n → ∞). It follows from here that among realizations of random codes
there are codes with properties satisfactory at least to the same extent.
We move to a calculation of probability (7.2.3), (7.2.4). Substituting (7.2.2)
to (7.2.3) and taking into account (7.2.4), we find that
Per = 1 − ∑ P(ξk , η ) ∑ ∏ P(ξl )

ξk , η ∀P(η |ξl )<P(η |ξk ) l=k
⎧ M−1 ⎫
⎨ ⎬
= ∑ P(ξk , η ) 1 − ∑ P(ξ ) . (7.2.5)
ξ ,η
⎩ P(η |ξ )P(η |ξ )
⎭
k k
Otherwise,
Per = ∑ P(ξk , η ) f ∑ P(ξ ) , (7.2.6)
ξk , η P(η |ξ )<P(η |ξk )
if we introduce the increasing function f (x) = 1 − (1 − x)M−1 and take account of

7.2 Random code and the mean probability of error 223
∑ P(ξ ) = 1 − ∑ P(ξ ). (7.2.7)

P(η |ξ )<P(η |ξk ) P(η |ξ )>P(η |ξk )
In formulae (7.2.2), (7.2.5), (7.2.6) sign < does not rule out sign . The matter is
that there can exist ‘questionable’ points η equidistant from several competing code
points, say, P(η | ξl ) = P(η | ξk ) (l = k). Such points can be attributed equally to
region El or region Ek . If we take into account the ambiguity related to such points,
then as is easy to see we will have the following inequality instead of (7.2.6)

∑ P(ξk , η ) f ∑ P(ξ )
ξk , η P(η |ξ )P(η |ξk )

Per ∑ P(ξk , η ) f ∑ P(ξ ) , (7.2.8)
ξk , η P(η |ξ )P(η |ξk )
where there is a strict inequality within the summation sign on the left. Contributions
of all ‘questionable’ points are excluded from the left-hand side expression, whereas
those contributions are counted multiple times in the right-hand side expression.
Further, we consider the expression ∑P(η |ξ )P(η |ξk ) P(ξ ), which is an argument of
function f . The inequality P(η | ξ ) P(η | ξk ) or P(η | ξ )/P(η ) P(η | ξk )/P(η )
is equivalent to the inequalities
P(ξ | η )
eI(ξ ,η ) eI(ξk ,η ) ; eI(ξk ,η ) , (7.2.9)
P(ξ )
since due to (6.2.5) we have
P(η | ξ ) P(ξ | η )
= = eI(ξ ,η ) .
P(η ) P(ξ )
From the inequality P(ξ ) P(ξ | η )e−I(ξk ,η ) (i.e. from the second inequality
of (7.2.9)) we obtain via summation that
∑ P(ξ ) e−I(ξk ,η ) ∑ P(ξ | η ).

P(η |ξ )P(η |ξk ) I(ξ ,η )I(ξk ,η )
The given inequality can be only reinforced if the sum in the right-hand side is
replaced with one:
∑ P(ξ ) e−I(ξk ,η ) .
P(η |ξ )P(η |ξk )
Using the latter inequality for the right-hand side (7.2.8) and taking into account the
increasing nature of function f , we find that
!
!
Per ∑ P(ξk , η ) f e−I(ξk ,η ) = E f e−I(ξk ,η ) = f e−I dF(I). (7.2.10)
ξk , η
Fig. 7.2 Function f (x) = 1 − (1 − x)M−1 and the majorizing polygonal line
Here we use F(I) = P{I(ξ , η ) < I} to denote the cumulative distribution function
of the information of communication I(ξ , η ) of variables ξ and η having joint dis-
tribution P(ξ , η ). Hence, the derived estimator is expressed only in terms of this
distribution function.
The behaviour of function f is shown on Figure 7.2. This function increases from
0 to 1 on interval 0 x 1. It has the derivatives
f (x) = (M − 1)(1 − x)M−2 ,

f (x) = −(M − 1)(M − 2)(1 − x)M−3 0
and thereby it is concave. Evidently, this function can be majorized by the polygonal
curve
f (x) min [(M − 1)x, 1] min [Mx, 1] . (7.2.11)
After that formula (7.2.10) will take the form
−I
−I
Per min Me , 1 dF(I) M e dF(I) + dF(I), (7.2.12)
Iln M I<ln M
i.e. ∞
Per M e−I dF(I) + F(ln M). (7.2.13)
I=ln M
The breakpoint I = ln M in (7.2.12) may be substituted by any other point I = λ ;
doing that we can only make the inequality stronger. The derived inequalities will
be used later on.
7.3 Asymptotic zero probability of decoding error. Shannon’s theorem 225
7.3 Asymptotic zero probability of decoding error. Shannon’s

theorem (second asymptotic theorem)
One important and far from trivial fact is that the average decoding error can be
made as small as needed by an appropriate code selection when increasing the num-
ber of characters n without diminishing a noise level in the channel and without
decreasing the amount of transmitted information per one character. This result was
obtained by Shannon in 1948 [45] (the English original is [38]) and (being formu-
lated in terms of channel capacity) usually has the name of Shannon’s theorem.
Theorem 7.1. Consider channel [P(y | x), P(x)] and channel [P(η | ξ ), P(ξ )], which
is the n-th power of the former (as in Section 7.1, see (7.1.1), (7.1.2)). Further, sup-
pose that the amount ln M of transmitted information increases as n → ∞ according
to the law
ln M = ln[enR ] nR (7.3.1)
(the square brackets mean an integer part), where R is a value independent of n and
satisfying the inequality
R < Ixy < ∞. (7.3.2)
Then there exists a sequence of codes K (n) such that
Per (K (n) ) → 0 as n → ∞. (7.3.3)
Proof. In consequence of independence and identical distribution of distinct char-

acters xi and yi , which constitute words ξ and η in (7.1.1), respectively, we have
that the random information I(ξ , η ) is equal to the sum
n
I(ξ , η ) = ∑ I(xi , yi ) (7.3.4)
i=1
of identically distributed independent random variables I(xi , yi ). Each of them has a

finite expected value
Ixy = ∑ I(x, y)P(x, y) = Iξ η /n.
x,y
Applying the Law of Large Numbers (the Khinchin’s theorem, see, for instance,
Gnedenko [13] (also translated to English[14])) to (7.3.4) we obtain

1

P I(ξ , n) − Ixy < ε → 1
n
and thereby
P{|I(ξ , n) − Iξ n | nε } → 0 as n → ∞, (7.3.5)
for whatever ε > 0.
Next we consider the average error probability with respect to the ensemble of
random codes described in Section 7.2. We suppose that ε = (Ixy − R)/2 = (Iξ η −
nR)/2n, so that n(R + ε ) = Iξ η − nε . It is apparent that ε > 0 due to (7.3.2). Since

−I M e−I for I > Iξ n − nε ,
min [M e , 1]
1 for I Iξ n − nε ,
it follows from formula (7.2.12) that

Per M e−I dF(I) + P{I(ξ , n) Iξ n − nε }. (7.3.6)
I>nR+ε
Further, we consider the first term situated in the right-hand side. The inequality
I > nR + nε entails

e−I < e−nR−nε , e−I dF(I) < e−nR−nε dF(I) e−nR−nε .
I>nR+nε I>nR+nε
Therefore,
M e−I dF(I) M e−nR−nε
I>n(R+ε )
and due to (7.3.1) we have

M e−I dF(I) enR e−nR−nε = e−nε .
I>nR+nε
This expression converges to zero as n → ∞. The second term in the right-hand side
of (7.3.6) goes to zero because of (7.3.5). Consequently,
Per → 0 as n → ∞. (7.3.7)
Among the ensemble of random codes there must be code (ξ1n , . . . , ξMn ), for which
its error probability does not exceed the average probability
Per (ξ1n , . . . , ξMn ) Per . (7.3.8)
Per (ξ1n , . . . , ξMn ) → 0 as n→∞
for the sequence of such codes. This finished the proof.

The proven asymptotic errorless transmission of information takes place not only
for channels, which are powers of some channel. The provided proof can be easily
generalized to other cases. The condition of informational stability introduced below
is essential here.
Definition 7.1. A sequence of random variables ξ n , η n , n = 1, 2, . . . (n is an index)
is called informationally stable if
7.3 Asymptotic zero probability of decoding error. Shannon’s theorem 227
A. ∞ > Iξ n η n → ∞ as n → ∞;
B. The ratio I(ξ n , η n )/Iξ n η n converges in probability to 1 as n → ∞.
The respective generalization of Theorem 7.1 can be formulated as follows.
Theorem 7.2 (The general form of the second asymptotic theorem). Let ξ n , η n ,
n = 1, 2, . . . be a sequence of informationally stable random variables. Further, sup-
pose that the amount of transmitted information increases as n → ∞ according to
the law
ln M = Iξ n η n (1 − μ ) (more precisely M = [eI ξ n η n(1−μ ) ]), (7.3.9)
where μ > 0 is independent of n. Then there exists a sequence of codes such that
Per → 0 as n → ∞.
Proof. It follows directly from the definition of informational stability (property B)

that
P{|I(ξ n , η n )/Iξ n η n − 1| < ε } → 1,

P{|I(ξ n , η n ) − Iξ n η n | ε Iξ n η n } → 0, (7.3.10)
for every ε > 0. We suppose that
ε = μ /2. (7.3.11)
In consequence of the inequality

−I M e−I for I > (1 − ε )Iξ n η n ,
min[M e , 1]
1 for I (1 − ε )Iξ n η n
we obtain from formula (7.2.12) that

Per M e−I dF(I) + P{I(ξ , η ) Iξ n η n − ε Iξ n η n }.
I>(1−ε )Iξ n η n
The second term in the right-hand side converges to zero as n → ∞ due to (7.3.10).
At the same time we can apply (7.3.9) to estimate the corresponding first term as
follows:

M e−I dF(I)

eI ξ n η n(1−μ ) e−(1−ε )Iξ n η n dF(I) e(ε −μ )Iξ n η n .
Therefore, it goes to zero due to (7.3.11) and property A from the definition of
informational stability. The considerations, which are analogous to the ones in the
previous theorem finishes the proof.
7.4 Asymptotic formula for the probability of error
In addition to the results of the previous section, we can obtain stronger results
related to the rate, with which of the error probability vanishes. It turns out that
the probability of error for satisfactory codes decreases mainly exponentially with a
growth of n:
Per ea−α n (7.4.1)
where a is a value weakly dependent on n and α is a constant of main interest.
Rather general formulae can be derived for the latter quantity.
1.
Theorem 7.3. Under the conditions of Theorem 7.1 the following inequality is
valid:

Per 2e−[sμ (s)−μ (s)]n (7.4.2)
where
μ (t) = ln ∑ P1−t (x, y)Pt (x)Pt (y) (7.4.3)
x,y
[see (6.4.10), with argument s replaced by t] and s is a positive root of the equation
μ (s) = −R. (7.4.4)
It is also assumed that R is relatively close to Ixy in order for the latter equation to
have a solution. Besides, the value of s is assumed to lie within a differentiability
interval of potential μ (t).
Proof. At first, we introduce the skewed cumulative distribution function:

λ
e−I dF(I)
λ ) = −∞
F( ∞ (7.4.5)
e−I dF(I)
−∞
formula (7.4.5) can be rewritten in the following form:

∞
∞
e−I dF(I) = 1 − F( λ) e−I dF(I).
λ −∞
Substituting this inequality to an inequality of type (7.2.13) (but simultaneously

selecting a breakpoint λ = nR) we obtain

∞

Per M 1 − F(nR) e−I dF(I) + F(nR)
−∞
or
∞

Per enR 1 − F(nR) e−I dF(I) + F(nR) (7.4.6)
−∞
in consequence of (7.3.1).
7.4 Asymptotic formula for the probability of error 229
In order to estimate the obtained expression, we need to know how to estimate

‘tails’ of distribution functions F(nR), F(nR), which correspond to a sum of inde-
pendent random variables. This constitutes the content of Theorems 4.6, 4.7. By
applying Theorem 4.7 to function F(nR) (for B(ξ ) = −I(ξ , η )) we have

F(nR) = P [−I > nR] e−[sμ (s)−μ (s)]n (7.4.7)
where
enμ (t) = e−tI F(dI) (7.4.8)
i.e. n

e nμ (t)
=E e −tI(ξ ,η )
= ∑e −tI(x,y)
P(x, y) (7.4.9)
x,y
and s is a root of the equation

μ (s) = −R. (7.4.10)
Root s is positive because −R > −Ixy = μ (0) and μ (t) is an increasing function:
μ (t) > 0.

Analogously, for the skewed distribution 1 − F(−x) = P[−I < x] it follows from
Theorem 4.6 that

1 − F(nR) e−[sμ (s)−μ (s)]n (7.4.11)
where
enμ (s) =
e−sI F(dI) (7.4.12)
and
(
μ s) = −R (7.4.13)
s < 0 since − R < μ (0) (μ (0) = μ (1) > 0).
Substituting (7.4.5) to (7.4.12) we find that

∞
e−sI−I dF(I)
e μ (
n s)
= −∞
∞
−I dF(I)
= enμ (s+1)−nμ (1) . (7.4.14)
−∞ e
The latter is valid by virtue of relationship (7.4.8), which entails

∞
e−I dF(I) = enμ (1) . (7.4.15)
−∞
Therefore,
(
μ s) = μ (
s + 1) − μ (1). (7.4.16)
That is why (7.4.13) takes the form
μ (
s + 1) = −R. (7.4.17)
A comparison of this equality with (7.4.10) yields:
s+ 1 = s.
Therefore, formula (7.4.11) can be rewritten as

1 − F(nR) e−[(s−1)μ (s)−μ (s)+μ (1)]n (7.4.18)
where (7.4.16) is taken into account.

Substituting (7.4.7), (7.4.18), (7.4.15) to (7.4.6) and taking into account (7.4.10),
we obtain the desired formula (7.4.2). Equality (7.4.3) follows from (7.4.9). The
theorem has been proven.
2. According to the provided theorem potential μ (s) defines the coefficient sitting
in the exponent (7.4.1)
α = sμ (s) − μ (s) (7.4.19)
as the Legendre transform of the characteristic potential:
α (R) = −sR − μ (s) (d μ /ds = −R). (7.4.20)
According to common features of Legendre transforms, convexity of function

μ (s) results in convexity of function α (R). This fact can be justified similarly to the
proof of inequality (4.1.16a). Using convexity, we extrapolate function α (R) with
tangent lines. If function α (R) has been already computed, say, on interval [R1 , Ixy ],
then we can perform the following extrapolation:

α (R) for R1 R Ixy
αtan (R) =
α (R1 ) + d αα(RR 1 ) (R − R1 ) for R R1
and replace the formula

Per 2e−nα (R) (7.4.21)
with a weaker formula
Per 2e−nαtan (R) .
A typical character of dependence α (R) and the indicated extrapolation by a
tangent line are shown on Figure 7.3.
Point R = Ixy corresponds to value s = 0, since equation (7.4.4) takes the form
μ (s) = μ (0) for such a value. Furthermore, μ (0) = 0 follows from the definition
of function μ (s). Hence, the null value of coefficient α = 0 corresponds to value
R = Ixy .
Next we investigate the behaviour of dependence α (R) in proximity of the spec-
ified point. Differentiating (7.4.4) and (7.4.19), we obtain
−μ (s)ds = dR, d α = sμ (s)ds.
Consequently, d α /dR = −s.

7.4 Asymptotic formula for the probability of error 231
− ln Per
Fig. 7.3 Typical behaviour of the coefficient α (R) = limn→∞ n appearing in the exponent of
formula (7.4.1)
Further, we take in the differential from that derivative and divide it by dR =

−μ (s)ds. Thereby we find that
d2α 1
= .
dR2 μ (s)
In particular, we have
d2α 1
(Ixy ) = (7.4.22)
dR2 μ (0)
at point R = Ixy , s = 0. As is easy to see from definition (7.4.9) of function μ (s), here
μ (0) coincides with the variance of random information I(x, y):
μ (0) = E [I(x, y) − Ixy ]2 = ∑ [I(x, y)]2 P(x, y) − Ixy

2
.
x,y
Calculating the third-order derivative by the same approach, we have
d3α μ (s)

= . (7.4.23)
dR3 [μ n (s)]3
Representing function α (R) in the form of Taylor expansion and disregarding

terms with higher-order derivatives, we obtain
(Ixy − R)2 μ (0)

α (R) =
− (Ixy − R)3 + · · · , R < Ixy (7.4.24)
2μ (0) 6 [μ (0)]3
according to (7.4.22), (7.4.23).

The aforementioned results related to formula (7.4.1) can be complemented and
improved in more ways than one. Several stronger results will be mentioned later
on (see Section 7.5).
These results can be propagated to a more general [in comparison with (7.1.1)]
case of arbitrary informationally stable random variables, i.e. Theorem 7.2 can be
enhanced in the direction of accounting for rate of convergence of the error probabil-
ity. We will not pursue this point but limit ourselves to indicating that the improve-
ment in question can be realized by a completely standard way. Coefficient α should
not be considered separately from n. Instead, we need to operate with combination
αn = nα . Analogously, we should consider only the combinations
Rn = nR, μn (t) = nμ (t), Iξ n η n = nIxy .
Therefore formulae (7.4.2)–(7.4.4) will be replaced with the formulae

Per 2e−sμn (s)+μn (s) , μn = −Rn ,

μn (t) = ln P1−t (d ξ n , d η n )Pt (d ξ n )Pt (d η n ). (7.4.24a)
Only the method of writing formulae will change in the above-said text.
Formulae which assess a behaviour of the error probability were derived in the
works of Shannon [42] (the English original is [40]) and Fano [10] (the English
original is [9]).
7.5 Enhanced estimators for optimal decoding
1. In the preceding sections decoding was performed according to the principle

of the maximum likelihood function (7.1.6) or, equivalently, the minimum dis-
tance (7.1.8). It is of our great interest to study what an estimator of the error prob-
ability equals to if decoding is performed on the basis of ‘distance’ D(ξ , η ) defined
somewhat differently. In the present paragraph we suppose that ‘distance’ D(ξ , η )
is some arbitrarily given function. Certainly, a transition to a new ‘distance’ cannot
diminish the probability of decoding error but, in principle, it can decrease an upper
bound for an estimator of the specified probability.
Theorem 7.4. Suppose that we have a channel [P(η | ξ ), P(ξ )] (just as in Theo-
rem 7.1), which is an n-th power of channel [P(y | x), P(x)]. Let the decoding be
performed on the basis of the minimum distance
n
D(ξ , η ) = ∑ d(x j , y j ) (7.5.1)
j=1
where d(x, y) is a given function.

The amount ln M of transmitted information increases with n according to the
law
ln M = ln enR nR (7.5.2)
7.5 Enhanced estimators for optimal decoding 233
(R < Ixy is independent of n). Then there exists a sequence of codes having the
probability of decoding error

Per 2e−n[s0 γ (s0 )−γ (s0 )] . (7.5.3)
Here s0 is one of the roots s0 , t0 , r0 of the following system of equations:
γ (s0 ) = ϕr (r0 ,t0 )

ϕt (r0 ,t0 ) = 0 (7.5.4)
(s0 − r0 )γ (s0 ) − γ (s0 ) + ϕ (r0 ,t0 ) + R = 0.
Also, γ (s) and ϕ (r,t) are the functions below:
γ (s) = ln ∑ esd(x,y) P(x)P(y | x) (7.5.5)

x,y

ϕ (r,t) = ln ∑ e(r−t)d(x,y)+td(x ,y) P(x)P(y | x)P(x ). (7.5.6)
x,y,x
Besides, ϕr = ∂∂ϕr , ϕt = ∂∂ϕt .

It is also assumed that equations (7.5.4) have roots, which belong to the domains
of definition and differentiability of functions (7.5.5), (7.5.6), and s0 > 0, r0 < 0,
t0 < 0.
Proof. As earlier, we will consider random codes and average out a decoding error
with respect to them.
First, we write down the inequalities for the average error, which are analogous
to (7.2.1)–(7.2.4) but with an arbitrarily assigned distance D(ξ , η ). Now we perform
averaging with respect to η in the last turn:
Per = Per (| k) = ∑ Per (| k, η )P(η ) (7.5.7)

η
where
Per (| k, η ) = ∑ υ (k, ξ1 , · · · , ξM , η )P(ξ1 ) · · · P(ξk−1 )×

ξ1 ···ξm
× P(ξk | η )P(ξk+1 ) · · · P(ξM ) (7.5.8)
and

0, if all D(ξl , η ) > D(ξk , η ), l = k,
υ (k, ξ1 , · · · , ξM , η ) = (7.5.9)
1, if at least one D(ξl , η ) D(ξk , η ), l = k.
Substituting (7.5.9) to (7.5.8) in analogy with (7.2.5)–(7.2.8), we obtain

Per (| k, η ) = Per (| η ) = 1 − ∑ P(ξk | η ) ∏ ∑ P(ξl )×

k l=k D(ξl ,η )>D(ξk ,η )

× ∑ P(ξk | η ) = f ∑ P(ξl ) . (7.5.10)
ξk D(ξl ,η )D(ξk ,η )
Here the inequality D(ξl , η ) < D(ξk , η ) is not strict.

Including all those cases when D(ξl , η ) = D(ξk , η ), we denote
Fη [λ ] = ∑ P(ξl ). (7.5.11)
D(ξl ,η )λ
Next, if we use (7.2.11), then it will follow from (7.5.10) that
Per (| η ) ∑ P(ξ | η ) f (Fη [D(ξ , η )])

ξ
3 4
E min M [Fη [D(ξ , η )]] , 1 | η . (7.5.12)
We have omitted index k here because the expressions in (7.5.10), (7.5.12) turn out
to be independent of it. Selecting some boundary value nd (independent of η ) and
using the inequality

3 4 MFη [D(ξ , η )] for D(ξ , η ) nd,
min MFη [D(ξ , η )] , 1 (7.5.13)
1 for D(ξ , η ) > nd
we conclude from (7.5.12) that
Per (| η ) ∑ MFη [D(ξ , η )] P(ξ | η ) + ∑ P(ξ | η ). (7.5.14)

D(ξ ,η )nd D(ξ ,η )>nd
Further, we average out the latter inequality by η and thereby obtain the estimator
for the average probability of error
Per MP1 + P2 (7.5.15)
where
P2 = ∑ P(ξ , η ) (7.5.16)
D(ξ ,η )>nd
P1 = ∑ ∑ P(ξ , η )P(ξl ) (7.5.17)

D(ξ ,η )nd D(ξl ,η )D(ξ ,η )
[(7.5.11) is taken into account in the last expression].

The cumulative distribution function (7.5.16) can be estimated with the help of
Theorem 4.7. Under the assumption d > 1n E[D(ξ , η )] = γ (0) we have

P2 e−n[s0 γ (s0 )−γ (s0 )] (7.5.18)
where
γ (s0 ) = d, s0 > 0;

nγ (s) = ln E esD(ξ ,η ) , γ (s) = ln E esd(x,y) (7.5.19)
is the function (7.5.5).

The probability (7.5.17) can be expressed in terms of the bivariate joint distribu-
tion function
F(λ1 , λ2 ) = ∑ ∑ P(ξ )P(η | ξ )P(ξl ). (7.5.20)

D(ξ ,η )λ1 D(ξl ,η )−D(ξ ,η )λ2
In order to derive an analogous estimator for it, we need to apply the multivariate
generalization of Theorem 4.6 or 4.7, i.e. formula (4.4.13). This yields

P1 e−n[r0 ϕr +t ϕt −ϕ (r0 ,t0 )]

∂ ϕ (r0 ,t0 ) ∂ ϕ (r0 ,t0 )
ϕr = , ϕt = (7.5.21)
∂r ∂t
where r0 , t0 are solutions of the respective equations
ϕr (r0 ,t0 ) = d, ϕt (r0 ,t0 ) = 0 (7.5.22)
and nϕ (r,t) is a two-dimensional characteristic potential

nϕ (r,t) = ln E erD(ξ ,η )+t[D(ξl ,η )−D(ξ ,η )]

= n ln E erd(x,y)+t [d(x ,y)−d(x,y)] (7.5.23)
[see (7.5.6)]. It is assumed that
r0 < 0, t0 < 0.
We substitute (7.5.18), (7.5.21) to (7.5.15) and replace M with enR . Employing

the freedom of choosing constant d, we adjust it in such a way that the estimators
for both terms in the right-hand side of the formula
Per enR P1 + P2 (7.5.24)
are equal to each other. This will result in the equation
r0 ϕr (r0 ,t0 ) − ϕ (r0 ,t0 ) − R = s0 γ (s0 ) − γ (s0 )
which constitutes the system of equations (7.5.4) together with other equations fol-
lowing from (7.5.19), (7.5.22). Simultaneously, inequality (7.5.24) turns into (7.5.3).
2. Now we turn our attention to the particular case when R(< Ixy ) is so far from
Ixy that root r0 [see equations (7.5.4)] becomes positive. Then it is reasonable to
choose ∞ as nd in inequality (7.5.13), so that that inequality takes the form
3 4
min MFη [D(ξ , η )] , 1 MFη [D(ξ , η )] .
Instead of (7.5.15)–(7.5.17) we will have
Per M ∑ P(ξ , η )P(ξl ) = MF(∞, 0) (7.5.25)

D(ξl ,η )<D(ξ ,η )
[see (7.5.20); F(∞, λ2 ) is a one-dimensional distribution function of the difference

D(ξl , η ) − D(ξ , η )].
In order to apply Theorem 4.6 to inequality (7.5.25), we need to take in the
characteristic potential
ln ∑ exp {t [D(ξl , η ) − D(ξ , η )]} P(ξ , η )P(ξl )
which is equal to nϕ (0,t) according to (7.5.23). Due to Theorem 4.6 (while substi-
tuting M with enR that reinforces the inequality) we obtain
3 4 ∗
Per exp nR − n t ∗ ϕt (0,t ∗ ) − ϕ (0,t ∗ ) = en[ϕ (0,t )+R] . (7.5.26)
Here t ∗ is a negative root of the equation
ϕt (0,t ∗ ) = 0. (7.5.27)
Next, we address equations (7.5.4). We use R∗ to denote a certain value R, for which
the root r0 equals 0.
The other roots s0 , t0 corresponding to this value are denoted as s∗0 and t0∗ , respec-
tively. The second equation from (7.5.4) takes the form
ϕt (0,t0∗ ) = 0. (7.5.28)
Comparing (7.5.28) with (7.5.27), we see that t ∗ = t0∗ . So, the third equation
from (7.5.4) can be rewritten as follows:
s∗0 γ (s∗0 ) − γ (s∗0 ) + ϕ (0,t ∗ ) + R∗ = 0. (7.5.29)
Consequently, inequality (7.5.26) can be reduced to

3 4
Per exp −n s∗0 γ (s∗0 ) − γ (s∗0 ) + R∗ − R . (7.5.30)
This result is valid when r0 > 0 and when formula (7.5.3) cannot be used. Tak-
ing into account a character of change of potentials ϕ (r,t), γ (s), we can make cer-
tain that the constraint r0 > 0 is equivalent to the constraint R < R∗ or s0 > s∗0 .
After introducing the Legendre convex conjugate α (R) of function γ (s) by equali-
ties (7.4.19), (7.4.20), formulae (7.5.3), (7.5.30) can be represented as follows:

2e−nα (R) for R∗ < R < Ixy
Per = −n[α (R∗ )+R∗ −R] (7.5.31)
e for R < R∗ .
3. Now we consider an important corollary from the obtained results. We select

a function d(x, y) of the following type:
d(x, y) = − ln P(y | x) + f (y) (7.5.32)
where f (y) is a function, which will be specified below. The corresponding distance
D(ξ , η ) = ∑i d(xi , yi ) is more general than (7.1.8). By introducing the notation
γy (β ) = ln ∑ Pβ (y | x)P(x) (7.5.33)
x
we can represent functions (7.5.5), (7.5.6) in this case as
γ (s) = ln ∑ eγy (1−s)+s f (y)

y
3 4
ϕ (r,t) = ln ∑ exp γy (1 − r + t) + γy (−t) + r f (y) . (7.5.34)
y
Further, the derivatives involved in (7.5.4) take the form

γ (s0 ) = e−γ (s0 ) ∑ eγy (1−s0 )+s0 f (y) f (y) − γy (1 − s0 ) (7.5.35)
y
ϕr (r0 ,t0 ) = e−ϕ (r0 ,t0 ) ∑ exp[γy (1 − r0 + t0 )+

y

+ γy (−t0 ) + r0 f (y)] f (y) − γy (1 − r0 + t0 ) (7.5.36)
ϕt (r0 ,t0 ) = e−ϕ (r0 ,t0 ) ∑ exp[γy (1 − r0 + t0 ) + γy (−t0 )+

y

+ r0 f (y)] γy (1 − r0 + t0 ) − γy (−t0 ) . (7.5.37)
The second equation from (7.5.4) will be satisfied if we suppose that
1 + t − r = −t, r = 1 + 2t (7.5.38)
particularly, r0 = 1 + 2t0 , since in this case every term of the latter sum turns into
zero. According to (7.5.35), (7.5.36) the first equation of (7.5.4) takes the form

e−γ (s0 ) ∑ eγy (1−s0 )+s0 f (y) f (y) − γy (1 − s0 ) =
y

= e−ϕ (1+2t0 ,t0 ) ∑ e2γy (−t0 )+(1+2t0 ) f (y) f (y) − γy (−t0 ) (7.5.39)
y
in the given case. In order to satisfy the latter equation we suppose that
1 − s0 = −t0 (7.5.40)
γy (1 − s0 ) + s0 f (y) = 2γy (−t0 ) + (1 + 2t0 ) f (y) (7.5.41)
i.e. we choose the certain type of function f :

1
f (y) = γy (1 − s0 ). (7.5.42)
1 − s0
Then summations in the left-hand and right-hand sides of (7.5.39) will be equated
and thereby this equation will be reduced to the equation
γ (s0 ) = ϕ (1 + 2t0 ,t0 ) = ϕ (2s0 − 1, s0 − 1). (7.5.43)
But the last equation is satisfied by virtue of the same relations (7.5.38), (7.5.40),
(7.5.42) that can be easily verified by substituting them to (7.5.34).
Hence, both equations of system (7.5.4) are satisfied. In consequence of (7.5.38),
(7.5.40), (7.5.43) the remaining equation can be reduced to
(1 − s0 )γ (s0 ) + R = 0 (7.5.44)
and due to (7.5.34), (7.5.42) we have

s
γ (s) = ln ∑ exp γy (1 − s) + γy (1 − s0 ) . (7.5.45)
y 1 − s0
Differentiating this expression or, taking into account (7.5.35), we obtain that equa-
tion (7.5.44) can be rewritten as
1 γy (1−s0 )
R = e−γ (s0 ) ∑ e 1−s0 (1 − s0 )γy (1 − s0 ) − γy (1 − s0 ) . (7.5.46)
y
It is left to check signs of roots s0 , r0 , t0 . Assigning s0 = 0 we find the boundary

value from (7.5.46):

Rmax = − γ (s0 ) s =0 = ∑ eγy (1) γy (1) − γy (1) . (7.5.47)
0
y
This boundary value coincides with information Ixy . Indeed, in consequence of

(7.5.33) it follows from
P(x, y)
γy (1) = ∑ ln P(y | x)
x P(y)
γy (1) = ln P(y)
1
P(y) ∑
γy (1) − γy (1) =H(y) − P(x, y)H(y | x)
x
that
∑ eγy (1) γy (1) − γy (1) = Hy − Hy|x = Ixy . (7.5.47a)
y
Analyzing expression (7.5.46), we can obtain that s0 > 0 if R < Ixy . Due to
(7.5.38), (7.5.40) the other roots are equal to
r0 = 2s0 − 1, t0 = s0 − 1. (7.5.48)
Apparently, they are negative if 0 < s0 < 1/2, i.e. if R is sufficiently close to Ixy . If
s0 exceeds 1/2 thus, r0 becomes positive because of the first equality of (7.5.48) ,
then we should use formula (7.5.30) instead of (7.5.3) as it was said in the previous
clause. Due to (7.5.48) values s∗0 , t0∗ obtained from the condition r0 = 0 turn out to
be the following:
s∗0 = 1/2, t0∗ = −1/2.
The ‘critical’ value of R∗ is obtained from equation (7.5.46) by substituting s0 = 1/2,
i.e. it turns out to be equal to

1 1 2γy (1/2) 1 1 1
∗
R =− γ
2 2
=e −γ (1/2)
∑e γ
2 y 2
− γy
2
y
or, equivalently, 1 1
∗ ∑y e2γy (1/2) 1
2 γy 2 − γ 2
R = (7.5.49)
∑y e2γy (1/2)
where we take into account (7.5.35), (7.5.42).
The results derived above can be formulated in the form of a theorem.
Theorem 7.5. Under the conditions of Theorems 7.1, 7.3 there exists a sequence of
codes such that the probability of decoding error satisfies the inequality

2en[s0 R/(1−s0 )+γ (s0 )] for R∗ R < Ixy
Per n[γ (1/2)+R] (7.5.50)
e for R < R∗
where
1/(1−s0 )
γ (s0 ) = ln ∑ e γy (1−s0 )/(1−s0 )
≡ ln ∑ ∑ P(x)P 1−s0
(y | x)
y y x
s0 ∈ (0, 1/2) is a root of equation (7.5.46), and R∗ is a value of (7.5.49).
Coefficient α0 = −s0 R/(1 − s0 ) − γ (s0 ) in exponent (7.5.50) is obtained by sub-

stituting equality (7.5.44) to the expression
α0 = s0 γ (s0 ) − γ (s0 ) (7.5.51)
situated in the exponents of formulae (7.5.3), (7.5.30), (7.5.31).

4. Next we will investigate a behaviour of the expressions provided in Theo-

rem 7.5 for values of R close to the limit value Ixy . This investigation will allow to
compare the indicated results with the results of Theorem 7.3.
As is seen from (7.5.5), function γ (s) possesses the property γ (0) = 0 for any
distance d(x, y). This property is also valid for the particular case (7.5.32), (7.5.42).
That is why the decomposition of function (7.5.45) into a bivariate series with re-
spect to s, s0 will contain the following terms:
γ (s) = γ10 s + γ20 s2 + γ11 ss0 + γ30 s3 + γ21 s2 s0 + γ12 ss20 + · · · . (7.5.52)
In this case,
γ10 = −Ixy (7.5.53)
due to (7.5.47), (7.5.47a).
Substituting (7.5.52) to (7.5.51) we obtain
α0 = γ20 s20 + (2γ30 + γ21 )s30 + s40 + · · · . (7.5.54)
In order to express s0 in terms of Ixy − R we plug (7.5.52) into equation (7.5.44)

that yields
−R = (1 − s0 )[γ10 + (2γ20 + γ11 )s0 + (3γ30 + 2γ21 + γ12 )s20 + · · · ].
Taking into account (7.5.53), it follows from the latter formula that
Ixy − R = (2γ20 + γ11 − γ10 )s0 + (3γ30 + 2γ21 +

+ γ12 − 2γ20 − γ11 )s20 + · · · s30 + · · · . (7.5.55)
Then we can express s0 as follows

Ixy − R 3γ30 + 2γ21 + γ12 − 2γ20 − γ11 2 3
s0 = − s0 − s0 · · ·
2γ20 + γ11 − γ10 2γ20 + γ11 − γ10
Ixy − R 3γ30 + 2γ21 + γ12 − 2γ20 − γ11
= − ×
2γ20 + γ11 − γ10 2γ20 + γ21 − γ10
2
Ixy − R
× −··· . (7.5.56)
2γ20 + γ21 − γ10
The substitution of this expression to (7.5.54) allows us to find value α0 with the
order of magnitude (Ixy − R)3 .
Coefficients γik can be computed with the help of (7.5.45). For convenience of
computation we transform the latter expression to a somewhat different form by
introducing the conditional characteristic potential of the random information:
μ (t | y) = ln ∑ e−tI(x,y) P(x | y) = ln ∑ P1−t (y | x)P(x)Pt−1 (y) (7.5.57)

x x
which is associated with function (7.5.33) via the evident relationship

γy (1 − t) = μ (t | y) + (1 − t) ln P(y). (7.5.58)
The derivatives of (7.5.57) may be interpreted as conditional cumulant of −I(x, y):
μ (0 | y) = 0,
μ (0 | y) = −E[I(x, y) | y] = −m,
μ (0 | y) = E[I 2 (x, y) | y] − {E[I(x, y) | y]}2 = Var[I(x, y) | y] = D,
μ (0 | y) = −k,
··· (7.5.59)
[see formula (4.1.12)].

The substitution of (7.5.58) to (7.5.45) yields

s
γ (s) = ln ∑ exp μ (s | y) + μ (s0 | y) P(y). (7.5.60)
y 1 − s0
Taking into account (7.5.59), we can represent the expression situated in the ex-
ponent in the form
μ (s | y) + s(1 + s0 + s20 + · · · )μ (s0 | y) =

1 2 1 3 1
= −ms + Ds − ks + · · · + s −ms0 + D − m s0 + · · · .
2
2 6 2
Consequently,

s
exp μ (s | y) + μ (s0 | y)
1 − s0
1 1 1 1
= 1 − ms + m2 s2 − m3 s3 + Ds2 + · · · − ks3 + · · · −
2 6 2 6
1 1
− mDs + m s s0 − mss0 + · · · +
3 2 2
D − m ss20 + · · ·
2 2
1 1
= 1 − ms + (D + m2 )s2 − mss0 − (k + 3Dm + m3 )s3 +
2 6
1
2 2
+ m s s0 + D − m ss20 + · · · .
2
After averaging the latter expression over y according to (7.5.60) and denoting a
mean value by an overline, we will have

1
γ (s) = ln 1 − ms + (D + m2 )s2 − mss0 −
2

1 1
− (k + 3Dm + m3 )s3 + m2 s2 s0 + D − m ss20 + · · · .
6 2
The last formula entails
1

γ (s) = −ms + D + m2 − (m)2 s2 − mss0 −
2
1
− [k + 3Dm + m3 + 2(m)3 − 3(D + m2 )m]s3 +
6
1
+ m2 s2 s0 − (m)2 s2 s0 + D − m ss20 + · · · . (7.5.61)
2
If we assign s0 = 0 in expression (7.5.60), then due to (7.5.57) it will turn into

the characteristic potential
μ (s) = ln ∑ e−sI(x,y) P(x, y)

x,y
equivalent to (7.4.3). That is why the terms with s, s2 , s3 , . . . in (7.5.61) are automat-
ically proportional to the full cumulants
γ10 = μ (0) = −Ixy ,

1 1
γ20 = μ (0) = Var[I(x, y)], (7.5.62)
2 2
1 1
γ30 = μ (0) = − K3 [I(x, y)],
6 6
...
where K3 means the third cumulant.

In addition to the given relations from (7.5.61) we have
γ11 = γ10 = −Ixy ,

γ21 = m2 Ixy
2
,
1
γ12 = D − Ixy .
2
Therefore, relationship (7.5.56) takes the form

1 2 2I 2 + 1 D − μ (0)
Ixy − R 2 μ (0) + 2m xy 2
s0 = − (Ixy − R)2 .
μ (0) [μ (0)]3
Further, we substitute this equality to formula (7.5.54), which is taking the form

1 1
α0 = μ (0)s0 + μ (0)m − Ixy s30 + · · · .
2 2 2
2 3
As a result we obtain that

7.6 General relations between entropies and mutual informations for encoding and decoding 243

(Ixy − R)2 3 1 (Ixy − R)3
α0 =
+ 2 μ (0) − D − μ (0) (R < Ixy ). (7.5.63)
2μ (0) 2 6 [μ (0)]3
Here we have taken into account that, according to the third equality (7.5.59), it
holds true that
μ (0) = D − MI 2 (x, y) − Ixy
2
− D = m2 − Ixy
2
.
Moreover, we compare the last result with formula (7.4.24), which is valid in the
same approximation sense. At the same time we take into account that
μ (0) 0 and μ (0) D,
since the conditional variance Var does not exceed the regular (non-conditional) one
μ (0).
In comparison with (7.4.24), equation (7.5.63) contains additional positive terms,
because of which α0 > α . Therefore, inequality (7.5.50) is stronger than inequal-
ity (7.4.2) (at least for values of R, which are sufficiently close to Ixy ). Thus, Theo-
rem 7.5 is stronger than Theorem 7.3.
A number of other results giving an improved estimation of behaviour of the
probability of decoding error are provided in the book by Fano [10] (the English
original is [9]).
7.6 Some general relations between entropies and mutual

informations for encoding and decoding
1. The noisy channel in consideration is characterized by conditional probabilities

P(η | ξ ) = ∏i P(yi | xi ). The probabilities P(ξ ) = ∏i P(xi ) of the input variable
ξ determine the method of obtaining the random code ξ1 , . . . , ξM . The transmit-
ted message indexed by k = 1, . . . , M is associated with k-th code point ξk . During
decoding, the observed random variable η defines the index l(η ) of the received
message. As it as mentioned earlier, we select a code point ξl , which is the ‘closest’
to the observed point η in terms of some ‘distance’.
The transformation l = l(η ) is degenerate. Therefore, due to inequality (6.3.9)
we have
Ik,l (| ξ1 , . . . , ξM ) Ik,η (| ξ1 , . . . , ξM ). (7.6.1)
Applying (6.3.9) we need to juxtapose k with y, l with x1 and interpret x2 as a
random variable complementing l to η (in such a way that η coincides with x1 , x2 ).
The code ξ1 , . . . , ξM in (7.6.1) is assumed to be fixed. Averaging (7.6.1) over
various codes with weight P(ξ1 ) . . . P(ξM ) and denoting the corresponding results
as Ikl|ξ1 ...ξM , Ikη |ξ1 ...ξM , we obtain
Ikl|ξ1 ,...,ξM Ikη |ξ1 ,...,ξM . (7.6.2)

2. Let us now compare the amount of information Ikη |ξ1 ...ξM with Iξ η . The former
amount of information is defined by the formula
Ikη |ξ1 ,...,ξM = ∑ Ikη (| ξ1 , . . . , ξM )P(ξ1 ) · · · P(ξM ). (7.6.3)
where
Ikη (| ξ1 , . . . , ξM ) = Hη (| ξ1 , . . . , ξM ) − Hη |k (| ξ1 , . . . , ξM )

=∑f ∑ P(k)P(η | ξk ) − ∑ P(k)Hη (| ξk ); (7.6.4)
η k k
f (z) = −z ln z. (7.6.5)
In turn, information Iξ η can be rewritten as

Iξ η = Hη − Hη |ξ = ∑ f ∑ P(ξ )P(η | ξ ) − ∑ P(ξ )Hη (| ξ ). (7.6.6)
η ξ ξ
It is easy to make sure that after averaging (7.6.4) over ξ1 , . . . , ξk the second (sub-
trahend) term coincides with the second term of formula (7.6.6). Indeed, the expec-
tation
∑ P(ξk )H(| ξk )
ξk
of entropy Hη (| ξ ) is independent of k due to a parity of all k (see 7.2), so that
∑ P(k) ∑ P(ξk )Hη (| ξk ) = ∑ P(ξ )Hη (| ξ ).

k ξk
Therefore, the difference of informations (7.6.6) and (7.6.3) is equal to
Iξ η − Ikη |ξ1 ,...,ξM =

=∑f ∑ P(ξ )P(η | ξ ) −E ∑ f ∑ P(k)P(η | ξk ) , (7.6.7)
η ξ η k
where E denotes the averaging over ξ1 , . . . , ξM . In consequence of concavity of

function (7.6.5), we can apply the formula
E [ f (ζ )] f (E[ζ ])
[see (1.2.4)] for ζ = ∑k P(k)P(η | ξk ), i.e. the inequality

f ∑ P(k)E[P(η | ξk )] −E f ∑ P(k)P(η | ξk ) 0.
k k
But E[P(η | ξk )] = ∑k P(ξk )P(η | ξk ) does not depend on k and coincides with the
argument of function f from the first term of relation (7.6.7). Hence, for every η

f ∑ P(ξ )P(η | ξ ) −E f ∑ P(k)P(η | ξk ) 0
ξ k
and it follows from (7.6.7) that
Iξ η − Ikη |ξ1 ,...,ξM 0. (7.6.8)
Uniting inequalities (7.6.2), (7.6.8) we obtain that
Ikl Ikη Iξ η , (7.6.9)
where | ξ1 . . . ξM are neglected for brevity.

3. It is useful to relate the information Ikl between input and output messages
to the probability of error. Consider the entropy Hl (| k, ξ1 , . . . , ξk ) corresponding to
a given transmitted message k. After message k has been sent, there is still some
uncertainty about what message will be received. Employing the hierarchical prop-
erty of entropy (Section 1.3), this uncertainty (which is equal to Hl (| k; ξ1 , . . . , ξM )
numerically) can be represented as a sum of two terms. Those terms correspond to
a two-stage elimination of uncertainty. At the first stage, it is pointed out whether a
received message is correct, i.e. whether l and k coincide. This uncertainty is equal
to
− Per (k, ξ1 , . . . , ξM ) ln Per (k, ξ1 , . . . , ξM )−

− [1 − Per (k, ξ1 , . . . , ξM )] ln[1 − Per (k, ξ1 , . . . , ξM )] ≡
≡ h2 [Per (k, ξ1 , . . . , ξM )],
where Per (k, ξ1 , . . . , ξM ) is the probability of decoding error under the condition that
message k has been transmitted (a code is fixed). At the second stage, if l = k,
then we should point out which of the remaining messages has been received. The
corresponding uncertainty cannot be larger than Per ln (1 − M). Therefore,
Hl (| k, ξ1 , . . . , ξM ) < h2 (Per (k, ξ1 , . . . , ξM )) + Per (k, ξ1 , . . . , ξM ) ln(M − 1).
Next, we average out this inequality by k by using the formula
E[h2 (ζ )] h2 (E[ζ ]), (7.6.10)
which is valid due to convexity of function h2 (z) = −z ln z − (1 − z) ln (1 − z) and

inequality (1.2.4). That yields
Hl|k (| ξ1 , . . . , ξM ) h2 (Per (ξ , 1 . . . , ξM ) − Per (ξ1 , . . . , ξM ) ln(M − 1).

Further, we can perform averaging over an ensemble of random codes and anal-
ogously, using (7.6.10) one more time, obtain
Hl|k,ξ1 ,...,ξM h2 (Per ) + Per ln(M − 1). (7.6.11)
Because Ikl|ξ1 ,...,ξM = Hl|ξ1 ,...,ξM − Hl|k,ξ1 ,...,ξM , it follows from (7.6.11) that
Ikl|ξ1 ,...,ξM Hl|ξ1 ,...,ξM − Per ln(M − 1) − h2 (Per ). (7.6.12)
The same reasoning is applicable if we switch k and l. Then in analogy with (7.6.12)
we will have
Ikl|ξ1 ,...,ξM Hk − Per ln(M − 1) − h2 (Per ) (7.6.13)

(Per = P(k = l) = E [P(k = l | k)] = E [P(k = l | l))] .
Presuming the equiprobability of all M possible messages k = 1, . . . , M we have

Hk = ln M. Besides, it is evident that h2 (Per ) ln 2 = 1 bit. That is why (7.6.13) can
be rewritten as
ln M − Per ln(M − 1) − ln 2 Ikl|ξ1 ,...,ξM . (7.6.14)
Earlier we supposed that M = [enR ]; then
enR − 1 M, enR M − 1,
and (7.6.14) takes the form
ln(enR − 1) − Per nR − ln 2 Ikl|ξ1 ,...,ξ2 ,
i.e.
1 + ln(1 − e−nR )/nR − Per − ln 2/nR Ikl|ξ1 ,...,ξM /nR.
Taking into account (7.6.9) and the relationship Iξ η − nIxy we obtain the resultant
inequality
Ixy ln 2 1
Per 1 − − + ln(1 − e−nR ), (7.6.15)
R nR nR
which defines a lower bound for the probability of decoding error. As nR → ∞, the
inequality (7.6.15) turns into the asymptotic formula
Per 1 − Ixy /R.
It makes sense to use the latter formula when R > Ixy (if R < Ixy , then the inequality
becomes trivial). According to the last formula, the errorless decoding inherently
does not take place when R > Ixy , so that the boundary Ixy for R is substantial.
4. Uniting formulae (7.6.9), (7.6.14) and replacing the factor in Per with ln M, we
will obtain the result
ln M − Per ln(M) − ln 2 Ikl Ikη Iξ η . (7.6.16)

The quantity ln M = I is the amount of information in Hartley’s sense. It follows

from the asymptotic faultlessness of decoding that amount of information is close
to the Shannon’s amount of information Iξ η . Dividing (7.6.16) by ln M = I, we have
1 − Per − ln 2/I Ikl /I Ikη /I Iξ η /I. (7.6.17)
It can be concluded from the previous theorems in this chapter that one can
increase I and perform encoding and decoding in such a way that Iξ η /I → 1
as n → 0 and Iat the same time Per → 0. Then, apparently, the length of interval
ln 2 ξ η
1 − Per − I , I tends to zero:
Iξ η /I − 1 + Per + ln 2/I → 0. (7.6.18)
This means that with increasing n the following approximations are valid with a
greater degree of accuracy:
I ≈ Ikl /I ≈ Ikη /I ≈ Iξ η /I
or
I/n ≈ Ikl /n ≈ Ikη /n ≈ Iξ η /n. (7.6.19)
These asymptotic relations generalize equalities (6.1.17) concerned with simple
noise (disturbances). Thus, arbitrary noise can be regarded as asymptotically equiv-
alent to simple noise. Index l of code region Gl is an asymptotically sufficient coor-
dinate (see Section 6.1).
Just as in the case of simple noise (Section 6.1) when the use of Shannon’s
amount of information
Ixy = Hx − Hx|y (7.6.20)
was justified by (according to (6.1.17)) its ability to be reduced to a simpler ‘Boltz-
mann’ amount of information
Hk = − ∑ P(k) ln P(k),
k
in the case of arbitrary noise the use of information amount (7.6.20) is most con-
vincingly justified by the asymptotic equality Iξ η / ln M ≈ 1.
Chapter 8
Channel capacity. Important particular cases of
channels
This chapter is devoted to the second variational problem, in which we try to find
an extremum of the Shannon’s amount of information with respect to different input
distributions. We assume that the channel, i.e. a conditional distribution on its output
with a fixed input signal, is known. The maximum amount of information between
the input and output signals is called channel capacity. Contrary to the conventional
presentation, from the very beginning we introduce an additional constraint con-
cerning the mean value of some function of input variables, i.e. we consider a con-
ditional variational problem. Results for the case without the constraint are obtained
as a particular case of the provided general results.
Following the presentation style adopted in this book, we introduce potentials, in
terms of which a conditional channel capacity is expressed. We consider a number of
important particular cases of channels more thoroughly, and for which it is possible
to derive explicit results. For instance, in the case of Gaussian channels, the general
formulae in matrix form are obtained using matrix techniques.
In this chapter, the presentation concerns mainly the case of discrete random
variables x, y. However, many considerations and results can be generalized directly
by changing notation (for example, via substituting P(y | x), P(x) by P(dy | x), P(dx)
and so on).
8.1 Definition of channel capacity
In the previous chapter, it was assumed that not only noise inside a channel (de-
scribed by conditional probabilities P(y | x)) is statistically defined, but also signals
on a channel’s input, which are described by a priori probabilities P(x). That is
why the system characterized by the ensemble of distributions [P(y | x), P(x)] (or,
equivalently, by the joint distribution P(x, y)) was considered as a communication
channel.
Usually, the distribution P(x) is not an inherent part of a real communication
channel as distinct from the conditional distribution P(y | x). Sometimes it makes
https://doi.org/10.1007/978-3-030-22833-0 8
250 8 Channel capacity. Important particular cases of channels
sense not to fix the distribution P(x) a priori but just to fix some technically impor-
tant requirements, say of the form
a1 ∑ c(x)P(x) a2 , (8.1.1)
x
where c(x) is a known function. Usually, it is sufficient to consider only a one-sided

constraint of the type
E[c(x)] a0 . (8.1.2)
For instance, a constraint on average power of a transmitter belongs to the set of
requirements of such a type. Thus, a channel is characterized by the ensemble of
distribution P(y | x) and condition (8.1.1) or (8.1.2). According to that, when re-
ferring to such a channel, we use the expressions ‘channel [P(y | x), c(x), a1 , a2 ]’,
‘channel [P(y | x), c(x), a0 ]’ or succinctly ‘channel [P(y | x), c(x)]’. In some cases
conditions (8.1.1), (8.1.2) may be even absent. Then a channel is characterized only
by a conditional distribution P(y | x). As is seen from the results of the previous
chapter, the most important informational characteristic of channel [P(y | x), P(x)]
is the quantity Ixy , which defines an upper limit for the amount of information that
is transmitted without errors asymptotically. Capacity C is an analogous quantity
for channel [P(y | x), c(x)]. Using a freedom of choice of distribution P(x), which
remains after fixing either condition (8.1.1) or (8.1.2), it is natural to select the most
beneficial distribution in respect to the amount of information Ixy . This leads to the
following definition.
Definition 8.1. Capacity of channel [P(y | x), c(x)] is the maximum amount of in-
formation between the input and output:
C = C[P(y, x), c(x)] = sup Ixy , (8.1.3)

P(x)
where the maximization is considered over all P(x) compliant with condition (8.1.1)
or (8.1.2).
As a result of the specified maximization, we can find the optimal distribution
P0 (x) for which
C = Ixy (8.1.4)
or at least an ε -optimal Pε (x) for which
0 C − Ixy < ε ,
where ε is infinitesimally small. After that we can consider system [P(y | x), P0 (x)] or
[P(y | x), Pε (x)] and apply the results of the previous chapter to it. Thus, Theorem 7.1
induces the following statement.
Theorem 8.1. Suppose there is a stationary channel, which is the n-th power of
channel [P(y | x), c(x)]. Suppose that the amount ln M of transmitted information
grows with n → ∞ according to the law
8.1 Definition of channel capacity 251
ln M = ln[enR ],
where R is a scalar independent of n and satisfying the inequality
R < C, (8.1.5)
and C < ∞ is the capacity of the channel [P(y | x), c(x)]. Then there exists a sequence
of codes such that
Per → 0 as n → ∞.
In order to derive this theorem from Theorem 7.1, evidently, it suffices to select
a distribution P(x), that is consistent with condition (8.1.1) or (8.1.2) in such a way
that the inequality R < Ixy C holds. This can be done in view of (8.1.3), (8.1.5).
In a similar way, the other results of the previous chapter obtained for the chan-
nels [P(y|x), P(x)] can be extended to the channels [P(y|x), c(x)]. We shall not dwell
on this any longer.
According to definition (8.1.3) of capacity of a noisy channel, its calculation
reduces to solving a certain extremum problem. The analogous situation was en-
countered in Sections 3.2, 3.3 and 3.6, where we considered capacity of a noiseless
channel. The difference between these two cases is that entropy is maximized for
the first one, whereas the Shannon’s amount of information is maximized for the
other. In spite of such a difference, there is a lot in common for these two extremum
problems. In order to distinguish the two, the latter extremum problem will be called
the second extremum problem of information theory.
The extremum (8.1.3) is usually achieved at the boundary of the feasible range (8.1.1)
of average costs. Thereby, condition (8.1.1) can be replaced with a one-sided in-
equality of type (8.1.2) or even with the equality
E[c(x)] = a (8.1.6)
(where a coincides with a0 or a2 ). In this case the channel can be specialized as a

system [P(y, x), c(x), a] and we say that a channel capacity corresponds to the level
of losses (degree of blocking) a.
The channel in consideration does not need to be discrete. All the aforesaid also
pertains to the case when random variables x, y are arbitrary: continuous, combined
and so on. Under an abstract formulation of the problem we should consider an
abstract space Ω (which consists of points ω ) and two Borel fields F1 , F2 of its
subsets (so-called σ -algebras). Fields F1 and F2 correspond to random variables
x and y, respectively. The cost function c(x) = c(ω ) is an F1 -measurable function
of ω . Besides, there must be defined a conditional distribution P(Λ | F1 ), Λ ∈
F2 . Then system [P(·, F1 ), c(·), a] constitutes an abstract channel1. Its capacity is
defined as the maximum value (8.1.3) of the Shannon’s amount of information

P(d ω | F1 )
Ixy = P(d ω ) ln P(d ω | F1 )P(d ω )
P(d ω )
d ω ∈F2

P(Λ ) = P(Λ | F1 )P(d ω ), Λ ∈ F2
in the generalized version (see Section 6.4). Therefore, we compare distinct distri-
butions P(·) on F1 , which satisfy a condition of type (8.1.1), (8.1.2) or (8.1.6).
8.2 Solution of the second variational problem. Relations for

channel capacity and potential
1. We use X to denote the space of values, which an input variable x can take. For
the extremum distribution P0 (dx) corresponding to capacity (8.1.3), the probability
can be concentrated only in a part of the indicated space. Furthermore, let us denote
= 1 (i.e. P0 (X − X)
by X the minimal subset X ∈ X, for which P0 (X) = 0). We shall
call it an ‘active domain’.
When solving the extremum problem we suppose that x is a discrete variable
for convenience. Then we can consider probabilities P(x) of individual points x and
take partial derivatives of them. Otherwise, we would have introduced variational
derivatives that are associated with some complications, which is not of a special
type though.
We try to find a conditional extremum with respect to P(x) of the expression
P(y | x)
Ixy = ∑ P(x)P(y | x) ln
x,y ∑ P(x)P(y | x)
x
under the extra constraints
∑ c(x)P(x) = a, (8.2.1)
x
∑ P(x) = 1. (8.2.2)
x
It is allowed not to fix the non-negativity constraint for probabilities P(x) for now
but to check its satisfaction after having solved the problem.
Introducing the indefinite Lagrange multipliers −β , 1 + β ϕ , which will be deter-
mined from constraints (8.2.1), (8.2.2) later on, we construct the expression
P(y | x)
K = ∑ P(x)P(y | x) ln − β ∑ c(x)P(x) + (1 + β ϕ ) ∑ P(x). (8.2.3)
x,y ∑ P(x)P(y | x) x x
x
We will seek its extremum by varying values P(x), x ∈ X corresponding to the active
Equating the partial derivative of (8.2.3) by P(x) to zero, we obtain the
domain X.
8.2 Solution of the second variational problem. Relations for channel capacity and potential 253
equation
P(y | x)
∑ P(y | x) ln P(y)
− β c(x) + β ϕ = 0 for x ∈ X, (8.2.4)
y
which is a necessary condition of extremum. Here
P(y) = ∑ P(x)P(y | x). (8.2.5)

x
Factoring (8.2.4) by P(x) and summing over x with account of (8.2.1), (8.2.2),
we have Ixy = β (a − ϕ ), i.e.
C = β (a − ϕ ). (8.2.6)
This relation allows us to exclude ϕ from equation (8.2.4) and thereby rewrite it in
the form
P(y | x)
∑ P(y | x) ln P(y) = C − β a + β c(x) for x ∈ X. (8.2.7)
y
Formulae (8.2.7), (8.2.1), (8.2.2) constitute the system of equations that serves
if the region X is already
for a joint determination of variables C, β , P(x) (x ∈ X),
selected. For a proper selection of this region, solving the specified equations will

yield positive probabilities P(x), x ∈ X.
We can multiply the main equation (8.2.4) or (8.2.7) by P(x) and, thus, rewrite it
in the inequality form
P(y | x)
∑ P(x)P(y | x) ln P(y)
= [C − β a + β c(x)]P(x), (8.2.8)
y
which is convenient because it is valid for all values x ∈ X and not only for x ∈ X.
Equation (8.2.7) does not necessarily hold true beyond region X.
It is not difficult to write down a generalization of the provided equations for the
case when random variables are not discrete. Instead of (8.2.7), (8.2.5) we will have

P(dy | x)
P(dy | x) ln = C − β a + β c(x), x ∈ X (8.2.9)
P(dy)

P(dy) = P(dy | x)P(x) .

In this case, the desired distribution turns out to be P(dx) = p(x)dx, x ∈ X.
2. Coming back to the discrete version, we prove the following statement.
Theorem 8.2. A solution to equations (8.2.7), (8.2.1), (8.2.2) corresponds to the
maximum of information Ixy with respect to variations of distribution P(x) that leave
the active domain X invariant.
Proof. First, we compute the matrix of second derivatives of expression (8.2.3):

∂ 2K ∂ 2 Ixy P(y | x)P(y | x )

=
= −∑ , (8.2.10)
∂ P(x)∂ P(x ) ∂ P(x)∂ P(x ) y P(y)
where β , ϕ do not vary. It is easy to see that this matrix is negative semi-definite
Indeed,
(since P(y) 0), so that K is a convex function of P(x), x ∈ X.
2
P(y | x)P(y | x ) 1
−∑ ∑ f (x) f (x ) = − ∑

∑ P(y | x) f (x) 0
x,x y P(y) y P(y) x
for every function f (x).

Preservation of region X for different variations means that only probabilities
P(x), x ∈ X are varied. For such variations function K has null derivatives (8.2.4).
Hence, it follows from the convexity of function K that the function attains max-
imum at an extreme point. Therefore, maximality also takes place for those par-
which remain equalities (8.2.1), (8.2.2) un-
tial variations of variables P(x), x ∈ X,
changed. The proof is complete.
Unfortunately, in writing and solving equation (8.2.7), the active domain X is not
known in advance, which complicates the problem. As a matter of fact, in order to
one should perform maximization of C(X)
find X, over X. In some important cases,
for instance, when the indicated equations yield distribution P(x) > 0 on the entire
space X = X, it can be avoided. Then C(X) is the desired capacity C.
Indeed, in consequence of the maximality of information Ixy proven in Theo-
rem 8.2, there is no need to consider smaller active domains X1 ⊂ X in the given
case. Distribution P01 (x) having any smaller active domain X1 can be obtained as
the limit of distributions P (x), for which X is an active domain. But for P informa-
is not greater than extreme information I 0 due to Theorem 8.2. Therefore,
tion Ixy xy
the limit information lim Ixy (which coincides with information I 01 for P ) is also
xy 01
not greater than the found extreme information Ixy 0 . Thus, the specified solution is
nothing else but the desired solution.

The statement of Theorem 8.2 for the discrete case can be extended to an arbitrary
case corresponding to equations (8.2.9).
3. For the variational problem in consideration we can introduce thermodynamic
potentials, which play the same role as the potentials in the first variational prob-
lem considered in Sections 3.2, 3.3 and 3.6. Relation (8.2.6) is already analogous to
the well-known relation H = (E − F)/T [see, for instance, (4.1.14a)] from thermo-
dynamics. Therein, C is an analog of entropy H, a is an analog of internal energy
E, −ϕ is an analog of free energy F and β = 1/T is a parameter is a parameter
inverse of temperature. In what follows, the average cost E[c(x)] as a function of
thermodynamic parameters will be denoted by letter R. The next theorem confirms
the indicated analogy.
Theorem 8.3. Channel capacity can be calculated by differentiating potential ϕ
with respect to temperature:
C = −d ϕ (T )/dT. (8.2.11)
Proof. We will vary parameter β = 1/T or, equivalently, parameter a inside con-
straint (8.2.1). This variation is accompanied by variations of parameter ϕ and dis-
tribution P(x). We take in an arbitrary point x of the active domain X. If variation
da is not rather large, then the condition P(x) + dP(x) > 0 will remain valid af-
ter the variations, i.e. x will belong to the variated active domain. An equality of
type (8.2.4) holds true at point x before and after variating. We differentiate it and
obtain the equation for variations
dP(y)
− ∑ P(y | x) − [c(x) − ϕ ]d β + β d ϕ = 0. (8.2.12)
P(y)
y∈Y
Here the summation is carried out over region Y , where P(y) > 0. Further, we
consecutively multiply the latter equality by P(x) and sum over x. Taking into ac-
count (8.2.5), (8.2.1) and keeping the normalization constraint
∑ dP(y) = d ∑ P(y) = dl = 0,
y y
we obtain (a − ϕ )d β = β d ϕ . Due to (8.2.6) this yields
dϕ
C = β2 , (8.2.13)
dβ
and thereby (8.2.11) as well. This ends the proof.
If potential ϕ (T ) is known as a function of temperature T , then in order to deter-

mine the capacity by formula (8.2.11) it is left to concretize a value of temperature
T (or parameter β ), which corresponds to constraint (8.2.1). In order to do so, it is
convenient to consider a new function
dϕ
R = −T + ϕ. (8.2.14)
dT
Substituting (8.2.11) to (8.2.6), we obtain the equation
dϕ
−T (T ) + ϕ (T ) = a, i.e. R = a, (8.2.15)
dT
serving to determine quantity T .
It is convenient to consider formula (8.2.14) as the Legendre transform of func-
tion ϕ (T ):
∂ϕ
R(S) = T S + ϕ (T (S)) S=− .
∂T
Then, according to (8.2.15), capacity C will be a root of the equation
R(C) = a. (8.2.16)
Besides, if we consider the potential

1
Γ (β ) = − β ϕ , (8.2.17)
β
then equation (8.2.15) will take the form
dΓ
(β ) = −a. (8.2.18)
dβ
The value of β can be found from the last equation.

In so doing, formula (8.2.11), which determines channel capacity, can be trans-
formed to
d Γ (β )
C = Γ (β ) − β . (8.2.19)
dβ
formulae (8.2.18), (8.2.19) are analogous to formulae (4.1.14b), (3.2.9) if account-
ing for (4.1.14). This consideration is a consequence of the fact that the same regular
relations (8.2.6), (8.2.11) are satisfied in the case of the second variational problem
as well as in the case of the first variational problem.
Taking the differential of (8.2.15), we obtain
da/T = −d(d ϕ /dT ) = dC, (8.2.20)
that corresponds to the famous thermodynamic formula dH = dE/T for a differen-

tial of entropy.
4. The following result turns out to be useful.
Theorem 8.4. Function C(a) is convex:
d 2C(a)
0. (8.2.21)
da2
Proof. Let variation da correspond to variations dP(x) and dP(y)= ∑x P(y | x)dP(x).
We multiply (8.2.12) by dP(x) and sum it over x ∈ X taking into account that
∑ dP(x) = 0, ∑ c(x)dP(x) = da
x x
due to (8.2.1), (8.2.2). This yields
[dP(y)]2
∑ P(y) + dad β = 0. (8.2.22)

y∈Y
Apparently, the first term cannot be negative herein. That is why da d β 0. Divid-
ing this inequality by positive quantity (da)2 , we have
d β /da 0. (8.2.23)
The desired inequality (8.2.21) follows from here if we take into account that
β = dC/da according to (8.2.20). The proof is complete.
As is seen from (8.2.22), the equality sign in (8.2.21), (8.2.23) relates to the case
when all dP(y)/da = 0 within region Y .
A typical behaviour of curve C(a) is represented in Figure 8.1. In consequence
of relation (8.2.20), which can be written as dC/da = β , the maximum point of
function C(a) corresponds to the value β = 0. For this particular value of β equa-
tion (8.2.7) takes the form
P(y | x)
∑ P(y | x) ln P(y)
= C. (8.2.24)
y∈Y
Here c(x) and a are completely absent. Equation (8.2.24) corresponds to channel
P(y | x) defined without accounting for conditions (8.1.1), (8.1.2), (8.1.6). Indeed,
solving the variational problem with no account of condition (8.1.6) leads exactly
to equation (8.2.24), which needs to be complemented with constraint (8.2.2). We
denote this maximum value C by Cmax .
Now we discuss corollaries of Theorem 8.4. Function a = R(C) is an inverse
function to function C(a). Therefore, it is defined only for C Cmax and is two-
valued (at least on some interval adjacent to Cmax ), if sign < takes place in (8.2.21)
for β = 0. It holds for one branch that
dR(C)
> 0, i.e. T > 0 or β > 0;
dC
We call this branch normal. For the other, anomalous branch we have
dR(C)
< 0, β < 0.
dC
It is easy to comprehend that the normal branch is convex:
d 2 R(C) dT
≡ >0 (C < Cmax ),
dC2 dC
In its turn, the anomalous branch is concave:
d 2 R(C) dT
≡ <0 (C < Cmax ),
dC2 dC
as it follows from concavity of function C(a).
If we consider function ϕ (T ), which is the Legendre transform of R(C):

dR
ϕ (T ) = −CT + R(C(T )) T=
dC
[see (8.2.15)], then its normal and anomalous branches will be convex and concave,
respectively, since
dϕ d2ϕ dC
= −C, =− ,
dT dT 2 dT
Fig. 8.1 Typical behaviour of the dependency between R = a and C
In conclusion of this paragraph we return to the case when constraint (8.1.1) is

given on interval [a1 , a2 ]. The previous investigation of the behaviour of functions
C(a), R(C) immediately allows us to find the corresponding value of capacity C.
Further, there are three possible scenarios.
1. If
a1 R(Cmax ) a2 ,
then, evidently, the maximum value of channel capacity
C = Cmax
relates to feasible ones. In this case fixing the constraint (8.1.1) does not result in
a decrease of the channel capacity.
2. If
a2 R(Cmax ),
then the interval [a1 , a2 ] is related to the normal branch of the dependence be-
tween C and a. Here function C(a) is non-decreasing and, consequently, the ca-
pacity equals
C = C(a2 ).
3. If
R(Cmax ) < a1 ,
8.3 The type of optimal distribution and the partition function 259
then we need to consider the anomalous branch. Since function C(a) is non-
increasing for a < Cmax , the maximum value of C(a) is attained for a = a1 , i.e.
C = C(a1 ).
8.3 The type of optimal distribution and the partition function
The presentation in this section will be less general than the results from the previous
section. It is restricted by the existence condition of the inverse linear transformation
L−1 described below.
As is seen from the form of equations (6.2.8), (6.2.9), their solution can be easily
written, if we can find the transformation
f (y) = L−1 g(x), (8.3.1)
which is the inverse of the transformation

g(x) = L f (y) = P(dy | x) f (y), (8.3.2)
or, equivalently,

L f (y) = ∑ P(y | x) f (y), L f (y) = p(y | x) f (y)dy.
y
The inverse transformation (8.3.1) can be expressed with the help of the kernel
Q(y, dx) = q(y, x) dx as follows:

L−1 g(x) = Q(y, dx)g(x)
or
L−1 g(x) = ∑ Q(y, x)g(x), L−1 g(x) = q(y, x)g(x)dx. (8.3.3)
x
For simplicity, let us stop at the discrete version. Then Q(y, x) = P(y|x) −1 will
be a matrix, which is the inverse of the matrix P(y | x), y ∈ Y , x ∈ X .

With the help of the same (transposed) matrix we can also write down the trans-
formation
G(x) = F(y)L−1 = ∑ F(y)Q(y, x), (8.3.4)
y
which is an inverse of
F(y) = G(x)L = ∑ G(x)P(y | x). (8.3.4a)

x
Representing equation (8.2.7) as

∑ P(y | x) ln P(y) = −C + β a − β c(x) − Hy (| x)

y
or
∑ P(y | x)[ln P(y) +C − β a] = −β c(x) − Hy (| x), x ∈ X (8.3.5)
y
and using the inverse transformation (8.3.3), we have
ln P(y) +C − β a = −L−1 [β c(x) + Hy (| x)],
i.e.
P(y) = exp β a −C − ∑ Q(y, x)[β c(x) + Hy (| x)] . (8.3.6)
x∈X
The left-hand side can be represented in the form ∑x∈X P(x)P(y | x). Consequently,

P(x)L = exp β a −C − ∑ Q(y, x)[β c(x) + Hy (| x)] .
x
Applying the inverse transformation (8.3.4), we find

P(x) = ∑ exp β a −C − ∑ Q(y, x1 )[β c(x1 ) + Hy (| x1 )] Q(y, x). (8.3.7)
y x1
The unknown parameters C, β are determined with the help of (8.2.1), (8.2.2).
Summing (8.3.7) over x and taking into account that the left-hand side turns into
one, we obtain the equation

∑ exp − ∑ Q(y, x)[ β c(x) + Hy (| x)] = eC−β a , (8.3.8)
y x
which is equivalent to (8.2.2). Note that we have used the constraint ∑x Q(y, x) = 1,
because due to (8.3.2), (8.3.3) g(x) = 1 corresponds to function f (y) = 1 and vice
versa. Moreover, substituting (8.3.7) to (8.2.1), we have

∑ ∑ exp β a −C − ∑ Q(y, x)[β c(x) + Hy (| x)] Q(y, x)c(x) = a. (8.3.9)
y x∈X x
Equalities (8.3.8), (8.3.9) constitute a system of two equations with respect to

two variables C and β .
The specified equations can be solved by using standard thermodynamic methods
and notions. It is convenient to denote
b(y) = ∑ Q(y, x)c(x)

x∈X

v(y) = exp − ∑ Q(y, x)Hy (| x) . (8.3.10)
x
8.3 The type of optimal distribution and the partition function 261
Then expression (8.3.8) takes the form of the usual partition function
Z = eΓ = ∑ e−β b(y) v(y), (8.3.11)

y
defined by formulae (3.3.11), (3.6.4) earlier. Here b(y) and ν (y) are analogs of en-
ergy and ‘degree of degeneracy’, respectively. Equation (8.3.11) can be rewritten as
follows:
d ln z
a = ∑ Z −1 e−β b(y) b(y)v(y) = − . (8.3.12)
y dβ
If we use (8.3.11) to determine potentials Γ = ln Z, ϕ = −β −1 ln Z, which are anal-

ogous to potentials Γ , F [refer to (4.1.7), (3.3.12)] introduced earlier, then these
potentials will be nothing else but potentials considered in Section 8.2. They satisfy
the standard thermodynamic relations provided in Section 8.2, that is absolutely
natural from the standpoint of formulae (8.3.11), (8.3.12) having a form, which is
usual in statistical thermodynamics [see (3.6.4), (3.3.17)]. They correspond to the
canonical distribution
P(y) = e−Γ −β b(y) v(y) (8.3.13)
with respect to variable y.
According to (8.3.7), the desired extreme distribution has a more unusual form
P(x) = ∑ e−Γ −β b(y) Q(y, x).

y
In addition to the theory developed in Section 8.2, formula (8.3.11) implies a

method for explicit computation of potentials Γ , ϕ .
In conclusion of this section, we provide the following result.
Theorem 8.5. If transformation (8.3.2) is invertible and if the conditional entropy
Hy (| x) = Hy|x is independent of x, then the ratio C/(a−a0 ) is confined in the interval
between β and β0 . Here a0 and β0 , β are defined by the equalities
dΓ dΓ dΓ
Γ ( β0 ) − β0 (β0 ) = 0, a0 = − (β0 ), a=− (β ), (8.3.14)
dβ dβ dβ
That is
dC dC
β0 = (a0 ), β= (a).
da da
Proof. For Hy (x) = Hy|x it follows from (8.3.10), (8.3.11) in consequence of the
equality ∑x Q(y, x) = 1 (already mentioned before) that
Γ (β ) = −Hy|x + Γ0 (β ), (8.3.15)
where
Γ0 (β ) = ∑ e−β b(y) .
y
Apparently, potential Γ0 (β ) possesses the same convexity property as Γ (β ) does.

That is why its Legendre conjugate

dΓ0 (β )
S0 (R) = Γ0 (β ) + β R R=− (8.3.16)
dβ
is a concave function. If we take into account (8.3.15), we find from (8.2.18), (8.2.19)
that
dΓ0 dΓ0
C + Hy|x = Γ0 − β , = −a (8.3.17)
dβ dβ
and according to (8.3.16)
C + Hy|x = S0 (a). (8.3.18)
In consequence of the indicated concavity, the ratio [S0 (a) − S0 (a0 )]/(a − a0 ) be-
longs to the interval between dS dS0
da (a) = β and da (a0 ) = β0 . Selecting a0 from the
0
condition S0 (a0 ) = Hy|x , we obtain the statement of the theorem due to (8.3.18),
(8.2.19). The proof is complete.
We point out one corollary of the given theorem. If a a0 (and thereby β β0 ),
then we obtain
β (a − a0 ) C β0 (a − a0 ), (8.3.19)
In particular,
C β0 a. (8.3.20)
if also β0 a0 > 0.
8.4 Symmetric channels
In some particular cases, a channel capacity can be computed by methods simpler

than those of Section 8.3. First, we consider the so-called symmetric channels. We
provide an analysis in the discrete case.
Channel [P(y | x), c(x)] is called symmetric at input if every row of the matrix
⎛ ⎞
P(y1 | x1 ) · · · P(yr | x1 )
P(y | x) = ⎝. . . . . . . . . . . . . . . . . . . . .⎠ (8.4.1)
P(y1 | xr ) · · · P(yr | xr )
is a transposition of the same values p1 , . . . , pr playing a role of probabilities. Ap-

parently, for such channels the expression
Hy (| x) = − ∑ P(y | x) ln P(y | x) (8.4.2)

y
is independent of x and equal to
H ≡ − ∑ p j ln p j .
j
8.4 Symmetric channels 263
It is not difficult to understand that (8.4.2) coincides with Hy|x . Therefore, for-
mula (8.3.5) can be represented as

∑ P(y | x) ln ∑ P(x)P(y | x) +C − β a + Hy|x = −β c(x) (8.4.3)
y x
(x = x1 , . . . , xr ),
or, equivalently,

∑ P(y | x) ln ∑ P(x)P(y | x) +C + Hx|y = 0 (8.4.4)
y x
(x = x1 , . . . , xr ),
if β = 0 or a constraint associated with c(x) is absent. Relations (8.4.3) constitute

equations pertaining to unknown variables P(x1 ), . . . , P(xr ). Since Hy (| x) = Hy|x =
H for channels, which are symmetric at input, the amount of information Ixy =
Hy − Hy|x for such channels is equal to
Ixy = Hy − H.
Hence, the maximization of Ixy by P(x) is reduced to the maximization of entropy

Hy by P(x):
C = max Hy − H
formula (8.4.4) can be even more simplified for completely symmetric channels. A
channel is called completely symmetric if every column of matrix (8.4.1) is also a
transposition of same values (say, pj , j = 1, . . . , r ).
It is easy to justify that the uniform distribution
P(x) = const = 1/r (8.4.5)
is a solution of equations (8.4.4) in the case of completely symmetric channels.

Indeed, for completely symmetric channels the probabilities
1
P(y) = ∑ P(x)P(y | x) =
r ∑
P(y | x)
x x
are independent of y, because the sum
∑ P(y | x) = ∑ pj
x j
is the same for all columns. Therefore, a distribution by y is uniform:
P(y) = const = 1/r. (8.4.6)
It follows from here that Hy = ln r and equalities (8.4.4) are satisfied if

C = ln r − H. (8.4.7)
Since
r
pj 1
r= ∑ pj
=E
pj
, (8.4.8)
j=1
result (8.4.7) can be rewritten as
C = ln E[1/p j ] − E[ln 1/p j ]. (8.4.9)
Here E means averaging by j with weight p j . The provided consideration can be

also generalized to the case when variables x and y have continuous nature. We can
tailor a definition of symmetric channels for this case by replacing a transposition of
elements of rows or columns of matrix P(x, y) with a transposition of corresponding
intervals. Formula (8.4.7) will have the form

P(dy)
C = ln r + ln P(dy) = ln r − H P/Q ,
Q(dy)
where
Q(dy)
r = P(dy) = Q(dy).
P(dy)
Instead of (8.4.9) we have

Q(dy) Q(dy)
C = ln E − E ln .
P(dy) P(dy)
Here the averaging corresponds to weight P(dx); Q(dy) is an auxiliary measure

using which the permutation of intervals is carried out (equal intervals are inter-
changed in a scale, for which the measure Q is uniform).
8.5 Binary channels
1. The simplest particular case of a completely symmetric channel is a binary sym-

metric channel having two states both at input and output, and having the following
matrix of transition probabilities:

1− p p
P(y | x) = , p < 1/2. (8.5.1)
p 1− p
According to (8.4.5), (8.4.6), the capacity of such a channel takes place for uniform
distributions
P(x) = 1/2 , P(y) = 1/2. (8.5.2)
Due to (8.4.7) it is equal to
8.5 Binary channels 265
C = ln 2 − h2 (p) = ln 2 + p ln p + (1 − p) ln(1 − p). (8.5.3)

(1 − p)/2 p/2
P(x, y) = (8.5.4)
p/2 (1 − p)/2.
Further, we compute thermodynamic potential μ (t) (8.4.3) for a binary symmet-

ric channel. This thermodynamic potential allows us to estimate the probability of
decoding error (7.4.2) according to Theorem 7.3.
Substituting (8.5.2), (8.5.4) to (7.4.3) we obtain
μ (t) = ln{[(1 − p)1−t + p1−t ]2−t }

= ln[p1−t + (1 − p)1−t ] − t ln 2. (8.5.5)
Due to (8.5.5) equation (7.4.4) takes the form
p1−s ln p + (1 − p)1−s ln(1 − p)

+ ln 2 = R, (8.5.6)
p1−s + (1 − p)1−s
and coefficient α in exponent (7.4.2) can be written as
α = − sR − ln[p1−s + (1 − p)1−s ] + s ln 2
p1−s ln p + (1 − p)1−s ln(1 − p)
= − (1 − s)
p1−s + (1 − p)1−s
− ln[p1−s + (1 − p)1−s ] + ln 2 − R. (8.5.7)
We can rewrite formulae (8.5.7), (8.5.6) in a more convenient way by introducing a

new variable

p1−s (1 − p)1−s
ρ = 1−s 1 − ρ = 1−s
p + (1 − p)1−s p + (1 − p)1−s
instead of s. Then, taking into account
(1 − s) ln p = ln p1−s = ln ρ + ln[p1−s + (1 − p)1−s ],

(1 − s) ln(1 − p) = ln(1 − p)1−s = ln(1 − ρ ) + ln[p1−s + (1 − p)1−s ],
we conclude from (8.5.7) that
α = ρ ln ρ + (1 − ρ ) ln(1 − ρ ) + ln 2 − R = ln 2 − h2 (ρ ) − R. (8.5.8)
Moreover, we can simplify equation (8.5.6) similarly to get
ρ ln p + (1 − ρ ) ln(1 − p) = R − ln 2. (8.5.9)
Accounting for (8.5.9), (8.5.3) it is easy to see that the condition R < C means the
inequality
ρ ln p + (1 − ρ ) ln(1 − p) < p ln p + (1 − p) ln(1 − p)
or
p
(ρ − p) ln < 0,
1− p
p 1
i.e. ρ > p, since < 1 (because p < ). As is seen from (8.5.9), the value
1− p 2
ln[4p(1 − p)]
ρ=
ln[(1 − p)/p]
corresponds to R = 0. When R grows to C, the value of ρ decreases to p.

Stronger results can be derived with the help of Theorem 7.5 but we will not
linger upon this subject.
2. Let us now consider a binary, but non-symmetric channel [P(y | x), c(x)], which
is characterized by the transition probabilities

P(y1 | x1 ) P(y2 | x1 ) 1 − α, α
P(y | x) = = (8.5.10)
P(y1 | x2 ) P(y2 | x2 ) α , 1 − α
and the cost function

c
c(x) = 1 .
c2
In this case transformation (8.3.2) is represented as follows:

(1 − α ) f1 + α f2
Lf = .
α f1 + (1 − α ) f2
Its respective inverse transformation turns out to be

−1 1 1 − α −α g1 1 (1 − α )g1 − α gl
L g= = ,
D −α 1 − α g2 D −α g1 + (1 − α )g2
where D = 1 − α − α . Furthermore, according to (8.5.10) we have
Hy (| 1) = h2 (α ), Hy (| 2) = h2 (α ).
That is why

−1 1 (1 − α )[β c1 + h2 (α )] − α [β c2 + h2 (α )]
L β c(x) + Hy (| x) = ,
D −α [β c1 + h2 (α )] + (1 − α )[β c2 + h2 (α )]
and thereby formula (8.3.11) yields

α α
Z = exp [β c1 + h2 (α )] + [β c2 + h2 (α )] (ξ1 + ξ2 ), (8.5.11)
D D
8.6 Gaussian channels 267
where

1 1
ξ1 = exp − [β c1 + h2 (α )] , ξ2 = exp − [β c2 + h2 (α )] .
D D
Further, we employ (8.3.12) to obtain
α c1 + α c2 1 c1 ξ1 + c2 ξ2
a=− + . (8.5.12)
D D ξ1 + ξ2
Next, since C = β a + Γ = β a + ln Z [see (8.2.6), (8.2.17)], we obtain from (8.5.11),

(8.5.12) that
α h2 (α ) + α h2 (α ) β c1 ξ1 + c2 ξ2
C= + ln(ξ1 + ξ2 ) + . (8.5.13)
D D ξ1 + ξ2
In particular, if a constraint with c(x) is absent, then, supposing β = 0, we find
α h2 (α ) + α h2 (α )
C= + ln[e−h2 (α )/(1−α −α ) + e−h2 (α )/(1−α −α ) ].
1 − α − α
In the other particular case, when α = α ; c1 = c2 (symmetric channel), it follows
from (8.5.13) that
2α h2 (α ) β c1 + h2 (α ) β c1
C= + ln 2 − + = ln 2 − h2 (α ),
1 − 2α 1 − 2α 1 − 2α
that naturally coincides with (8.5.3).
8.6 Gaussian channels
1. Let x and y be points from multidimensional Euclidean spaces X, Y of dimensions

p and q, respectively.
We call a channel [p(y | x), c(x)] Gaussian if:
1. The transition probability density is Gaussian:

1
p(y | x) = (2π ) −s/2
det Ai j exp − ∑ Ai j (yi − mi )(y j − m j ) ,
1/2
(8.6.1)
2 ij
2. The Ai j are independent of x, and the mean values mi depend on x linearly:
mi = m0i + ∑ dik xk , (8.6.2)

k
3. The cost function c(x) is a polynomial of degree at most 2 in x:

1
c(x) = c(0) + ∑ ck xk +
2∑
(1)
ckl xk xl . (8.6.3)
k k,l
Obviously, we can choose the origin in the spaces X and Y (making the substi-
tution xk + c−1
(1)
kl cl → xk , yi − mi → yi ), so that the terms in (8.6.3) and in the ex-
0
ponent (8.6.1), which are linear in x, y, vanish. Furthermore, the constant summand
in (8.6.3) is negligible and can be omitted. Therefore, without loss of generality,
we can keep only bilinear terms in (8.6.1), (8.6.3). Using matrix notation, let us
write (8.6.1), (8.6.3) as follows:

1/2 A 1 T
p(y | x) = det exp − (y − x d )A(y − dx)
T T
(8.6.4)
2π 2
1
c(x) = xT cx. (8.6.5)
2
Here we imply a matrix product of two adjacent matrices. Character T denotes
transposition; x, y are column-matrices and xT , yT are row-matrices, correspond-
ingly:
xT = (x1 , . . . , xr ), yT = (y1 , . . . , ys ).
Certainly, matrix A is a non-singular positive definite matrix, which is inverse
to the correlation matrix K = A−1 . Matrix c is also perceived as non-singular and
positive definite.
As is seen from (8.6.1), actions of disturbances in a channel are reduced to the
addition
y = dx + z (8.6.6)
of noises zT = (z1 , . . . , zs ) having a Gaussian distribution with zero mean vector and
correlation matrix K:
E[z] = 0, E[zzT ] = K. (8.6.7)
(a matrix representation is applied in the latter formula as well).
2. Let us turn to computation of capacity C and probability densities p(x), p(y)
for the channel in consideration. To this end, consider equation (8.2.9), which (as
it was mentioned in Section 8.2) is valid in subspace X, where there are non-zero

probabilities p(x)Δ x. In our case, X will be an Euclidean subspace of the original r-
dimensional Euclidean space X. In that subspace, of course, matrix c (we can define
a scalar product with the help of it) will also be non-singular and positive definite.
We shall seek the probability density function p(x) in the Gaussian form:

1 T −1
p(x) = (2π ) −
r/2
det Kx exp − x Kx x ,
1/2
x ∈ X (8.6.8)
2
(so that E[x] = 0, E[xxT ] = Kx ). (8.6.9)
Here r is a dimension of space X; positive

Kx is an unknown (non-singular in X)
definite correlation matrix. Mean values E[x] are selected to be zeros according
to (8.6.4), (8.6.5) on the basis of symmetry considerations.
Taking into account (8.6.6), it follows from Gaussian nature of random variables
x and z that y are Gaussian random variables as well. Therefore, averaging out (8.6.6)
and accounting for (8.6.7), (8.6.9), it is easy to find their mean value and correlation
matrix
E[y] = 0, E[yyT ] = Ky = K + dKx d T . (8.6.10)
Therefore,

1
p(y) = det−1/2 (2π Ky ) exp − yT Ky−1 y
2

1
= det−1/2 [2π (K + dKx d T )] exp − yT (K + dKk d T )−1 y . (8.6.11)
2
Substituting (8.6.5), (8.6.11) to (8.2.9), we obtain
1 1
− ln [det(A) det(Ky )] + E[(yT − xT d T )A(y − dx) − yT Ky−1 y | x] =
2 2
β
= β α − xT cx −C, x ∈ X. (8.6.12)
2
When taking a conditional expectation we account for
E[(y − dx)(yT − xT d T ) | x] = E[zzT ] = K,

E[yyT | x] = E[(dx + z)(xT d T + zT ) | x] = dxxT d T + E[zzT ] = dxxT d T + K
due to (8.6.6), (8.6.7). That is why (8.6.12) takes the form
− ln[det Ky A] + tr KA − tr Ky−1 K − xT d T Ky−1 dx =

= 2β α − β xT cx − 2C,
x ∈ X. (8.6.13)
Supposing, in particular, that x = 0, we have
− ln[det Ky A] + tr(A − Ky−1 )K = 2β α − 2C (8.6.14)
and [after comparing the latter equality with (8.6.13)]
xT d T Ky−1 dx = β xT cx,
x ∈ X. (8.6.15)
From this moment we perceive operators c, d T Ky−1 d and others as operators acting
on vectors x from subspace X and transforming them into vectors from the same
subspace (that is, x-operators are understood as corresponding projections of initial
x-operators). Then, due to a freedom of selection of x from X, equality (8.6.15)
yields
d T Ky−1 d = β c.
Substituting the equality
Ky−1 = [K(1y + AdKx d T )]−1 = (1y + AdKx d T )−1 A (8.6.16)
following from (8.6.10) [since K = A−1 ] to the last formula, we obtain
d T (1y + AdKx d T )−1 Ad = β c, (8.6.17)
where 1y is an identity y-operator.

Employing the operator equality
A∗ (1 + B∗ A∗ )−1 = (1 + A∗ B∗ )−1 A∗ (8.6.18)
for A∗ = d T , B∗ = AdKx [refer to formula (A.1.1) from Appendix], we reduce equa-

tion (8.6.17) to the form
(1x + AK = β c.
x )−1 A (8.6.19)
Here
= d T Ad
A (8.6.20)
is an x-operator and 1x is an identity x-operator.
Operator c (both an initial r-dimensional operator and its r-dimensional pro-
jection) is non-singular and positive definite according to the aforesaid. It follows
from (8.6.19) that the operator

(1x + AKx) −1
A
is non-singular as well. It is not difficult to conclude from here that each of opera-
x )−1 , A
tors (1x + AK is non-singular. Indeed, the determinant of the matrix product
−1
(1x + AKx ) A, which is equal to the product of the determinants of the respective
matrices, could not be different from zero if at least one factor-determinant were
equal to zero. Non-singularity of A follows from the inequality det A = 0. Thus,
−1 T
there exists a reverse operator A = (d Ad) . −1
Using this operator, we solve equation (8.6.19):

x )−1 = β cA
(1x + AK −1 , x = A(
1x + AK β c)−1 . (8.6.21)
We obtain from here that

1 −1 −1 1 −1
Kx = c − A = c − (d T Ad)−1 . (8.6.22)
β β
Further, taking into account (8.6.10) we have
1 −1 T
Ky = A−1 − d(d T Ad)−1 d T + dc d . (8.6.23)
β
Hence, distributions p(x) (8.6.8), p(y) (8.6.11) have been found.

3. It remains to determine capacity C and average energy (costs) a = E[c(x)]. Due
to (8.6.5) the latter is evidently equal to
1 1
a = MxT cx = tr cKx (8.6.24)
2 2
or, if we substitute (8.6.22) hereto,

1 1 −1
r 1 −1
a = tr 1x − cA = T − tr cA . (8.6.25)
2 β 2 2
Here we have taken into account that the trace of the identity x-operator is equal to
dimension i.e. ‘the number of active degrees of freedom’ of random
r of space X,
variable x. The corresponding ‘thermal capacity’ equals

r
da/dT = (8.6.26)
2
(if
r does not change for variation dT ). Thus, according to the laws of classical sta-
tistical thermodynamics, there is average energy T /2 per every degree of freedom.
In order to determine the capacity, we can use formulae (8.6.14), (8.6.25) [when
applying formulae (8.6.16), (8.6.21)] or regular formulae for the information of
communication between Gaussian variables, which yield
1 1
C= ln det Ky A = tr ln Ky A.
2 2
Further, we substitute (8.6.23) hereto. It is easy to prove (moving d from left to right
with the help of formula (8.6.41)) that
1
C= tr ln[1 − d(d T Ad)−1 d T A + T dc−1 d T A]
2
1 −1 ), = d T Ad).
= tr ln(T Ac (A
2
Otherwise,
C=
1 −1 ) + ln(T ) tr 1x
tr ln(Ac
2 2
1 −1
r
= tr ln(Ac ) + ln T. (8.6.27)
2 2
The provided logarithmic dependence of C from temperature T can be already
derived from formula (8.6.25) if we take into account the general thermodynamic
relationship (8.2.20). In the given case it takes the form
dC = da/T =
rdT /2T
due to (8.6.26). There is a differential of (r/2) ln T + const on the right.

Now we find thermodynamic functions ϕ (T ), C(a) for the channel in considera-
tion. Using (8.2.6), (8.6.25), (8.6.27), (8.2.17), we have

r 1 −1 ) T −
r 1 −1
− ϕ (T ) = TC − a = ln T + tr ln(Ac T + tr A c (8.6.28)
2 2 2 2

1 −1
Ac 1 −1
Γ (β ) = tr ln + tr A c.
2 βe 2
Further, excluding parameter T from (8.6.25), (8.6.27), we obtain

r 1 −1 2 1 −1 ).
C = ln tr cA + a + tr ln(Ac (8.6.29)
2
r
r 2
These functions, as it was pointed out in Section 8.2, simplify a usage of condi-
tion (8.1.1), (8.1.2).
Quantity C increases with a growth of a, according to the terminology of Sec-
tion 8.2 the dependence between C and a is normal in the given case. If a condition
is of type (8.1.1), then the channel capacity is determined from (8.6.29) by the sub-
stitution a = a2 .
4. Examples. The simplest particular case is when matrix Ac −1 is a multiple of
the identity matrix:
−1 − 1x .
Ac (8.6.30)
2N
This takes place in the case when, say,

1 l = 1, . . . ,
r
c = 2 · 1x , K = N · 1y , dil = (8.6.31)
0 l > r.
Substituting (8.6.30) to (8.6.29), we obtain in this case that

r a ! 2a
C= ln 1 + T = 2N + , (8.6.32)
2
rN
r
since
−1 − 2N tr 1x − 2N
tr cA r, −1 ) = −
tr ln(Ac r ln(2N).
According to formulae (8.6.22), (8.6.25) the correlation matrix at input is repre-
sented as
T a
Kx = − N Ix = 1x .
2
r
Next, we consider a somewhat more difficult example. We suppose that spaces
X, Y coincide with each other, matrices 12 c, d are identity matrices and matrix K is
diagonal but not a multiple of the identity matrix:
K = Ni δi j .
Consequently, disturbances are independent but not identical. In this case

" "
" δi j "
−1 −1 −1
Ac = K c = "" " (8.6.33)
2Ni "
Subspace X is a space of a smaller dimensionality. For this subspace, not all

components (x1 , . . . , xr ) are different from zero but only a portion of them, say,
components xi , i ∈ L. This means that if i does not belong to set L, then xi = 0
Under such notations it is more appropriate to represent ma-
for (x1 , . . . , xr ) ∈ X.
trix (8.6.33) as follows:
δi j /2Ni , i ∈ L, j ∈ L.
For the
Let us give a number of other relations with the help of the introduced set X.
example in consideration, formula (8.6.22) takes the form

T
(Kx )i j = − Ni δi j (8.6.34)
2
and also due to (8.6.33) equalities (8.6.27), (8.6.24) yield

1 T
C= ∑ ln
2 i∈L 2Ni
,

T
a = ∑ (Kx )ii = ∑ − Ni . (8.6.35)
i∈L i∈L 2
These formulae solve the problem completely, if we know set L of indices i cor-
responding to non-zero components of vectors x from X. Let us demonstrate what
considerations define that set.
In the case of Gaussian distributions, probability densities p(x), p(y), of course,
cannot be negative, and therefore subspace X is not determined from the positive-
ness condition of probabilities (see Section 8.2). However, the condition of positive
definiteness of matrix Kx [which is given by formula (8.2.22)] may be violated, and
it must be verified. Besides, the condition of non-degeneracy of operator (8.6.20)
must be also satisfied. In the given example, we do not need to care much about the
latter condition because d = 1. But the condition of positive definiteness of matrix
Kx , i.e. the constraint
T
Ni < (8.6.36)
2
due to (8.6.34), is quite essential though. For each fixed T the set of indices L is
determined from constraint (8.6.36). Hence, we can replace i ∈ L with Ni < T /2
under the summation signs in formulae (8.6.35).
The derived relations can be conveniently tracked on Figure 8.2. We draw indices
i and variances of disturbances Ni (stepwise line) on abscissa and ordinate axes,
respectively. A horizontal line corresponds to a fixed temperature T /2. Below it
(between the horizontal line and the stepwise line) there are intervals (Kx )ii located
according to (8.6.34). Intersection points of the two specified lines define boundaries
of set L. The area sitting between the horizontal and the stepwise lines is equal to
the total useful energy a. In turn, the shaded area located between the stepwise line
and the abscissa axis is equal to the total energy of disturbances ∑i∈L Ni .
Space X (as a function of T ) is determined by analogous methods for more com-
plicated cases.
5. Now we compute thermodynamic potential (7.4.3), defining estimation (7.4.2)
of the probability of decoding error, for Gaussian channels. The information
Fig. 8.2 Determination of channel capacity in the case of independent (spectral) components of a
useful signal and additive noise with identically distributed components
p(y | x)
I(x, y) = ln
p(y)
1 1
= ln det(Ky A) − (yT − xT d T )A(y − dx) + yT Ky−1 y
2 2
have already been involved in formula (8.6.12). Substituting it to the equality

eμ (s) = e−sI(x,y) p(x)p(y | x)dxdy
and taking into account (8.6.4), (8.6.8), we obtain
eμ (s) = (2π )−(q+r)/2 det1/2 Adet−s/2 Ky A×

1 1−s T
× exp − xT [Kx−1 + (1 − s)d T Ad]x + (y Adx + xT d T Ay)−
2 2

1
− yT [(1 − s)A + sKy−1 ]y dxdy. (8.6.37)
2
The latter integral can be easily calculated with the help of the well-known matrix
formula (5.4.19). Apparently, it turns out to be equal to
eμ (s) = dets/2 (Ky A)det−1/2 (KKx )det−1/2 B,
where −1
Kx + (1 − s)d T Ad (1 − s)d T A
B= (8.6.38)
−(1 − s)Ad (1 − s)A + sKy−1 .
In order to compute det B we apply the formula

b c
det T = det d det(b − cd −1 cT ), (8.6.39)
c d
which follows from formula (A.2.4) from Appendix. This application yields
eμ (s) = det−s/2 (Ky A)det−1/2 (KKx )det−1/2 [(1 − s)A + sKy−1 ]×

× det−1/2 [Kx−1 + (1 − s)d T Ad − (1 − s)2 d T A(A − sA+
+ sKy−1 )−1 Ad] = det−s/2 (Ky A)det−1/2 [1 − s + sKKy−1 ]×
× det−1/2 [1 + (1 − s)Kx d T Ad − (1 − s)2 d T A(A − sA + sKy−1 )−1 Ad].
Taking the logarithm of the last expression and accounting for formula (6.5.4), we
find
s 1
μ (s) = − tr ln Ky A − tr ln(1 − s + sKKy−1 )−
2 2
1
− tr ln[1 + (1 − s)sKx d T Ky−1 (1 − s + sKKy−1 )−1 d]. (8.6.40)
2
We have simplified the latter term here by taking into account that
d T Ad − (1 − s)d T A(A − sA + sKy−1 )−1 Ad =

−1
s −1 −1
= d A 1− 1+
T
A Ky d
1−s
−1
s s
= dT A A−1 Ky−1 1 + A−1 Ky−1 d.
s−1 1−s
If we next use the formula

tr f (AB) = tr f (BA) (8.6.41)
for f (z) = ln (1 + z) [see (A.1.5), (A.1.6)], then we can move d from left to right in
the latter term in (8.6.40), so that the two last terms can be conveniently merged into
one:
tr ln(1 − s + sKKy−1 ) + tr ln[1 + (1 − s)sdKx d T Ky−1 (1 − s + sKKy−1 )−1 ] =

= tr ln[1 − s + sKKy−1 + s(1 − s)dKx d T Ky−1 ].
Besides, it is useful to substitute dKx d T by Ky − K [see (8.6.10)] here. Then equal-

ity (8.6.40) takes the form
s 1
μ (s) = − tr ln Ky A − tr ln[1 − s2 + s2 KKy−1 ]. (8.6.42)
2 2
Since KKy−1 = Ky−1 A, as we can see, function μ (s) has turned out to be expressed
in terms of a single matrix Ky A. Taking into account (8.6.23), (8.6.22) we can trans-
form formula (8.6.42) in such a way that μ (s) is expressed in terms of a single
matrix cA−1 . Indeed, substituting (8.6.23) we have
s −1 d T A + T dc−1 d T d T A)−

μ (s) = − tr ln(1 − d A
2
1 −1 d T A + T dc−1 d T A)−1 ].
− tr ln[(1 − s)2 + s2 (1 − d A
2
Moving d T A from left to right according to formula (8.6.41) [in the second term for
f (z) = ln [1 + (1 − s2 )z] − ln (1 + z)], we obtain

s −1 1 s2 −1
μ (s) = − tr ln(T Ac ) − tr ln 1 − s + cA
2
.
2 2 T
We can also represent this formula as follows:

s −1 ) − 1 tr ln[(−s2 + s2 β cA
−1 ]
μ (s) = tr ln(β cA (8.6.43)
2 2
or
1−s −1 ) − 1 tr ln[s2 + (1 − s2 )T Ac
−1 ].
μ (s) = tr ln(T Ac (8.6.44)
2 2
We obtain from here, in particular, the variance of information

1 −1 1 −1 .
μ (0) = tr 1 − cA =r − tr cA (8.6.45)
T T
If we apply (8.6.43) to the particular case of (8.6.31), then we will have

s T
r 2s2 N
μ (s) = − r ln − ln 1 − s2 +
2 2N 2 T
!
s
r a
r a
= − ln 1 + − ln 1 − s2 .
2
rN 2 a+ rN
In turn, it is valid for the second considered example (8.6.33) that

1 2Ni 2 Ni
2 2N∑
μ (s) = s ln − ln 1 − s 2
+ 2s .
i <T
T T
8.7 Stationary Gaussian channels 277
We use (8.6.43) to determine coefficient (7.4.19) by formula (7.4.2) for the

probability error. Since only one matrix is involved in (8.6.43), derivative μ (s)
can be computed by a simple method not taking into account the matrix non-
commutativeness. This yields
s(1 − β cA−1 )
μ (s) = −C + tr .
−1
1 − s2 + s2 β cA
Here we have accounted that μ (0) = −C.

Finally, according to (7.4.2) we find out that coefficient α in the estimation
Per < 2e−an (8.6.46)
can be determined via solving the system of equations
1 − β cA−1 1
a = s2 tr −1 )
+ tr ln(1 − s2 + s2 β cA
1 − s + s β cA
2 2 −1 2
−1 )
s(1 − β cA
tr = C − R.

1 − s2 + s2 β cA
Using the equality
1 − β cA−1 1
s2 = −1
1 − s + s β cA
2 2 −1 −1
1 − s + s2 β cA
2
we can reduce the given system to the form

1 −1 ),
α = s(C − R) + tr ln(1 − s2 + s2 β cA (8.6.47)
2
−1 )−1 −
C − R = tr(1 − s2 + s2 β cA r.
For small deviations C − R we can employ decomposition (7.4.24). Thus, holding
off only the first term and taking into account (8.6.45), we have
1 (C − R)2
α= +··· . (8.6.48)
−1
r − β tr cA
2
8.7 Stationary Gaussian channels
1. Stationary channels are invariant under a translation with respect to the index
(time):
(x1 , . . . , xm ) → (x1+n , . . . , xm+n ), (y1 , . . . , ym ) → (y1+n , . . . , ym+n ),

where n is an arbitrary number. Spaces X, Y are assumed to be of the same dimen-

sion. For such channels, matrices ci j , Ai j , di j depend only on the index differences
ci j = ci− j , Ai j = Ai− j , di j = di− j . (8.7.1)
At first, we consider spaces X = Y of a finite dimensionality m. In order to have

a strong stationarity in this case, functions cl , al , dl involved in (8.7.1) must be
periodic by l with a period equal to m:
cl+m = cl , al+m = al , dl+m = dl . (8.7.2)
It is convenient to reduce matrices (8.7.1) to a diagonal form via the unitary trans-
formation
x̄ = U + x, (8.7.3)
where
1
U = U jl , U jl = √ e2π i jl/m j, l = 1, .., m (8.7.4)
m
[refer to (5.5.8)]. Its unitarity can be easily verified. As it was shown in Section 5.4,
the Hermitian conjugate operator
" "
" 1 −2π ikl/m "
U = Ulk = "
+ + " √ e " (8.7.5)
m "
actually coincides with the inverse operator U −1 . The transformed matrix

1
m l∑
(U + cU) jk = e−2π i( jl −kl )/m cl −c
l
1
m∑
= e−2π i( j−k)l /m ∑ e−2π ikl/m cl
l l
= δ jk ck
and other ones will be diagonal:
U + cU = c¯j δ jk , U + AU = A¯ j δ jk , U + dU = d¯j δ jk . (8.7.6)
In this case,
m m m
c¯j = ∑ cl e−2π i jl/m , A¯ j = ∑ e−2π i jl/m al , d¯j = ∑ e−2π i jl/m dl . (8.7.7)
l=1 l=1 l=1
After the indicated transformation is performed, noises (disturbances) will be inde-

pendent and the given channel will resemble the channel considered in clause 4 of
the previous paragraph.
Matrix (8.6.20) will take the form

= |d¯j |2 A¯ j δ jk = |d¯j |2 δ jk /N j 1
A Nj = ¯ .
Aj
Since it must be non-singular in space X, space X has to be constructed only on

those components x̄ j (x̄ = U x), for which d¯j = 0. Other components should not be
+
considered.
Matrix (8.6.22) can be represented as

T Nj
Kx j δ jl = − δ jl .
c¯j |d¯j |2
The condition of its positive definiteness shrinks subspace X = X(T

) even more and
keeps only those components, for which N j c j < |d¯j |2 T .
Inconsequence of (8.7.6), formulae (8.6.27), (8.6.25) yield
1 |d¯j |2 T
C=
2 c¯ N ∑ ln
c¯j N j
(8.7.8)
j j <|d j | T
¯ 2

1 c¯j N j
a=
2 c¯ N ∑ T− ¯ 2 .
|d j |
(8.7.9)
j <|d j |
j
¯ 2T
It is also not difficult to write down function (8.6.43):

1 T |d¯j |2 s2 c¯j N j
− μ (s) = ∑ s ln c¯j N j + ln 1 − s + T |d¯j |2
2 c¯ N <|
2
(8.7.10)
j d¯ |2 T
j j
which allows us to estimate the error probability.

2. The provided formulae also allow a generalization to the case of an infinite di-
mensional space X. We suppose, for instance, that x is a random function on segment
[0, T0 ].
As earlier, a channel is assumed to be strongly stationary, i.e. in analogy with
(8.7.2) matrixes c, A, d (now they are represented as c(t − t ), A(t − t ), d(t − t ),
where t , t ∈ [0, T0 ]) satisfy the constraints
c(τ + T0 ) = c(τ ), A(τ + T0 ) = A(τ ), d(τ + T0 ) = d(τ ). (8.7.11)
Considering Δ -partition (Δ = T0 /m) of interval (0, T0 ) by points t j = jΔ , we can

suppose that
xi = x(ti ), ci j = c(ti − t j ) Ai j = A(ti − t j ) di j = d(ti − t j )
and apply the prior theory.

Now formulae (8.7.7) can be rewritten as
c¯j Δ = ∑ e−2π i jl/m c(l Δ )Δ = ∑ e−2π i jtl /T0 c(tl )Δ

l l
and analogously for Ā j , d¯j . As Δ → 0, the expression in the right-hand side con-
verges to the limit
T T0 /2
2π j 0
c̄ = e−2iπ jt/T0 c(t)dt = e−i2π jt/T0 c(t)dt. (8.7.12)
T0 0 −T0 /2
Hence, c̄ j Δ is approximately equal to c̄(2π j/T0 ):
c j Δ → c̄(2π j/T0 ) = c̄(ω j ). as Δ → 0, (2π j/T0 = ω j ). (8.7.13)
Similarly,
A¯ j Δ → Ā(ω j ), d¯j Δ → d(
¯ ω j ),
where
T0 T0
1
Ā(ω ) = e−iω t A(t)dt ≡ , ¯ ω) =
d( e−iω t d(t)dt. (8.7.14)
0 N(ω ) 0
Moreover,
|d¯j |2 ¯ ω j)
d(
→ . (8.7.15)
c¯j N j c̄(ω j )N(ω j )
Taking into account (8.7.15), we obtain
∞ ¯ ω j )|2
1 T |d(
C=
2 ∑ ln c̄(ω j )N(ω j )
(8.7.16)
j=0
in the limit Δ → 0 and, analogously,

1 ∞ c̄(ω j )N(ω j )
a= ∑ T− ¯ ω j )|2 , (8.7.17)
2 j=0 |d(
where the summation is carried out over region c̄(ω j )N(ω j ) < T |d(
¯ ω j )|2 . These
formulae could have been derived if we had disregarded a passage to the limit and
considered the following integral unitary transformation:
T0
1
x̄(ω j ) = √ e−iω j t x(t)dt, j = 0, 1, 2, . . . . (8.7.18)
T0 0
These values constitute Fourier components of the initial function and turn out to be
independent in the stationary case.
3. The other example of a finite dimensional space is the case when x =
(. . . , x1 , x2 , . . .) represents a discrete-time process on an infinite interval. Then pe-
riodicity constraints (8.7.2) vanish. Corresponding results can be obtained from the
formulae of clause 1 via the passage to the limit m → ∞.
Now it is convenient to represent transformation (8.7.4) in the form

1 1 2π j
x̄¯(λ j ) = √
2π
∑ e−2π i jl/m xl = √
2π
∑ e−iλ j l xl λj =
m
(8.7.19)
l l
and write
c̄¯(λr )δ (λr − λk ) , . . . ,
instead of (8.7.6), supposing c̄ j = c̄¯(λ j ), . . .. In this case, formula (8.7.7) will be
reduced to the form
m
c̄¯(λ j ) = ∑ e−iλ j l cl , . . . (8.7.20)
l=1
and remain a value for m = ∞.

The sums in formulae (8.7.8)–(8.7.10) turn into respective integrals in the limit
m → ∞. Indeed, it follows from the equality λ j = 2π j/m that m(λ j+1 − λ j )/2π = 1.
Hence, equality (8.7.8) can be rewritten as
m ¯ λ )|2
T |d(
∑ ln N̄¯ (λ )c̄¯(λ ) (λ j+1 − λ j ).
j
C=
4π j j j
Dividing both sides of the equality by m and passing on to the limit, we will have
¯ λ ) 2
C 1 T d(
C1 = lim = ln dλ (8.7.21)
m→∞ m 4π ¯ 2T
N̄¯ c̄¯<|d| N̄¯ (λ )c̄¯(λ )
and, analogously,

a 1 N̄¯ (λ )c̄¯(λ )
a1 = lim = T− ¯ dλ . (8.7.22)
K→∞ m 4π ¯ 2T
N̄¯ c̄¯<|d| |d(λ )|2
In order to find C1 (a1 ), we need to exclude T from here.

4. Finally, we consider the case when x is a continuous-time process on an infinite
interval. Formulae for such a case can be derived from the results of clause 2 via the
additional passage to the limit T0 → ∞.
As T0 → ∞, arguments ω j = 2π j/T0 in formulae (8.7.12)–(8.7.17) get denser and
denser, since ω j+1 − ω j = 2π /T0 . Dividing (8.7.16) by T0 , we obtain in the limit that
C 1 ¯ ω j )|2
T |d(
C0 = lim
T0 →∞ T0
= lim
T0 →∞ 4π
∑ ln N(ω j )c̄(ω j ) (ω j+1 − ω j )
¯ ω )|2
1 T |d(
= ln dω . (8.7.23)
4π c̄N<T |d|
¯2 N(ω )c̄(ω )
Analogously,

a 1 N(ω )c̄(ω )
a0 = lim = T− ¯ dω .
T0 →∞ T0 4π c̄N<T |d|
¯2 |d(ω )|2

∞ −iω t
Here, according to (8.7.12), (8.7.14) c̄(ω ) = −∞ e ¯ ω ) are spectra
c(t)dt, d(

of functions c(t − t ), d(t − t ), and N(ω ) is a spectral density of disturbances
(noises): ∞
N(ω ) = e−iωτ K(τ )d τ . (8.7.23a)
−∞
Similarly, we find the rate (per unit of time) function μ0 (s) from (8.7.10):
μ (s)
μ0 (s) = lim =
T0 →∞ T0
¯ ω )|2
1 T |d( N(ω )c̄(ω )
=− s ln + ln 1 − s2 + s2 d ω . (8.7.24)
4π N̄c<T |d|¯ 2 N(ω )c̄(ω ) ¯ ω )|2
T |d(
Now transformation x̄ = U + x, diagonalizing matrices c, K, d, takes the Fourier

integral form ∞
1
x̄(ω ) = √ e−iω t x(t) dt.
2π −∞
The integrals in the derived formulae are assumed to be convergent.
5. Formulae related to continuous and discrete times share a common field of ap-
plication: those are cases when for continuous-time spectra K̄x (ω ), d(
¯ ω ) are differ-
ent from zero only within a finite frequency band. As is known, the continuous-time
process x(t) is equivalent to the sequence of values x(tk ) at discrete time moments
tk (the Kotelnikov’s theorem). We denote the respective integral as
tk+1 − tk = τ0 (8.7.25)
thereby m = T0 /τ0 . Since

e2π i j/m = e2π i( j−m)/m
we can suppose for (8.7.4) that j ranges not from 1 to m but from −m0 to m−m0 −1,
where m0 is an arbitrary integer number. In particular, we can take m0 = m/2 or
(m − 1)/2. Correspondingly, we may suppose that λ j = 2π j/m in the formulae from
clause 3 ranges not from 0 to 2π but, say, from −π to π .
Then we compare the formulae ω j = 2π j/T0 [see (8.7.13)] and λ j = 2π j/m
(8.7.19). Evidently, ω and λ are related via the relation
mλ λ
ω= = (8.7.26)
T0 T0
and the interval −π < λ < π is equivalent to the frequency interval

π π
− <ω < . (8.7.27)
τ0 τ0
If the spectra are concentrated only within this interval, then, taking into ac-
count (8.7.26), we see that formulae (8.7.21), (8.7.23) are equivalent to each other,
since C0 = mC1 /T0 = C1 /τ0 due to (8.7.25). Also,
[d¯λ ]2 /N̄¯ (λ )c̄¯(λ ) = |d(

¯ ω )|2 /N(ω )c̄(ω ) λ = ωτ0
holds true according to (8.7.7), (8.7.13), (8.7.20), (8.7.23a).

The indicated coincidence takes place for other formulae as well. Inequal-
ity (8.7.27) yields a relation between time interval τ0 and the width of the frequency
interval.
6. In conclusion of the paragraph we discuss formula (8.6.46) for large m or T0 .
Apparently, we can introduce the rate coefficient α1 = limm→∞ α /m for discrete time
or α0 = limT0 →∞ α /T0 for continuous time. Then instead of (8.6.46) the following
asymptotic estimation will take place:
Per < e−α1 mn , Per < e−α0 T0 n ,
or, more precisely,

− ln Per − ln Per
lim → α1 lim → α0 .
mn→∞ mn T0 n→∞ T0 n
The error probability Per will converge to zero as m → ∞ or T0 → ∞, even if n =

1. The corresponding rate coefficient α1 or α0 will be determined from equations
analogous to, say, (8.6.47), if we transition to rate variables therein
C1 = lim C/m, R1 = lim R/m C0 = lim C/T0 R0 = lim R/T0 .

m→∞ m→∞ T0 →∞ T0 →∞
So, in the continuous-time case we will have

1 2 c̄(ω )N̄(ω )
α0 = s(C0 − R0 ) + ln 1 − s + s
2
¯ ω )|2 d ω ,
4π c̄N<T |d|¯ 2 T |d(

−1
1 2 c̄(ω )N(ω )
C0 − R0 = 1−s +s
2
− 1 dω
2π c̄N<T |d|¯ 2 ¯ ω )|2
T |d(
instead of (8.6.47). Evidently, formula (8.6.48) will take the form

−1
c̄(ω )N(ω )
α0 = π 1− ¯ ω )|2 d ω (C0 − R0 )2 + · · · .
c̄N<T |d|
¯2 T |d(
The condition on finiteness of the integral involved hereto is essential.

In the aforementioned consideration a strongly stationary periodic process (with
period m or T0 ) corresponded to finite m, T0 . The derived limit formulae (as m →
∞, T0 → ∞) will remain the same, if for finite m, T0 we consider intervals of a
process, which is stationary only for infinite m, T0 , similarly to how it was done in
Sections 5.5 and 5.7.
8.8 Additive channels
1. Let X and Y be identical linear spaces. The channel [p(y | x), c(x)] is called addi-
tive if X = X, and if the probability density p(y | x) depends only on the difference of
the arguments p(y | x) = p0 (y − x). This means that y = x + z is obtained by adding
to x random variable z, having probability density p0 (z). Assuming X = X, let us
consider those simplifications that are introduced by the additivity assumption to the
theory worded in Sections 8.2 and 8.3.
Now equation (8.3.5) can be written in the form

p0 (y − x) ln p(y)dy = β a − β c(x) −C − Hz , (8.8.1)
where Hz = Hy (| x) is independent of x now. We can also represent it as follows:

p0 (y − x)[ln p(y) +C + Hz − β a]dy = −β c(x). (8.8.2)
Here we have applied a continuous representation: if the considered linear space is

discrete, then probability densities need to be understood as sums of delta-functions
concentrated at points of a lattice. The indicated equation can be solved in terms of
the function
f (y) = ln p(y) +C + Hz = β a
by using the Fourier transformation method. Further, we propose an operator method
equivalent to it.
Introduce the characteristic function

eμ (s) = esz p0 (z)dz (8.8.3)
(generally speaking, s is a vector).

Because (8.8.3) implies the equation

μ (d/dx) z(d/dx)
e f (x) = p0 (z)e f (x)dz = p0 (z) f (x + z)dz
the following transformation:

G(x) = p0 (y − x) f (y)dy = p0 (z) f (x + z)dz
can be evidently rewritten in the operator form
G(x) = eμ (d/dx) f (x). (8.8.4)
Comparing (8.8.4) with (8.3.2), we observe that the operator L in this case is of the
following form:
L = eμ (d/dx) .
8.8 Additive channels 285
Hence we find the inverse operator:
L−1 = e−μ (d/dx) (8.8.5)
With the help of this operator, the problem of finding the channel capacity and ex-
treme distribution is solved according to the formulae of Section 8.3.
The operator transposed with respect to (8.8.5) is the following:
(L−1 )T = e−μ (−d/dx) . (8.8.6)
Therefore,
F(y)L−1 = (L−1 )T F(y) = e−μ (−d/dx) F(y).
With the help of operators (8.8.5), (8.8.6), formula (8.3.7) can be rewritten as
p(x) = e−μ (−d/dx) exp{β a −C − Hz − β e−μ (d/dx) c(x)}. (8.8.7)
We also construct some other formulae of Section 8.3. Relations (8.3.10) take the
form b(y) = [e−μ (d/dx) c(x)]y=x , ν (y) = e−Hz . That is why, we obtain the following
potentials from (8.3.11):

1
ϕ (T ) = T Hz − T ln exp − e−μ (d/dx) c(x) dx
T
( ) (8.8.8)
−μ (d/dx)
Γ (β ) = −Hz + ln exp −β e c(x) dx.
Hence, according to (8.2.13), (8.2.14) we have

( )
C = − Hz + ln exp −β e−μ (d/dx) c(x) dx
( )
+ β e−μ (d/dx) c(x) exp −β e−μ (d/dx) c(x) dx (8.8.9)
( )
a = e−μ (d/dx) c(x) exp −β e−μ (d/dx) c(x) dx.
These formulae, in principle, give the solution to the problem.

2.
Example 8.1. Let x be a one-dimensional continuous space, c(x) be the 4th-degree
power function
c(x) = x4
and let the probability density p0 (z) be Gaussian:
1
e−z /2σ .
2 2
p0 (z) = √
2πσ
Then
eμ (s) = eσ
2 s2 /2
and, thereby,
∞
−μ ( dy
d )
2 2
− σ2 d 2 4 (−1)k σ 2k d 2 k 4
e c(y) = e dy y = ∑ k! 2 k dy2k
y = y4 − 6σ 2 y2 + 3σ 4 (8.8.10)
k=0
because the only derivatives that are different from zero are
d2 4 d4 4
y = 12y2 , y = 24 .
dy2 dy4
Accounting for (8.8.10), formulae (8.3.11), (8.8.8) yield:
∞
Z = eβ¯ϕ = e−Hz −3β σ e−β y
4 4 +6β σ 2 y2
dy
−∞
∞ √
= 2β −1/4 e−Hz −3β σ e−t β σ 2t 2
4 4 +6
dt
0
1
−Hz −3β σ 4 3 2 9 2 √
9 βσ4 9
=e e K1σ β σ + π 2I 1
4 2 βσ 4
(8.8.11)
24 2 4 2

1 3σ 2 3 9 √ 9
Γ = −Hz + ln + β σ 4 + ln K 1 β σ 4 + π 2I 1 βσ4 .
2 2 2 4 2 4 2
Taking into account that in the given case

1 1 1
Hz = − p0 (z) ln p0 (z)dz = ln(2πσ02 ) + = ln(2πσ 2 e)
2 2 2
and denoting
9 √
βσ4 = λ, K 1 (λ ) + π 2I 1 (λ ) = F(λ )
2 4 4
formulae (8.2.18), (8.2.19), (8.8.11) or formula (8.8.9) allow us to obtain
3 9 F (λ )
a = − σ4 − σ4 ,
2 2 F(λ )

1 3 λ F (λ )
C = ln − + ln F(λ ).
2 4π e F(λ )
Example 8.2. Now suppose that
c(x) = cosh(α x),
and p0 (z) is a probability density having the following characteristic function:
θ (s) = eμ (s) .
Such probability density is not necessarily Gaussian, but it is symmetric: θ (s) =

θ (−s).
8.8 Additive channels 287
Since
d
θ e±α x = θ (±α )e±α x
dx
now we have that

1 −1 d 1 1
θ (eα x + e−α x ) = θ −1 (α )eα x + θ −1 (−α )e−α x = θ −1 (α ) cosh α x
2 dx 2 2
in consequence of the mentioned symmetry. For this case, we compute the integral
partition function:
∞
Γ β 2 β
Z=e = exp −Hz − cosh α y dy = K0 e − Hz .
−∞ θ (α ) α θ (α )
Formulae (8.2.18), (8.2.19) yield in this case
a = K1 (β )/[K0 (β )θ (α )],
2 K1 (β )
C = −Hz + ln + ln K0 (β ) + . (8.8.12)
a K0 (β )
Here β = β /θ (α ) is a new parameter replacing β .
3. Gaussian channels investigated in Section 8.6 are particular case of additive

channels if the space X appearing therein is formed by the points x = dx or if d is
an identity transformation (we assume the latter). In this particular case, functions
c(x) and μ (s) are quadratic:
1 1
c(x) = xT cx, μ (s) = sT Ks
2 2
(a matrix form), and the transformation e−μ ( dx ) c(x) is reduced to the following:
d
n
(−1)n ∂ T ∂ 1 T 1 1
e−μ (d/dx)
c(x) = ∑ K x cx = xT cx − tr Kc.
n=0 n!2n ∂ x ∂ x 2 2 2
!
d 1 T
Indeed, K dx 2 x cx = Kcx, so that

dT d 1 T
K x cx tr(Kc)
dx dx 2
and other higher-order derivatives vanish. Potentials (8.8.8) are easily obtainable in
this case using formula (5.3.19) that yields

β 1 βc
Γ = −Hz + tr(Kc) − tr ln . (8.8.13)
2 2 2π
Because
1 1
Hz = tr ln(2π K) + tr l
2 2β
formulae (8.2.18), (8.2.19) lead to the following result:
1 1
a = − tr Kc + tr l
2 2 (8.8.14)
1 βc 1 1 1
C = − tr ln − tr ln(2π K) = − tr ln cK − ln β tr l
2 2π 2 2 2
Because tr 1 = r = r coincides with the dimensionality of the space, and because

matrix K = A−1 in this case does not differ from A −1 (since d = 1), the equali-
ties (8.8.14) comply with formulae (8.6.25), (8.6.27) derived earlier.
Chapter 9
Definition of the value of information
The concept of the value of information, introduced in this chapter, connects Shan-
non’s information theory with statistical decision theory. In the latter theory, the
most basic is the notion of average cost or risk, which characterizes the quality of
decisions being made. The value of information can be described as the maximum
benefit that can be gained in the process of minimizing the average cost with the
help of a given amount of information. Such a definition of the value of informa-
tion turns out to be related to the formulation and the solution of certain conditional
variational problems.
The notion of the value of information can be defined in three related ways based
either on the amounts of Hartley’s information, Boltzmann’s information or Shan-
non’s mutual information. Choosing Shannon’s mutual information necessitates
solving the third variational problem. There exists a certain relationship between
these definitions, and one concept can be conveniently substituted for the other. All
of these concepts characterize a particular object—the Bayesian system—which,
along with the communication channel, is a major object of study in information
theory.
The theory of the value of information is an independent branch of informa-
tion theory, but is rooted in communication theory. Some of its elements and re-
sults emerged from the traditional theory studying communication channels. Claude
Shannon [45] (originally published in English [38]) considered the third variational
problem in 1948, which was posed as entropy minimization problem under a con-
straint on the level of costs, or, using Shannon’s terminology, under a given rate of
distortion. This terminology is quite far from that of the statistical decisions theory,
but it certainly does not change its mathematical essence. Later Kolmogorov [26]
(translated to English [27]) introduced the notion of ε -entropy based on this varia-
tional problem and obtained some related results. Instead of the term ε -entropy, we
shall use the term α -information, because we shall use Shannon’s mutual informa-
tion rather than entropy.
Recently, this theory (in the original Shannon’s interpretation) has been de-
veloped significantly in the research papers of American scientists (especially in

https://doi.org/10.1007/978-3-030-22833-0 9
290 9 Definition of the value of information
Berger’s monograph [3]) following Shannon’s papers [42, 45] (original papers in
English [38, 40]). However, we adhere to a different interpretation and terminology.
We emphasize that the class of problems associated with the third variational
problem is equivalent in significance to that associated with the second and the first
variational problems. (Of course, this does not preclude a unified approach; see, for
example, the assertion of the generalized Shannon’s theorem in Section 11.5.)
9.1 Reduction of average cost under uncertainty reduction
The utility of information is that it allows one to reduce the losses associated with
the average cost. It is assumed that a cost function is defined, which imposes differ-
ent costs for different actions and decisions. More successful actions yield smaller
costs and bigger rewards in comparison with less successful actions. Our goal is to
minimize average cost. The available information allows us to achieve a lower level
of the average cost.
Before proceeding to the mathematical formulation of the above, let us first con-
sider in this introductory section a simpler problem (or the same type as the first
variational problem). This problem illustrates the fact that a high level of uncer-
tainty in the system (neg-information) does indeed increase the level of losses.
Assume there is a system with discrete states. The system takes one of those
possible states at a time. The random variable ξ describing a certain state assumes a
fixed value. Also, assume a cost function c(ξ ) is given, which was chosen according
to the purpose of the system. For instance, if one desires that the system be posi-
tioned near the null state ξ = 0 (a stabilization problem), then one may choose cost
function c(ξ ) = |ξ |.
Suppose that, for whatever reason, the system cannot reach the ideal equality
ξ = 0. For example, inevitable fluctuations in the component parts of the system
entail statistical scattering, i.e. there is uncertainty or neg-information. In this case,
the value of the variable ξ will be random and described by some probabilities P(ξ ).
As is well known, entropy is the measure of uncertainty
Hξ = − ∑ P(ξ ) ln P(ξ ). (9.1.1)

ξ
Assume that the amount of uncertainty Hξ is fixed and consider the expected value
of possible costs E[c(ξ )]. There exists some lower limit for these costs that can be
found via the methods mentioned in Sections 3.2, 3.3 and 3.6. In fact, the problem
of finding the extremum of average costs given constant entropy has already been
solved. We revisit the solution of that problem now. The optimal distribution of
probabilities has the form
P(ξ ) = eβ F0 −β c(ξ ) , (9.1.2)
where
9.1 Reduction of average cost under uncertainty reduction 291
e−β F0 = ∑ e−β c(ξ ) . (9.1.3)

ξ
[see equation (3.3.5)]. The parameter β = 1/T is derived from the constraint of fixed
entropy (9.1.1). Further, it follows from (3.3.15) that
dF0 (T )
− = Hξ . (9.1.4)
dT
After determining parameter β or T we compute the minimum average costs by the
formula
d ∂ F0
R0 (Hξ ) = (β F0 ) = F0 − T . (9.1.5)
dβ ∂T
These formulas show us how the minimum average cost depend on uncertainty Hξ
in the system. According to Theorem 3.4, the average cost R0 (Hξ ) for T > 0 gets
bigger if entropy Hξ increases. Now assume that there is inflow of information I
that reduces entropy according to (1.1.2). If the system contained uncertainty (neg-
information) Hξ initially and that uncertainty decreased to the value Hξ − I = Hps ,
because of inflow of information, then, obviously, it has led to the cost reduction
Δ R0 = R0 (Hξ ) − R0 (Hξ − I). (9.1.6)
This difference indicates the benefit brought about by the information I. It is a quan-
titative measure of the value of information.
Assume I = Δ Hξ is very small. Then it follows from (9.1.6) that
dR0
Δ R0 ≈ I = T Δ Hξ .
dHξ
Hence, the derivative dR0 /dHξ = T may be regarded as the differential value of
entropy reduction (differential value of information).
Example 9.1. Let ξ be integer-valued and the cost function be given by c(ξ ) = |ξ |.
Denoting e−β = z, we obtain the partition function for this problem
∞ ∞
2z 1+z
e−β F0 = ∑ e−β |ξ | = 1 + 2 ∑ zξ = 1 + =
1−z 1−z
.
ξ =−∞ ξ =1
Therefore,
β F0 = ln(1 − z) − ln(1 + z), (9.1.7)

d(β F0 ) dz d(β F0 ) z z 2z
R0 = = −z = + = . (9.1.8)
dz d β dz 1 − z 1 + z 1 − z2
Taking into account the formula Hξ = β R0 − β F0 we obtain the entropy
2z ln z 1+z
Hξ = − + ln . (9.1.9)
1−z 2 1−z
This equation allows us to determine z(Hξ ) and, consequently, parameters β and

T . We use expressions (9.1.8) and (9.1.9), which give parametrical formulae for
R0 (Hξ ), to draw a graph of the interdependence in question. Further, the differential
value of information is derived from (9.1.9) as follows:
1
T (Hξ ) = − .
ln z(Hξ )
The behaviour of function R0 (Hξ ) and the differential value of information T (Hξ )
is represented on Figure 9.1.
Fig. 9.1 Average cost and differential value of information as functions of entropy (Example 9.1)
Example 9.2. Assume that ξ may only assume the eight values −3, −2, −1, 0, 1, 2,
3, 4. The cost function remains the same as in the previous example, i.e. c(ξ ) = |ξ |.
Then
e−β F0 = 1 + 2z + 2z2 + 2z3 + z4 = (1 + z)(1 + z + z2 + z3 )

(1 + z)(1 − z4 )
= , (9.1.10)
1−z
1−z
β F0 = ln − ln (1 − z4 ),
1+z
and thus
d(β F0 ) 2z 4z4
R0 = −z = − . (9.1.11)
dz 1−z 2 1 − z4
Instead of (9.1.9) we have now
1 + z2 − 2z3 1+z
Hξ = −2z ln z + ln + ln (1 − z4 ). (9.1.12)
1 − z4 1−z
9.1 Reduction of average cost under uncertainty reduction 293
The dependency of R0 on Hξ corresponding to the formulae (9.1.11), (9.1.12) is

presented on Figure 9.2. The null entropy (Hξ = 0) is attained at R0 = 0, z = 0,
β = ∞, and T = 0.
Fig. 9.2 Average cost in Example 9.2 (Hξ is given in bits)
With the growth of temperature T , the entropy Hξ and the average cost R0 mono-
tonically increase. In the limiting case of T → ∞, β → 0, z → 1 we have the maxi-
mum possible entropy
(1 + z)(1 − z4 )
Hξ = lim ln = ln 8 = 3 ln 2 = 3 bits
z→1 1−z
and average cost
2z(1 + z2 − 2z3 )
R0 = lim = 2,
z→1 1 − z4
which correspond to the uniform distribution with P(ξ ) = 1/8.
As noted above, a decrease in uncertainty in the system may be achieved by gain-
ing information. The amount of information was conceived simply as the difference
between two entropies Hξ for a single random variable. Meanwhile, according to the
discussion in Chapter 6, the amount of information I = Ixy is a more complex con-
cept that presupposes the existence of two random variables x and y (rather than a
single ξ ). There has to be an unknown random variable x, about which the informa-
tion is communicated, and random variable y, which carries that information. This
leads us to a complication in the reasoning and the need to turn from the simpler
(the first) variational problem to a more sophisticated one, which we shall designate
as the third variational problem of information theory.
9.2 Value of Hartley’s information amount. An example
1. Consider the following example similar to the examples from the previous sec-
tion. Let x be an internal coordinate of a system assuming values −3, −2, −1, 0, 1,
2, 3, 4. Let u be an estimating variable selected from the same values, similar to the
variable ξ from Example 9.2.
Assume that the eight points lie on the circumference (see Figure 9.3). It is desir-
able that the variable u be located as close as possible to the internal variable x. For
example, this desire may be described by introducing the following cost function:
Fig. 9.3 The graph of the cost function for the considered example
c(x, u) = min{|x − u|, 8 − |x − u|}. (9.2.1)

The last equation is depicted on Figure 9.3 and is symmetrical with respect to ro-
tation of the system of the eight points on π4 radians. A priori we assume that the
variable x is distributed uniformly: P(x) = 1/8. If we do not receive any information
about the value of x, then we can choose any random value for u, say 3. Trivially,
the average cost can be computed as follows:
1 1 1 1
E[c(x, u)] = 0 · 8 + 1 · 2 · + 2 · 2 · + 3 · 2 · + 4 · = 2.
8 8 8 8
The average cost can be reduced only via prior specification of the variable x, that
is, by gaining information. Suppose, for example, that the amount of information is
limited to 1 bit. We find that one of two scenarios is carried out. For example, either
x belongs to some subset E1 of the set of eight points E, or it does not. Denote this
message as y = y(x). Having received the message x ∈ E1 , we naturally choose u
from E1 . Similarly, we choose u from the complementary set E0 = E\E1 if we learn
that x does not belong to E1 . The choice is optimal if the value of u minimizes the
following expression:
9.2 Value of Hartley’s information amount. An example 295

E[c(x, u) | y] = ∑ c(x, u)P(x) ∑ P(x).
x∈Ey x∈Ey
Also, we can optimally partition the eight points into two sets, i.e. minimize the
average cost
E[min{E[c(x, u) | y]}].
u
It can be easily checked that the best way to partition the circumference containing
the eight points is to split it into two equal semicircles, each of which consists of
four points. It is reasonable to choose any point within each of those semicircles.
The proposed choice leads to the following average cost:
1 1 1
E[c(x, u)] = 0 · + 2 · 1 · + 1 · 2 · = 1.
4 4 4
Fig. 9.4 The value of information functions for one example
Thus, in the optimal case, 1 bit of information yields a benefit that reduces the
average cost from 2 to 1. Consequently, the value of 1 bit of information is equal to
one: V (1) = 1, where we denote the described value as V .
A similar analysis can be performed for the case of 2 bits of information. In this
particular case, the set E is partitioned into four parts—ideally each part is a pair of
two adjacent points. Having determined which pair x belongs to, we choose u to be
any point from it. Then u either equals x with probability 1/2 or differs from x by 1
with probability 1/2. The average cost turns out to be equal to 1/2. Thus, 2 bits of
information reduces the losses from 2 to 1/2. The value of 2 bits of information is
equal to 3/2: V (2) = 3/2.
Having received 3 bits of information, we can determine the exact value of x and
assign u = x, thus nullifying the cost. Therefore, the value of 3 bits of information
in the problem under consideration equals to 2: V (3) = 2. The values indicated
above (V (1), V (2) and V (3)) correspond to the points A, B and C on Figure 9.4,
respectively.
If M = eI = 2Ibits equals 3, then we should partition the eight points into three
domains: E1 , E2 and E3 . We need to compute the benefit
2 − E[min{E[c(x, u) | Ek ]}]
for every partition (k = 1, 2, 3) and select the optimal one. Calculation shows that the
optimal partition is the following: E1 consists of two points, E2 and E3 both consist
of three points. The corresponding benefit is 1.375. The resulting point (ln 3, 1.375)
is displayed on the plane (Ibits ,V ) (the point D on Figure 9.4). Points corresponding
to M = 5, 6, 7 can be constructed in the same way. We enumerate optimal partitions
and their corresponding coordinates in the table below:
Content of (1,1,2,2,2) (1,1,1,1,2,2)

partitions (4,4) (2,3,3) (2,2,2,2) (1,1,1,2,3) (1,1,1,1,1,3) (1,1,1,1,1,1,2)
Ibits 1 1.59 2 2.32 2.59 2.81

V (I) 1 1.375 1.5 1.625 1.75 1.875
The values M = 5, 6 correspond to two equally optimal partitions. The points

A, B, C, D, . . . related to optimal partitions are connected by a piecewise line V (I)
(Figure 9.4) that depicts the value of Hartley’s information.
It is clear now that the value of information V (I) is defined as the maximum
possible reduction in average cost that can be achieved by receiving some amount
of information regarding the region containing x from a limited number of regions
M(I) (M(I) is equal to the floor of eI ).
2. The described method of determining the value of information can be extended
to the general case. It is assumed that the random variable x is given and that there
exists a cost function c(x, u), which has two arguments: a value x and an estimator u.
Suppose a person or machine whose job it is to select an estimator (or control
variable) u receives some information about the value x. The received message is
restricted to M values where M is an integer number, i.e. it has the form of one of
M propositions. Thus, the amount of information received is bounded and is equal
to the Hartley’s amount of information I = ln M (see Section 1.1).
The value observed by the person is denoted by y = y(x). According to the above,
the variable y(x) assumes one of M values. After observation of y, the person natu-
rally chooses an optimal estimator u = d(y) that minimizes the conditional average
cost. It is determined by the equality
E[c(x, d(y)) | y] = min E[c(x, u) | y]. (9.2.2)

u
Each value y = yk can be associated with one optimal value u = uk . Thus, the func-
tion u = d(y(x)) takes at most M values, similar to y(x).
Averaging the conditional average cost, (9.2.2) we obtain the total average cost

R = E min E[c(x, u) | y] . (9.2.3)
u
If we do not have any information about the value of the unknown random vari-
able x, then we have only one way to choose the optimal estimator u. We minimize
the average cost E[c(x, u)] = ∑x c(x, u)P(x) by u, which gives the following level of
losses:
R0 = min ∑ c(x, u)P(x) = min E[c(x, u)]. (9.2.4)
u x u
Note that E[c(x, u)] = E[c(x, u) | u], i.e. we do not average over u. Naturally, the ben-
efit yielded by the received information is related to the difference in losses (9.2.3)
and (9.2.4).
Define the value of information I = ln M as the maximum benefit that can be
obtained from the Hartley’s amount of information I = ln M as follows:

V (ln M) = min E[c(x, u)] − inf E min E[c(x, u) | y] . (9.2.5)
u y(x) u
Here we do not just minimize by u—we also minimize over all possible functions
y(x) taking M values.
Theorem 9.1. The minimization (9.2.5) over all M-valued functions y(x) can be
restricted only to the set of deterministic (non-randomized) dependencies of y on
x. In other words, taking into account randomized dependencies y(x) (when y is
random for a fixed x) does not alter the extremum.
For non-randomized dependencies, observing a certain value y = yk taking one of

M values y1 , . . . , yM is equivalent to choosing a subset Ek (from M disjoint subsets
E1 , . . . , EM ) that contains x. Minimization by y(x) in (9.2.5) can be reduced then to
the following minimization problem:

V (I) = min E[c(x, u)] − inf E min E[c(x, u) | Ek ] (9.2.6)
u ∑k Ek =X u
where we consider all possible partitions E1 + · · · + EM of the space X of feasible so-

lutions x lying in the domain E1 , . . . , EM , where M eI holds. In more detail, (9.2.6)
can be expressed as

V (I) = min c(x, u)P(dx) − inf ∑ P(Ek ) min c(x, u)P(dx | Ek ). (9.2.6a)
u ∑k Ek =X k u
Proof. Assume that y = yr (x) depends on x in a randomized way and ranges over
values y1 , . . . , yM . This means that the variable y is random for a fixed x and is
described by some probability distribution Pr (y | x). Let dr (y) be an optimal solution
for the dependency in question determined from (9.2.2). Express its losses (9.2.3)
as follows:
Rr = ∑ ∑ c(x, dr (y))P(x)Pr (y | x). (9.2.7)

x y
Construct a new, no longer randomized dependency y = yn (x) in the following

way. Given a fixed x, find yn (x) as a value y ∈ {y1 , . . . , yM } that minimizes the func-
tion c(x, dr (y)). Therefore,
c(x, dr (y)) c(x, dr (yn (x))) for y ∈ {y1 , . . . , yM }
and since (9.2.7) holds
Rr ∑ ∑ c(x, dr (yn (x)))P(x)Pn (y | x)

x y
= ∑ c(x, dr (yn (x)))P(x). (9.2.8)

x
The latter sum can be written as
∑ c(x, dr (yn (x)))P(x) = ∑ c(x, dr (yn ))P(x)Pn (yn | x)

x x,yn
= ∑ ∑ c(x, dr (yn ))Pn (x | yn )Pn (yn )

yn x
where the index n corresponds to the non-randomized case. Clearly,
∑ c(x, dr (yn ))Pn (x | yn ) min

u
∑ c(x, u)Pn (x | yn ).
x x
Thus,
∑ Pn (yn ) ∑ c(x, dr (yn ))Pn (x | yn ) ∑ Pn (yn ) min

u
∑ c(x, u)Pn (x | yn ). (9.2.9)
yn x yn x
Denote the right-hand side of the last formula by Rn . It follows from the inequali-
ties (9.2.8) and (9.2.9) that
Rr Rn ≡ ∑ Pn (yn ) min ∑ c(x, u)Pn (x | yn ),

yn u x
i.e. the non-randomized dependency yn (x) is not worse than the randomized depen-
dency yr (x) with respect to average cost.
Fig. 9.5 System of transmission of the most valuable information. Channel without noise. MD—
measuring device
3. One simple application of the concept of the value of Hartley’s information im-
mediately follows from the aforementioned definition: we construct of a measuring-
transmitting system containing an informational restriction.
Let there be given a measuring device (MD) (see Figure 9.5) with an output
signal x equal to a measurable quantity, say continuous. However, suppose the exact
value of x cannot be conveyed to the device’s user due to either a (noiseless) channel
with limited capacity or a recording device with limited information capacity (the
variable y can assume only M = [eI ] different values). The goal is to receive values
of u ‘the closest’ to x with respect to the cost function c(x, u). To achieve this goal,
we must construct blocks 1 and 2 reflected on Figure 9.5 such that average cost is
minimized. Since the total amount of information is limited, we need to transmit the
most valuable information through the channel.
Taking into account the definition of the value of the Hartley’s amount of in-
formation, it becomes clear how to solve this problem. Block 1 must partition the
feasible space of values x into optimal domains E1 , . . . , EM mentioned in (9.2.6)
and deliver the index of a certain domain, i.e. y = k. After receiving one of the
possible signals k, block 2 must output the value uk that minimizes the conditional
expectation E[c(x, u) | Ek ].
The above is generally valid in the case of a noisy channel, that is, when the
output signal y does not necessarily coincide with the input signal y . This case
can be asymptotically reduced to the previous one if the system is forced to work
repeatedly, where we substitute x by ξ = (x1 , . . . , xn ), u by ζ = (u1 , . . . , un ) (n is
relatively large), and apply Shannon’s noisy-channel coding theorem. Thus, as is
shown on Figure 9.6, we need to install a channel encoder (CE) at the entrance of
the channel and a channel decoder (CD) at the exit. We assume that both the encoder
and the decoder function according to optimal encoding and decoding theory (see
Chapter 7). The structure of the blocks 1 and 2 is the same as on Figure 9.5.
Fig. 9.6 System of transmission of the most valuable information. Channel with noise. CE—
channel encoder, CD—channel decoder, ν = (y1 , . . . , yn )
The above method of defining the value of information V (I) has several draw-
backs. First of all, it defines the value of information only for integer numbers
M = [eI ]. The question remains, what is the benefit of the fractional part of eI ?
Secondly, if in the above example, we change the number of points by taking, say,
9 points, 10 points, and so forth, then the value of information will fluctuate irreg-
ularly. Thirdly, the Hartley’s amount of information I = ln M, that is, the argument
of the function V (I), is not characterized by the difference of two entropies like it
is in Section 9.1. Therefore, the definition given here is not consistent with ideas
of Section 9.1. It can be made more consistent if we take the Shannon’s mutual
information Ixy as an argument because Ixy is equal to the difference of entropies
Hx − Hx|y , where Hx is the entropy before signal receipt, and Hx|y is the entropy after
it.
The definition of the value of information proposed below has features of both
aforementioned definitions (Sections 9.1 and 9.2).
9.3 Definition of the value of Shannon’s information amount and

α -information
As in paragraph 2 of the previous section we consider a random variable x described

by a distribution P(dx) and a (measurable) cost function c(x, u) of x and an estimator
u. Values of x and u are points of given measurable spaces X and U, respectively.
We are therefore given a system [P(dx), c(x, u)], which we shall call the Bayesian
system.
For each conditional distribution P(du | x) we can compute the average cost or
risk:
E[c(x, u)] = c(x, u)P(du | x)P(dx) (9.3.1)
and the Shannon’s amount of information

P(du | x)
Ixu = ln P(du | x)P(dx). (9.3.2)
X P(du | x)P(dx)
We formulate the following third extremum problem. We say that a conditional

distribution P(du | x) is extremum, if it results in an extremum of the average
cost (9.3.1) at a fixed amount of information (9.3.2):
Ixu = I, (9.3.3)
where I is an arbitrarily chosen number. Further analysis shows that this same ex-
tremum distribution results in an extremum (to be more precise, a minimum) for the
information Ixu under the fixed average cost:
Ixu = min

c(x, u)P(du | x)P(dx) = α = fixed. (9.3.4)
We shall denote the average cost (risk) (9.3.1) of an extremum distribution by R. It

is a function of I owing to the condition (9.3.3):

R(I) = c(x, u)P(du | x)P(dx). (9.3.5)
Along with R(I) we can also consider the inverse dependence Ixu (a). The value
I(a) is called the information corresponding to the level of losses R = α or α -
information, succinctly. As can be seen from Theorem 9.6 below, the function Ixu (α )
9.3 Definition of the value of Shannon’s information amount and α -information 301
is convex (see Figure 9.7). Therefore, the function R(I) is, in general, two-valued.
In the general case, the function R(I) takes a minimum value equal to zero on some
interval
R0 R R0 . (9.3.6)
Call the function R(I) = R+ (I), being inverse to I(R), R R0 , the normal branch.
Also, call the function R(I) = R− (I), being inverse to I(R), R R0 , the anomalous
branch. We can define the value of Shannon’s information for the normal branch as
V (I) = R0 − R+ (I). (9.3.7)
Further, define the value of Shannon’s information for the anomalous branch by the
formula
V (I) = R− (I) − R0 . (9.3.8)
With this definition, the value of information is always non-negative. In certain
cases, one of the branches may be thrown to infinity, i.e. may be absent.
In order to clarify the meaning of definitions (9.3.7) (9.3.8) we first consider the
notions of R0 and R0 . The range (9.3.6) corresponds to null information Ixy = 0. This
means that the distribution P(du | x) does not depend on x in (9.3.5). Thus,

R(0) = P(du)E[c(x, u) | u],

where E[c(x, u) | u] = c(x, u)P(dx). Hence, we can obtain the range of variation of
R(0) over all possible P(du):
min E[c(x, u) | u] R(0) max E[c(x, u) | u].

u u
Fig. 9.7 Typical plot of the value function V (I)

Comparing this range with (9.3.6) we obtain
R0 = min E[c(x, u) | u]; R0 = max E[c(x, u) | u]. (9.3.9)

u u
Further, it will be seen from Theorem 9.2 (Section 9.4, paragraph 2) that the normal
branch corresponds to minimum cost
R+ (I) = min E[c(x, u)] (9.3.10)

P(du|x)
and the anomalous branch to maximum cost
R− (I) = max E[c(x, u)] (9.3.11)

P(du|x)
under the condition (9.3.3). Thus, the formula (9.3.7) can be expressed as
V (I) = min E[c(x, u) | u] − min E[c(x, u)], (9.3.12)

u P(du|x)∈G
where the set G is defined by the condition
Ixu I. (9.3.13)
Comparing (9.3.13) with (9.2.5) we notice an analogy between these two defini-
tions of the value of information. In both cases V (I) has the meaning of the largest
possible reduction of average cost (under the condition of a fixed amount of I).
Furthermore, the formula (9.3.8) on account of (9.3.9) and (9.3.11) takes the form
V (I) = max E[c(x, u)] − max E[c(x, u) | u]. (9.3.14)

P(du|x) u
In this case, the function c(x, u) should be interpreted not as a cost, but as a reward.
The value of information then has the meaning of the largest possible average reward
yielded by a given amount of information I. Of course, it is not hard to come up with
a version of the Hartley’s value of information corresponding to this case. Instead
of the formula (9.2.5) we get
V (ln M) = sup max E[c(x, u) | y] − max E[c(x, u) | u]. (9.3.15)

y(x) u u
Hence, the theory of the value of information contains an inherent symmetry

about the operations of minimization and maximization, which corresponds with
a sign change of the function c(x, u). Let us touch on the question of the range
of variation of the function R(I) or, equivalently, the value function V (I). A zero
amount of information corresponds to the interval (9.3.6), as has been previously
discussed. In contrast, if the amount of information goes to a maximum (finite or
infinite) value, then we will have a completely accurate value of x. In this case, an
optimal estimator can be obtained directly by minimizing cost c(x, u). Averaging
9.3 Definition of the value of Shannon’s information amount and α -information 303
will give us the lower limiting value
RL = ∑ P(x) min c(x, u).

x u
Analogously, maximization of the anomalous branch cost c(x, u) followed by aver-

aging yields the upper limiting value
RU = ∑ P(x) max c(x, u).

x u
Therefore, the function R(I) lies in the range
RL < R(I) < RU . (9.3.16)
Accordingly, the value function is located in the interval 0 < V (I) < R0 −RL (normal
branch) and in the interval 0 < V (I) < RU − R0 (anomalous branch).
Very important corollaries that connect the theory of the value of information
with the theory of optimal statistical solutions follow directly from the defini-
tion (9.3.12), (9.3.13) of the Shannon’s value of information. We provide the fol-
lowing easily proven results.
Theorem 9.2. Assume that a Bayesian system [c(x, u), P(x)] and an observed vari-
able y(x) with conditional probability distribution P(y | x) are given. Irrespective
of the specific decision rule u = d(y) (randomized or non-randomized), the level of
average cost satisfies the inequality
E[c(x, d(y))] R(Ixy ) ≡ R0 −V (Ixy ). (9.3.17)
Proof. It is not hard to see that whatever algorithm u = d(y) we choose, the amount
of information about the unknown value x cannot increase, i.e. the following in-
equality holds
Ixu Ixy . (9.3.18)
It follows from (9.3.12) and (9.3.13) that
E[c(x, d(y))] R0 −V (Ixu ) (9.3.19)
since the distribution
Pd (u | x) = ∑ P(y | x)P(u = d(y) | y)

y
belongs to the set of distributions enumerated while minimizing minP(u|x) E[c(x, u)].
As discussed above, the dependence V (I) is non-decreasing. Therefore, (9.3.17)
follows from (9.3.18) and (9.3.19). The proof is complete.
This theorem is evidence of the fruitfulness of the concept of the value of infor-
mation. The question of how to actually attain extremely small average cost, which
is studied by the theory of the value of information, will be covered in Chapter 11.
The idea of information corresponding to a predetermined level of losses was
first introduced by Shannon [45] (the English original is [38]) and by Kolmogorov
[26] (translated to English [27]). Shannon named it rate of generating messages)
whereas Kolmogorov called it W -entropy or ε -entropy). The notion of the value of
information was introduced by Stratonovich [47].
The quantities of Hartley’s and Shannon’s values of information do not coincide,
and the inequality V (I) V (I) follows from their definitions. An important result
of information theory is that V (I) ≈ V (I) asymptotically (Chapter 11). This result is
profound and significant in the same way as the results considered in Chapter 7—
about the asymptotically errorless communication through a noisy channel.
9.4 Solution of the third variational problem. The corresponding

potentials.
1. The problem of finding the extreme value distribution introduced in the previous
section for the dependence R(I) or V (I) we call the third variational problem of
information theory. As with the second variational problem, we assume that x and
u are discrete random variables for simplicity. We then extend obtained results for a
general case.
We vary probabilities P(x, u) of the joint distribution of x and u in order to make
average cost attain the extremum
∑ c(x, u)P(x, u) = extr (9.4.1)
given the additional constraint

∑ ln P(x, u) − ln ∑ P(x, u) − ln P(x) P(x, u) = I (9.4.2)
x,u x
[see (9.3.3)]. Since the a priori distribution P(x) is not supposed to change, it is
necessary to add a constraint
∑ P(x, u) = P(x) (9.4.3)

u
to (9.4.2). The total normalization condition
∑ P(x, u) = 1 (9.4.4)
x,u
can be ignored, since it is a consequence of (9.4.3). We can also omit the non-
negativity constraint P(x, u) 0 because the solution obtained for the problem with-
9.4 Solution of the third variational problem. The corresponding potentials. 305
out it satisfies this constraint (see below). We solve the constrained extremum prob-
lem (9.4.1)–(9.4.3) by the method of Lagrange multipliers 1/β , γ (x), finding the
extremum of the expression
K = β ∑ c(x, u)P(x, u)+

x,u

+ ∑ ln P(x, u) − ln P(x) − ln ∑ P(x, u) P(x, y) + ∑ γ (x)P(x, u). (9.4.5)
x,u x x,u
Denote by Z the set of pairs (x, u) (i.e. Z is a subset of the space X ×U) having posi-
tive probabilities P(x, u) > 0 in an extreme value distribution. The partial derivative
of (9.4.5) with respect to P(x, u) must be equal to zero. We do not differentiate
ln P(x) because of (9.4.3). After differentiating K with respect to P(x, u) we obtain
(which is a necessary condition for an extremum) the equality
β c(x, u) − ln P(x) − ln ∑ P(x, u) + ln P(x, u) + γ (x) = 0 (9.4.6)

x
or, equivalently,
P(x, u) = P(x)P(u)e−γ (x)−β c(x,u) for (x, y) ∈ Z. (9.4.7)
Here we denote
∑ P(x, u) = P(u) (9.4.8)
x
Values of β , γ (x) can be determined from (9.4.2) and (9.4.3). Multiplying (9.4.6) by
P(x, u) and summing over x, u we obtain
β R + Γ = −Ixu , (9.4.9)
where
Γ = ∑ γ (x)P(x) (9.4.10)
x
It follows from (9.3.3) that

β R + Γ = −I. (9.4.11)
Plugging (9.4.7) into (9.4.3) and (9.4.8) we obtain the equalities
∑ P(x)e−γ (x)−β c(x,u) = 1 (9.4.12)

x∈Zu
and
∑ P(u)e−β c(x,u) = eγ (x) . (9.4.13)
u∈Zx
Here the conditions x ∈ Zu and u ∈ Zx correspond to the condition (x, u) ∈ Z.

2. Now we can prove the following theorem.

Theorem 9.3. If we consider only those variations dP(x, u) and dP(u | x) that leave
the domain Z invariant, then the distribution (9.4.7) corresponds to the minimum of
information Ixy under the constraint E[c(x, u)] = a. Furthermore, this distribution
corresponds to the minimum average cost (9.3.1) subject to constraint (9.3.3) for
β > 0, and the maximum average cost for β < 0.
Proof. Differentiating the function (9.4.5) twice with respect to variables P(x, u),
(x, u) ∈ Z, we find the matrix of all second-order partial derivatives
∂ 2K ∂ 2I δ δ δ

=
= xx uu − uu (9.4.14)
∂ P(x, u)∂ P(x , u ) ∂ P(x, u)∂ P(x , u ) P(x, u) P(u)
for (x, u) ∈ Z, (x , u ) ∈ Z. Further, let us prove that this matrix is positive semi-
definite. It suffices to show that the x-matrix
δxx 1
Lxx = −
P(x, u) P(u)
δxx
(u is fixed) is semi-definite or the matrix ax − ∑ 1ax is semi-definite for every ax > 0.
x
We construct a quadratic form
v2x (∑ vx )2
∑ Lxx vx vx = ∑ P(x, u)
− x
P(u)
,
xx x
where vx is an arbitrary vector. We introduce new variables wx by the equality vx =

P(x | u)wx and obtain
2
1
∑ Lxx vx vx = P(u) ∑ P(x | u)ωx − ∑ P(x | u)ωx .
2
xx x x
But the expression in the braces above is actually the variance of variable wx cor-
responding to the distribution P(x | u). Its non-negativity entails positive semi-
definiteness of the matrix Lxx and, consequently, the matrix (9.4.14).
As a result, K and Ixy both turn out to be convex functions of arguments P(x, u),
(x, u) ∈ Z. We expand these functions into Taylor series at the point corresponding
to the extreme value distribution (9.4.7). Then we take into account the fact that
linear terms of this expansion for the distribution in question disappear. Finally, we
use the above-mentioned positive semi-definiteness to obtain
dK 0; dIxu 0. (9.4.15)
Since these relationships are valid for arbitrary variations of the variables P(x, u),
(x, u) ∈ Z, they are also valid for variations compatible with additional constraints (9.4.3)
and others. If the condition E[c(x, u)] = a holds, we have dIxu 0 according
to (9.4.15), which proves the first assertion of the theorem. In order to prove the
second assertion, we need to take the following formula into consideration:
dK = β d[Ec(x, u)] + dIxu + ∑ γ (x)d ∑ P(x, u) 0. (9.4.15a)

x u
Together with conditions Ixu = I and (9.4.3) it yields β d[Ec(x, u)] 0, i.e. d[Ec] 0
when β > 0, and d[Ec] 0 when β < 0. The proof is complete.
Theorem 9.4. The ‘active’ domain Z, where extremum probabilities are non-zero
(P(x, u) > 0), is cylindrical:
Z = X × U.
(9.4.16)
Here X and U
are sets, where P(x) > 0 and P(u) > 0, respectively.
Proof. Suppose that Z = Z1 does not coincide with X × U. Then clearly Z1 must be

a subset of the region X × U. According to (9.4.7) an extreme value distribution for
the region Z1 can be expressed as follows:
P1 (x, u) = P(x)P1 (u)e−γ1 (x)−β c(x,u) , (x, u) ∈ Z1 (9.4.17)
where γ1 (x), P1 (u) satisfy all necessary conditions. Employing (9.4.17) we construct
an auxiliary distribution P2 (x, u) with probabilities that are non-zero in a broader
region Z2 = X × U.
Further, put
P2 (x, u) = P(x)P2 (u | x), (x, u) ∈ Z2 (9.4.18)
where
P(u)e−γ1 (x)−β c(x,u)
P2 (u | x) = ≡ P(u)e−γ2 (x)−β c(x,u)
∑u∈U P(u)e−γ1 (x)−β c(x,u)
Taking account of the following equality (9.4.13):
∑ P(u)e−β c(x,u) = eγ1 (x) , x ∈ X

u∈Zx
it is not hard to see that

∑ P(u)e−β c(x,u) eγ1 (x)

u∈U
since the range of summation is extended. Therefore,
γ2 (x) = ln ∑ P(u)e−β c(x,u) γ1 (x).

u∈U
From here averaging yields

Γ2 Γ1 . (9.4.19)
For both distributions (9.4.17) and (9.4.18) the formula (9.4.9) is valid, that is β R1 +
I1 = −Γ1 ; β R2 + I2 = −Γ2 . Furthermore, it follows from (9.4.18) that
β R2 + I2 β R1 + I. (9.4.20)
Earlier the parameter β was assumed to be fixed. Now we consider the whole
family Z(β ) of active domains dependent on β and conduct the previously described
extension Z1 (β ) → Z2 (β ) for every β . Since the inequality (9.4.20) holds for each
β , it follows that I2 (R) I1 (R) or R2 (I) R1 (I) for β > 0. Consequently, consid-
ering only the cylindrical ‘active’ domain (9.4.15), we do not lose optimality. This
completes the proof.
Both right-hand and left-hand sides of (9.4.7) are equal to zero outside of the
domain X × U.
That is why the equality
P(x, u) = P(x)P(u)e−γ (x)−β c(x,u) (9.4.21)
holds everywhere for an extreme value distribution. Due to Theorem 9.4, the equal-
ities (9.4.12) and (9.4.13) can be expressed as follows:
∑ P(x)e−γ (x)−β c(x,u) = 1,

u∈U (9.4.22)
x∈X
∑ P(u)e−β c(x,u) = eγ (x) ,

x ∈ X. (9.4.23)

u∈U
We can also use the entire space of x and u for summation in the previous two
formulae.
The equations (9.4.22) and (9.4.23) under the corresponding constraint (9.4.2)
and a fixed domain U allow us to find the optimal distribution (9.4.21). Hence,
we can represent an extremum dependence R(I) for each domain U that satisfies a
number of conditions. For a complete solution, the problem of how to choose an
active domain U from the set of feasible domains D remains. It is natural to use the
condition of extremum here
R(I) = extrU∈D
(9.4.24)
that results from (9.4.1). If set D allows for continuous changes δ U of the domain
then, as a rule, condition (9.4.24) can be replaced by the stationarity condition
U,
δ R = 0. (9.4.25)
of the domain U,
Here the variation δ R corresponds to the variation δ U and the
information I remains constant:
δ I = 0. (9.4.26)
can be
Despite the absence of the variations (9.4.25) and (9.4.26) the variation δ U
accompanied by non-zero variations δ β and δΓ . Using (9.4.25) and (9.4.26), we
derive the following condition from (9.4.9):
Rδ β = −δΓ (9.4.27)
which will be used in the future. As will be shown later, this condition can be ex-
pressed similar to (10.1.8), which means extremeness of a potential Γ for a fixed
β.
3. Let us now turn to the derivation of ‘thermodynamic’ relations associated with

the third variational problem. They appear to be analogous in many respects to
the relations of the first and the second variational problems (see Sections 3.3, 3.6
and 8.2).
First, we consider relation (9.4.11). The potential (9.4.10) can be naturally inter-
preted as −β F, where F has the form
F = R+TI (9.4.28)
which is an analog of free energy and β = 1/T . Hence, (9.4.28) resembles a famous
relation from thermodynamics F = U − T H (U is internal energy, H is entropy).
These relations only differ by the sign of the term T I.
We proceed to the derivation of other formulae that resemble the usual relations
from thermodynamics. We begin with the one involving the potential Γ (β ) instead
of the free energy F(T ) = −T Γ (1/T ).
Theorem 9.5. The following equations hold for the third variational problem:
dΓ
= −R (9.4.29)
dβ
dΓ
β −Γ = I (9.4.30)
dβ
dR 1
=− (9.4.31)
dI β
by analogy with the first two variational problems.
Proof. We vary the parameter β in equation (9.4.22). This variation is accompanied,
Express the
in general, by variations of the function γ (x) and the active domain U.
variations d γ (x) and dΓ as a sum of two variations:
d γ (x) = d1 γ (x) + δ γ (x); dΓ = d1Γ + δΓ . (9.4.32)
Variations d1 γ (x) and d1Γ correspond to the variation d1 β of the parameter β for a
constant domain U. Also variations δ γ (x), δΓ , δ β = d β − d1 β correspond to the

variation δ U of the domain U.
We differentiate (9.4.22) for a constant domain U and obtain

d1 γ (x)
∑ d β1 + c(x, u) P(x)e−γ (x)−β c(x,u) = 0, u ∈ U.
x
and taking into

Multiplying the last equality by P(u) and summing it over u ∈ U
account the formula (9.4.21), we find

d1 γ (x)
E + c(x, u) = 0,
d β1
i.e. we obtain
d1Γ
= −R (9.4.33)
d β1
by (9.3.5) and (9.4.10). formulae (9.4.33) and (9.4.11) clearly imply the next relation
d1Γ
β − Γ = I. (9.4.34)
d1 β
We can simply obtain the required relations (9.4.29), (9.4.30) if we take into
account the fact that
dΓ d1Γ + δΓ
= = −R
dβ d1 β + δ β
in consequence of (9.4.27), (9.4.33). For derivation of (9.4.31) it suffices to take a
differential of both sides of the equality (9.4.30), which gives

dΓ dΓ dΓ
dβ +βd − dΓ = dI dΓ = dβ
dβ dβ dβ
and use (9.4.29). The proof is complete.

According to Theorem 9.5 the potential Γ (β ) is useful for computation of R(I).
However, it can be seen from the proof above that differentiating the potential with
respect to β can be achieved by assuming the domain U to be independent of β .
Analogous relations are valid also for the free energy F(T ) = −T Γ (1/T ). In-
deed, plugging Γ (β ) = −β F(1/β ) in (9.4.29)–(9.4.31) yields
dF
F −T =R (9.4.35)
dT
dF
=I (9.4.36)
dT
dR
= −T. (9.4.37)
dI
From these relations, it is clear that the dependence R(I) turns out to be a Legendre
transform
dF(T )
− R(I) = −F(T (I)) + T (I)I I= (9.4.38)
dT
of the function F(T ). Further, the function I(R) is essentially the following Legendre
transform:
dΓ
I(R) = −β (R)R − Γ (β (R)) R=− (β ) (9.4.39)
dβ
of the function Γ (β ). The derivative
dV (I)
≡ v(I) (9.4.40)
dI
or, equivalently,
∓dR± (I)/dI
can be interpreted as a differential value of information. As a result of (9.4.30),

(9.4.36) it appears to be connected with a temperature parameter by the simple re-
lation
v(I) = |T | = 1/|β |. (9.4.41)
This function is always non-negative.
4. Several general statements about the convexity properties of the functions
Γ (β ), I(R), F(T ), R(I) can be made.
Theorem 9.6. Functions I(R), Γ (β ) are convex. Further, the normal branch of the
function R(I) is also convex, but the anomalous branch of R(I) is concave.
Proof. First, we prove that I(R) is convex. For this we consider two values R and
R from a feasible region (9.3.16). Assume that they correspond to extreme value
distributions P (x, u), P (x, u) and values of information I(R ), I(R ), respectively.
Consider an intermediate point
Rλ = λ R + (1 − λ )R (0 < λ < 1).
This value is the average cost for the distribution
Pλ = λ P (x, u) + (1 − λ )P (x, u). (9.4.42)
Let Ixu [Pλ ] denote the information Ixu for distribution (9.4.42). While proving Theo-
rem 9.3, we showed that the expression Ixu = Ixu [P] is a convex function with respect
to probabilities P(x, u). Thus,
Ixu [Pλ ] λ I(R ) + (1 − λ )I(R ). (9.4.43)
Let us now compare Ixu [Pλ ] with I(Rλ ), which is the solution of the extremum prob-
lem (9.3.4). The distribution (9.4.42) is simply one of the distributions searched
while minimizing. Thus, I(Rλ ) Ixu [Pλ ]. Comparing this inequality with (9.4.41),
we obtain
I(λ R + (1 − λ )R ) λ I(R ) + (1 − λ )I(R )
which proves convexity of the function I(R). Further, recall that Γ (β ) and I(R) are
connected by the Legendre transform (9.4.38). Therefore, convexity of the function
Γ (β ) follows from convexity of I(R), if we take account of the fact that the Legendre
transform preserves convexity and concavity. The latter fact is most easily shown for
the case of twice differentiable functions. Differentiating (9.4.29) we obtain
d 2Γ dR
=− . (9.4.44)
dβ 2 dβ
Now we express (9.4.31) in the following form:
dI
= −β (9.4.45)
dR
and take the derivative with respect to R
d2I dβ
2
=− (9.4.46)
dR dR
Comparing (9.4.44) and (9.4.46), we obtain
*
d 2Γ d2I
= 1 .
dβ 2 dR2
This shows that the convexity condition d 2 I/dR2 0 entails the convexity condition
d 2Γ /d β 2 0.
The normal branches of R(I), F(T ) are characterized by the positivity of param-
eters β , T . Due to (9.4.31), the derivative dI/dR is negative for the normal branch,
and convexity of the inverse function R(I) follows from convexity of the function
I(R). If the derivative dI/dR is positive, then the inverse function R(I) becomes
concave. This completes the proof.
In addition to function (9.4.38) and the value function V (I) = R(0) − R+ (I) we
can introduce the corresponding random functions, i.e. functions dependent not only
on I, but also on a random variable x. Taking into account the formula (9.4.10),
written in the form
F(T ) = −T ∑ γ (1/T, x)P(x)
x
it is not hard to see that the Legendre transform (9.4.38) may be performed be-
fore averaging over x. That is, we perform the Legendre transform for the function
F(T, x) = −T γ (1/T, x).
We call the corresponding mapping

∂ F(T, x)
V (x, I) = −F(T, x) + T I I=
∂T
the random value of information or the value of random information, because
∂ F(T, x) ∂ γ (β , x)
I= = −γ (β , x) + β
∂T ∂β
on account of (9.4.23), (9.4.21) is nothing but the random information
P(x, u)
Iu (| x) = ∑ P(u | x) ln for dP(u)/d β = 0 .
u P(x)P(u)
9.5 Solution of a variational problem under several additional assumptions. 313
9.5 Solution of a variational problem under several additional

assumptions.
1. Let us show the solution to the third variational problem in a more explicit form
for two special cases. Denoting
P(x)e−γ (x) = Q(x) (9.5.1)
let us rewrite equation (9.4.22) as follows:
∑ Q(x)e−β c(x,u) = 1,
u ∈ U. (9.5.2)
x
First, suppose that this equation can be solved for the unknown function Q(x)
just as a system of linear equations or a linear integral equation. In other words,
there exists the following transformation:
L−1 f ≡ ∑ bxu f (u) = g(x)

u
that is inverse of the transformation
Lg ≡ ∑ e−β c(x,u) g(x) = f (u).

x
That is
bxu = e−β c(x,u) −1 . (9.5.3)
Then, from (9.5.2), we have
Q(x) = ∑ bxu

u∈U
which, taking into account (9.5.1), yields
γ (x) = ln P(x) − ln ∑ bxu , x ∈ X .

u∈U
Averaging this equation with weight P(x) gives the expression for the potential

Γ = −Hx − ∑ P(x) ln ∑ bxu . (9.5.4)
x
u∈U
Applying (9.4.28) and (9.4.29) we obtain a relationship between R and I in para-

metric form
∂ bxu
∑u ∂β
R=∑ P(x) (9.5.5)
x ∑u bxu

β ∑u ∂∂bβxu
I = Hx − ∑ − ln ∑ bxu P(x). (9.5.6)
x ∑u bxu u
The subtracted part in the right-hand side of the last formula is evidently the condi-
tional entropy Hx|u .
If we introduce the function

1
−Γ0 (β ) = β F0 = ∑ P(x) ln ∑ bxu
β x u∈U
and its Legendre transform

dΓ0 dΓ0
H0 (R) = Γ0 − β − =R ,
dβ dβ
then the results (9.5.5), (9.5.6) can be expressed in the following compact form:
I = Hx − H0 (R). (9.5.7)
2. Let us now make another assumption, namely that the function Q(x) =
From (9.5.2), we move Q outside of the
Q (9.5.1) is constant within the region X.
summation, and observe that
1
∑ e−β c(x,u) = Q
x∈X
in this case. The sum on the left-hand side of the equality resembles (3.3.11), (3.6.4)
introduced to solve the first variational problem. Under the given assumptions, the
sum should not depend on u. By analogy with (9.1.3), let us denote
1
e−β F0 (β ) = ∑ e−β c(x,u) = Q(β ) . (9.5.8)
x∈X
Hence, ln Q = β F0 , and it follows from (9.5.1) that
γ (x) = ln P(x) − β F0 (9.5.9)
and (after averaging)

Γ (β ) = −Hx − β F0 . (9.5.10)
As we pointed out in the previous section, the function R(I) can be obtained via
the Legendre transform

dF dF
− R(I) = −F + T =I (9.5.11)
dT dT
of the function
F(T ) = −T Γ (1/T ) = T Hx + F0 (T ). (9.5.12)

In addition to the auxiliary function −F0 (T ) it is convenient to introduce its Legen-
dre transform
dF0 dF0
R0 (H) = F0 − T − (T ) = H (9.5.13)
dT dT
by complete analogy with formulae (9.1.4), (9.1.5). Further, the Legendre trans-
form (9.5.11) of the function (9.5.12) can be written as:
dF0 dF0
R(I) = F0 (T ) − T (T ) subject to Hx + (T ) = I
dT dT
that is
R(I) = R0 (Hx − I) (H = Hx − I) (9.5.14)
by virtue of (9.5.13). The formula (9.5.14) is consistent with (9.5.7) since the func-
tion R0 (H) is the inverse of the function H0 (R). Constructing a function for the value
of information
V (I) = R(0) − R(I) (β > 0)
we obtain the formula
V (I) = R0 (Hx ) − R0 (Hx − I) (9.5.15)
that is quite similar to (9.1.6). The entropy H = Hx − I is evidently the conditional
entropy Hx|u = Hps .
Recall that information was simply perceived as a difference of two entropies in
Section 9.1 and not as the Shannon’s amount of information connecting two random
variables. Thus, in this case, the theory of the value of information that has been
developed fully corroborates the preliminary considerations set out in Section 9.1.
In this particular case, formulae (9.4.21), (9.5.1) and equation Q = eβ F0 lead to the
following form of the optimal distribution:
P(x, u) = P(u)eβ F0 −β c(x,u) .
The corresponding conditional probabilities are
P(x | u) = eβ F0 −β c(x,u)
which is reminiscent of expression (9.1.2).

Let us consider examples closely related to the cases studied in Sections 9.1
and 9.2.
Example 9.3. We are given the Bayesian system described in the beginning of Sec-
tion 9.2. We have already calculated the value of Hartley’s information amount.
Now, for comparison, let us find the value of Shannon’s information amount.
We write the equation (9.5.2) for the current example taking into account that
the function c(x, u) (9.2.1) depends only on the difference x − u. Thus, this equation
takes the form
∑ Q(x)e−β c(x−u) = 1.
x
This equation has a solution when the entire region U consisting of 8 points is cho-
sen to be the active domain. The solution corresponds to a trivial constant function
Q(x) = Q. The partition function (9.5.8)
e−β F0 = ∑ e−β c(x−u)

x
is independent of u and coincides with the sum (9.1.10). The dependence R0 (H)
is represented parametrically by expressions (9.1.11) and (9.1.12). It is depicted on
Figure 9.2.
The uniform distribution P(x) = 1/8 corresponds to entropy Hx = 3 ln 2. There-
fore, it follows from formula (9.5.14) that R(I) = R0 (3 ln 2 − I), and the func-
tion for the value of the Shannon’s information amount has the form V (I) =
R0 (3 ln 2) − R0 (3 ln 2 − I). This function is represented graphically on Figure 9.4.
This curve is simply the inversion of the curve from Figure 9.2.
Those magnitudes of the value of the Hartley’s information amount that were
found in Section 9.2 are depicted by the stepped line on Figure 9.4. It lies below
the main curve V (I). It can be seen from the graph that the value V (I) of one bit of
information equals to 1.35, which exceeds the value of 1 obtained in Section 9.2.
Similarly, the value V (I) of two bits of information equals to 1.78, which is greater
than the previously determined value of 1.5.
If any other number of points located on a circle is considered, then the value of
Shannon’s information amount can be found in precisely the same way, in which
case we shall observe a monotonic dependency without any irregular jumps.
This example illustrates clearly the difference between the values of Hartley’s
and Shannon’s amounts of information. As will be shown in Chapter 11 this differ-
ence vanishes in some more complex cases. This is the case, for instance, if instead
of the random variables x and u described above, one considers sequences x1 , . . . ,
xN and u1 , . . . , uN of random variables for reasonably large N.
Example 9.4. Let us consider a space containing an infinite number of points.
Assume that x and u can take any integer value . . ., −1, 0, 1, 2, . . . similar to the
variable ξ in Example 9.1 in Section 9.1. Let us take a simple cost function
c(x, u) = |x − ξ |
and prior probabilities

1 − ν |x|
P(x) = ν (0 < ν < 1). (9.5.16)
1+ν
In contrast to the previous example, the prior distribution is not uniform. Equa-
tion (9.5.2), written as
∞
∑ Q(x)e−β |x−u| = 1
x=−∞
has a constant solution

1 + e−β
Q−1 = = e−β F0 ,
1 − e−β
where F0 is a function of β computed in (9.1.7). By eliminating parameter z = e−β
from (9.1.8) and (9.1.9), we can find the dependency R0 (H). In order to compute
the value of information by formula (9.5.15), we need to determine the entropy Hx
of the prior distribution (9.5.16). Analogous to (9.1.9) it is not hard to obtain the
following:
2ν ln ν 1+ν
Hx = − + ln .
1 − ν2 1−ν
So, in this example, as in the preceding one, the value of information (9.5.15)
can be computed by means of R0 (H), which was introduced in Section 9.1 (Exam-
ple 9.1) and represented on Figure 9.1.
The probabilities P(u) and the extreme value distribution P(x, u) cannot be found
as easily as in the previous example, where both P(u) and P(x) were distributed
uniformly. Now it is necessary to solve equation (9.4.23) in order to determine P(u).
Formulae (9.5.1) and (9.5.16) yield the equation
P(x) 1 − ν |x|
eγ (x) = = e−β F0 ν
Q 1+ν
which can be rewritten as follows:
1+z 1−ν
∑ e−β |x−u| P(u) = 1 − z 1 + ν ν |x| .
u
Luckily, the latter equation has an exact solution. After multiplying both sides by τ x
(τ = eiλ ) and summing over x, we obtain

1+z 1−ν
∑ e τ ∑ τ u P(u) = 1 − z 1 + ν ∑ ν |x| τ x .
−β | σ | σ
σ u x
However,
∞
1 1 1 − z2
∑ e−β |σ | τ σ = ∑ z| σ | τ σ = + z −1 =
1 − zτ 1 − τ 1 − 2z cos λ + z2
σ σ =−∞
1 − ν2
∑ ν |x| τ x = 1 − 2ν cos λ + ν 2
(9.5.17)
x
and, consequently,
2
1−ν 1 + z2 − 2z cos λ
∑τ u
P(u) =
1−z 1 + ν 2 − 2ν cos λ
.
u
We apply a transformation
1 + z2 − 2z cos λ z cos λ − (1 + z2 )/2z

= =
1 + ν − 2ν cos λ
2 ν cos λ − (1 + ν 2 )/2ν

z (1 + ν 2 )/2ν − (1 + z2 )/2z z z(1 + ν 2 )/ν − 1 − z2
= 1+ = −
ν cos λ − (1 + ν )/2ν
2 ν 1 − 2z cos λ + ν 2
and again using formula (9.5.17), we obtain the desired probability

2
z 1−ν (ν − z)(1 − ν z) 1 − ν |u|
P(u) = δu0 + ν .
ν 1−z ν (1 − z)2 1+ν
Now, according to (9.4.21), we can readily compute the joint probability distribution
P(x, u) = P(u)eβ F0 −β |u−x|
which gives a complete solution to the extremum problem under consideration.
9.6 Value of Boltzmann’s information amount
It is obvious that Hartley’s amount of information I = ln M (see Section 1.1), which

is directly connected with the number M of realizations of the message, forms the
basis for any measure of information. In practice, Shannon’s amount of informa-
tion (6.2.1) is often used, which can be conveniently computed analytically for com-
plicated real-world random variables and processes. However, keep in mind that the
main justification for the use of Shannon’s information amount is that it can be re-
duced to the Hartley’s information amount, as evidenced by the Shannon’s theorem
(Chapter 7). The value of Shannon’s information amount V (I) turns out to be a
convenient approximation for the value of Hartley’s information amount V (I).
Let us consider another measure of information amount, namely the Boltzmann’s
amount of information, which occupies an intermediate position between Hartley’s
and Shannon’s amounts, and introduce the corresponding value function.
As seen from Chapter 1, the amount of information can be defined not only by
Hartley’s formula (1.1.5) or by Shannon’s formula (6.2.1), but also by the Boltz-
mann’s formula
yM
I = Hy = − ∑ P(y) ln P(y). (9.6.1)
y=y1
[see (1.2.3)].
The latter amount of information differs from the Hartley’s amount, namely
Hy < ln M
if the non-zero probabilities P(y) are not all equal to each other. Whatever the case,
the amount of information (9.6.1) assumes an intermediate place
9.6 Value of Boltzmann’s information amount 319
ln M Hy Iyz (9.6.2)
between Hartley’s information amount ln M and Shannon’s information amount

Iyz = Hy − Hy|z .

Accordingly, we introduce the function V (I) for the value of Boltzmann’s infor-
mation amount, which will occupy a position in between the function V (I) for the
value of Hartley’s information amount and function V (I) for the value of Shannon’s
information amount.
When we defined the value V (I) (9.2.6), we considered the benefit

Δ = min E[c(x, u)] − E min E[c(x, u) | Ek ] (9.6.3)
u u
yielded by the discrete-valued information y = yk . Then that benefit was maximized

over different partitions E1 + · · · + EM of the sample space x of a fixed number of
regions. While considering the amount of information (9.6.1) the discrete nature
of the variable y = yk is obviously necessary to preserve the same benefits (9.6.3)
corresponding to the partition ∑ Ek . Maximization over partitions, however, should
be performed a little differently—instead of fixing the number of regions M eI we
impose the following condition:
Hk ≡ − ∑ P(Ek ) ln P(Ek ) I (9.6.4)

k
on the entropy of the partition ∑ Ek . The number of regions can be arbitrary. As for

the rest, the defining formula for V (I) will coincide with (9.2.6):

V (I) = min E[c(x, u)] − inf E min E[c(x, u) | Ek ] (9.6.5)
u ∑ Ek =X u
where minimization by ∑ Ek is constrained by (9.6.4). Clearly, the expression on the

right-hand side of (9.6.5) coincides with the expressions (9.2.6) and (9.2.6a).

Theorem 9.7. The value function V (I) of the Boltzmann’s information amount is
located between the value functions V (I) and V (I), that is

V (I) V (I) V (I). (9.6.6)
Proof. Let uk be the value achieved if we minimize the second term of (9.6.3). It
follows from (9.6.4) that Huk I and, consequently,
Ixuk ≡ Huk − Huk |x I.
Therefore, the transition probabilities P(uk | x) are included in the set G of transition
probabilities searched during the minimization in definition (9.3.12). Therefore,

V (I) V (I).
On the other hand, during minimization (9.2.6), one searches through partitions E1 +
· · · + EM = X. The condition (9.6.4) is satisfied for each of these partitions. Thus,
the class of partitions searched in (9.6.5) is broader than that in (9.2.6). It follows
that

V (I) V (I).
Example 9.5. In order to illustrate the above, we revisit the example considered in
paragraph 1 of Section 9.2 and Example 9.3 in Section 9.5. We search through all
possible partitions of the set consisting of eight points on connected regions E1 , . . . ,
EM (partitions without adjacent points are not extreme). We want to find the value of
Boltzmann’s information (9.6.1) and the difference (9.6.3) for each partition. Take
into account that P(yk ) = P(Ek ), and the first term in (9.6.3) equals to two in this
particular case. We plot the points (I, Δ ) found in the previous step in the plane

(I,V ). Further, we draw a stepwise function V (I) in such a way that it occupies the
lowest and the right-most position subject to the condition that no points are above
or to the left of it. This corresponds to minimization with respect to the partitions
appearing in (9.6.5). Points lying on the stepwise line correspond to the extreme

partitions. The other points are disregarded as non-extreme. The graph of V (I) is
depicted in Figure 9.4 (it is dashed when it does not coincide with V (I)). We now
provide the extreme partitions for our case and the corresponding coordinates: Here
Content of
partitions (1,7) (2,6) (3,5) (1,2,5) (1,3,4) (2,3,3)
Ibits 0.537 0.819 0.954 1.299 1.406 1.562
V (I) 0.5 0.75 1 1.125 1.25 1.375
Content of
partitions (1,1,3,3) (1,1,1,2,3) (1,1,1,1,1,3) (1,1,1,1,1,1,2)
Ibits 1.81 2.16 2.41 2.75
V (I) 1.5 1.625 1.75 1.875
the upper row presents the number of adjacent points in brackets for each optimal
partition E1 + · · · + EM (points belong to E1 , . . . , EM , respectively).

It can be seen on Figure 9.4 that the line V (I) is situated in between the lines

V (I) and V (I), coinciding with V (I) in some regions. It is fully consistent with
inequalities (9.6.6).
To conclude this section we present a simple to prove but important fact about all
three functions of the value of information.

Theorem 9.8. The value of information functions V (I), V (I), V (I) for a Bayesian
system [P(dx), c(x, u)] are invariant with respect to the following transformation of
the cost function:
c (x, u) = c0 (x) + c(x, u), (9.6.7)
where c0 (x) is an arbitrary measurable function with a finite mean value.

Proof. In order to prove invariance of the functions V (I), V (I) we need to take into
account that

E min E[c (x, u) | Ek ] = E min E[c0 (x) | Ek ] + E min E[c(x, u) | Ek ]
u u u
0

= E c (x) + E min E[c(x, u) | Ek ] (9.6.8)
u
and
min E[c (x, u)] = E[c0 (x)] + min E[c(x, u)] (9.6.9)
u u
from whence the terms E[c0 (x)] are cancelled out in (9.6.3). In order to prove the
invariance of V (I) it suffices to rewrite the second term (9.3.12) using the equality
E[c (x, u)] = E[c0 (x)] + E[c(x, u)] together with (9.6.9).
Also the optimal partition ∑k Ek or the optimal transition probabilities P(du | x)
remain unchanged when performing the substitution (9.6.7). All the above can be
extended to the case, in which the function c(x, u) takes on the meaning of rewards
instead costs, and the minimizations in (9.6.3) and other formulae are substituted
with maximization.
9.7 Another approach to defining the value of Shannon’s

information
1. We shall now try formally to bring the definition of the value of Shannon’s in-
formation amount closer to the definitions (9.2.5) and (9.2.6), which in contrast
to (9.3.13) involve minimization over u in the second term.
Let us introduce a modified value function V̄ (I) of the Shannon’s information
amount, which, as will be seen later, sometimes coincides with V (I). That is,
V̄ (I) = V (I). (9.7.1)
Let [P(dx), c(x, u)] be a given Bayesian system. We introduce an auxiliary ran-
dom variable z taking values from some sample space Z and associated with x via
transition probabilities P(dz | x). We treat z as an observed variable carrying infor-
mation about x. After observing z, an optimal non-randomized estimator u = d(z) is
determined by the minimization
E[c(x, u) | z] = inf (9.7.2)

u
as is usual in the theory of optimal statistical decisions. Further, we add a mini-

mization taken over different transition probabilities P(dz | x) compatible with the
informational constraint
Ixz I. (9.7.3)
This leads to the following function for the value of information:

V̄ (I) = inf E[c(x, u)] − inf E inf E[c(x, u) | z] , (9.7.4)
u P(dz|x)∈Ḡ u
where Ḡ is the set of conditional distributions P(dz | x) satisfying the condi-

tion (9.7.3).
It is not hard to understand that for every choice of the space Z the inequality
below holds
V̄ (I) V (I). (9.7.5)
Indeed, it follows from (9.7.3) that Ixu I since Ixy Ixz . Therefore, for P(dz | x) ∈ Ḡ
the distribution z P(du | z)P(dz | x) belongs to the set G of distributions P(du | x)
searched while minimizing over P(du | x) in the definition (9.3.12) of the function
V (I). This fact (9.7.5) is certainly related to the aforementioned Theorem 9.2.
2. According to its definition the value function (9.7.4) appears to be dependent
on the choice of the measurable space Z. We prove that this space can be always
chosen in such a way that V̄ (I) attains its upper bound, i.e. the inequality (9.7.5)
becomes the equality (9.7.1).
Proof. Assume that for a given Bayesian system [P(dx), c(x, u)] the extremum of
the third variational problem under the constraint
Ixu I (9.7.6)
is achieved for a certain distribution P(du | x), which we will denote as Pe (du | x).
This extremum distribution specifies a two-dimensional distribution
Pe (dx | du) = P(dx)Pe (du | x)
and a conditional distribution Pe (dx | u). The distribution Pe (dx | u) is a probabilistic

measure of X for almost all values of u, i.e. a point π = z belonging to the space Π .
Thus, fixing the conditional distribution Pe (dx | u) means fixing a non-randomized
mapping from the space U containing values of u to Π (this mapping is defined
almost everywhere). Consider the scheme
x −→ u −→ π , (9.7.7)
where the randomized transformation x → u has transition probabilities Pe (du | x),

and the second transformation is realized according to the above mapping. It is evi-
dent that Ixπ Ixu by virtue of the general properties of the Shannon’s information.
Hence, as a result of (9.7.6) we conclude that Ixπ I.
Thus, the distribution Pe (d π | x) of the transformation scheme (9.7.7) satisfies the

condition (9.7.3) when z = π . This distribution belongs to the set Ḡ of distributions
enumerated in (9.7.4), which entails

V̄ (I) inf E [c(x, u)] − E inf E[c(x, u ) | z = Pe (dx | u)] . (9.7.8)
u u
Here, the last term corresponds to the distribution Pe (dz | x) specified above. The
conditional expectation E[c(x, u) | z = Pe (dx | u)] is taken with probabilities
P[dx | z = Pe (dx | u)] ≡

≡ P(dx | z, u)P(du | z) = Pe (dx | u)P(du | z) = Pe (dx | u) (9.7.9)
U U
since
zP(du | z) = z (where z = Pe (dx | u)).
As a consequence of (9.4.21) we obtain
γ (x) 1 Pe (dx | u)
c(x, u) = − − ln (9.7.10)
β β P(dx)
and so

| )
1 1 P (dx u
inf E[c(x, u ) | z] = − E[γ (x) | z] + inf E − ln
e
z (9.7.11)
u β u β P(dx)
By virtue of the concept of infinum, we have

1 Pe (dx | u )
inf E − ln z = Pe (dx | u)
u β P(dx)

1 Pe (dx | u )
E − ln z = Pe (dx | u) (9.7.12)
β Pe (dx)
for any u . Consider u that takes the value of the inverse image of the point z
according to the transformation u → z specified above. In other words, u is one of
the points that satisfy the equality Pe (dx | u ) = z = Pe (dx | u).
Then the inequality (9.7.12) takes the following form:

1 Pe (dx | u )
inf E − ln z = Pe (dx | u)
u β P(dx)

1 Pe (dx | u)
E − ln z = Pe (dx | u) . (9.7.13)
β P(dx)
Here Pe (dx | u ) on the right-hand side is substituted with Pe (dx | u). Further, we
plug (9.7.13) into (9.7.11) and average by z to obtain

1 1 Pe (dx | u)
E inf E[c(x, u ) | z] E − γ (x) + E − ln = c(x, u)Pe (dx, du)
u β β P(dx)
(the formula (9.7.10) has been used again). This relation allows us to transform
(9.7.8) to the form

V̄ (I) inf Ec(x, u) − c(x, u)Pe (dx, du).
The right-hand side expression is nothing more than the value V (I) of the Shan-
non’s information. Comparison of the obtained inequality V̄ (I) V (I) with (9.7.5)
yields (9.7.1). The proof is accomplished.
3. As a consequence of theorem 9.7 and formula (9.7.4), the value of Shannon’s

information amount can be expressed in the following terms:

V (I) = inf c(x, u)P(dx) − inf P(d π ) inf c(x, u)P(dx | π ), (9.7.14)
u P(d π |x)∈G u

where P(d π ) = X P(dx)P(d π | x); P(dx | π ) = P(dx)P(d π | x)/P(d π ); π ≡ π (dx)
is a point that belongs to the space of distributions Π . At the same time the set G of
conditional distributions is constrained by the inequality Ixπ I (9.7.3).

We now show that the value functions V (I), V (I) can be defined by the for-
mula (9.7.14), but only if we replace the feasible set G with some narrower sets G,

G, respectively, such that G ⊃ G ⊃ G.
For a fixed partition ∑k Ek = X included in definitions (9.2.6), (9.6.5) it is appro-
priate to put
P(d π | x) = ∑ P(Ek | x)δ (d π , P(dx | Ek )), (9.7.15)
k
where δ (d π , P(dx | Ek )) is a probability measure on the space Π concentrated at the

point π (dx) = P(dx | Ek ). It follows from (9.7.15) that
P(d π dx) = ∑ P(Ek | x)P(dx)δ (d π , P(dx | Ek )) (9.7.16)

k
P(d π ) = ∑ P(Ek )δ (d π , P(dx | Ek )). (9.7.17)
k
As seen from the right-hand side of (9.7.15), the event x ∈ Ek implies the event
π (dx) = P(dx | Ek ) with probability 1 and vice versa. Therefore
P(dx | π ) = P(dx | Ek ), (9.7.18)
where π (dx) = P(dx | Ek ), i.e. for x ∈ Ek .

For the set of distributions of the type (9.7.15) the integral taken over π in (9.7.14)
becomes a sum according to (9.7.17). Hence, we obtain

V = inf
u
c(x, u)P(dx) − inf ∑ P(Ek ) inf
P(d π |x) k u
c(x, u)P (dx | π = P(dx | Ek )) .
Further, as a consequence of (9.7.18) this expression takes the form

V = inf
u
c(x, u)P(dx) − inf ∑ P(Ek ) inf
P(d π |x) k u
c(x, u)P(dx | Ek ).
At the same time minimization over P(d π | x) can be reduced to minimization over
partitions ∑k Ek = X for the set of distributions of the type (9.7.15). The resulting
expression coincides with the expression situated on the right-hand side of the for-

mulae (9.2.6a), (9.6.5). Consequently, the values of information V (I), V (I) can be
also determined by the formula (9.7.14), if minimization is taken over distributions

of the type (9.7.15). The set G corresponding to the definition of the value func-

tion V (I) is indeed the set of those distributions of the type (9.7.15), for which the
inequality (9.6.4) holds.
As a result of the equivalence of the events x ∈ Ek and π (dx) = P(dx | Ek ), the
transformation x → π can be reduced to the transformation x → k. In other words,
the formula
Ixπ = Hk = − ∑ P(Ek ) ln P(Ek )
k
holds in this case. Therefore, (9.6.4) implies the condition Ixπ I, i.e. (9.7.3). Thus,
⊂G
G belongs to the set G. Finally, the inclusion G results from the fact that the
constraint (9.6.4) is weaker than the upper bound constraint on the number of re-
gions E1 , . . . , EM , where M = eI .
Chapter 10
Value of Shannon’s information for the most
important Bayesian systems
In this chapter, the general theory concerning the value of Shannon’s information,
covered in the previous chapter, will be applied to a number of important practical
cases of Bayesian systems. For these systems, we derive explicit expressions for the
potential Γ (β ), which allows us to find a dependency in a parametric form between
losses (risk) R and the amount of information I and then, eventually, to find the value
function V (I).
First, we consider those Bayesian systems, for which the space X turns out to be
especially easy, namely those for which X consists of two points. In this case, the
third variational problem can be solved by reduction to a simple algebraic equation
with just one variable. In the case of systems with a homogeneous cost function
(Section 10.2), the Fourier transform method, or an equivalent operator method, can
be applied to obtain a solution.
Other specific (matrix) methods are employed in the important case of Gaussian
Bayesian systems characterized by the Gaussian prior distribution and the bilinear
cost function. They allow us to obtain a solution to the problem and study the depen-
dency of the active subspace U on the parameter β or I. Special attention is paid to
various (finite dimensional and infinite dimensional) stationary Gaussian systems,
for which the value of information function is expressed in parametric form.
10.1 Two-state system
1. Consider the simple case of the sample space X of a random variable such that
X contains exactly two points x1 and x2 . Probabilities P(x1 ) = p, P(x2 ) = 1 − p are
assumed to be known. The space U of all possible estimators u is assumed to be
more complicated. For instance, we consider the real axis as U. In this case, the cost
function c(x, u) can be reduced to two functions of u:
c(x1 , u) = c1 (u); c(x2 , u) = c2 (u).

https://doi.org/10.1007/978-3-030-22833-0 10
328 10 Value of Shannon’s information for the most important Bayesian systems
The equation (9.5.2) for the given Bayesian system takes the following form:
2
∑ Qi e−β ci (u) = 1,
u ∈ U. (10.1.1)
i=1
These relations are a system of equations with respect to two variables Q1 , Q2 .

Assuming non-singularity of the matrix e−β ci (u j ) the introduced variables are de-
termined from the two equations
Q1 e−β c1 (u1 ) + Q2 e−β c2 (u1 ) = 1, Q1 e−β c1 (u2 ) + Q2 e−β c2 (u2 ) = 1, (10.1.2)

consists of two points u1 , u2 .
that is, the space U
The Bayesian system in question is related to the special case that was covered
in 1 from Section 9.5. In this case, the matrix (9.5.3) takes the form
−β c (u ) −β c (u ) −1
e 1 1 e 2 1
bi j =
e−β c1 (u2 ) e−β c2 (u2 )
!−1
e−β c2 (u2 ) −e−β c2 (u1 )
= e−β c1 (u1 )−β c2 (u2 ) − e−β c2 (u1 )−β c1 (u2 ) .
−e−β c1 (u2 ) e−β c1 (u1 )
Using (9.5.4) we find the potential

Γ (β ) = − h2 (p) − p ln e−β c2 (u2 ) − e−β c2 (u1 )

− (1 − p) ln e−β c1 (u1 ) − e−β c1 (u2 )

+ ln e−β c1 (u1 )−β c2 (u2 ) − e−β c2 (u1 )−β c1 (u2 ) ,
that is
Γ (β ) = −h2 (p) − p ln sinh β d2 −

− (1 − p) ln sinh β d1 + ln sinh(β d1 + β d2 ) − β M, (10.1.3)
where
2d1 = c1 (u2 ) − c1 (u1 ); 2d2 = c2 (u1 ) − c2 (u2 ); (10.1.4)

2M = p[c1 (u1 ) + c1 (u2 )] + (1 − p)[c2 (u1 ) + c2 (u2 )]. (10.1.5)
If we differentiate the potential with respect to β (u1 and u2 remain constant) we

can find R and I. After applying the Legendre transform

dΓ2
I2 (R ) = −Γ2 − β R R =−
dβ
for the function

10.1 Two-state system 329
Γ2 (β ) = −p ln sinh β d2 − (1 − p) ln sinh β d1 + ln [sinh(β d1 + β d2 )]
the result can be expressed in the following way:
I = h2 (p) + I2 (R − M). (10.1.6)
The unknown values u1 and u2 must be determined from additional equations. In

order to derive those equations we employ the condition (9.4.27) corresponding
of the region U,
to the variation δ U i.e. to the variations δ u1 and δ u2 . By virtue
of (9.4.29) the variation δΓ can be expressed as follows:
dΓ
δΓ = δ β + δ0Γ = −Rδ β + δ0Γ , (10.1.7)
dβ
where the variation δ0Γ corresponds to the variation of the region U taken without
respect to the variation of the parameter β . Plugging (10.1.7) into (9.4.27) we obtain
the condition
δ0Γ = 0. (10.1.8)
Since the variations δ u1 , δ u2 are independent, the condition (10.1.8) can be broken
down into two equations
∂Γ ∂Γ
= 0; = 0. (10.1.9)
∂ u1 ∂ u2
Assuming that the Jacobian
∂ d1 ∂ d1
∂ u1 ∂ u2

∂ d2 ∂ d2
∂u ∂ u2

1
of the transformation (10.1.4) does not equal zero, the equations (10.1.9) can be
replaced by the equations
∂Γ ∂Γ
= 0; = 0. (10.1.10)
∂ d1 ∂ d2
Differentiating (10.1.4) we can express them in an explicit way
∂M
(1 − p) coth β d1 − coth β (d1 + d2 ) + = 0,
∂ d1
∂M
p coth β d2 − coth β (d1 + d2 ) + = 0. (10.1.11)
∂ d2
2. Consider quadratic cost functions
c1 (u) = (u − a1 )2 + b1 ; c2 (u) = (u − a2 )2 + b2 (10.1.12)
as an example. We introduce a centred variable ũ − u − (a1 + a2 )/2 and express the

cost functions as follows:
u) − (
c1 ( u − (a1 − a2 )/2)2 + b1 ; c2 ( u + (a1 − a2 )/2)2 + b2 .
u) = (
By virtue of invariance relative to the transformation (9.6.7), these functions can be

replaced by the following functions:
u) = u2 − (a1 − a2 )
c1 ( u; u) = u2 + (a1 − a2 )
c2 ( u. (10.1.13)
In this particular case we obtain
2d1 = u22 − u21 − (a1 − a2 )(

u2 − u1 ),
2d2 = u21 − u22 + (a1 − a2 )(
u1 − u2 ), (10.1.14)
2M = u21 + u22 + (1 − 2p)(a1 − a2 )(
u2 + u1 )
according to (10.1.4).
The values u˜1 , u˜2 must be determined from the equations (10.1.9), (10.1.10).
As a result of the symmetry of functions (10.1.13) we can look for symmetric
roots u˜2 = −u˜1 , which drastically simplifies the equations. Hence, as a consequence
of (10.1.14) the function (10.1.13) takes the form
Γ (β ) = −h2 (p) − ln sinh β d1 + ln sinh 2β d1 − β M

= −h2 (p) + ln(2 cosh β d1 ) − β M, (10.1.15)
where d1 = d2 = (a1 − a2 )u1 ; M = u21 = d12 /(a1 − a2 )2 .

Equating the partial derivative of the function (10.1.15) with respect to d1 to zero,
we obtain the equation
(a1 − a2 )2 tanh β d1 = 2d1 , (10.1.16)
which allows us to find u˜1 . This transcendental equation has only one positive solu-
tion when (a1 − a2 )2 β 2. After having solved the equation (10.1.16) we determine
u˜1 = d1 /(a1 − a2 ).
We take the partial derivative of (10.1.15) with respect to β , combine it with
(10.1.16) and, finally, obtain
2d12
R = −d1 tanh β d1 + M = − + M = −
u21 ,
(a1 − a2 )2
I = h2 (p) − ln(2 cosh β d1 ) + 2β u21 .
It is convenient to introduce a new parameter ϑ = 2u˜1 /(a1 − a2 ) = tanh β d1 and

apply it to rewrite the last formulae as follows:
R = −(a1 − a2 )2 ϑ 2 /4(1/2β )ϑ artanh ϑ , (10.1.17)

I = h2 (p) − ln 2 + (1/2) ln(1 − ϑ ) + ϑ artanh ϑ .
2
(10.1.18)
Expressing artanh ϑ in terms of logarithms, we rewrite the latter equation in the

following way:
10.2 Systems with translation invariant cost function 331
1 1
I = h2 (p) − ln 2 + (1 + ϑ ) ln(1 + ϑ ) + (1 − ϑ ) ln(1 − ϑ )
2 2
1+ϑ
= h2 (p) − h2 . (10.1.19)
2
From this it follows that null information I = 0 is achieved when (1 + ϑ )/2 = p,

ϑ = 2p − 1. According to (10.1.17) this value corresponds to losses R(0) = −(a1 −
a2 )2 (p − 1/2)2 . So that
(a1 − a2 )2 2
V (I) = R(0) − R(I) = ϑ − (2p − 1)2 . (10.1.20)
4
We eliminate parameter ϑ from (10.1.19), (10.1.20) and so obtain the depen-
dency between the value V and the information amount I that takes the form of the
equation
⎛ 2 ⎞
2
1 V 1
I = h2 (p) − h2 ⎝ + + p− ⎠. (10.1.21)
2 (a1 − a2 )2 2
While the amount of information I changes gradually from 0 to h2 (p), the value
V ranges from 0 to (a1 − a2 )2 × [1/4− (p − 1/2)2 ] = (a1 − a2 )2 p(1 − p) ϑ takes
values from 2p − 1 to 1, respectively .
10.2 Systems with translation invariant cost function
1. In this section we assume that the cost function c(x, u) of the Bayesian system
[P(x), c(x, u)] is translation invariant, i.e. it depends on x and u only via their differ-
ence: c(x, u) = c(x − u) = c(z). In other words, c(x, u) is invariant under translation
x → x + a, u → u + a. At the same time it is implied that x, u, z = x − u, a all take
values from the same space X, so that a translation keeps this space invariant. Con-
sequently, the space X must not have boundaries, but it may have a ‘finite’ volume
(for instance, it can be periodically closed).
We apply the above translation to the equality (9.5.2)
∑ Q(x)e−β c(x−u) = 1, (10.2.1)

x
which yields
∑ Q(x + a)e−β c(x−u) = 1. (10.2.2)
x
invariant, we conclude
Assuming that the translation leaves the active subspace U
from a comparison of (10.2.1) and (10.2.2) that Q(x) = Q(x + a), that is, Q(x) =
Q is constant. Therefore, in this case we can apply the theory covered in 2 from
Section 9.5 and use the formulae (9.5.8), (9.5.10), (9.5.15). Knowing the potential
Γ = −Hx − β F0 , we can easily find R and I according to (9.4.28) and (9.4.29) as

follows:
dF0 d dF0
R = F0 − T = (β F0 ); I = Hx + . (10.2.3)
dT dβ dT
The value of information function can be expressed from these two formulae by
excluding the parameter T = 1/β .
Since Q(x) turns out to be constant in the case under consideration, the extreme
value distribution (9.4.21) takes the form
Pe (x, u) = Pe (u)eβ F0 −β c(x−u) (10.2.4)
due to (9.5.9). It follows from here that the conditional distribution
Pe (x | u) = eβ F0 −β c(x−u) (10.2.5)
does not depend on the a priori distribution P(x). The variable H = −dF0 /dT =
Hx − I in the formulae (10.2.3) is nothing but the entropy Hx|u of this conditional
distribution: H = −E[ln P(x | u)].
The a priori distribution P(x) influences only the distribution Pe (u) in (10.2.4).
The equation (9.4.23) that defines it can be rewritten as
∑ e−β c(x−u) Pe (u) = e−β F0 P(x) (10.2.6)

u
by virtue of (9.5.9). This equation is solved by the Fourier transformation method

because a Fourier-image of an integral of a convolution is equal to the product of
respective images. We introduce characteristic functions
θ (s) = eμ (s) = ∑ esx P(x); θe (s) = eν (s) = ∑ esu Pe (u),
and obtain
θe (s) = e−ψ (β ,s)−β F0 θ (s)
from the equation (10.2.6). The last equation can be further transformed to the form
ν (s) = μ (s) − [ψ (β , s) − ψ (β , 0)], (10.2.7)
where
eψ (β ,s) = ∑ e−β c(z)+sz , (10.2.8)
z
and also
ψ (β , 0) = −β F0 (1/β ). (10.2.9)
By virtue of (10.2.8), the relation (10.2.6) can be rewritten in operator form

d
exp ψ β , − ψ (β , 0) Pe (u) = P(x) (10.2.10)
dx
analogously with (8.8.4). Resolving the last equation, we obtain

d
Pe (u) = exp −ψ β , + ψ (β , 0) P(x). (10.2.11)
dx
By taking into account the expansions

∞ n
d 1 d
exp −ψ β , + ψ (β , 0) = 1 + ∑ ψ (β , 0) − ψ β , ,
dx n=1 n! dx
and ∞
d 1 ∂ lψ dl
ψ β, − ψ (β , 0) = ∑ ( β , 0)
l=1 l! ∂ s
dx l dxl
it is possible rewrite (10.2.11) as follows:
∞ ∞
(−1)n d l1 +···+ln
Pe (u) = P(x) + ∑ ∑ kl1 · · · kln l +···+l P(x)
,...,ln =1 n! l1 ! · · · ln ! dx 1 n
n=1 l1
dP(x) 1 2 d 2 P(x)
= P(x) − k1 + k1 − k2 +··· . (10.2.12)
dx 2 dx2
The derivatives kl = (∂ ξ l /∂ sl )(β , 0) here coincide, according to (10.2.5), (10.2.8),
with cumulants of the distribution P(z) = exp [β F0 − β c(z)]. The obtained for-
mula (10.2.12) can be conveniently used when the a priori distribution P(x) is much
broader than the conditional distribution (10.2.5). Those conditions entail an effect
of rapidly decreasing expansion terms.
The simplest case takes place when the distribution P(x) is uniform:
P(x) = e−Hx .
Then it follows from (10.2.6) that Pe (u) has the same form:
Pe (u) = e−Hx . (10.2.13)
The relations given in this section hold true both for discrete and continuous
spaces X. Of course sums should be substituted by integrals in the continuous case.
It may happen that in the case of a continuous or unbounded space some terms
in the above expressions do not have any independent meaning (for example, Hx ).
However, they occur in combination with other terms and functions in such a way
that those combinations have meaning (for instance, the sum Hx + dF0 /dT ).
2. Examples. The examples covered in Sections 9.2 and 9.5, where variable x
was discrete, belong to the very class of systems with a translation invariant cost
function.
In this section, we consider examples, in which variable x assumes continuous
values and is described by the probability density function p(x).
Let the cost function c(z) be expressed in the following way:

0, if z ∈ Z0
c(z) = (10.2.14)
∞, if z ∈
/ Z0
Then, after writing formula (9.5.8) in the continuous version

e−β F0 = e−β c(z) dz = dz,
Z0
we obtain
β F0 = − ln Ω ,

where Ω = Z0 dz is the ‘volume’ of region Z0 , where c(z) = ∞.
Fig. 10.1 Dependency of the average cost on the amount of information for the cost func-
tion (10.2.14)
Applying the formulae (10.2.3) to the current case, we find
d
R= (β F) = 0; I = Hx − ln Ω . (10.2.15)
dβ
It can be seen from here that in order to keep losses at zero, the amount of informa-
tion must be equal to

Icr = Hx − ln Ω = − ln[Ω p(x)]p(x)dx. (10.2.16)
At fewer amounts of information there is a non-zero probability of an infinite cost,

and, as a consequence, the average cost R = E[c(x, u)] becomes infinite as well. The
specific dependency between R and I is visualized on Figure 10.1.
In a multidimensional case, when a fidelity of reproduction (equal to ε ) is given

for each coordinate, we have
Ω = εr,
where r is a dimensionality of the space. In this case, the critical value of informa-
tion (10.2.16) is equal to
1
I = r ln + Hx . (10.2.17)
ε
3. Now let X be an r-dimensional linear space, and let c(z) be a quadratic func-
tion:
r
c(z) = ∑ z2i . (10.2.18)
i=1
Applying (9.5.8) we find the potential

r β
e−β z dz =
2
β F0 = −r ln ln .
2 π
According to (10.2.3), we have
R = rT /2; I = Hx − (r/2) ln(π eT ). (10.2.19)
Fig. 10.2 The value of information for quadratic cost function (10.2.20)
If ε 2 = E[z2i ] is the fidelity for each coordinate, then the average cost R = E[c(z)]
can be set to rε 2 due to (10.2.18). Hence, formulae (10.2.19) will yield
1
I = r ln √ + Hx ,
ε 2π e
which resembles (10.2.17). Eliminating parameter T from (10.2.19) and solving for
V (I) = R(0) − R(I), we obtain the value of information function
r 2 Hx −1 2
!
V (I) = er 1 − e− r I , (10.2.20)
2π
which is shown on Figure 10.2.
Let us also write the probability density for distribution Pe (u), defined by for-
mula (10.2.11). Distribution (10.2.5) in this case is Gaussian
r
β
e−β ∑i (xi −ui ) ,
2
Pe (x | u) =
π
and has the following characteristic function:

1
θz (s) = exp[ψ (β , s) − ψ (β , 0)] = E exp ∑ si (xi − ui ) = exp ∑ s2i .
i 4β i
Hence, (10.2.11) yields

1 ∂2
pe (u) = exp −
4β ∑ ∂ x2 p(x).
i i
The terms of the expansion

k
∞
1 1 ∂2
∑ k! −
4β ∑ ∂ x2 p(x) = pe (u)
k=0 i i
converge as a power series of the ratio

2 *
1 ε !4 d n p(x) d n+1 p(x)
=4 , where Δ∼ .
βΔ2 Δ dxn dxn+1
If this ratio is small, then one can leave only the first few expansion terms, for
instance, 4
T ∂2 ε
pe (u) = p(x) − ∑ 2 p(x) + O . (10.2.21)
4 i ∂ xi Δ4
An analogous formula takes place also for the previous example (10.2.14), when

1/Ω , if x − u ∈ Z0
pe (x | u) =
0, if x − u ∈
/ Z0 .
Namely, if
(k1 )i = zi dz = 0,
Z0
then, due to an expansion of the kind (10.2.12), we will observe that

1 ∂2
pe (u) = p(x) − ∑
2 i, j
Bi j
∂ xi ∂ x j
p(x) + · · · ,

1
Bi j = zi z j dz (Bi j ∼ ε 2 ).
Ω Z0
The above results are valid under the assumptions that probabilities (10.2.11)
are non-negative, and entropy Hx of the a priori distribution is not smaller than the
conditional entropy Hx|u = Hz of distribution (10.2.5) obtained as a solution of the
extremum problem.
4. In the conclusion of this section, let us consider the function
μx (t) = ln ∑ etc(x,u) Pe (u), (10.2.22)

u
which, as will be seen in Chapter 11, plays a significant role in addressing the issue
of how the value functions for Shannon’s and Hartley’s information amounts differ
from each other.
In the case of a translation invariant cost function the expression
eμx (t) = ∑ etc(x−u) Pe (u)

u
can be written in the following operator form:

μx (t) d
e = exp ψ −t, Pe (u), (10.2.23)
dx
which is analogous to (10.2.10) (function ξ (β , s) is defined by the formula (10.2.8)).

Substituting here (10.2.11), we obtain

d d
eμx (t) = exp ψ −t, −ψ β, + ψ (β , 0) . (10.2.24)
dx dx
In the simple case of a uniform distribution Pe (u) (see (10.2.13)), formula (10.2.23)
yields
eμx (t) = eψ (−t,0) Pe = eψ (−t,0)−Hx ,

μx (t) = ψ (−t, 0) − Hx
or
1
μx (t) = −Hx + tF0 − = Γ (−t), (10.2.25)
t
if we consider (10.2.9), (9.5.10). Thus, function μx (t) does not depend on x.
10.3 Gaussian Bayesian systems
1. Let x and u be points from Euclidean spaces Rr and Rs having dimensions r and s,
respectively. A Bayesian system [p(x), c(x, u)] is called Gaussian, if the probability
density p(x) is Gaussian, and c(x, u) is the sum of linear and quadratic forms of x
and u. Choosing an appropriate origin in Rs we therefore have
1
c(x, u) = c0 (x) + xT gu + uT hu, (10.3.1)
2
(where c0 (x) = c0 +cT1 x+ 12 xT c2 x, which, however, is not mandatory). Here a matrix
form is used:
⎡ ⎤ ⎡ ⎤
x1 u1
⎢ .. ⎥ T
⎢ .. ⎥
x = ⎣ . ⎦; x = x1 . . . xr ; u = ⎣ . ⎦; uT = u1 . . . us ;
xr us
where g, h are matrices. Further, according to the definition of a Gaussian system

with a proper origin we have
1/2
det a 1 T
p(x) = exp − x ax . (10.3.2)
2π 2
The matrix a is non-singular and positive definite. If the distribution p(x) turns out to
be singular, i.e. it is concentrated in some subspace X ⊂ Rr , then we can restrict our
consideration to the space X = Rr̃ substituting r by r̃(r̃ < r). Thus, without loss of
generality, the correlation matrix kx = a−1 of the distribution (10.3.2) can be treated
as non-singular.
For Gaussian systems we can represent the function q(x) (see (9.5.1)) in the
following Gaussian form:

1
p(x)e−γ (x)−β c (x) = q(x)e−β c (x) = exp σ + xT ρ − xT κ x .
0 0
(10.3.3)
2
Similarly, the distribution pe (u) is assumed to be Gaussian

1/2
det au 1 T
pe (u) = exp − (u − m )au (u − m) ,
T
u∈U (10.3.4)
2π 2
and be concentrated in some subspace U = Rs̃ (dim U = s̃) of the space Rs . Specif-
ically, there may be a coincidence Rs̃ = Rr . Henceforth, we restrict ourselves by
considering the space Rs̃ and consider matrices h and g to be of sizes s̃ × s̃ and
r̃ × s̃, respectively. Thus, we consider spaces Rr̃ , Rs̃ of x, u so that matrices kx = a−1 ,
ku = a−1 u are non-singular. Variables σ , ρ = (ρ1 , . . . , ρr̃ ) , κ = κi j , m and au ap-
T
pearing in (10.3.3), (10.3.4) will be determined later.

By plugging (10.3.3), (10.3.4) into the equation (9.5.2) we obtain
10.3 Gaussian Bayesian systems 339

1 β
exp xT ρ − xT κ x − β xT gu dx = exp −σ + uT hu .
Rr 2 2
Assuming that the matrix κ is positive definite and, consequently, that det κ = 0, we
take that integral with the help of
1
1 T det A − 2 1 T −1
exp − x Ax + x y dx =
T
exp y A y , (10.3.5)
2 2π 2
where A is a positive definite matrix (the latter formula can be simply derived
from (5.4.19)). This leads to the result
−1/2
κ
detκ 1 T β
exp κ −1 (ρ − β gu) = exp −σ + uT hu .
(ρ − β uT gT )κ
2π 2 2
We take the logarithm of this equality and equate separately quadratic, linear and
constant terms (with respect to u). This operation yields the following equations:
β gT κ −1 g = h, (10.3.6)
−1
−g κ ρ = 0,
T
(10.3.7)
1 κ 1
σ = ln det − ρ T κ −1ρ . (10.3.8)
2 2π 2
In order to obtain the other necessary relations, we turn to the second equa-
tion (9.4.23), which, by multiplying by p(x), can be rewritten as follows:

p(u)p(x)e−γ (x)−β c(x,u) du = p(x),
x ∈ X.

U
Applying (10.3.2)–(10.3.4) we obtain then

1 T
exp − u (au + β h)u + (m au − β x g)u =
T T

U 2

1 1
= const exp − xT ax + xT κ x − xT ρ . (10.3.9)
2 2
The matrix au + β h is assumed to be non-singular and positive definite. Again us-

ing (10.3.5), it follows from (10.3.9) that
1 T 1
(m au − β xT g)(au + β h)−1 (au m − β gT x) = − xT (a −κ
κ )x − xT ρ + const.
2 2
This equality must hold for all x. Thus, we can equate separately quadratic and linear
formulae (with respect to x this time), which yields
β 2 g(au + β h)−1 gT = κ − a, (10.3.10)

−1
β g(au + β h) au m = ρ (10.3.11)
equations (10.3.6)–(10.3.8), (10.3.10), (10.3.11) allow us to determine variables au ,

κ , m, ρ , σ found in (10.3.3) and (10.3.4). Solving (10.3.10) in terms of κ and
putting it into (10.3.6) we obtain a matrix equation
β gT [a + β 2 g(au + β h)−1 gT ]−1 g = h, (10.3.12)
which as can be seen from the following completely determines the matrix au . In-
troducing the unknown matrix
B = (au + β h)−1 , (10.3.13)
and taking into account that a = kx−1 we rewrite (10.3.12) in the following form:
β gT kx [1x + β 2 gBgT kx ]−1 g = h, (10.3.14)
where 1x is a unity operator in Rr̃ . Using the operator identity
f (AC)A = A f (CA) (10.3.15)
(see formula (A.1.1)) for A = g, C = BgT kx , f (z) = (1 + β 2 z)−1 = ∑∞

n=0 (−β z) ,
2 n
we transform (10.3.14)
x [1u + β 2 Bk
βk x ]−1 = h, (10.3.16)
= Rs̃ .
where k̃x = gT kx g and 1u is a unity operator in U
It is not hard to write the solution of equation (10.3.16):
x = β h−1 k
1u + β 2 Bk x ,
and, therefore,
x )−1 .
β B = h−1 − (β k (10.3.17)
We have assumed here that matrices h, k̃x are non-singular (this condition charac-
As a result of (10.3.13), (10.3.10) it follows from (10.3.17)
terizes the subspace U).
that
x )−1 ]−1 − β h,
au = β [h−1 − (β k (10.3.18)
κ = a + β g[h−1 − (β kx )−1 ]gT = k−1 −1 T −1 T
x − gkx g + β gh g . (10.3.19)
Furthermore, by virtue of (10.3.11), (10.3.17), (10.3.13) we have

x )−1 ]au m.
ρ = g[h−1 − (β k (10.3.20)
Plugging the latter formula into the equality (10.3.7) we find that
x )−1 ]au m = 0,
gT κ −1 g[h−1 − (β k
and then due to (10.3.6)

x )−1 ]au m = 0.
h[h−1 − (β k
Because it is assumed that matrices h, h−1 = (β k̃x )−1 = β B are non-singular, it

implies that
au m = 0, (10.3.21)
and, as a consequence of (10.3.20), that
ρ = 0. (10.3.22)
Finally, according to (10.3.22), (10.3.19) it follows from (10.3.8) that

1 κ 1 1 −1 −1 T
σ = ln det = ln det (a − gk gT
+ β gh g . (10.3.23)
2 2π 2 2π x
Thus, functions (10.3.3), (10.3.4) have been derived explicitly.

2. Equation (9.4.10) implies that computation of potential Γ (β ) requires solving
equation (10.3.3) with respect to γ (x) and average it over x with weight (10.3.2):
1 1 a
Γ (β ) = −σ + E[xT κ x − xT ax] + ln det − β E[c0 (x)]
2 2 2π
1 1
κ kx ) + E[xT (κ
= − ln det(κ κ − a)x] − β E[c0 (x)]. (10.3.24)
2 2
Here we take into account (10.3.22) and (10.3.23). Because
κ − a)xxT ] = (κ
E[(κ κ − a)kx = κ kx − 1x
it follows from (10.3.2) that (10.3.24) can be rewritten as

1 1
κ kx ) + tr(κ
Γ (β ) = − tr ln(κ κ kx − 1x ) − β E[c0 (x)]. (10.3.25)
2 2
We also should apply the inequality (10.3.19) to the previous formula and, conse-
quently, obtain
κ kx − 1x = g(β h−1 − k−1 T
x )g kx . (10.3.26)
Further we make use of the matrix formula
tr f (AB) = tr f (BA), (10.3.27)
which is valid if f (0) = 0 (see (A.1.5) and (A.1.6)). Substituting A = g(β h−1 − k̃x−1 ),
B = gT kx into equation (10.3.27) we get
! !
tr f g(β h−1 − k−1
x )g T
kx = tr f gT
k x g( β h−1 −1
− kx )
!
x h−1 − 1 .
= tr f β k
Applying the last equality and taking into account (10.3.26) towards (10.3.25) for
functions f (z) = ln (1 + z), f (z) = z we obtain
1 x h−1 ) + 1 tr(β k
x h−1 − 1u ) − β E[c0 (x)] .
Γ (β ) = − tr ln(β k (10.3.28)
2 2
Hence, the dependency of Γ on β can be represented by the formula
s 1 s
Γ (β ) = − ln β − tr ln
kx h−1 − − β M, (10.3.29)
2 2 2
β ) and also
where s̃ = tr(1u ) is a dimension of the space U(
1
M = E[c0 (x)] − tr kx h−1 .
2
To calculate R and I according to (9.4.29), (9.4.30) it remains to differentiate
the potential (10.3.28), (10.3.29). It follows from the general theory concerning the
third variational problem (see the proof of Theorem 9.5) that we can equally well
vary the active domain U( β ) or assume it is constant. Following the latter more
simple approach, we obtain from (10.3.28)

1 1u −1
R = tr − kx h + E[c0 (x)];
2 β
(10.3.30)
1
s 1
x h ) = ln β + tr ln(k
−1 x h ).
−1
I = tr ln(β k
2 2 2
The value of information function (9.3.7) can be obtained by forming difference

1 1 −1 1 1 −1
V (I) = tr − kx h − tr − kx h . (10.3.31)
2 β I=0 2 β
Analysis of the active domains U shows that if the trace tr(ln(β

kx h−1 )) (which
coincides with 2I due to (10.3.30)) equals to zero, then so does the trace tr(1/β −

kx h−1 ). Therefore, formula (10.3.31) takes a simpler form
x h−1 − 1/β ).
V = (1/2) tr(k (10.3.32)
The last relation together with the second formula (10.3.30) gives a parametric rep-
resentation of the dependence V (I).
3. Let us characterize the space U( β ). We remind that for the theory covered
in this section to be valid (it has been mentioned already before), it is necessary
that matrices κ and B−1 = β h + au were positive definite and, consequently, non-
singular; and also that matrices h and k̃x were non-singular as well. It is easy to
see that matrix k̃x = gT kx g is positive semi-definite, and therefore non-singular.
Further, taking into account (10.3.17), the requirement of positive definiteness of
matrices B = h−1 − (β k̃x )−1 and k̃x entails positive definiteness and, consequently,
non-singularity of matrix h−1 = B + (β k̃x )−1 for β > 0. Finally, taking into ac-
count (10.3.19) we conclude that the matrix κ = a + β 2 gBgT is positive definite,
and the matrix gBgT is positive semi-definite. Thus, in order to fulfil all the neces-
sary requirements it is sufficient to comply only with two requirements:
1. Matrix k̃x = gT kx g must be positive definite
2. The difference h−1 − (β k̃x )−1 must be positive definite:
[h−1 − (β k̃x )−1 ] 0. (10.3.33)
Positive definiteness of au follows from 10.3.33) and (10.3.18).

If x ranges over values from X = Rr̃ , then u = gT x varies over some subspace,
which we denote by U0 (U0 ⊂ Rs ). Now we can claim that space U( β ) is a maximal
linear subspace of space U0 , in which the above-mentioned requirements 1 and 2
are satisfied.
4. Examples. First, let us consider a one-dimensional example when r̃ = s̃ = 1.
In this case k̃x = g2 kx and thereby formula (10.3.32) yields
g2 kx
V (I) = (1 − e−2I ) (10.3.34)
2h
(the matrices coincide with scalars).
Let us turn to a two-dimensional case, when there are two independent Gaus-
sian random variables with variances k1 and k2 , respectively, and zero means. Let
matrices g, h have the diagonal form:

g 0 h 0
g= 1 ; h= 1 .
0 g2 0 h2
For clarity, let us assume that k1 g21 /h1 > k2 g22 /h2 . According to conditions 1
and 2, the space U( β ) will consist of points (u1 , u2 ) = (u1 , 0) either belonging to
the line u2 = 0 for h1 /k1 g21 < β < h2 /k2 g22 , or will coincide with the entire two-
dimensional space U = R2 for β > h2 /k2 g22 .
β ) has one dimension, we have
In the first case, when U(

g
g= 1 ; gT = g1 0 ; x = gT kx g = (k1 g21 ).
k
0
In the two-dimensional case, we have

2
k1 g1 0
kx = .
0 k2 g22
Taking into account (10.3.10) we write formulae (10.3.30) for the given example as
follows:

ln β + 12 ln(k1 g21 /h1 ),
1
2 for h1 /k1 g21 < β < h2 /k2 g22
I=
ln β + 12 ln(k1 g21 /h1 ) + 12 ln k2 g22 /h2 , for β > h2 /k2 g22 ,

1/2β − k1 g21 /2h1 + E[c0 (x)], for h1 /k1 g21 < β < h2 k2 g22
R=
1/β − k1 g1 /2h1 − 2 k2 g2 /h2 + E[c0 (x)], for β > h2 /k2 g22 .
2 1 2
Excluding β we find the value of information

⎧
⎪ k g2 k g2 h
⎨ 12 1h1 1 (1 − e−2I ), for 0 I I2 = 12 ln k1 g12 h2 ,
# 2 2 1
2!
V (I) =
⎪
⎩ k g 2 k g k1 g21 k2 g22 −I
1
2
1 1
h1 + h2
2 2
− h1 h2 e , for I I2 .
(10.3.35)
At the point I = I2 , β = h2 /k2 g2 the second derivative
d 2V d 1 dT
= = (10.3.36)
dI 2 dI β dI
undergoes a jump. It equals V (I2 − 0) = −2V = −2k2 g22 /h2 to the left of the point
I = I2 − 0 and V (I2 + 0) = −V = −k2 g22 /h2 to the right of the point I = I2 .
The obtained dependency is depicted in Figure 10.3. The dimensionality r̃ of
β ) may be interpreted as the number of active degrees of freedom,
the space U(
which may vary with changes in temperature. This leads to a discontinuous jump
of the second derivative (10.3.36), which is analogous to an abrupt change in heat
capacity in thermodynamics (second-order phase transition).
5. In conclusion, let us compute function (10.2.22) for Gaussian systems. Taking
into account (10.3.1), (10.3.4), (10.3.21) we get

1 t 1
μx (t) = ln det(au /2π ) + tc0 (x) + ln exp txT gu + uT hu − uT au u du.
2
U 2 2
Applying formula (10.3.5) we obtain

1 1 2 T
μx (t) = tc0 (x) − ln det(1 − tha−1 −1 T
u ) + t x g(au − th) g x
2 2
1 1 2 T −1
= tc0 (x) − tr ln(1 − tha−1 −1 −1 T
u ) + t x gau (1 − thau ) g x, (10.3.37)
2 2
where a−1
u is determined from (10.3.18). Hence, we can find the expectation
1 1 2 −1 −1 −1
E[μx (t)] = tE[c0 (x)] − tr ln(1 − tha−1
u ) + t tr au (1 − thau ) kx (10.3.38)
2 2
and the derivatives
Fig. 10.3 The value of information for a Gaussian system with a ‘phase transition’ corresponding
to formula (10.3.35)
1 ha−1 1 T −1 2 − tha−1
μx (t) = c0 (x) + tr u
+ tx gau u
gT x, (10.3.39)
2 1 − tha−1
u 2 (1 − tha−1
u )
2
1 (ha−1
u )
2
−1 −3
E[μx (t)] = tr −1
+ tr a−1
u (1 − thau ) kx . (10.3.40)
2 (1 − thau )2
If we further combine (10.3.37) with (10.3.39) and assign t = −β , then we obtain

1 β ha−1
β μx (−β ) + μx (−β ) = tr u
− ln(1 + β ha−1
) −
1 + β ha−1
u
2 u
β2
− xT ga−1 −1 −2 T
u (1 + β hau ) g x. (10.3.41)
2
The relations, arising from (10.3.18)

1 1 −1 −1 −1 − 1
1+ au h−1 = 1 − hk ; β ha−1
u = β kx h
β β x
allow us to pass the obtained results from the matrix ha−1 −1

u to the matrix k̃x h . Thus,
it follows from (10.3.40), (10.3.41) that
x h−1 − 1
1 1 −1 2 βk
β2
E[μx (−β )] =
tr 1 − hkx + tr , (10.3.42)
2 β (β kx h−1 )2

1 1 −1
β μx (−β ) + μx (−β ) = tr 1 − hk − ln( β kx h−1
)
2 β x
β βkx h−1 − 1
− xT gh−1 gT x. (10.3.43)
2 x h−1 )2
(β k
These formulae will be used in Section 11.4.

10.4 Stationary Gaussian systems
1. Let x and u consist of multiple components:

x = x1 , . . . , xr , u = u1 , . . . , ur .
A Bayesian system is stationary, if it is invariant under translation

x1 , . . . , xr → x1+n , . . . , xr+n ; u1 , . . . , ur → u1+n , . . . , ur+n ;
(xk+r = xk , uk+r = uk ).
To ensure this, the stochastic process x must be stationary, i.e. the density p(x) has
to satisfy the condition
p(x1 , . . . , xr ) = p(x1+n , . . . , xr+n )
and the cost function must also satisfy the stationarity condition
c(x1 , . . . , xr ; u1 , . . . , ur ) = c(x1+n , . . . , xr+n ; u1+n , . . . , ur+n ).
When applied to Gaussian systems, the conditions of stationarity lead to the re-
quirement that matrices
hi j , gi j , ai j , (10.4.1)
in the expressions (10.3.1), (10.3.2) must be stationary, i.e. their elements have to
depend only on the difference of indices:
hi j = hi− j , gi j = gi− j ai j = ai− j , (kx )i j = (kx )i− j .
At first, let us consider the case when spaces have a finite dimensionality r. Ma-
trices (10.4.1) can be transformed to the diagonal form by a unitary transformation
x = U+ x (10.4.2)
with the matrix

1
U = Ul j ; Ul j = e2π il j/r ; l, j = 1, . . . , r (10.4.3)
r
satisfying
U+ hU = hl δlm ; U+ gU = gl δlm ; U+ kx U = kl δlm , (10.4.4)
where
r
hl = ∑ e−2π ilk/r hk (10.4.5)
k=1
10.4 Stationary Gaussian systems 347
by analogy with (5.5.9), (8.7.6).

After diagonalization, formulae (10.3.30) take the form
1
2∑
I= ln β kl |gl |2 /hl ;
l
(10.4.6)
1
R = ∑ 1/β − kl |gl |2 /hl + E[c0 (x)].
2 l
The result of the unitary transformation (10.4.2) is that the new variables
⎡ ⎤ ⎡ ⎤
x1 x1
⎢ .. ⎥ + ⎢ .. ⎥
⎣ . ⎦=U ⎣ . ⎦
xr xr
have become uncorrelated, i.e. independent. Taking into consideration coordinates

⎡ ⎤ ⎡ ⎤
u1 u1
⎢ .. ⎥ + ⎢ .. ⎥
⎣ . ⎦=U ⎣ . ⎦
ur ur ,
it is convenient to characterize the active subspace U. Some of the coordinates are

equal to zero. Due to (10.3.33) the only non-zero coordinates are those, which sat-
isfy the inequality
β kl |gl |2 /hl > 1 . (10.4.7)
Thus, the summation in (10.4.6) has to be carried only over those indexes l, for
which the latter inequality is valid. In fact, the number of such indices is exactly r̃.
2. Now, let x = {x(t)} be a random function on the interval [0, T0 ], i.e. space x is
infinite dimensional (functional).
A Bayesian system is assumed to be strictly stationary with matrices h, g having
the form
h = h(t − t ) ; g = g(t − t ) (10.4.8)
and similarly for a, where h(τ ), g(τ ), a(τ ) are periodic functions with period T0 . A
unitary transformation that diagonalizes these matrices has the form of the following
Fourier transformation:
T0
1
(Ux)l = xl = x (ωl ) = √ e−2π ilt/T0 x(t) dt,
T0 0
(ωl = 2π l/T0 ; l = 0, 1, . . .).
In this case, we have (see (8.7.12))

U+ hU = h(ωl )δlm ,
T0 T0 /2
−iωl τ
h(ωl ) = e h(τ )d τ = eiωl τ h(τ )d τ , (10.4.9)
0 −T0 /2
and so on for the other matrices. Formulae (10.3.30) in this case can be will have
the form similar to (10.4.6)

I 1 k(ωl )|g(ωl )|2
I0 =
T0
= ∑ ln β h(ω )
2T0 l∈L
,
l
(10.4.10)
R 1 k(ωl )|g(ωl )|2 1 1
R0 =
T0
=− ∑
2T0 l∈L h(ωl )
−
β
+
T0
E[c0 (x)],
with the only difference that index l now can range over all possible integer values
. . ., −1, 0, 1, 2, . . ., which satisfy the conditions
β k(ωl )|g(ωl )|2 > h(ωl ) (10.4.11)
similar to (10.4.7). Let Φmax be the maximum value

k(ωl )|g(ωl )|2
Φmax = max l = . . . , −1, 0, 1, 2, . . . .
h(ωl )
Then for β < 1/Φmax there are no indices l, for which the condition (10.4.11) is
satisfied, and the summations in (10.4.10) will be absent, so that l will equal to zero
and also
1
(R0 )I0 =0 = E[c0 (x)]. (10.4.12)
T0
I0 will take a non-zero value as soon as β attains the value 1/Φmax and surpasses it.
Taking into account (10.4.12) by the usual formula
V0 (I0 ) = (R0 )I0 =0 − R0 (I0 ),
we derive from (10.4.10) the expression for the value of information

1 k(ωl )|g(ωl )|2 1
V0 (I0 ) = ∑
2T0 l∈L h(ωl )
−
β
. (10.4.13)
The dependency V0 (I0 ) has been obtained via (10.4.10), (10.4.13) in a parametric
form.
3. Let us consider the case when x = {. . . , x−1 , x0 , x1 , x2 , . . .} is an infinite station-
ary sequence, and the elements of the matrices
h = hi− j , g = gi− j
depend only on the difference i − j. Then these matrices can be diagonalized by a

unitary transformation
1
(U+ x)(λ ) = √
2π
∑ e−iλ l xl , (10.4.14)
l
where −π λ π . Similarly to (10.4.4), (10.4.5) we obtain in this case
U+ hU = h(λ )δ (λ − λ ) , . . . , h(λ ) = ∑ e−iλ k hk . (10.4.15)

k
The transformation (10.4.14) may be considered as a limiting case (for r → ∞)

of the transformation (10.4.2), (10.4.3), where λ = λl = 2π r/l. It follows from the
comparison of (10.4.5) and (10.4.15) that h̄l = h(2π l/r), . . . . Therefore, formu-
lae (10.4.6) yield

I 1 k(2π l/r)|g(2π l/r)|2
I1 = = ∑ ln β ,
r 2r l h(2π l/r)

R 1 k(2π l/r)|g(2π l/r)|2 1 1
R1 = = − ∑ − + E[c0 (x)].
r 2r l h(2π l/r) β r
Taking the limit r → ∞ and because Δ λ = 2π /r, we obtain the integrals

1 k(λ )|g(λ )|2
I1 = ln β dλ ,
4π λ ∈L(β ) h(λ )
(10.4.16)
1 k(λ )|g(λ )|2 1
R1 = − − d λ + const.
4π λ ∈L(β ) h(λ ) β
The integration is carried over subinterval L(β ) of interval (−π , π ), for which
β k (λ )|g(λ )|2 > h(λ ). The obtained formulae determine the rates I1 , R1 correspond-
ing on average to one element from sequences {. . . , x1 , x2 , . . .}, {. . . , u1 , u2 , . . .}. De-
noting the length of the subinterval L(β ) by |L|, we obtain from (10.4.16) that
4π R1
= −Φ1 + Φ2 e−4π R1 /|L| + const, (10.4.17)
|L|
where
1 1
Φ1 = Φ (λ )d λ ; Φ2 = exp ln Φ (λ )d λ
|L| L |L| L
are some mean (in L(β )) values of functions Φ (λ ) = k (λ )|g(λ )|2 /h(λ ), ln Φ (λ ).
4. Finally, suppose there is a stationary process on the infinite continuous-time
axis. Functions h, g depend only on the differences in time, similarly to (10.4.8).
This case can be considered as a limiting case of the system considered in 2 or 3.
Thus, in the formulae of 2 it is required to take the limit T0 → ∞, for which the
points ωl = 2π l/T0 from the axis ω are indefinitely consolidating (Δ ω = 2π /T0 ),
and the sums in (10.4.10), (10.4.13) are transformed into integrals:

1 k(ω )|g(ω )|2
I0 = ln β dω
4π L(β ) h(ω )
(10.4.18)
1 k(ω )|g(ω )|2 1
V0 = − .
4π L(β ) h(ω ) β
The functions h̄(ω ), ḡ(ω ) included in the previous formulae are defined by the
equality (10.4.9) ∞
h(ω ) = e−iωτ h(τ )d τ , . . . . (10.4.19)
−∞
The integration in (10.4.18) is carried over the region L(β ) belonging to the axis ω ,
where
β k(ω )|g(ω )|2 > h(ω ). (10.4.20)
Denote the total length of that region by |L|. Then, similarly to (10.4.17), formu-
lae (10.4.18) entail the following:
4π V0
= Φ1 − Φ2 e−4π I0 /|L| ,
|L|
where

1 1
Φ0 = Φ (ω )d ω ; Φ2 = exp ln Φ (ω )d ω ,
|L| L |L| L
k(ω )|g(ω )|2
Φ (ω ) = . (10.4.21)
h(ω )
These mean values turn out to be mildly dependent on β .

5. As an example, let us consider a stationary Gaussian process x = {x(t)} having
the following correlation function:
k(τ ) = σ 2 e−γ |τ | .
Thus, according to (10.4.19), we have
k(ω ) = 2γσ 2 /(γ 2 + ω 2 ). (10.4.22)
Let us take the cost function in quadratic form

1
c(x, u) = [x(t) − u(t)]2 dt (10.4.23)
2
or, equivalently, using matrix notation
1 1
c(x, u) = xT x − xT u + uT u .
2 2
Fig. 10.4 The rate of the value of information for the example with the cost function (10.4.23)
When matrices h, g (as it can be seen from comparison with (10.3.1) have the
form h = δ (t − t ) , g = − δ (t − t ) , then
h(ω ) = 1; g(ω ) = −1. (10.4.24)
In this case, due to (10.4.22), (10.4.24), function (10.4.21) can be written as
Φ (ω ) = 2γσ 2 /(γ 2 + ω 2 ).
Then condition (10.4.20) takes the form
2β γσ 2 > γ 2 + ω 2 ,
and,
consequently, for afixed value β > γ /2σ 2 the region L(β ) is the interval
− 2β γσ − γ< ω < 2β γσ 2 − γ 2 . Instead of parameter β , we consider now
2 2
parameter y = 2β σ 2 /γ − 1. Then β = γ (1 + y2 )/2σ 2 , and the specified interval

L will have the form
−γ y < ω < γ y.
Taking into account y
1 dx arctan y
2
= ,
2y −y 1 + x y
and y
1 arctan y
2
ln(1 + x )dx = ln(1 + y ) + 2 2
−1 ,
2y −y y
it follows from formulae (10.4.18) that

γ σ2 y
I0 = (y − arctan y); V0 = arctan y − . (10.4.25)
π π 1 + y2
Because of (10.4.23), and the stationarity of the process, the doubled rate of loss
2R0 , corresponding to a unit of time, coincide with the mean square error 2R0 =
E[x(t) − u(t)]2 . Furthermore, the doubled value 2V0 (I0 ) = σ 2 − 2R0 (I0 ) represents
the value of maximum decrease of the mean square error that is possible for a given
rate of information amount I0 .
The graph of the value function, corresponding to formulae (10.4.25), is shown
on Figure 10.4. These formulae can also be used to obtain approximate formulae for
the dependency V0 (I0 ) under small and large values of parameter y (or, equivalently,
of the ratio I0 /y).
For small value y 1, let us use the expansions arctan y = y − y3 /3 + y5 /5 − · · ·
and y/(1 + y2 ) = y − y3 + y5 + · · ·. Substituting them into (10.4.25) we obtain

γ y3 y5 σ2 2 3 4 5
I0 = − +··· , V0 = y − y +··· ,
π 3 5 π 3 5
and, finally, after elimination of y:

5/3
2σ 2 3 2σ 2 6 I0
V0 (I0 ) = I0 1 − y2 + · · · = − I0 − σ 2 (3π )2/3 +··· .
γ 5 γ 5 γ
For large value y

1, let us expand with respect to the reciprocal parameter, namely
π 1 π 1
arctan y = − arctan = − y−1 + y−3 − · · · ,
2 y 2 3
y
= y−1 − y−3 + y−5 + · · · .
1 + y2
Then formulae (10.4.25) yield
π I0 /γ = y − π /2 + y−1 + · · · ,
π V0 /σ 2 = π /2 − 2y−1 + (4/3)y−3 .
Eliminating y from the last formulae we obtain
V0 /σ 2 = 1/2 − (2/π 2 )(I0 /γ + 1/2)−1 + O(I0−3 ).
For this example, it is also straightforward to express potential (10.3.28).

Chapter 11
Asymptotic results about the value of
information. Third asymptotic theorem
The fact about asymptotic equivalence of the values of various types of information
(Hartley’s, Boltzmann’s or Shannon’s information amounts) should be regarded as
the main asymptotic result concerning the value of information, which holds true un-
der very broad assumptions, such as the requirement of information stability. This
fact cannot be reduced to the fact of asymptotically errorless information transmis-
sion through a noisy channel stated by the Shannon’s theorem (Chapter 7), but it is
an independent and no less significant fact.
The combination of these two facts leads to a generalized result, namely the gen-
eralized Shannon’s theorem (Section 11.5). The latter concerns general performance
criterion determined by an arbitrary cost function and the risk corresponding to it.
Historically, the fact of asymptotic equivalence of different values of information
was first proven (1959) precisely in this composite and implicit form, combined
with the second fact (asymptotically errorless transmission). Initially, this fact was
not regarded as an independent one, and was, in essence, incorporated into the gen-
eralized Shannon’s theorem.
In this chapter we follow another way of presentation and treat the fact of asymp-
totic equivalence of the values of different types of information as a special, com-
pletely independent fact that is more elementary than the generalized Shannon’s the-
orem. We consider this way of presentation more preferable both from fundamental
and from pedagogical points of view. At the same time we can clearly observe the
symmetry of information theory and equal importance of the second and the third
variational problems.
Apart from the fact of asymptotic equivalence of the values of information, it
is also interesting and important to study the magnitude of their difference. Sec-
tions 11.3 and 11.4 give the first terms of the asymptotic expansion for this dif-
ference, which were found by the author. These terms are exact for a chosen
random encoding, and they give an idea of the rate of decrease of the differ-
ence (as in any asymptotic semiconvergent expansion), even though the sum of
all the remaining terms of the expansion is not evaluated. We pay a special at-
tention to the question about invariance of the results with respect to a transfor-
mation c(ξ , ζ ) → c(ξ , ζ ) + f (ξ ) of the cost function, which does not influence
https://doi.org/10.1007/978-3-030-22833-0 11
354 11 Asymptotic results about the value of information. Third asymptotic theorem
information transmission and the value of information. Preference is given to those

formulae, which use variables and functions that are invariant with respect to the
specified transformation. For instance, we take the ratio of the invariant difference
R − R = V − V over invariant variable V rather than risk R, which is not invariant
(see Theorem 11.2).
Certainly, research in the given direction can be supplemented and improved.
Say, when the generalized Shannon’s theorem is considered in Section 11.5, it is
legitimate to raise the question about the rate of decrease of the difference in risks.
This question, however, was not considered in this book.
11.1 On the distinction between the value functions of different

types of information. Preliminary forms
Let [P(dx), c(x, u)] be a Bayesian system. That is, there is a random variable x
from probability space (X, F, P), and F × G-measurable cost function c(x, u), where
u (from a measurable space (U, G)) is an estimator. In Chapter 9, we defined
for such a system the value functions of different types of information (Hartley,
Boltzmann, Shannon). These functions correspond to the minimum average cost
E[c(x, u)] attained for the specified amounts of information received. The Hartley’s
information consists of indexes pointing to which region Ek of the optimal partition
E1 + · · · + EM = X (Ek ∈ F) does the value of x belongs to. The minimum losses

= inf E inf E [c(x, u) | Ek ]
R(I) (11.1.1)
∑ Ek u
are determined by minimization over both the estimators, u, and different partitions.
The upper bound I on Hartley’s information amount corresponds to the upper bound
M eI on the number of the indicated regions.
When bounding the Shannon’s amount of information, the following minimum
costs are considered:

R(I) = inf c(x, u)P(dx)P(du | x) (11.1.2)
P(du|x)
where minimization is carried over conditional distributions P(du | x) compatible

with inequality Ixy I.

The minimum losses R(I) with respect to the constraint on the Boltzmann’s in-
formation amount, according to (9.6.6), range between losses (11.1.1) and (11.1.2):

R(I) R(I) R(I). (11.1.3)
will be close to each other when R(I)

Therefore, all three functions R(I), R(I), R(I)
and R(I) are close.
11.1 Distinction between values of different types of information 355

Note that the definitions of functions R(I), R(I), R(I) imply only the inequal-
ity (11.1.3). A question that emerges here is how much do the functions R(I) and
R(I) differ from each other? If they do not differ much, then, instead of the function

R(I), which is difficult to compute, one may consider a much easier computationally
function R(I), which in addition has other convenient properties, such as differen-
tiability. The study of the differences between the functions R(I), R(I) and, thus,

between the value functions V (I), V (I), V (I), is the subject of the present chapter.
It turns out that for Bayesian systems of a particular kind, namely Bayesian sys-
tems possessing the property of ‘information stability’, there is asymptotic equiva-
lence of the above-mentioned functions. This profound asymptotic result (the third
asymptotic theorem) is comparable in its depth and significance with the respective
asymptotic results for the first and the second theorems.
Before giving the definition of informationally stable Bayesian systems, let us
consider composite systems, a generalization of which are the informationally stable
systems. We call a Bayesian system [Pn (d ξ ), c(ξ , ζ )] an n-th power (or degree) of
a system [P1 (dx), c1 (x, u)], if the random variable ξ is a tuple (x1 , . . . , xn ) consisting
of n independent identically distributed (i.i.d.) random variables that are copies of
x, i.e.
Pn (d ξ ) = P1 (dx1 ) · · · P1 (dxn ). (11.1.4)
The estimator, ζ , is a tuple of identical u1 , . . . , un , while the cost function is the
following sum:
n
c(ξ , ζ ) = ∑ c1 (xi , ui ). (11.1.5)
i=1
Formulae (11.1.1), (11.1.2) can be applied to the system [P1 , c1 ] as well as to its
n-th power. Naturally, the amount of information In = nI1 for a composite system
should be n times the amount of the original system. Then the optimal distribution
P(d ζ | ξ ) (corresponding to formula (11.1.2) for a composite system) is factorized
according to (9.4.21) into the product
PIn (d ζ | ξ ) = PI1 (du1 | x1 ) · · · PI1 (dun | xn ), (11.1.6)
where PI1 (du | x) is the analogous optimal distribution for the original system. Fol-
lowing (11.1.2) we have

R1 (I1 ) = c1 (x, u)P1 (dx)PI1 (du | x). (11.1.7)
Thus, for the n-th power system, as a consequence of (11.1.5), (11.1.7), we obtain
Rn (In ) = nR1 (I1 ). (11.1.8)
The case of function (11.1.1) is more complicated. Functions Rn (nI1 ) and R1 (I1 )
of the composite and the original systems are no longer related so simply. The num-
ber of partition regions for the composite system can be assumed to be M = [enI1 ],
whereas for the original system it is m [eI1 ] (the brackets [ ] here denote an integer
part). It is clear that M mn , and therefore the partition

∑ Ek . . . ∑ Ez (11.1.9)
of space X n = (X × · · · × X) into mn regions, induced by the optimal distribution

∑ Ek = X of the original system, is among feasible partitions searched over in the
formula

Rn (nI1 ) = inf E inf E[c(ξ , ζ ) | Gk ] . (11.1.10)
∑M
k=1 Gk =X
n ζ
Hence,
Rn (nI1 ) nR1 (I1 ). (11.1.11)
However, besides partitions (11.1.9) there is a large number of feasible partitions
of another kind. Thus, it is reasonable to expect that nR1 (I1 ) is substantially larger
than Rn (nI1 ). One can expect that for some systems the rates Rn (nI1 )/n of aver-
age costs [clearly exceeding R1 (I1 ) on account of (11.1.3), (11.1.8)] decrease as n
increases and approach their plausible minimum, which turns out to coincide pre-
cisely with Rn (nI1 )/n = R1 (I1 ). It is this fact that is the essence of the main result
(the third asymptotic theorem). Its derivation also yields another important result,
namely, there emerges a method of finding a partition ∑ Gk close (in asymptotic
sense) to the optimal. As it turns out, a procedure analogous to decoding via a ran-
dom Shannon’s code is suitable here (see Section 7.2). One takes M code points ζ1 ,
. . . , ζM (we remind that each of them represents a block (u1 , . . . , un )). These points
are the result of M-fold random sampling with probabilities

PIn (d ζ ) = P(d ξ )PIn (d ζ | ξ ) = PI1 (u1 ) · · · PI1 (un ), (11.1.12)
Xn
where PIn (d ζ | ξ ) is an optimal distribution (11.1.6). The number of code points

= [eIn ] (In = nI1 , I1 is independent of n). Moreover, to prove the next
is set to M
Theorem 11.1, we should take the quantity In to be different from In = nI1 occur-
ring in (11.1.12). The specified code points and the ‘distance’ c(ξ , ζ ) determine the
partition

M
∑ Gk = X n . (11.1.13)
k=1
The region Gk contains those points ξ , which are ‘closer’ to point ζk than to other
points (equidistant points may be assigned to any of competing regions by default).
If in the specified partition (11.1.13) the estimator is chosen to be point ζk used to
construct the region Gk , instead of point ζ minimizing the expression E[c(ξ , ζ ) |
Gk ], then it will be related to some optimality, that is

E inf E[c(ξ , ζ ) | Gk ] E [E[c(ξ , ζ ) | Gk ]] . (11.1.14)
ζ
11.2 Theorem about asymptotic equivalence of the value functions of different. . . 357
Comparing the left-hand side of inequality (11.1.14) with (11.1.10) we evidently

conclude that

Rn (nI1 ) E inf E[c(ξ , ζ ) | Gk ] E [E[c(ξ , ζ ) | Gk ]] .
ζ
In the expression E [E[c(ξ , ζk | Gk )]] points ζ1 , . . . , ζM are assumed to be fixed.

Rewriting the latter expression in more detail as E [E[c(ξ , ζk ) | Gk ] | ζ1 , . . . , ζn ] we
obtain the inequality
Rn (nI1 ) E {E[c(ξ , ζk ) | Gk ] | ζ1 , . . . , ζM } , (11.1.15)
that will be useful in the next paragraph.

In what follows, it is also useful to remember (see Chapter 9) that the optimal
distribution (11.1.12) is related via formula (9.4.23), that is via

γn (ξ ) = ln e−β c(ξ ,ζ ) PIn (d ζ ) (11.1.16)
or via
γ1 (ξ ) = ln e−β c1 (x,u) PI1 (du), (11.1.17)
with functions γ1 (x), γn (ξ ) = γ1 (x1 ) + · · · + γ1 (xn ). Averaging them gives the poten-
tials

Γ1 (β ) = E[γ1 (x)] = P(dx) ln e−β c1 (x,u) PI1 (du), (11.1.18)
Γn (β ) = E[γn (ξ )] = nΓ1 (β ), (11.1.19)
that allow for the computation of the average costs
dΓ1 dΓ1
Rn = −n (β ); R1 = − (β ) (11.1.20)
dβ dβ
and the amount of information

dΓ1
In = nI1 ; I1 = β (β ) − Γ1 (β ) (11.1.21)
dβ
according to (9.4.10), (9.4.29), (9.4.30).
11.2 Theorem about asymptotic equivalence of the value

functions of different types of information
1. Before we transition to the main asymptotic results related to equivalence of

different types of information, we introduce useful inequalities (11.2.5), (11.2.13),
which allow us to find upper bounds for minimum losses Rn (nI1 ) corresponding to
Hartley’s information amount. We use inequality (11.1.15) that is valid for every
set of code points ζ1 , . . . , ζM . It remains true if we average out with respect to a
statistical ensemble of code points described by probabilities PIn (d ζ1 ) · · · PIn (d ζM ),
M = [enI1 ]. At the same time we have
! 3 4
Rn nI1 E E [c (ξ , ζk ) | Gk ] | ζ1 , . . . , ζM PIn (d ζ1 ) · · · PIn (d ζM ) ≡ L. (11.2.1)
Let us write down the expression L in the right-hand side in more detail. Since
Gk is a region of points ξk , for which the ‘distance’ c(ξ , ζk ) is at most the ‘distance’
c(ξ , ζi ) to any point ζi from ζ1 , . . . , ζM , we conclude that

1
E [c (ξ , ζk ) | Gk ] = c(ξ , ζk )P(d ξ )
P(Gk )
c(ξ , ζk ) c(ξ , ζ1 ),
.................
c(ξ , ζk ) c(ξ , ζM )
and, consequently, also

3 4
E E [c (ξ , ζk ) | Gk ] | ζ1 , . . . , ζM = ∑ P(Gk )E [c (ξ , ζk ) | Gk ]
k

=∑ c(ξ , ζk )P(d ξ ).
k
c(ξ , ζk ) c(ξ , ζ1 )
.................
c(ξ , ζk ) c(ξ , ζM )
Averaging over ζ1 , . . . , ζM yields

L=∑ ··· c(ξ , ζk )P(d ξ )PIn (d ζ1 ) · · · PIn (d ζM ). (11.2.2)
k
c(ξ , ζk ) c(ξ , ζi )
.................
c(ξ , ζk ) c(ξ , ζM )
For each k-th term, it is convenient first of all to make an integration with respect
to points ζi that do not coincide with ζk . If we introduce the cumulative distribution
function ⎧
⎪
⎨ PIn (d ζ ) for λ 0,
c(ξ ,ζ )λ
1 − Fξ (λ ) = (11.2.3)
⎪
⎩ PIn (d ζ ) for λ < 0,
c(ξ ,ζ )>λ
then, after an (M − 1)-fold integration by points ζi not coinciding with ζk , equa-
tion (11.2.2) results in

M

L ∑ c(ξ , ζk )[1 − Fξ (c(ξ , ζk ))]M−1 P(d ξ )PIn (d ζk )
k=1
ξ ζk

=M c(ξ , ζ )[1 − Fξ (c(ξ , ζ ))]M−1 P(d ξ )PIn (d ζ ). (11.2.4)
ξ ζ
The inequality sign has emerged because for c(ξ , ζk ) 0 we slightly expanded the
regions Gi (i = k) by adding to them all ‘questionable’ points ξ , for which c(ξ , ζk ) =
c(ξ , ζi ), and also for c(ξ , ζk ) < 0 we shrank those regions by dropping out all similar
points.
It is not hard to see that (11.2.4) can be rewritten as
∞

L P(d ξ ) λ d{1 − [1 − Fξ (λ )]M },
−∞
or, as a result of (11.2.1),

Rn (nIn ) P(d ξ ) λ dF1 (λ ). (11.2.5)
Here in the right-hand side we average by ξ the expectation corresponding to the

cumulative distribution function

F1 (λ ) = 1 − [1 − Fξ (λ )]M . (11.2.6)
It is evident that
1 − F1 = [1 − Fξ ]M 1 − Fξ , i.e. F1 (λ ) Fξ (λ ), (11.2.7)
since 1 − Fξ (λ ) 1. It is convenient to introduce a new cumulative distribution

function close to F1 (λ ) (for large n)

F2 (λ ) = max{1 − e−MFξ (λ ) , Fξ (λ )}. (11.2.8)
It is not hard to prove that

F1 = 1 − (1 − Fξ )M 1 − e−MFξ . (11.2.9)
Indeed, by using the inequality
1 − Fξ e−Fξ ,
we get

(1 − Fξ )M e−MFξ , (11.2.10)
that is equivalent to (11.2.9).
Employing (11.2.7), (11.2.9) we obtain that function (11.2.8) does not surpass
function (11.2.6):
F2 (λ ) F1 (λ ). (11.2.11)
It follows from the last inequality that

λ dF1 (λ ) λ dF2 (λ ). (11.2.12)
In order to ascertain this, we can take into account that (11.2.11) entails the
inequality λ1 (F) λ2 (F) for reciprocal functions due to which the difference

λ dF2 − λ dF1 can be expanded to 01 [λ2 (F) − λ1 (F)]dF and, consequently, turns
out to be non-negative.
So, inequality (11.2.5)

will only become stronger, if we substitute λ dF1 in the
right-hand side by λ dF2 . Therefore, we shall have

Rn (nI1 ) P(d ξ ) λ dF2 (λ ), (11.2.13)
= [enI1 ]. The derived inequality

where F2 (λ ) is defined by formula (11.2.8) when M
will be used in Section 11.3.
2.
Theorem 11.1. Let [Pn (d ξ ), c(ξ , ζ )] be Bayesian systems that are n-th degree of
system [P1 (dx), c1 (x, u)] with the cost function bounded above:
|c1 (x, u)| K1 . (11.2.14)
Then the cost rates corresponding to different types of information coincide in the
limit:
1
Rn (nI1 ) → R1 (I1 ) (11.2.15)
n
as n → ∞. In other words, there is asymptotic equivalence
Vn (nI1 )/Vn (nI1 ) → 1 (11.2.16)
of the value functions of Shannon’s and Hartley’s information amounts. It is as-

sumed here that the cost rates in (11.2.15) are finite and that function R1 (I1 ) is
continuous.
We shall follow Shannon [43] (originally published in English [41]) in the proof
of this theorem, who formulated this result in different terms.
Proof. 1. In consequence of (11.1.4) the random information
n
I(ξ , ζ ) = ∑ I(xi , ui )
i=1
constitutes a sum of independent random variables for the extremum distribu-

tion (11.1.6). The cost function (11.1.5) takes the same form. Each of them has
a finite expectation. Applying the Law of Large Numbers (Khinchin’s Theorem)

we obtain

1 1
P c(ξ , ζ ) − Rn (nI1 ) < ε → 1,
n n

1
P I(ξ , ζ ) − I1 < ε → 1
n
for n → ∞ and for any ε > 0.

Thus, whichever δ > 0 is, for sufficiently large n > n(ε , δ ) we shall have
P {c(ξ , ζ )/n < R1 (I1 ) + ε } > 1 − δ 2 /2,

P {I(ξ , ζ )/n < I0 + ε } > 1 − δ 2 /2. (11.2.17)
Let us denote by Γ the set of pairs (ξ , ζ ), for which both inequalities
c(ξ , ζ ) < n[R1 (I1 ) + ε ], (11.2.18)

I(ξ , ζ ) < n(I1 + ε ) (11.2.19)
hold true simultaneously. Then it follows from (11.2.17) that
P(Γ ) > 1 − δ 2 . (11.2.20)
Indeed, the probability of Ā (a complementary event of (11.2.18)) and also the

probability of B̄ (a complementary event of (11.2.19)) are at most δ 2 /2. There-
fore, we have
P(A ∪ B) P(A) + P(B) δ 2
for their union. The complementary event of Ā ∪ B̄ is what Γ is and, conse-
quently, (11.2.20) is valid.
For a fixed ξ , let Zξ be a set of those ζ , which belong to Γ in a pair with ξ . In
other words, Zξ is a section of set Γ . With this definition we have P(Zξ | ξ ) =
P(Γ | ξ ).
Consider the set Ξ of those elements ξ , for which
P(Zξ | ξ ) > 1 − δ . (11.2.21)
It follows from (11.2.20) that
P(Ξ ) > 1 − δ . (11.2.22)
In order to confirm this, let us assume the contrary, that is
P(Ξ ) < 1 − δ ; P(Ξ ) δ (11.2.22a)
(here Ξ̄ is the complement of Ξ ). Probability P{Γ̄ } of the complement of Γ

can be estimated in this case in the following manner. Let us use the following
representation:
P(Γ ) = P(Γ | ξ )P(d ξ ),
split this integral into two parts and keep just one subintegral:

P(Γ ) = P(Γ | ξ )P(d ξ ) + P(Γ | ξ )P(d ξ ) P(Γ | ξ )P(d ξ ).
Ξ Ξ Ξ
(11.2.22b)
Within the complementary set Ξ̄ the inequality opposite to (11.2.21) holds true:
P(Zξ | ξ ) ≡ P(Γ | ξ ) δ , i.e. P(Γ̄ | ξ ) > δ when ξ ∈ Ξ̄ . Substituting this esti-
mation in (11.2.22b) and taking into account (11.2.22a) we find that

P(Γ ) > δ P(d ξ ) = δ P(Ξ ) δ 2 .
Ξ
It contradicts (11.2.20) and, consequently, the assumption (11.2.22a) is not true.

Inequality (11.2.19) means that
P(ζ | ξ )
ln < n(I1 + ε ),
P(ζ )
that is
P(ζ ) > e−n(I1 +ε ) P(ζ | ξ ).
Summing over ζ ∈ Zξ we obtain from here that
P(Zξ ) > e−n(I1 +ε ) P(Zξ | ξ ),
or, employing (11.2.21),
P(Zξ ) > e−n(I1 +ε ) (1 − δ ) (ξ ∈ Ξ ) where n > n(ξ , δ ). (11.2.23)
2. Now we apply formula (11.2.5). Let us take the following cumulative distribution
function:
⎧
⎪
⎨0 for λ < nR1 (I1 ) + nε
F3 (λ ) = F1 (nR1 (I1 ) + nε ) for nR1 (I1 ) + nε < λ < nK1 (11.2.24)
⎪
⎩
1 for λ nK1 .
Because of the boundedness condition (11.2.14) of the cost function, the prob-
ability of inequality c(ξ , ζ ) > nK1 being valid is equal to zero, and it follows
from (11.2.3) that Fξ (nK1 ) = 1. As a result of (11.2.6) we have F1 (nK1 ) = 1.
Hence, functions F1 (λ ) and F3 (λ ) coincide within the interval λ nK1 . Within
the interval nR1 (I1 ) + nε < λ < nK1 we have F3 (λ ) F1 (λ ) since
F3 (λ ) = F1 (nR1 (I1 ) + nε ) F1 (λ ) where nR1 (I1 ) + nε λ

by virtue of an non-decreasing characteristic of function F1 (λ ). Therefore, in-

equality F3 (λ ) F1 (λ ) holds true for all values of λ . This inequality entails a
reciprocal inequality for means:

λ dF3 (λ ) λ dF1 (λ ),
in the same way as (11.2.12) has followed from (11.2.11). That is why for-
mula (11.2.5) yields

Rn (nI1 ) P(d ξ ) λ dF3 (λ ), (11.2.25)
but

λ dF3 (λ ) = [nR1 (I1 ) + nε ]F1 (nR1 (I1 ) + nε ) + nK1 [1 − F1 (nR1 (I1 ) + nε )]
due to (11.2.24). Consequently, (11.2.25) together with (11.2.6) results in

1
Rn (nI1 ) R1 (I1 ) + ε + [K1 − R1 (I1 ) − ε ][1 − Fξ (nR1 (I1 ) + nε )]M P(d ξ )
n
or

1
Rn (nI1 ) R1 (I1 ) + ε + [K1 − R1 (I1 ) − ε ] P(d ξ )e−MFξ (nR1 (I1 )+nε ) , (11.2.26)
n
if (11.2.10) is taken into account. The validity of the presumed inequality
K1 − R1 (I1 ) − ε > 0 is ensured by increasing K1 .
3. As a result of (11.2.3) we have

Fξ (nR1 (I1 ) + nε ) = P(d ζ )
c(ξ ,ζ )<nR1 (I1 )+nε
for nR1 (I1 ) + nε 0;

(11.2.26a)
Fξ (nR1 (I1 ) + nε ) = P(d ζ )
c(ξ ,ζ )nR1 (I1 )+nε
for nR1 (I1 ) + nε < 0,
where ξ is fixed. Let us compare this value with probability

P(Zξ ) = P(d ζ ), (11.2.26b)
Zξ
featuring in (11.2.23). For the values of ζ within the set Zξ inequalities (11.2.18)
and (11.2.19) hold true due to the definition given earlier. Therefore, the domain
of integration Zξ in (11.2.26b) constitutes a subset of the domain of integration
in (11.2.26a) and, consequently,
Fξ (nR1 (I1 ) + nε ) P(Zξ ).
Substituting (11.2.23) into the above, we obtain that
Fξ (nR1 (I1 ) + nε ) > e−n(I1 +ε ) (1 − δ ) (ξ ∈ Ξ ). (11.2.27)
Let us split the integral in (11.2.26) into two parts: an integral over set Ξ and an
integral over the complementary Ξ̄ . We apply (11.2.27) to the first integral and
replace the exponent by number one in the second one. This yields

ξ (nR1 (I1 ) + nε )]
P(d ξ ) exp[−MF

−nI1 −nε (1 − δ )] + 1 − P(Ξ )
P(d ξ ) exp[−Me
Ξ
−nI1 −nε (1 − δ )] + δ ,
exp[−Me (11.2.28)
where inequality (11.2.22) is taken into account. Here
= [enI1 ] enI1 − 1,
M
and also it is expedient to assign I¯1 = I1 + 2ε , ε > 0. Then inequality (11.2.28)

will take the form

P(d ξ ) exp[−MFξ (nR1 (I1 ) + nε )] exp{−enε (1 − δ )(1 − e−nI1 )} + δ .
Plugging the latter expression into (11.2.26) we obtain that
1
Rn (nI1 ) R1 (I1 − 2ε ) + ε +
n

+ 2K1 δ + 2K1 exp{−enε (1 − δ )(1 − e−nI1 )} (11.2.29)
(here we used the fact that |R1 | < K1 due to (11.2.14)).

Thus, we have deduced that for every I¯1 , ε > 0, δ > 0 independent of n, there ex-
ists n(ε , δ ) such that inequality (11.2.29) is valid for all n > n(ε , δ ). This means
that
1
lim sup Rn (nI1 ) R1 (I1 − 2ε ) + ε + 2K1 δ +
n→∞ n

+ lim 2K1 exp{−enε (1 − δ )(1 − e−nI1 )}.
n→∞
However,

lim exp{−enε (1 − δ )(1 − e−nI1 )} = 0
n→∞
(here δ < 1). Thus, we have

1
lim sup Rn (nI1 ) − R1 (I1 ) R1 (I1 − 2ε ) − R1 (I1 ) + ε + 2K1 δ . (11.2.30)
n→∞ n
Because the function R1 (I1 ) is continuous, the expression in the right-hand side
of (11.2.30) can be made arbitrarily small by considering sufficiently small ε and
δ . Therefore,
1
lim sup Rn (nI1 ) R1 (I1 ).
n→∞ n
The above formula, together with the inequality
1
lim inf R1 (nI1 ) R1 (I1 ), (11.2.30a)
n→∞ n
that follows from (11.1.3) and (11.1.8), proves the following relation
1 1
lim sup R1 (nI1 ) = lim inf R1 (I1 ) = R1 (I1 ),
n→∞ n n→∞ n
that is equation (11.2.15). The proof is complete.
3. Theorem 11.1 proven above allows for a natural generalization for those cases,
when a system in consideration is not an n-th degree of some elementary system,
but instead some other more general conditions are satisfied. The corresponding
generalization is analogous to the generalization made during the transition from
Theorem 7.1 to Theorem 7.2, where the requirement for a channel to be an n-th
degree of an elementary channel was replaced by a requirement of informational
stability. Therefore, let us impose such a condition on Bayesian systems in question
that no significant changes to the aforementioned proof were required. According
to a common trick, we have to exclude from the proof the number n and the rates I1 ,
R1 , . . . , and instead consider mere combinations I = nI1 , R = nR1 and so on. Let us
require that a sequence of random variables ξ , ζ (dependent on n or some other pa-
rameter) were informationally stable in terms of the definition given in Section 7.3.
Besides let us require the following convergence in probability:
[c(ξ , ζ ) − R]/V (I) → 0 as n → ∞. (11.2.31)
It is easy to see that under such conditions the inequalities of the type (11.2.17) will
hold. Those inequalities now take the form
P{c(ξ , ζ ) < R + ε1V (I)} > 1 − 1/2δ 2 ,

P{I(ξ , ζ ) < I + ε2 I} > 1 − 1/2δ 2
for n > n(ε1 , ε2 , δ ), where ε1V = ε2 I = nε . Instead of the previous relation I1 =
I1 + 2ε we now have
I = I + 2ε2 I.
Let us take the boundedness condition (11.2.14) in the following form:
|c(ξ , ζ )| KV (I), (11.2.32)
where K is independent of n. Only modification of a way to write formulae is re-

quired in the mentioned proof. Now the relation (11.2.29) takes the form
R(I− 2ε2 I) + ε1V (I) + 2KV (I)δ + 2KV (I) exp{−eε2 I (1 − δ )(1 − e−I)}
I)
R(
or
V (I− 2ε2 I) − V (I)

ε1 + 2K δ + exp{−eε2 I (1 − δ )(1 − e−I) } (11.2.33)
V (I− 2ε2 I)
for all values of ε1 , ε2 , δ that are independent of n, and for n > n(ε1 , ε2 , δ ). In conse-
quence of condition A from the definition of informational stability (see Section 7.3)
˜
for every ε2 > 0, the expression exp{−eε2 I (1 − δ )(1 − e−I )} where I˜ = (1 + 2ε2 )I
converges to zero as n → ∞ if δ < 1. Therefore, a passage to the limit n → ∞
from (11.2.33) results in
V (I)

V (I)
1 − lim inf ε1 + 2K δ . (11.2.33a)
n→∞ V (I− 2ε2 I)
V (I)
We assume that there exists a limit

V (I)
lim = ϕ (y) (11.2.34)

V (yI)
which is a continuous function of y. Then it follows from (11.2.33a) that

V (I)
1
lim inf ϕ 1 − ε1 − 2K δ .
n→∞ V (I) 1 + 2ε2
Because ε1 , ε2 , δ are arbitrary and the function φ (y) is continuous, we obtain that
V (I)

lim inf 1. (11.2.34a)
n→∞
V (I)
In order to prove the convergence
V (I)

→1 (11.2.35)

V (I)
it only remains to compare (11.2.34a) with the inequality
V (I)

lim sup 1
n→∞
V (I)
which is a generalization of (11.2.30a).

4. We noted in Section 9.6 that the value of information functions V (I), V (I) are
invariant under the following transformation of the cost function:
c (ξ , ζ ) = c(ξ , ζ ) + f (ξ ) (11.2.36)
(see Theorem 9.8). At the same time the difference of risks R − R and also the
regions Gk defined in Section 11.1 remain invariant, if the code points ζk and
the distribution P(d ζ ), from which they are sampled do not change. Meanwhile,
conditions (11.2.31), (11.2.32) are not invariant under the transformation defined
by (11.2.36). Thus, (11.2.31) turns into

c (ξ , ζ ) − E[c (ξ , ζ )] − f (ξ ) + E[ f (ξ )] /V (I) → 0,
that evidently do not coincide with

c (ξ , ζ ) − E[c (ξ , ζ )] /V (I) → 0.
Taking this into account, one may take advantage of arbitrary function f (ξ ) in trans-
formation (11.2.36) and select it in such a way that conditions (11.2.31), (11.2.32)
are satisfied, if they were not satisfied initially. This broadens the set of cases, for
which the convergence (11.2.34) can be proven, and relaxes the conditions required
for asymptotic equivalence of the value functions in the aforementioned theory.
In fact, using a particular choice of function f (ξ ) in (11.2.36) one can eliminate
the need for condition (11.2.31) altogether. Indeed, as one can see from (9.4.21), the
following equation holds for the case of extremum distribution P(ζ | ξ ):
− I(ξ , ζ ) = β c(ξ , ζ ) + γ (ξ ). (11.2.37)
Using arbitrary form of function f (ξ ) in (11.2.36), let us set f (ξ ) = γ (ξ )/β .

Then a new ‘distance’ c (ξ , ζ ) coincides with −I(ξ , ζ )/β , and the convergence
condition (11.2.31) need not be discussed, because it evidently follows from the
informational stability condition
[I(ξ , ζ ) − Iξ ζ ]/Iξ ζ → 0 (in probability) (11.2.38)
unless, of course, the ratio I/β V does not grow to infinity. We say that a sequence
of Bayesian systems [P(d ξ ), c(ξ , ζ )] is informationally stable, if the sequence of
random variables ξ , ζ is informationally stable for extremum distributions. Hence,
the next result follows the above-mentioned.
Theorem 11.2 (A general form of the third asymptotic theorem). If an informa-

tionally stable sequence of Bayesian systems satisfies the following boundedness
condition:
|I(ξ , ζ )| K V (I)/V (I) (11.2.39)
(K is independent of n) and function (11.2.34) is continuous in y, then the conver-
gence defined in (11.2.35) takes place.
Condition (11.2.39), as one can check, follows from (11.2.32), K = β KV (I).

5. Let us use the aforementioned theory for more detailed analysis of a measuring-
transmitting system considered in 3 of Section 9.2. Asymptotic equivalence of the
values of Hartley’s and Shannon’s information allows one to enhance a performance
of the system shown on Figure 9.5, i.e. to decrease the average cost from the level
= min E[c(x, u)] − V (I),
R(I) I = ln m,
u
to the level
R(I) = min E[c(x, u)] −V (I),
u
defined by the value of Shannon’s information.

In order to achieve this, we need to activate the system many times and replace x
by the block ξ = (x1 , . . . , xn ), while u by the block ζ = (u1 , . . . , un ) for sufficiently
large n. A cumulative cost has to be considered as penalties. After that we can apply
Theorem 11.1 and construct blocks 1 and 2 shown on Figure 11.1 according to the
same principles as in 3 of Section 9.2. Now, however, with the knowledge of the
proof of Theorem 11.1 we can elaborate their operation. Since we do not strive to
reach precise optimality, let us consider the aforementioned partition (11.1.13) as
a required partition ∑k Ek = X n . The regions Gk are defined here by the proximity
to random code points ζ1 , . . . , ζk . This partition is asymptotically optimal, as it has
Fig. 11.1 Block diagram of a system subject to an information constraint. The channel capacity is
n times greater than that on Figure 9.5. MD—measuring device
been proven. In other words, the system (see Figure 11.1), in which block 1 classifies
the input signal ξ according to its membership in regions Gk and outputs the index
of the region containing ξ , is asymptotically optimal. It is readily seen that the
specified work of block 1 is absolutely analogous to the work of the decoder at the
output of a noisy channel mentioned in Chapter 7. The only difference is that instead
of the ‘distance’ defined by (7.1.8) we now consider ‘distance’ c(ξ , ζ ) and also ξ , η
are substituted by ζ , ξ . This analogy allows us to call block 1 a measuring decoder,
while block 2 acts as a block of optimal estimation.
The described information system and more general systems were studied in
the works of Stratonovich [47], Stratonovich and Grishanin [55], Grishanin and
Stratonovich [17].
The information constraints of the type considered before (Figures 9.6 and 11.1)
can be taken into account in various systems of optimal filtering, automatic con-
trol, dynamic programming and even game theory. Sometimes these constraints are
related to boundedness of information inflow, sometimes to memory limitations,
sometimes to a bounded complexity of an automaton or a controller. The consid-
11.3 Rate of convergence between the values of Shannon’s and Hartley’s information 369
eration of these constraints results in permeation of these theories by the concepts

and methods of information theory and also to their amalgamation with the theory
of information. In dynamic programming (for instance, see Bellman [2], the corre-
sponding original in English is [1]) and often in other theories a number of actions
occurring consequently in time is considered. As a result, informational constraints
in this case occur repeatedly. This leads to a generalization of the aforementioned
results to sequences. A number of questions pertaining to the specified direction are
studied in the works of Stratonovich [49, 51], Stratonovich and Grishanin [56].
11.3 Rate of convergence between the values of Shannon’s and

Hartley’s information
The aforementioned Theorem 11.1 establishes the fact about asymptotic equiva-
lence of the values of Shannon’s and Hartley’s information amounts. It is of interest
how quickly does the difference between these values vanish. We remind the reader
that in Chapter 7, right after Theorems 7.1 and 7.2, which established the fact of
asymptotic vanishing of probability of decoding error, contains theorems, in which
the rate of vanishing of that probability was studied.
Undoubtedly, we can obtain a large number of results of various complexity and
importance, which are concerned with the rate of vanishing of the difference V (I) −
V (I) (as in the problem of asymptotic zero probability of error). Various methods
can also be used in solving this problem. Here we give some comparatively simple
results concerning this problem. We shall calculate the first terms of the asymptotic
expansion of the difference V (I) − Ṽ (I) in powers of the small parameter n−1 . In
so doing, we shall determine that the boundedness condition (11.2.14) of the cost
function, stipulated in the proof of Theorem 11.1, is inessential for the asymptotic
equivalence of the values of different kinds of information and is dictated only by
the adopted proof method.
Consider formula (11.2.13), which can be rewritten using the notation S =
λ dF2 (λ ) as follows:
Rn (nI1 ) SP(d ξ ). (11.3.1)
Hereinafter, we identify I˜ with I, I˜1 with I1 and M̃ with M, because it is now unnec-
essary to make the difference between them.
Let us perform an asymptotic calculation of the expression situated in the right-
hand side of the above inequality.
1. In view of (11.2.8) we have
S = S1 + S2 + S3 , (11.3.2)
where
c ∞
S1 = − λ de−MFξ (λ ) , S2 = − λ de−MFξ (λ ) , (11.3.3)
λ =−∞ λ =c

S3 = λ d Fξ (λ ) − 1 + e−MFξ (λ . (11.3.4)
−MFξ (λ )
Fξ (λ )>1−e
It is convenient to introduce here the notation

c= c(ξ , ζ )PIn (d ζ ).
The distribution function
Fξ (λ ) = P [c(ξ , ζ ) λ ]
(ξ is fixed, P corresponds to PIn (d ζ ), see (11.2.3)) appearing in these formulae

can be assessed using Theorem 4.8, or more precisely using its generalization that
was discussed right after its proof. We assume for simplicity that segment [s1 , s2 ]
mentioned in Theorem 4.8 is rather large, so that equation (4.4.17) (11.3.7) has root
s for all values λ = x. Then

−1/2

−sμξ (s)+μξ (s) −1
Fξ (λ ) = 2π μξ (s)s 2
e 1 + O(μξ ) (11.3.5)

for λ < c(ξ , ζ )PIn (d ζ ) = c̄ and also
−sμξ (s)+μξ (s)
Fξ (λ ) = 1 − [2π [μξ (s)s2 ]−1/2 e [1 + O(μξ−1 )] (11.3.6)
for λ > c, where s is the root of equation
μξ (s) = λ , (11.3.7)
and
μξ (t) = ln etc(ξ ,ζ ) PIn (d ζ ). (11.3.8)
Different expressions (11.3.5), (11.3.6) correspond to different signs of the root s.

Due to (11.3.5)–(11.3.7) the first integral (11.3.3) can be transformed into the
form
0 (
S1 = − μξ (s)d exp − exp nI1 − sμξ (s) +
s=s
1

−1
+ μξ (s) − · · · − ln 2π μξ (s)s 2
1 + O(n ) , (11.3.9)
2
where integration is conducted with respect to s, and the lower bound of integration
s∗ is determined by the formula lims→s∗ μξ (s) = −∞.
For a better illustration (perhaps, somehow symbolically) we denote I1 = I/n,

μ1 = μξ /n and substitute O(n−1 ) for O(μξ−1 ). In order to estimate the specified
integral, we consider point r belonging to axis s that can be determined by the equa-
tion
1
I1 − r μ1 (r) + μ1 (r) − ln 2π nμ1 (r)r2 = 0, (11.3.10)
2
where μ1 (t) = μξ (t)/n. Then we perform Taylor expansion about this point of the
functions from equation (11.3.9):
1
μ1 (s) = μ1 (r) + μ1 (r)(s − r) + μ1 (r)(s − r)2 + · · · ,
2
sμ1 (s) − μ1 (s) = r μ1 (r) − μ1 (r) + r μ1 (r)(s − r)+
1
+ r μ1 (r) + μ1 (r) (s − r)2 + · · · ,
2
ln 2π nμ1 (s)s2 = ln 2π nμ1 (r)r2 + μ1 (r)/μ1 (r) + 2/r (s − r) + · · ·
Substituting these expansions into integral (11.3.9) we obtain that

1

S1 = μ1 (r) 1 − e−MFξ (c) −
n
−r ( )
− μ1 (r) xd exp −enα x+nβ x +··· 1 + O(n−1 ) −
2
x=s −r
−r ( )
1
− μ1 (r) x2 d exp −enα x+nβ x +··· 1 + O(n−1 ) + · · · ,
2
2 x=s −r
(x = s − r), (11.3.11)
where
α = −r μ1 (r) + O(n−1 ); 2β = −r μ1 (r) − μ1 (r) + O(n−1 ) (11.3.12)
with the required precision. Further, we choose a negative root of equation (11.3.10)
r<0 (11.3.13)
(if one exists), so that α > 0. Making a change of a variable enα x = z we trans-
form (11.3.11) into the form
1

S1 = μ1 (r) 1 − e−MFξ (c) −
n
−nα r ( )
μ (r) e
ln(z)d exp e− ln z+(β /nα ) ln z+··· 1 + O(n−1 ) −
2 2
− 1
nα z=z
−nα r ( )
1 μ1 (r) e − ln z+(β /nα 2 ) ln2 z+··· −1
− ln2
(z)d exp e 1 + O(n ) −··· ,
2 n2 α 2 z=z
(z = ena (s − r)). (11.3.14)
The above equation allows one to appreciate the magnitude of various terms for
large n. By expanding the exponent

exp −eln z+(β /nα ) ln z+··· ≡ exp −eln z+ε
2 2
into the Taylor series with respect to ε = (β /nα 2 ) ln2 z + · · · we obtain

1
exp(−zeε ) = exp −z − zε − zε 2 − · · ·
2
! 1 !2
ε ε
= e−z 1 − zε 1 + + · · · + z2 ε 2 1 + + · · · − · · · . (11.3.15)
2 2 2
Plugging (11.3.15) into (11.3.14), let us retain the terms only of orders 1, n−1 , n−2
and obtain
1

S1 = μ1 (r) 1 − e−nξ (c) −
n
−nα r
μ (r) e β
− 2 ln(z)d e−z − z 2 ln2 ze−z −
nα z=z nα
e−MF α r
μ (r)
− 12 2 ln2 (z)nde−z + n−3 · · ·
2n α z=z
If we choose smaller precision, then we obtain a simpler formula:

∞
1 μ (r)
S1 = μ1 (r) − 1 ln(z)de−z + n−2 · · ·
n nα 0
or, taking into account (11.3.12),

∞
1 1
S1 = μ1 (r) + ln(z)de−z + n−2 · · · (11.3.16)
n nr 0
Here we have neglected the term −μ1 (r)e−MFξ (c̄) , which vanishes with the growth
of n and M = [enI1 ] much faster than n−2 . Also we have neglected the integrals
∞ z
μ1 (r) μ1 (r)
ln(z)e−z dz; ln(z)e−z dz,
nα e−nα r nα 0
which decrease very quickly with the growth of n, because e−nα r and z∗ = enα (s∗ −r)
exponentially converge to ∞ and 0, respectively (recall that r < 0, s∗ < r). One can
easily assess the values of these integrals, for instance, by omitting the factor ln z in
the first integral and e−z in the second one, which have relatively small influence.
However, we will not dwell on this any further.
The integral in (11.3.16) is equal to the Euler constant C = 0.577 . . . Indeed,
expressing this integral as the limit
α
lim ln(z)d e−z − 1
α →∞ 0
and integrating by parts

α α α
dz
ln(z)d(e−z − 1) = (e−z − 1) ln(z) + (1 − e−z )
0 0 0 z
1 α
dz dz
= (e−α − 1) ln α + (1 − e−z ) − e−z + ln α
0 z 1 z
we can reduce it to the form
∞ 1 ∞
dz dz
ln(z)de−z = (1 − e−z ) − e−z =C
0 0 z 1 z
(see p. 97 in [25], the corresponding book in English is [24]). Therefore, re-
sult (11.3.16) acquires the form
1 C
S1 = μ1 (r) + + n−2 · · · (11.3.17)
n nr
Instead of the root r of equation (11.3.10), it is convenient to consider the root qξ of
the simpler equation
qξ μ1 (qξ ) − μ1 (qξ ) = I1 . (11.3.18)
Comparing these two equations shows

1 ln[2π nμ1 (qξ )qξ ]
2
r − qξ = − + n−2 ln2 n · · · , (11.3.19)
2n qξ μ1 (qξ )
and the condition for the existence of the negative root (11.3.13) is equivalent (for
large n) to the condition of the existence of the root
qξ < 0. (11.3.20)
By differentiation, we can make verify that the function sμξ (s) − μξ (s) attains the
minimum value equal to zero at the point s = 0. Therefore, equation (11.3.18), at
least for sufficiently small I1 , has two roots: positive and negative. Thus, it is possible
to choose the negative root qξ .
According to (11.3.19), formula (11.3.17) can be simplified to the form
1 1 C
S1 = μ1 (qξ ) − ln[2π nμ1 (qξ )q2ξ ] + + n−2 ln2 n · · · (11.3.21)
n 2nqξ nqξ
2. Let us now turn to the estimation of integral (11.3.4). Denote by λ the bound-
ary point determined by the equation
Fξ (λΓ ) = 1 − e−MFξ (λΓ ) . (11.3.22)

For large M (namely, M

1) the quantity 1 − F(λΓ ) = ε is small (ε 1). Trans-
forming equation (11.3.22) to the form

−M(1−ε ) −M 1 2 2
ε =e =e 1 + εM + ε M + · · · ,
2
we find its approximate solution
1 − Fξ (λΓ ) = ε = e−M + Me−2M + · · · , (11.3.22a)
which confirms the small value of ε . Within the domain of integration λ > λΓ in
equation (11.3.4) the derivative
d
! dF (λ )
ξ
Fξ (λ ) − 1 + e−MFξ (λ ) = 1 − Me−M eM[1−Fξ (λ )]
dλ dλ
is positive for M
1, which can be checked by taking into account that
1 − Fξ (λ ) ε ∼ e−M for λ > λΓ .
Therefore, integral (11.3.4) can be bounded above as follows:

∞
∞
S3 = λ d Fξ (λ ) − 1 + e−MFξ (λ ) |λ | dFξ (λ ) + S3 (11.3.23)
λΓ λΓ

where S3 = λ∞Γ |λ |de−MFξ (λ ) is a negative surplus we neglect.
Starting from some M > M(c̄) the value λΓ certainly exceeds an average value,
thereby we can incorporate estimation (11.3.6) into integral (11.3.23). This formula
allows us to obtain
1 − Fξ (λ ) = [1 − Fξ (λΓ )]e−nρ (λ −λΓ )+··· , (11.3.24)
where
ρ = sΓ μ1 (sΓ ) > 0 nμ1 (sΓ ) = λΓ , (11.3.25)
and the dots correspond to other terms, the form of which is not important to us.
Substituting (11.3.24) into (11.3.23) we obtain
∞ ∞
−nρ x+···
S3 ε |λΓ + x| de ε [|λΓ | + x] de−nρ x+···
x=0 x=0

1
= ε |λΓ | + +··· .
nρ
The right-hand side of the above equation can be written using (11.3.6) as follows:
S3 e−nu(λΓ )+··· [|λΓ | + · · · ] . (11.3.26)
Here μ is the image under the Legendre transform:

u(λ ) = sμ1 (s) − μ1 (s) (nμ1 (s) = λ );
and dots correspond to the less essential terms.

If λΓ were independent of n, then, according to (11.3.26), the estimate of in-
tegral S3 would vanish with increasing n mainly exponentially. However, because
of (11.3.22a), the value λΓ increases with n, and therefore μ̃ (λΓ ) increases as well
(the function μ̃ (λ ), as is easily verified, is increasing at λ > c̄). That makes the
expression in the right-hand side of (11.3.26) vanish even faster. Thus, the integral
S3 vanishes with the growth of n more rapidly than the terms of the asymptotic
expansion (11.3.21). Therefore, the term S3 in (11.3.2) can be neglected.
3. Let us now estimate the second integral S2 in (11.3.3). Choosing a positive
number b > |c̄|, we represent this integral as a sum of two terms:
S2 = S2 + S2 ; (11.3.27)

b ∞
S2 = − λ de−MFξ (λ ) ; S2 = − λ de−MFξ (λ ) .
λ =c λ =b
Evidently,

S2 b e−MFξ (c) − e−MFξ (b) be−MFξ (c) . (11.3.28)
To estimate the integral S2 we use (11.3.6) and the formula of the type (11.3.24):
∞ (
)
S2 = − λ d exp −M 1 − 1 − Fξ (b) e−nρb (λ −b)+··· .
b
Here, analogously to (11.3.25), we have ρb > 0. However,

∞ ( )
− λ d exp −M + Ne−nρb (λ −b) =
b
∞
3 4
−M y
= −e b+ d exp +Ne−y
0 nρb

1 −M 1
= e−M eN − 1 b − e ln(z)deNz = T1 + T2 , (11.3.29)
nρb 0
where we denote N = M[1 − Fξ (b)] and z = e−y .

For the first term in (11.3.29) we have the inequality

T1 ≡ e−M eN − 1 b be−M+N = be−MFξ (b) . (11.3.30)
Let us compute the second term in (11.3.29). By analytical continuation, from for-
mula 1
1 1
e−ρ z ln zdz = (C + ln ρ ) + Ei(−ρ )
0 ρ ρ
(see formulae (3.711.2) and (3.711.3) in Ryzhik and Gradstein [36], the correspond-
ing translation to English is [37]) we obtain
1
1 1
eNz ln zdz − (C + ln N) − E1 (N),
0 N N
whereas
eN 1 2
Ei(N) = 1+ + 2 +···
N N N
(see p. 48 in Jahnke and Emde [25], the corresponding book in English is [24]).
Consequently, the main dependency of the second term T2 in (11.3.29) on M is
determined by the exponential factor e−M+N = e−MFξ (b) .
Thus, all three terms (11.3.28), (11.3.30) and T2 constituting S2 decrease with a
growth of n quite rapidly. Just as S3 , they do not influence an asymptotic expan-
sion of the type (11.3.21) over the powers of the small parameter n−1 (in combi-
nation with logarithms ln n). Therefore, we need to account only for one term in
formula (11.3.2), so that due to (11.3.21) we have
1
C
S μξ (qξ ) − ln (2π )−1 μξ (qξ )q2ξ + + O n−1 ln2 n . (11.3.31)
2qξ qξ
Consequently, inequality (11.3.1) takes the form

1 1 C
Rn (nI1 ) μ1 (qξ ) − ln 2π nμ1 (qξ )q2ξ + P(d ξ )+
n 2nqξ nqξ
+ O(n−2 ln2 n) · · · (11.3.32)
4. Let us average over ξ in compliance with the latter formula. This averaging
becomes easier, because of the fact that function

(11.3.8) comes as a summation of
identically distributed random summands ln etc1 (xi ,u) PI1 (du), while μ1 = μξ /n is
their arithmetic mean:

1 n
μ1 (t) = ∑ ln
n i=1
etc1 (xi ,u) PI1 (du). (11.3.33)
By virtue of the Law of Large Numbers this mean converges to the expectation

ν1 (−t, β ) ≡ E[μ1 (t)] = P(dx) ln etc1 (x,u) PI1 (du), (11.3.34)
which is naturally connected with the following potential (11.1.18):
ν1 (−t, β ) ≡ ν1 (−t) = Γ1 (−t). (11.3.35)
For each fixed t, according to the above law, we have the following convergence in
probability:
μ1 (t) → ν1 (−t), (11.3.36)
and also for the derivatives
(k) (k)
μ1 (t) → (−1)k ν1 (−t), (11.3.37)
if they exist.
Because of the convergence in (11.3.36), (11.3.37), the root qξ of equation
(11.3.18) converges in probability to the root q = −z of the equation
− qν1 (−q) − ν1 (−q) = I1 where zν1 (z) − ν1 (z) = I1 . (11.3.38)
The last equation coincides with (11.1.21). That is why z = β so that
qξ → −β as n → ∞. (11.3.39)
According to (9.4.37) we have
β = −dI/dR,
and the parameter β is positive for the normal branch R+ (9.3.10) (see Section 9.3).
Hence we obtain the inequality qξ < 0 that complies with condition (11.3.20).
Because of the convergence in (11.3.37), (11.3.39) and equalities (11.3.35),
(11.1.20), the first term E[μ1 (qξ )] in (11.3.32) tends to R1 (I1 ). That proves (if we
also take into account vanishing of other terms) the convergence
1
R(nI1 ) → R1 (I1 ) (n → ∞),
n
which was discussed in Theorem 11.1.
In order to study the rate of convergence, let us consider the deviation of the
random variables occurring in the averaged expression (11.3.32) from their non-
random limits. We introduce a random deviation δ μ1 = μ1 (−β ) − ν1 (β ). Just as
the random deviation δ μ1 = μ1 (−β ) − ν1 (β ), this deviation due to (11.3.33) has a
null expected value and a variance of the order n−1 :
E[δ μ1 ]2 = E[μ12 (−β )] − ν12 (β )

1
= P(dx) ln2 e−β c1 (x,u) PI1 (du)−
n
2
1 −β c1 (x,u)
− P(dx) ln e PI1 (du)
n
and analogously for δ μ1 .

In order to take into account the contribution of random deviations having the or-
der n−1 while averaging variable μ1 (qξ ), let us express this quantity as an expansion
in δ μ , δ μ including the quadratic terms.
Denoting qξ + β = δ q and expanding the right-hand side of equations (11.3.18)
with respect to δ q and accounting for quadratic terms, we obtain
1
−β μ1 (−β ) − μ1 (−β ) − β μ1 (−β )δ q + μ1 (−β ) − β μ1 (−β ) δ q2 + · · · = I1 .
2
Substituting in the above
μ1 (−β ) = ν1 (β ) + δ μ1 ; μ1 (−β ) = −ν1 (β ) + δ μ1 ;

μ1 (−β ) = ν1 (β ) + δ μ1 ,
and neglecting the terms of order higher than quadratic we obtain that
β ν1 (β ) − ν1 (β ) − β δ μ1 − δ μ1 − β δ μ1 δ q − β ν1 (β )δ q+

1
+ ν1 (β ) + β ν1 (β ) δ q2 + · · · = I1 .
2
Zero-order terms cancel out here on the strength of (11.3.38), where z = β , and we
have

1 1
2
δq = − β δ μ 1 − δ μ 1 − β δ μ 1 δ q + ν1 ( β ) + β ν1 ( β ) δ q + · · · ,
β ν1 (β ) 2
and, consequently,
β δ μ1 + δ μ1 δ μ1 (β δ μ1 + δ μ1 )

δq = − + 2 +
β ν1 (β ) β ν (β )1
ν1 (β ) + β ν1 (β )
2
+ 3 β δ μ1 + δ μ1 + · · ·
2β 3 ν1 (β )
Substituting this result into the expansion

1
μ1 (qξ ) = μ1 (−β ) + μ1 (−β )δ q + μ1 (−β )δ q2 + · · ·
2
1
= −ν1 (β ) + δ μ1 + ν1 (β )δ q + δ μ1 δ q − ν1 (β )δ q2 ,
2
after cancellations we obtain
E[(β δ μ1 + δ μ1 )2 ]
E[μ1 (qξ )] = −ν1 (β ) + + E[δ μ13 ] · · ·
2β 3 ν1 (β )
Here we take into account that E[δ μ1 ] = E[δ μ1 ] = 0.

The other terms entering into the expression under the integral sign in (11.3.32)
require less accuracy in calculations. It is sufficient to substitute their limit values
for the random ones.
Finally, we obtain
1 1
0 Rn (nI1 ) − R1 (I1 ) ln 2π nν1 (β )β 2 −
n 2nβ
C
2 7 3
− + E β δ μ1 + δ μ1 2β ν1 (β ) + o(n−1 ). (11.3.40)
nβ
11.4 Alternative forms of the main result. Generalizations and special cases 379
Applying the same methods and conducting calculations with a greater level of
accuracy, we can also find the terms of higher order in this asymptotic expansion.
This way one can corroborate those points of the above-stated derivation, which
may appear insufficiently grounded.
In the exposition above we assumed that the domain (s1 , s2 ) of the potential μξ (t)
and its differentiability, which was mentioned in Theorem 4.8, is sufficiently large
enough. However, only the vicinity of the point s = −β is actually essential. The
other parts of the straight line s have influence only on exponential terms of the
types (11.3.26), (11.3.28), (11.3.30), and they do not influence the asymptotic ex-
pansion (11.3.40). Of course, the anomalous behaviour of the function μ1 (s) on
these other parts of s complicates the proof.
For formula (11.3.40) to be valid, condition (11.2.14) of boundedness of the
cost function is not really necessary. However, the derivation of this formula can
be somewhat simplified if we impose this condition. Then elaborate estimations of
integrals S2 , S3 and the integral over domain Ξ will not be required. Instead, we
can confine ourselves to proving that the probability of the appropriate regions of
integration vanishes rapidly (exponentially). In this case, the value of the constant
K will be inessential, since it will not appear in the final result.
As one can see from the above derivation, the terms of the asymptotic expan-
sion (11.3.40) are exact for the given random encoding. Higher-order terms can be
found, but the terms already written cannot be improved if the given encoding pro-
cedure is not rejected. The problem of interest is how close the estimate (11.3.40) is
to the actual value of the difference (1/n)R̃(nI1 ) − R1 (I1 ), and to what extent can it
be refined if more elaborate encoding techniques are used.
11.4 Alternative forms of the main result. Generalizations and

special cases
1. In the previous section we introduced the function μ1 (t) = (1/n)μξ (t) instead of
function (11.3.8). It was done essentially for convenience and illustration reasons
in order to emphasize the relative magnitude of terms. Almost nothing will change
if the rate quantities μ1 , ν1 , R1 , I1 and others are used only as factors of n, that
is if we do not introduce the rate quantities at all. Thus, instead of the main result
formula (11.3.40), after multiplication by n we obtain
− R(I) = V (I) − V (I)
0 R(I)

1 2π E (β δ μ + δ μ )2
ln 2 ν (β )β 2 + + o(1), (11.4.1)
2β γ 2β 3 ν (β )
where according to (11.3.34), (11.3.8), we have


ν (−t) = E[μξ (t)] = P(d ξ ) ln etc(ξ ,ζ ) PI (d ζ ), (11.4.2)

δ μ = δ μ (−β ); δ μ = δ μ (−β ); δ μ (t) = μξ (t) − ν (−t) (11.4.3)
(index n is omitted, the term C/β is moved under the logarithm; γ = eC = 1.781).
With these substitutions, just as in paragraph 3 in Section 11.2, it becomes unnec-
essary for the Bayesian system to be the n-th degree of some elementary Bayesian
system.
Differentiating twice function (11.4.2) at point t = −β and taking into ac-
count (11.1.16) we obtain that

ν (β ) = P(d ξ ) c2 (ξ , ζ )e−γ (ξ )−β c(ξ ,ζ ) PI (d ζ ) −
2
− c(ξ , ζ )e−γ (ξ )−β c(ξ ,ζ ) PI (d ζ ) .
Because of (9.4.21), the above integrals over ζ are the integrals of a conditional
expectations taken with respect to conditional probabilities P(d ζ | ξ ). Therefore,
( )
ν (β ) = E E c2 (ξ , ζ ) | ξ − (E [c(ξ , ζ ) | ξ ])2
≡ E [Var [c(ξ , ζ ) | ξ ]] , (11.4.4)
where Var[. . . | ξ ] denotes conditional variance. As is easily verified, the average

conditional variance (11.4.4) remains unchanged under the transformation (11.2.36).
Consequently, conducting transformation (11.2.37) we obtain
β 2 ν (β ) = E [Var [I(ξ , ζ ) | ξ ]] . (11.4.5)
Let us now consider what is the mean square term E[(β δ μ + δ μ )2 ] in (11.4.4),
which, due to (11.4.3), coincides with the variance

2

E β δ μ + δ μ = Var β μξ (−β ) + μξ (−β ) (11.4.6)
of the random variable β μξ (−β ) + μξ (−β ).

Setting t = −β in (11.3.8) and comparing the resulting expression with (9.4.23)
we see that
μξ (−β ) = γ (ξ ). (11.4.7)
Next, after differentiating (11.3.8) at point t = −β , we find that

μξ (−β ) = c(ξ , ζ )e−γ (ξ )−β c(ξ ,ζ ) PI (d ζ ).
In view of (9.4.21), this integral is nothing but the integral averaging with respect
to conditional distribution P(d ζ | ξ ), i.e.
μξ (−β ) = E [c(ξ , ζ ) | ξ ] .
It follows from the above and equation (11.4.7) that
β μξ (−β ) + μξ (−β ) = E [β c(ξ , ζ ) + γ (ξ ) | ξ ] .
However, −β c(ξ , ζ ) − γ (ξ ) coincides with the random information I(ξ , ζ ) on the

strength of (9.4.21), so that
β μξ (−β ) + μξ (−β ) = −E [I(ξ , ζ ) | ξ ] ≡ −Iζ | (| ξ ). (11.4.8)
Thus, we see that expression (11.4.6) is exactly the variance of a particularly aver-
aged random information:

2

E β δ μ + δ μ = Var Var β μξ (−β ) + μξ (−β ) = [Iξ | (| ξ )]. (11.4.9)
Following from (11.4.5), (11.4.9) the main formula (11.4.1) takes the form

2π
0 2β [V (I) − V (I)] ln E [Var[I( ξ , ζ ) | ξ ]] +
γ2
+ Var[Iζ (| ξ )]/E [Var[I(ξ , ζ ) | ξ ]] + o(1). (11.4.10)
2. In several particular cases the extremum distribution PI (d ζ ) does not depend

on β , i.e. on I. Then, as it can be seen from (9.4.23) and (11.3.8), the function
μξ (t) = μ (ξ ,t, β ), which in general depends on all arguments ξ , t, β , turns out to
be independent of β , while its dependence on ξ and t coincides with the dependence
of function γ (ξ ) = γ (ξ , β ) on ξ and −β :
μ (ξ ,t, β ) = γ (ξ , −t) ∀t. (11.4.11)
By averaging this equality over ξ and taking into account (9.4.10), (11.4.2) we ob-
tain
ν (−t) = Γ (−t) ∀t. (11.4.12)
That is why we can replace ν (β ) by Γ (β ) in formula (11.4.1). Further, it is useful
to recall that due to (9.4.31) the value of β is related to the derivatives of functions
R(I), V (I):
1 dR
=− = V (I), (11.4.13)
β dI
so that
β = 1/V (I). (11.4.14)
The second derivative Γ (β )
can also be expressed in terms of the function I(R) (or
V (I)), because these functions are related by the Legendre transform [see (9.4.30)
and (9.4.29)]. Differentiating (9.4.30) we have
dI
βΓ (β ) = .
dβ
This implies
1 1 dβ
=
β 2Γ (β ) β dI
or, equivalently, if we differentiate (11.4.14) and take into account (11.4.13),
1/(β 2Γ (β )) = −V (I)/V (I). (11.4.15)
On the strength of (11.4.14), (11.4.15), (11.4.9) the main formula (11.4.1) can be
written as follows:

1 2π V (I)
0 V (I) − V (I) V (I) ln − 2 −
2 γ V (I)
1
− V (I)Var Iζ | (| ξ ) + o(1). (11.4.16)
2
Besides, the function (11.4.11) sometimes turns out to be independent of ξ . We
encountered such a phenomenon in Section 10.2 where we derived formula (10.2.25)
for one particular case. According to the latter we have μξ (t) = Γ (−t), so that
μξ (t) = ν (−t) = Γ (−t); δ μ (t) = 0, and averaging over ξ becomes redundant.
In this case, the variance (11.4.6), (11.4.9) vanishes and formula (11.4.16) can be
somewhat simplified taking the form

1 2π V (I)
V (I) V (I) V (I) − V (I) ln − 2 + o(1). (11.4.17)
2 γ V (I)
At the same time the analysis carried out in 4 of the previous paragraph becomes
redundant.
3. In some important cases the sequence of values of I and the sequence of
Bayesian systems [P(d ξ ), c(ξ , ζ )] (dependent on n or another parameter) are such
that for an extremum distribution
A.
I = Iξ ζ → ∞. (11.4.18)
B. There exist finite non-zero limits
E [Var[I(ξ , ζ ) | ξ ]] Var[Iζ | (| ξ )] V (yI) dV (I)
lim , lim , lim , lim (11.4.19)
I I I dI
(y is arbitrary and independent of I).
It is easy to see that the sum
3
E [Var[I(ξ , ζ ) | ξ ]] + Var Iζ | (| ξ ) = E E[I 2 (ξ , ζ ) | ξ ]−
4
−(E [I(ξ , ζ ) | ξ ])2 + E (E[I(ξ , ζ ) | ξ ])2 − (E [E[I(ξ , ζ ) | ξ ]])2
coincides with the total variance Var[I(ξ , ζ )]. Consequently, it follows from the ex-
istence of the first two limits in (11.4.19) implies the existence of the finite limit
1
lim Var[I(ξ , ζ )].
I
Therefore, on one hand, A and B imply conditions of information stability of ξ , ζ
mentioned in Section 7.3 (it can be derived in a standard way by using Chebyshev’s
inequality). Also, B implies (11.2.34). Thus, if in addition the boundedness condi-
tion (11.2.39) and continuity of function (11.2.34) with respect to y are satisfied,
then, according to Theorem 11.2, convergence (11.2.35) will take place. On the
other hand, it follows from (11.4.18) together with the finiteness of limits (11.4.19)
and equation (11.4.14) that
1 Var[Iζ | (| ξ )]
=
2β V (I) E [Var[I(ξ , ζ ) | ξ ]]
1 dV I Var[Iζ | (ξ )] I
= → 0. (11.4.20)
2I dI V (I) I E [Var[I(ξ , ζ ) | ξ ]]
Furthermore,

1 2π ln I 1 2π E [Var[I(ξ , ζ ) | ξ ]]
ln E [Var[I( ξ , ζ )] ξ ]] = + ln →0
I γ2 I I γ2 I
where I → ∞. Thus, for the logarithmic term in (11.4.10) we have

1 2π
ln E [Var[I( ξ , ζ ) | ξ ]] → 0. (11.4.21)
2β V (I) γ2
In this case, because of (11.4.20), (11.4.21), the convergence in (11.2.35) fol-

lows from (11.4.10), whereas the other terms replaced by o(1) in the right-hand
side of (11.4.10) decrease even faster. At the same time we can see that condi-
tion (11.2.39) is unnecessary.
4. Let us consider more thoroughly one special case—the case of Gaussian
Bayesian systems, for which we dedicated Sections 10.3 and 10.4. In this case,
generally speaking, we cannot employ the simplification from Paragraph 2, and we
have to refer to formula (11.4.10). The value of β 2 ν (β ) appearing in (11.4.5) has
already been found earlier in Chapter 10. It is determined by equation (10.3.42) that
can be easily transformed to the form

2 1 1 −1 2
β ν (β ) = tr 1u − hk . (11.4.22)
2 β x
Further, in order to compute the variance in (11.4.9) we should use formula (10.3.43).
The variance of expressions quadratic in Gaussian variables was calculated earlier
in Section 5.4. It is easy to see that by applying the same computational method
[based on formulae (5.4.15), (5.4.16)] to (10.3.43) instead of (5.4.14) we obtain
Var[β μx (−β ) + μx (−β )] =

β2 βkx h−1 − 1u T βkx h−1 − 1u T
= tr gh−1 g kx gh−1 g kx
2 (β
kx h−1 )2 (β
kx h−1 )2

1 1 −1
= tr 1u − hkx (11.4.23)
2 β
x ).
(since gT kx g = k
After the substitution of (11.4.22), (11.4.23) into (11.4.1), (11.4.6) we will have
2
π 1
0 2β [V (I) − V (I)] ln tr 1u − hk−1 +
γ2 β x
8
1 −1 2 1 −1 2
+ tr 1u − hkx tr 1u − hk + o(1). (11.4.24)
β β x
If in addition to the specified expressions we take into account formula (10.3.30),

then condition A from the previous clause will take the form
!
tr ln β
kx h−1 → ∞. (11.4.25)
The requirement of existence of the limit lim (dV /dI) appearing in condition B
(11.4.19) is equivalent, in view of (11.4.14), to the requirement of the existence of
the limit
lim β = β0 . (11.4.26)
The first two limits in (11.4.19) can be rewritten as
tr[1u − (β −1 hkx−1 )2 ] tr(1u − β −1 hkx−1 )2

lim , lim . (11.4.27)
tr ln(β
kx h−1 ) tr ln(β
kx h−1 )
Finally, the condition of the existence of the limit V (yI)/I, in view of (10.3.32),
takes the form
lim tr(
kx h−1 − βy−1 1u )/ tr ln(β
kx h−1 ), (11.4.28)
where βy is determined from the condition
tr ln(βy
kx h−1 ) = 2yI = y tr ln(β
kx h−1 ). (11.4.29)
As an example, let us consider a continuous time Gaussian stochastic process

that is periodic within segment [0, T0 ]. It can be derived from a non-periodic station-
ary process via periodization (5.7.1) described in the beginning of Section 5.7. A
Bayesian system with such a periodic process was considered in 2 of Section 10.4.
Its traces in (10.3.30) were reduced to summations (10.4.10). We can also express
11.5 Generalized Shannon’s theorem 385
the traces in (11.4.24) in the similar way. If we likewise substitute summations by

integrals (that is a valid approximation for large T0 ) and let T0 → ∞ [see the deriva-
tion of formulae (10.4.18)], then the traces tr implicated in (11.4.27)–(11.4.29) will
be approximately proportional to T0 :

T0
tr f (
kx h−1 ) ≈ f (Φ (ω )) d ω (11.4.30)
2π L(ω )
(the condition of convergence of integrals is required to be satisfied as well). Here

we have
k(ω )
Φ (ω ) = |g(ω )|2 ;
h(ω )
where k̄ (ω ), h̄(ω ) and others have the same meaning as in Section 10.4 [see
(10.4.19), (10.4.21)]. The integration domain L(ω ) is defined by inequality (10.4.20).
In this case, limits (11.4.27), (11.4.28) do exist and equal to the ratio of the respec-
tive integrals. For instance, the right limit (11.4.27) equals
8
−2 −2
[1 − β0 Φ (ω )]d ω ln(β0 Φ (ω ))d ω (L = L(β0 )).
In view of (11.4.30), the difference 1 − V /V (as is easily seen from (11.4.24))

decrease with a growth of T0 according to the law
constT0−1 ln T0 + constT0−1 + o(T0−1 ).
11.5 Generalized Shannon’s theorem
In this section we consider a generalization of results from Section 7.3 (Theo-

rems 7.1 and 7.2) and Section 8.1 (Theorem 8.1) in the case of an arbitrary quality
criterion. We remind that in Chapter 7 only one quality criterion was discussed,
namely, quality of informational system was characterized by mean probability of
accepting a false report. The works of Kolmogorov [26] (also translated to English
[27]), Shannon [43] (the original was published in English [41]), Dobrushin [6]
commenced a dissemination of the specified results in the case of more general qual-
ity criterion characterized by an arbitrary cost function c(ξ , ζ ). The corresponding
theorem, formulated for an arbitrary function c(ξ , ζ ), which in the particular case

−1, if ξ = ζ
c(ξ , ζ ) = −δξ ζ = (11.5.1)
0, if ξ = ζ
(ξ , ζ are discrete random variables) turn into the standard results, can be naturally
named the generalized Shannon’s theorem.
The defined direction of research is closely related to the third variational prob-
lem and the material covered in Chapters 9–11, because it involves an introduction
of the cost function c(ξ , ζ ) and the assumption that distribution P(d ξ ) is given a
priori. Results of different strength and detail can be obtained in this direction. We
present here a unique theorem, which almost immediately follows from the standard
Shannon’s theorem and results of Section 11.2. Consequently, it will not require a
new proof.
Let ξ , ζ be random variables defined by the joint distribution P(d ξ d ζ ). Let also
[P(d η̃ | η )] be a channel such that its output variable (or variables) η̃ is connected
with its input variable η by the conditional probabilities P(d η̃ | η ). The goal is to
conduct encoding ξ → η and decoding η̃ → ζ̃ by selecting probabilities P(d η | ξ )
and P(d ζ̃ | η̃ ), respectively, in such a way that the distribution P(d ξ , d ζ̃ ) induced by
the distributions P(d ξ ), P(d η | ξ ), P(d η̃ | η ), P(d ζ̃ | η̃ ) coincides with the initial
distribution P(d ξ d ζ ). One can see that the variables ξ , η , η̃ , ζ̃ connected by the
scheme
ξ →η →η → ζ, (11.5.2)
form a Markov chain with transition probabilities mentioned above. We apply for-
mula (6.3.7) and obtain that
I(ξ η )(η ζ) = I(ζ η )(ζη ) = Iξ ζ + Iη ζ|ξ + Iξ η |ζ + Iη η |ξ ζ . (11.5.3)
At the same time using the same formula we can derive that
I(ηξ )(η ζ) = Iη η + Iξ η /η + Iη ζ/η + Iξ ξ/η η . (11.5.4)
According to the Markov property, the future does not depend on the past, if the
present is fixed. Consequently, mutual information between the past (ξ ) and the
future (η̃ ) with the fixed present (η ) equals zero: Iξ η̃ |η = 0. By the same reasoning
Iη ζ|η = 0; Iξ ζ|η η = 0. (11.5.5)
Equating (11.5.4) with (11.5.3) and taking into account (11.5.5) we get
Iξ ζ + Iη ζ|ξ + Iξ η |ζ + Iη η |ξ ζ = Iη η . (11.5.6)
However, the information Iη ζ̃ |ξ , Iξ η̃ |ζ̃ , Iη η̃ |ξ ζ̃ (like any Shannon’s amount of infor-

mation) are non-negative. Thus, if we take into account the definition of the channel
capacity C = sup Iη η̃ [see (8.1.3)], then (11.5.6) implies the inequality
Iξ ζ Iη η or Iξ ζ C. (11.5.7)
It can be seen from here that the distribution P(d ξ · d ζ̃ ) can copy the initial distri-
bution P(d ξ d ζ ) only under the following necessary condition:
Iξ ζ C. (11.5.8)
Otherwise, methods of encoding and decoding do not exist, i.e. P(d η | ξ ), P(d ζ | η )
such that P(d ξ d ζ) coincides with P(d ξ d ζ ).
As we can see, this conclusion follows from the most general properties of the
considered concepts. The next fact is less trivial: condition (11.5.8) or, more pre-
cisely, the condition
lim sup Iξ ζ /C < 1, (11.5.9)
is sufficient in an asymptotic sense, if not for coincidence of P(d ξ d ζ) with P(d ξ d ζ ),
then anyways for relatively good in some sense quality of this distribution. In order
to formulate the specified fact, we need to introduce a quality criterion—the cost
function c(ξ , ζ ). The condition of equivalence between P(d ξ d ζ) and P(d ξ d ζ ) can
be replaced by a weaker condition
!
|E[c(ξ , ζ )]|−1 ϑ + E[c(ξ , ζ)] − E[c(ξ , ζ ) ] → 0 (11.5.10)
(2ϑ+ (z) = z + |z|),
which points to the fact that the quality of distribution P(d ξ d ζ̃ ) is asymptotically
not worse than the quality of P(d ξ d ζ ). The primary statement is formulated for a
sequence of schemes (11.5.2) dependent on n.
Theorem 11.3. Suppose that
1. the sequence of pairs of random variables ξ , ζ is informationally stable (see
Section 7.3);
2. the convergence
c(ξ , ζ ) − E[c(ξ , ζ )]
→0 (11.5.11)
E[c(ξ , ζ )]
in probability P(d ξ d ζ ) takes place [compare with (11.2.31)];
3. the sequence of cost functions satisfies the boundedness condition
|c(ξ , ζ )| K|E[c(ξ , ζ )]| (11.5.12)
[compare with (11.2.32), K does not depend on n];

4. the sequence of channels [P(d η̃ | η )] is informationally stable [i.e. (η , η̃ ) is in-
formationally stable for the extremum distribution];
5. condition (11.5.9) is satisfied.
Then there exist encoding and decoding methods ensuring asymptotic equivalence
in the sense of convergence (11.5.10).
Proof. The proof uses the results derived earlier and points to the methods of en-
coding and decoding, i.e. it structurally defines ζ̃ .
Conditions 1–3 of the theorem allow us to prove the inequality
ξ ζ + 2ε2 Iξ ζ ) − R] ε1 + 4K δ
lim sup|R|−1 [R(I (11.5.13)
n→∞
(R = E[c(ξ , ζ )]; ε1 = ε2 I/R)
in exactly the same way that we used to prove (11.2.33a) on the basis of conditions
of informational stability and conditions (11.2.31), (11.2.32) in Section 11.2 [see
also the derivation of (11.2.30)]. In (11.5.13) ε , δ are indefinitely small positive
ξ ζ + 2ε2 Iξ ζ ) is the average cost:
values independent of n and R(I

ξ ζ + 2ε2 Iξ ζ ) = E inf E[c(ξ , ζ ) | Ek ] ,
R(I (11.5.14)
ζ
corresponding to a certain partition ∑ Ek of the sample space ξ into M = [eIξ ζ +2ε2 Iξ ζ ]

regions. These regions can be constructed with the help of random code points ζk
occurring with probabilities P(d ζ ) [see Sections 11.1 and 11.2].
It follows from (11.5.14) that there exist points ζk such that
ξ ζ + 2ε2 Iξ ζ ) ∑ P(Ek )E[c(ξ , ζk )] − ε3
R(I
k
holds true (ε3 > 0 is infinitesimal) and due to (11.5.13) we have

lim sup|R|−1
n→∞
∑ P(Ek )E[c(ξ , ζk ) | Ek ] − R ε1 + 4K δ + |R|−1 ε3 . (11.5.15)
k
Further, we convey the message about the index of region Ek containing ξ through
channel P(d η̃ | η ).
In consequence of (11.5.9) we can always choose ε2 such that the inequal-
ity [Iξ ζ + ε2 (2Iξ ζ + C)]/C < 1 holds true ∀n > N for some fixed N. Since M =
[exp(Iξ ζ + 2ε2 Iξ ζ )] the last statement means that the inequality ln M/C < 1 − ε2
is valid, i.e. (8.1.5) is satisfied. The latter together with requirement (4) of Theo-
rem 11.3 assures that Theorem 8.1 (generalized similarly to Theorem 7.2) can be
applied here. According to Theorem 8.1 the probability of error for a message re-
ception through a channel can be made infinitely small with a growth of n:
Pow < ε1 , (11.5.16)
starting from some n (ε4 is arbitrary). Let l be a number of a received message

having the form ξ ∈ El . Having received this message a decoder outputs signal
ζ̃ = ζl . Hence, the output variable ζ̃ has the probability density function
P(ζ) = ∑ P(l)δ (ζ − ζl ).

l
The joint distribution P(d ξ d ζ̃ ) takes the form
P(d ξ d9
ζ ) = ∑ P(d ξ )ϑEk (ξ )P(l | k)δ (ζ − ζl )d9
ζ, (11.5.17)
k,l
(ϑ̇Ek (ξ̇ ) = 1 for ξ̇ ∈ Ek ; ϑEk (ξ̇ ) = 0 for ξ̇ ∈

/ Ek )
where P(l | k) is a probability to receive message ξ ∈ El if message ξ ∈ Ek has been

delivered.
The average cost corresponding to distribution (11.5.17) is
E[c(ξ , ζ)] = ∑ P(Ek )P(l | k)E[c(ξ , ζl ) | Ek ]

k,l
= ∑ P(Ek )E[c(ξ , ζk ) | Ek ]
k
+ ∑ P(Ek )P(l | k) {E[c(ξ , ζl ) | Ek ] − E[c(ξ , ζk ) | Ek ]} .
k=l
We majorize the latter sum using (11.5.12). This yields
|R|−1 |E[c(ξ , ζ)] − ∑ P(Ek )E[c(ξ , ζk ) | Ek ]| 2K ∑ P(Ek )Pow (| k) (11.5.18)

k k

Pow (| k) = ∑ P(l | k) .
l,l=k
Averaging over the ensemble of random codes and taking into account (11.5.16) we
obtain that
E ∑ P(Ek )Pow (| k) = E [Pow (| k)] = Pow < ε4 .
k
From here we can make the conclusion that there exists some random code, which
is not worse than the first in the sense of the inequality
∑ P(Ek )Pow (| k) < ε4 . (11.5.19)

k
Thus, by virtue of (11.5.18) we have
E[c(ξ , ζ)] − ∑ P(Ek )E[c(ξ , ζk ) | Ek ] 2K ε4 | R].

k
Combining this inequality with (11.5.15) we obtain

lim sup |R|−1 E[c(ξ , ζ)] − E[c(ξ , ζ )] ε2 + 4K δ + |R|−1 ε3 + 2K ε4 .
Because ε2 , δ , ε3 , ε4 can be considered as small as required, equation (11.5.10)

follows from here. The proof is complete.
We assumed above that the initial distribution P(d ξ d ζ ) is given. The provided
consideration can be simply extended to the cases, for which there is a set of such
distributions or the Bayesian system [P(d ξ ), c(ξ , ζ )] with a fixed level of cost
E[c(ξ , ζ )] a.
In the latter case, we prove the convergence

1
ϑ + (E[c(ξ , ζ)] − a) → 0.
V
Condition (11.5.9) is replaced by
lim sup I(a)/C < 1;
condition (11.5.12) is replaced by (11.2.39), while conditions (1) and (2) of The-
orem 11.3 need to be replaced by the requirements of information stability of the
Bayesian system (Section 11.2, paragraph 4). After these substitutions one can apply
Theorem 11.2.
In conclusion, let us confirm that the regular formulation of the Shannon’s theo-
rem follows from the stated results. To this end, let us consider for ξ and ζ identical
discrete random variables taking M values. Let us choose the distribution P(ξ , ζ ) as
follows:
1
P(ξ , ζ ) = δξ ζ .
M
In this case we evidently have
Iξ ζ = Hξ = Hζ = ln M (11.5.20)
in such a way that the condition Iξ ζ → ∞ is satisfied, if M → ∞. Furthermore,

I(ξ , ζ ) = ln M for ξ = ζ . Hence, the second condition B of informational stabil-
ity from Section 7.3 is trivially satisfied. Thus, requirement (1) of Theorem 11.3
is satisfied, if M → ∞. Choosing the cost function (11.5.1) we ascertain that re-
quirements (2), (3) are also satisfied. Due to (11.5.20) inequality (11.5.9) takes the
form lim sup ln M/C < 1 that is equivalent to (8.1.5). In the case of the cost func-
tion (11.5.1) we have E[c(ξ , ζ )] = −1 and thereby (11.5.10) takes the following
form:
1 − E[δξ ζ ] = ∑ P(ξ , ζ) → 0. (11.5.21)
ξ =ζ
However, !
) = 1 =
∑ P( ξ , ζ
M∑
P ζ ξ | ξ
ξ =ζ ξ
is nothing else but the mean probability of error (7.1.11). Hence, convergence
(11.5.21) coincides with (7.3.7). Thus we have obtained that in this particular case,
Theorem 11.3 actually coincides with Theorem 7.2.
Chapter 12
Information theory and the second law of
thermodynamics
In this chapter, we discuss a relation between the concept of the amount of informa-
tion and that of physical entropy. As is well known, the latter allows us to express
quantitatively the second law of thermodynamics, which forbids, in an isolated sys-
tem, the existence of processes accompanied by an increase of entropy. If there
exists an influx of information dI about the system, i.e. if the physical system is
isolated only thermally, but not informationally, then the above law should be gen-
eralized by substituting inequality dH ≥ 0 with inequality dH + dI ≥ 0. Therefore,
if there is an influx of information, then the thermal energy of the system can be con-
verted (without the help of a refrigerator) into mechanical energy. In other words,
the existence of perpetual motion of the second kind powered by information be-
comes possible.
In Sections 12.1 and 12.2, the process of transformation of thermal energy into
mechanical energy using information is analysed quantitatively. Furthermore, a par-
ticular mechanism allowing such conversion is described. This mechanism consists
in installing impenetrable walls and in moving them in a special way inside the
physical system. Thereby, the well-known semi-qualitative arguments related to this
question and, for instance, contained in the book of Brillouin [4] (the corresponding
book in English is [5]) acquire an exact quantitative confirmation.
The generalization of the second law of thermodynamics by no means cancels
its initial formulation. Hence, in Section 12.3 the conclusion is made that, in prac-
tice, expenditure of energy is necessary for measuring the coordinates of a physical
system and for recording this information. If the system is at temperature T , then
in order to receive and record the amount of information I about this system, it is
necessary to spend at least energy T dI. Otherwise, the combination of an automatic
measuring device and an information converter of thermal energy into mechanical
one would result in perpetual motion of the second kind. The above general rule
is corroborated for a particular model of measuring device, which is described in
Section 12.3.
The conclusion about the necessity of minimal energy expenditure is also ex-
tended to noisy physical channels corresponding to a given temperature T (Sec-
tion 12.5). Hence, the second law of thermodynamics imposes some constraints on
https://doi.org/10.1007/978-3-030-22833-0 12
392 12 Information theory and the second law of thermodynamics
a possibility of physical implementation of informational systems: automata-meters

and channels.
12.1 Information about a physical system being in

thermodynamic equilibrium. The generalized
second law of thermodynamics
Theory of the value of information (Chapter 9) considers information about co-

ordinate x, which is a random variable having probability distribution p(x)dx. In
the present paragraph, in order to reveal the connection between information the-
ory and laws of thermodynamics we assume that x is a continuous coordinate of a
physical system being in thermodynamic equilibrium. The energy of the system is
presumed to be a known function E(x) of this coordinate. Further, we consider a
state corresponding to temperature T . In this case, the distribution is defined by the
Boltzmann–Gibbs formula
p(x) = exp{[F − E(x)]/T }, (12.1.1)
where
F = −T ln eE(x)/T dx (12.1.2)
is the free energy of the system. The temperature T is measured in energy units, for
which the Boltzmann constant is equal to 1.
It is convenient to assume that we have a thermostat at temperature T , and the dis-
tribution mentioned above reaches its steady state as a result of a protracted contact
with the thermostat.
In the frame of the general theory of the value of information distribution (12.1.1)
is a special case of probability distribution appearing in the definition of a Bayesian
system. Certainly, the general results obtained for arbitrary Bayesian systems in
Chapters 9 and 10 can be extended naturally to this case. Besides, some special phe-
nomena related to the Second law of thermodynamics can be investigated, because
the system under consideration is a physical system. Here we are interested in a pos-
sibility of transforming thermal energy into mechanical energy, which is facilitated
by the inflow of information about the coordinate x.
In defining the values of Hartley’s and Boltzmann’s information amounts (Sec-
tions 9.2 and 9.6) we assumed that incoming information about the value of x has a
simple form. It indicates what region Ek from the specified partition ∑k Ek = X of the
sample space X point x belongs to. This information is equivalent to an indication
of the index of region Ek . Let us show that such information does indeed facilitate
the transfer of thermal energy into mechanical energy. When specifying the region
Ek the a priori distribution (12.1.1) is transformed into a posteriori distribution
12.1 The generalized second law of thermodynamics 393

exp{[ f (Ek ) − E(x)]/T } for x ∈ Ek
p(x | Ek ) = (12.1.3)
0 for x ∈
/ Ek
where
F(Ek ) = −T ln e−E(x)/T dx (12.1.4)
Ek
is a conditional free energy. Because it is known that x lies within the region Ek ,
this region can be surrounded by impenetrable walls, and the energy function E(x)
is replaced by the function

E(x) if x ∈ Ek
E(x | k) = (12.1.5)
∞ if x ∈
/ Ek .
Distribution (12.1.3) is precisely the equilibrium distribution of type (12.1.1) for

this function.
Then we slowly move apart the walls encompassing the region Ek until they
go to infinity. Pressure is exerted on the walls, and the pressure force does work
when moving the walls apart. Energy in the form of mechanical work will flow to
external bodies mechanically connected with the walls. This energy is equal to the
well-known thermodynamic integral of the type

∂F
A = pdv p=− − pressure .
∂v
The differential dA of work can be determined by variating the region Ek in

expression (12.1.4). During the expansion of region Ek to region Ek = Ek + dEk , the
following mechanical energy is transferred to external bodies
F(Ek )−E(x)
dA = F(Ek ) − F(Ek + dEk ) = T e T dx. (12.1.6)
dEk
Because the walls are being moved apart slowly, the energy transfer occurs without
changing the temperature of the system. This is the result of the influx of thermal en-
ergy from the thermostat, the contact with which must not be interrupted. Note that
a contact with it must not to be interrupted. Then the source of mechanical energy
leaving the system will be the thermal energy of the thermostat, which is converted
into mechanical work. In order to calculate the total work Ak , it is necessary to sum
the differentials (12.1.6). When the walls are moved to infinity, the region Ek coin-
cides with entire space X, and also the free energy F(Ek ) coincides with (12.1.2).
Therefore, the total mechanical energy is equal to the difference between the free
energies (12.1.2) and (12.1.4)
Ak = F(Ek ) − F. (12.1.7)
By integrating (12.1.1) over Ek we obtain P(Ek ), while the analogous integral

for (12.1.3) is equal to one. This leads us to the formula
F−F(Ek )
e T = P(Ek ). (12.1.8)
Taking into account (12.1.8), we derive from (12.1.7) the equation
Ak = −T ln P(Ek ) = T H(| Ek ) (12.1.9)
where H(| Ek ) is a conditional entropy.

The formula just derived corresponds to the case, when point x appears in the
region Ek , which occurs with probability P(Ek ). By averaging (12.1.9) over different
regions, it is not hard to calculate the average energy converted from the thermal
form into the mechanical one:
A = ∑ Ak P(Ek ) = T HEk . (12.1.9a)

k
Supplementing the derived above formula with the inequality sign allowing for a
non-equilibrium process (occurring insufficiently slow), we have
A T HEk . (12.1.10)
Thus, we have obtained that the maximum amount of thermal energy turning into
work is equal to the product of the absolute temperature and the Boltzmann amount
of incoming information. The influx of information about the physical system fa-
cilitates the conversion of thermal energy into work without transferring part of the
energy to the refrigerator. The assertion of the Second law of thermodynamics about
the impossibility of such a process is valid only in the absence of information in-
flux. If there is an information influx dI, then the standard form of the second law
permitting only those processes, for which the total entropy of an isolated system
does not decrease:
dH 0, (12.1.11)
becomes insufficient. Constraint (12.1.11) has to be replaced by the condition that
the sum of entropy and information does not decrease:
dH + dI 0. (12.1.12)
In the above process of converting heat into work, there was the information in-
flow Δ I = HEk . The entropy of the thermostat decreased by Δ H = −A/T , and as a
result the energy of the system did not change. Consequently, condition (12.1.12)
for the specified process is valid with the equality sign. The equality sign is based on
the ideal nature of the process, which was specially constructed. If the walls encom-
passed a larger region instead of Ek or if their motion apart were not infinitely slow,
then there would be an inequality sign in condition (12.1.12). Also, the obtained
amount of work would be less than (12.1.9a). In view of (12.1.12), it is impossible
to produce from heat an amount of work larger than (12.1.9a).
The idea about the generalization of the Second law of thermodynamics to the
case of systems with an influx of information appeared long ago in connection with
12.2 Influx of Shannon’s information and transformation of heat into work 395
‘Maxwell’s demon’. The latter, by opening or closing a door in a wall between

two vessels (depending on the speed of a molecule coming close to this door), can
create the difference of temperatures or the difference of pressures without doing
any work, contrary to the second law of thermodynamics. For such an action the
‘demon’ requires an influx of information. The limits of violation of the second
law of thermodynamics by the ‘demon’ are bounded by the amount of incoming
information. According to what has been said above, we can state this not only
qualitatively, but also formulate in the form of a precise quantitative law (12.1.12).
For processes not concerned with an influx of information the second law of
thermodynamics considered in its standard form (12.1.11) remains valid, of course.
Furthermore, formula (12.1.10) corresponding to the generalized law (12.1.12) was,
in fact, derived based on the second law (12.1.11) applied to the expanding region
Ek . In particular, formulae (12.1.6), (12.1.7) are nothing but corollaries of the sec-
ond law (12.1.11). Let us now show this. Let us denote by dQ the amount of heat
that arrived from the thermostat, and let us represent the change of entropy of the
thermostat as follows
dHΓ = −dQ/T. (12.1.13)
According to the First law of thermodynamics
dA = dQ − dU, (12.1.14)
where U = E[E(x)] is the internal energy of the system, which is related to the free
energy F via the famous equation U = F + T Hx . Differentiating the latter, we have
dF = dU − T dHx . (12.1.15)
In this case, the second law (12.1.11) has the form dHT + dHx 0, T dHx − dQ 0,
which, on the basis of (12.1.14), (12.1.15), is equivalent to
dA −dF. (12.1.16)
Taking this relation with the equality sign (which corresponds to an ideal process),
we obtain the first relation (12.1.6).
12.2 Influx of Shannon’s information and transformation of heat

into work
All that has been said above about converting thermal energy into mechanical en-
ergy due to an influx of information, can be applied to the case, in which we have
information that is more complex than just specifying the region Ek , to which x
belongs. We shall have such a case if we make errors in specifying the region Ek .
Suppose that k̃ is the index of the region referred to with a possible error, and k is
the true number of the region containing x. In this case, the amount of incoming
information is determined by the Shannon’s formula I = HEk − HEk |k̃ . This amount
of information is less than the entropy HEk considered in previous section.
Further, in a more general case, information about the value of x can come not
in the form of an index of a region, but in the form of some other random vari-
able y connected with x statistically. In this case, the amount of information is also
determined by Shannon’s formula [see (6.2.1)]:
I = Hx − Hx|y . (12.2.1)
The posterior distribution p(x | y) will now have a more complicated form
than (12.1.3). Nevertheless, the generalized second law of thermodynamics will
have the same representation (12.1.12), if I is understood as Shannon’s amount of
information (12.2.1). Now formula (12.1.10) can be replaced by the following
A T I. (12.2.2)
In order to verify this, one should consider an infinitely slow isothermal transi-
tion from the state corresponding to a posteriori distribution p(x | y) and having
entropy Hx (| y) to the initial (a priori) state with a given distribution p(x). This
transition has to be carried out in compliance with the second law of thermodynam-
ics (12.1.11), i.e. according to formulae (12.1.13), (12.1.16). Summing up elemen-
tary works (12.1.16) we obtain that every found value y corresponds to the work
Ay −F + F(y). (12.2.3)
Here F is a free energy (12.1.2), and F(y) is a free energy
F(y) = E[E(x) | y] − T Hx (| y), (12.2.4)
corresponding to a posteriori distribution p(x | y), which is regarded as equilibrium

with installed walls (see below). Substituting (12.2.4) into (12.2.3), and averaging
over y, in view of the equation F = E[E(x)] − T Hx , we have
A T Hx − T Hx|y , (12.2.5)
that is inequality (12.2.2).

However, in this more general case, the specific ideal thermodynamic process,
for which formula (12.2.2) is valid with an equality sign, is more complicated than
that in Section 12.1. Because now a posteriori probability p(x | y) is not concentrated
on one region Ek , the walls surrounding the region cannot be moved apart to infinity
(this is possible only if, using the condition of information stability, we consider in-
formation without error). Nonetheless, the used above method of installing, moving
and taking away the walls can be applied here as well. As in Section 12.1, installing
and taking away the walls must be performed instantaneously without changing en-
ergy and entropy, which explains why the physical equilibrium free energy becomes
equal to expression (12.2.4) after installing the walls.
12.2 Influx of Shannon’s information and transformation of heat into work 397
Let us consider in greater detail how to realize a thermodynamic process close

to ideal. We take a posteriori distribution p(x | y) and install in space X a system
of walls, which partition X into cells Ek . Without the walls the distribution p(x | y)
would have transitioned into the equilibrium distribution p(x) (12.1.1) via the pro-
cess of relaxation. The system of walls keeps the system in the state with distribution
p(x | y). The physical system exerts some mechanical forces on the walls. Let us now
move the walls slowly in such a way that an actual distribution continuously transi-
tions from non-equilibrium p(x | y) to equilibrium (12.1.1). All interim states during
this process must be the states of thermodynamic equilibrium for a given configura-
tion of the walls. The final state is an equilibrium state not only in the presence of
the walls, but also in their absence. Therefore, in the end, one may remove the walls
imperceptibly. During the process of moving walls, mechanical forces exerted on a
membrane do some mechanical work, the mean value of which can be computed
via thermodynamic methods resembling (12.2.3)–(12.2.5). We shall have exactly
the equality A = T I, if we manage to choose the wall configurations in such a way
that the initial distribution p(x | y) and the final distribution (12.1.1) are exactly equi-
librium for the initial and final configurations, respectively. When the distributions
p(x), p(x | y) are not uniform, in order to approach the limit T I, it may be necessary
to devise a special passage to the limit, for which the sizes of certain cells placed
between the walls tend to zero.
As an example, let us we consider a simpler case, when an ideal thermodynamic
process is possible with a small number of cells.
Let X = [0,V ] be a closed interval, and the function E(x) ≡ 0 be constant, so that
distribution (12.1.1) is uniform. Further, we measure which half of the interval [0,V ]
contains the point x. Let information about this be transmitted with an error whose
probability is equal to p. The amount of information in this case is
I = ln 2 + p ln p + (1 − p) ln(1 − p). (12.2.5a)
Assume that we have received a message that y = 1, i.e. x is situated in the left half:
x ∈ [0,V /2]. This corresponds to the following posterior probability density

2(1 − p)/V for 0 x V /2
p(x | y = 1) = (12.2.6)
2p/V for V /2 < x V .
We install a wall at the point x = z0 = V /2, which we then move slowly. In order to
find the forces acting on the wall we calculate the free energy for every location z of
the wall. Since E(x) ≡ 0, the calculation of free energy is reduced to a computation
of entropy. If the wall has been moved from point x = V /2 to point x = z, then
probability density (12.2.6) should be replaced by the probability density

1−p
for 0 x < z
p(x | z, 1) = z
p (12.2.7)
V −z for z < x V ,
which has the following entropy

z V −z
Hx (| 1, z) = (1 − p) ln + p ln
1− p p
and free energy
z V −z
F(1, z) = −T (1 − p) ln − T p ln . (12.2.8)
1− p p
Differentiating with respect to z, we find the force acting on the wall
∂ F(1, z) T T
− = (1 − p) − p . (12.2.9)
∂z z V −z
If the coordinate of x were located on the left, then the acting force would be equal
to T /z (by analogy with the formula for the pressure of an ideal gas, z plays the
role of a volume); if x were on the right, then the force would be −T /(V − z).
Formula (12.2.9) gives the posterior expectation of these forces, because 1 − p is
the posterior probability of the inequality x < V /2.
The work of the force in (12.2.9) on the interval [z0 , z1 ] can be calculated by
taking the difference of potentials (12.2.8):
A1 = −F(1, z1 ) + F(1, z0 ). (12.2.10)
The initial position of the wall, as was mentioned above, is in the middle (z0 = V /2).
The final position is such that the probability density (12.2.7) becomes equilibrium.
This yields
(1 − p)/z1 = p/(V − z1 ) = 1/V, z1 = (1 − p)V.
Substituting these values of z0 , z1 into (12.2.10), (12.2.8) we find the work
A1 = T (1 − p) ln[2(1 − p)] + T p ln(2p). (12.2.11)
A similar result takes place for the second message y = 2. In this case we should
move the wall to the other part of the interval. The mean work A is determined by
the same expression (12.2.11). A comparison of this formula with (12.2.5a) shows
that relation (12.2.2) is valid with the equality sign.
In conclusion, we have to point out that the condition of equilibrium of the ini-
tial probability density p(x) [see (12.1.1)] is optional. If the initial state p(x) of a
physical system is non-equilibrium, then we should instantly (while the coordinates
of x have not changed) ‘take into action’ some new Hamiltonian E(x), which differs
from the original E(x) such that the corresponding probability density is equilib-
rium. After that, we can apply all of our previous reasoning. When conversion of
thermal energy into work is finished, we must instantly ‘turn off’ Ẽ(x) by proceed-
ing to E(x). Expenditure of energy for ‘turning on’ and ‘turning off’ the Hamiltonian
Ẽ(x) compensate each other. Thus, the informational generalization of the Second
law of thermodynamics does not depend on the condition that the initial state be
equilibrium. This generalization can be stated as follows:
12.3 Energy costs of creating and recording information. An example 399
Theorem 12.1. If a physical system S is thermally isolated and if there is informa-

tion amount I about the system (it does not matter of what type: Hartley, Boltzmann
or Shannon), then the only possible processes are those, for which the change of
total entropy exceeds −I: Δ H ≥ −I. The lower bound is physically attainable.
Here thermal isolation means that the thermal energy that can be converted into
work is taken from the system S itself, i.e. the thermostat is included into S.
As is well known, the second law of thermodynamics is asymptotic and not quite
exact. It is violated for processes related to thermal fluctuations. In view of this, we
can give an adjusted (relaxed) formulation for this law: in a heat-insulated system
we cannot observe processes, for which the entropy increment is
Δ H −1. (12.2.12)
If we represent entropy in thermodynamic units according to (1.1.7), then 1 should

be replaced by the Boltzmann constant k in the right-hand side of (12.2.12).
Then (12.2.12) will take the form Δ Hphys −k. Just as in the more exact state-
ment of the Second law, the condition Δ H < 0 is replaced by the stronger inequality
Δ H −1, so we can change the assertion of Theorem 12.1, namely now processes
are forbidden for which
Δ H + I −1 or Δ Hphys + kI −k. (12.2.13)
The term containing I is important if I

1.
We could include analogous refinements in the other sections, but we shall not
dwell on this.
12.3 Energy costs of creating and recording information. An

example
In the previous sections the random variable y carrying information about the coor-
dinate (or coordinates) x of a physical system was assumed to be known beforehand
and related statistically (correlated) with x. The problem of physical realization of
such an information carrying random variable must be considered specially. It is
natural to treat this variable as one (or several) coordinate (coordinates) out of a set
of coordinates (i.e. dynamic variables) of some other physical system S0 , which we
shall refer to as the ‘measuring device’. The Second law of thermodynamics im-
plies some assertions about physical procedure for creating an information signal
y statistically related to x. These assertions concern energy costs necessary to cre-
ate y. They are, in some way, the converse of the statements given in Sections 12.1
and 12.2.
The fact is that the physical system S (with a thermostat) discussed above which
‘converts information and heat into work’, and the measuring instrument S0 (acting
automatically) creating the information can be combined into one system and then
the Second law of thermodynamics can be applied to it. Information will not be en-
tering the combined system anymore, thereby inequality (12.1.12) will turn into its
usual form (12.1.11) for it. If the measuring instrument or a thermostat (with which
the instrument is in contact) has the same temperature T as the physical system with
coordinate x, then the overall conversion of heat into work will be absent according
to the usual Second law of thermodynamics. This means that mechanical energy of
type (12.1.10) or (12.2.5) must be converted into the heat released in the measuring
instrument. It follows that every measuring instrument at temperature T must con-
vert into heat the energy no less than T I to create the amount of information I about
the coordinate of a physical system.
Let us first check this inherent property of any physical instrument on one simple
model of a real measuring instrument. Let us construct a meter such that measur-
ing the coordinate x of a physical system does not influence the behaviour of this
system. It is convenient to consider a two-dimensional or even a one-dimensional
model. Let x be the coordinate (or coordinates) of the centre of a small metal ball
that moves without friction inside a tube made of insulating material, (or between
parallel plates). Couples of electrodes (the metal plates across which potential dif-
ference is applied) are put into the insulator finished by grinding. When the ball
takes a certain position, it connects a couple of electrodes and thus closes the circuit
(Figure 12.1). Watching the current, we can determine the position of the ball. Thus,
the space of values of x is divided into regions E1 , . . . , EM corresponding to the size
of the electrodes. If the number of electrode couples is equal to M = 2n , i.e. the
integral power of 2, then it will be suitable to select n current sources and to connect
electrodes in such a way that the fact of the presence of current (or the absence of it)
from one source gives one bit information about the number of regions E1 , . . . , EM .
An example of such a connection for n = 3 is shown on Figures 12.1 and 12.2. When
the circuit corresponding to one bit of information is closed, the appearing current
reverses the magnetization of the magnetic that is placed inside the induction coil
and plays the role of a memory cell. If the current is absent, the magnetization of
the magnetic is not changed. As a result, the number of the region Ek is recorded on
three magnetics in binary code.
Fig. 12.1 Arrangement of measuring contact electrodes in the tube
The sizes of the regions Ek (i.e. of the plates) can be selected for optimality
considerations according to the theory of the value of information. If the goal is to
produce maximum work (12.1.9a), then the regions Ek have to be selected such that
their probabilities are equal
12.3 Energy costs of creating and recording information. An example 401
p(E1 ) = · · · = p(EM ) = 1/M. (12.3.1)
In this case, Boltzmann’s and Hartley’s information amounts will coincide, HEk =
ln M, and formula (12.1.9a) gives the maximum mechanical energy equal to T ln M.
The logarithm ln M of the number of regions can be called the limit information
capacity of the measuring instrument.
In reality, the amount of information produced by the measuring instrument turns
out to be smaller than ln M as a result of errors arising in the instrument. A source of
these errors is thermal fluctuations, in particular fluctuations of the current flowing
through the coils Lr . We suppose that the temperature T of the measuring device is
given (T is measured in energy units). According to the laws of statistical physics,
the mean energy of fluctuation current ifl flowing through the induction coil L is
determined by the formula
1 1
LE[i2fl ] = T. (12.3.2)
2 2
Thus, it is proportional to the absolute temperature T . The useful current ius from
the emf source Er (Figures 12.1 and 12.2) is added to the fluctuational current ifl .
It is the average energy of the useful current Li2us /2 that constitutes in this case the
energy costs that have been mentioned above and are inherent in any measuring
instrument. Let us find the connection between the energy costs and the amount of
information, taking in account the fluctuation current ifl .
Fig. 12.2 Alternative ways of connecting the measuring contact electrodes
For given useful current ius the total current i = ifl + ius has Gaussian distribution
with variance that can be found from (12.3.2), that is
#
L −L(i−ius )2 /2T
p(i | ius ) = e . (12.3.3)
2π T
For M = 2n , the useful current [if (12.3.1) is satisfied] is equal to 0 with probability
1/2 or to some value i1 with probability 1/2. Hence,
#
1 L
Li2 /2T
+ e−L(i−i1 ) /2T .
2
p(i) = e (12.3.4)
2 2π T
Let us calculate the mutual information Ii,ius . Taking into account (12.3.3), (12.3.4)
we have
p(i | 0) 1
2

ln = − ln 1 + e(Lii1 /T )−(Li1 /2T ) ,
p(i) 2
p(i | i1 ) 1
−(Lii1 /T )+(Li2 /2T )
ln = − ln e 1 +1 .
p(i) 2
The first expression has to be averaged with the weight p(i | 0), while the second
with the
weight p(i | i1 ). Introduce the variable ξ = (i 1 /2 − i) L/T (and ξ = (i −
i1 /2) L/T for the second integral), we have the same expression for both integrals:

p(i | 0) p(i | i1 )
p(i | 0) ln di = p(i | i1 ) ln di
p(i) p(i)
∞
1 1 1 −2ξ η
e− 2 (ξ −η ) ln
1 2
= −√ + e dξ ,
2π −∞ 2 2

where η = (i1 /2) L/T .
Therefore, the information in question is equal to the same expression. Let us
bring it into the following form:
∞
1
e− 2 (ξ −η ) ln cosh ξ η d ξ .
1 3
Ii,ius = η 2 − √
2π −∞
The second term is obviously negative, because cosh ξ η > 1, ln cosh ξ η > 0 and,
consequently,
Ii,ius η 2 = Li21 /4T. (12.3.5)
However, the value Li21 /4 is nothing but the mean energy of the useful current:

1 2 1 1 2 1
E Lius = Li1 + 0.
2 2 2 2
Thus, we have just proven that in order to obtain information Ii,ius , the cost of
the useful current mean energy must exceed T Ii,ius . All that has been said above
refers only to one coil. Summing over different circuits shows that to obtain total
information I = ∑nr=1 (Ii,ius )r , it is necessary to spend, on the average, the amount of
energy that is not less than T I. Consequently, for the specified measuring instrument
the above statement about incurred energy costs necessary for receiving information
is confirmed. Taking into account of thermal fluctuations in the other elements of
the instrument makes the inequality A > T I even stronger.
12.4 Energy costs of creating and recording information. General formulation 403
12.4 Energy costs of creating and recording information.

General formulation
In the example considered above, the information coordinate y was given by the
currents i flowing through the coils L1 , L2 , . . . (Figures 12.1 and 12.2). In order to
deal with more stable, better preserved in time informational variables, it is reason-
able to locate magnets inside the coils that get magnetized, when the current flows.
Then the variables y will be represented by the corresponding magnetization m1 ,
m2 , . . . . Because magnetization mr is a function of the initial magnetization mr0
(independent from x) and the current i (m = f (m0 , i)), it is clear that the amount
of information Imx will not exceed the amount Iix + Im0 x = Iix , that is the amount Iix
considered earlier. Thus, the inequality Iix A/T (12.3.5) can only become stronger
under the transition to Imx .
The case of the informational signal y represented by magnetization degrees m1 ,
m2 , . . . of recording magnetic cores is quite similar to the case of recording of the
information onto a magnetic tape, where the number of ‘elementary magnets’ is
rather large. One can see from the example above that the process of information
‘creation’ by a physical measuring device is inseparable from the process of physi-
cally recording the information. The inequality IT A that was checked above for
a particular example can be proven from general thermodynamic considerations, if
a set of general and precise definitions is introduced.
Let x be a subset of variables ξ of a dynamical system S, and y be a subset of
coordinates η of system S0 , referring to the same instant of time. We call a physical
process associated with systems S and S0 interacting with each other and, perhaps,
with other systems, a normal physical recording of information, if its initial state
characterized by a multiplicative joint distribution p1 (ξ )p2 (η ) is transported into a
final state joint distribution p(ξ , η ) with the same marginal distributions p1 (ξ ) and
p2 (η ). Prior to and after recording the information the systems S, S0 are assumed to
be non-interacting.
We can now state a general formulation for the main assertion.
Theorem 12.2. If a normal physical recording of information is carried out in con-
tact with a thermostat at temperature T , then the following energy consumption and
energy transfer to the thermostat (in the form of heat) are necessary:
A IT, (12.4.1)
where
I = Hx + Hy − Hxy (12.4.2)
is Shannon’s amount of information.
Proof. Let us denote by H+ the entropy of the combined system S + S0 , while HT
will denote the entropy of the thermostat. Applying the Second law of thermody-
namics to the process of information recording we have
Δ H+ + Δ HT 0. (12.4.3)
In this case the change of entropy is evidently of the form
Δ H+ = Hξ η − Hξ − Hη = −Iξ η . (12.4.4)
Thus, the thermostat has received entropy Δ HT Iξ η , and, consequently, the trans-
ferred thermal energy is A T Iξ η . Where has it come from? According to the con-
ditions of the theorem, there is no interaction between systems S, S0 both in the
beginning and in the end, thereby the mean total energy U+ is the sum of the mean
partial energies: U+ = E[E1 (ξ )] + E[E2 (η )]. They too remain invariant, because the
marginal distributions p1 (ξ ) and p2 (η ) do not change. Hence, Δ U+ = 0 and, thus,
the energy A must come from some external non-thermal energy sources during the
process of information recording. In order to obtain (12.4.1), it only remains to take
into account the inequality Ixy Iξ η . The proof is complete.
In conclusion of this paragraph we consider one more example of information
recording of an absolutely different kind rather than the example from Section 12.3.
The compassion of these two examples shows that the sources of external energy
can be of completely different nature.
Suppose that we need to receive and record information about the fluctuating
coordinate x of an arrow rotating about an axis and supported near equilibrium by a
spring Π (Figure 12.3). A positively charged ball is placed at the end of the arrow.
Fig. 12.3 An example of creating and recording information by means of moving together and
apart the arrows that interact with the springs Π
In order to perform the ‘measurement and recording’ of the coordinate x of the

arrow, we bring to it a similar second arrow placed on the same axis, but with a ball
charged negatively. By bringing them together, we obtain some energy A caused
by the attraction of the balls. The approach is presumed to be very fast, so that
12.5 Energy costs in physical channels 405
the coordinates x, y of the arrows do have time to change during the process. The
state we have after this approach is non-equilibrium. Then it transforms into an
equilibrium state, in which correlation between the coordinates x, y of the arrows
is established. It is convenient to assume that the attraction force between the balls
is much stronger than the returning forces of springs, so that the correlation will
be very strong. The transition to the equilibrium is accompanied by the reduction
of the average interaction energy of the balls (‘descent into a potential well’) and
by giving some thermal energy to the thermostat (to the environment). After the
equilibrium (correlated) distribution p(x, y) is established, we move the arrows apart
quickly (both movements close and apart are performed along the axis of rotation
and do not involve the rotation coordinate). In so doing, we spend the work A2 ,
which is, obviously, greater than the work A1 we had before, since the absolute
value of the difference |x − y| has become smaller on the average. The marginal
distributions p(x) and p(y) are nearly the same if the mean E[(x − y)2 ] is small. As
thermodynamic analysis (similar to the one provided in the proof of Theorem 12.2)
shows, A2 − A1 = A IT in this example.
After moving the arrows apart, there is a correlation between them, but no force
interaction. The process of ‘recording’ information is complete. Of course, the ob-
tained recording can be converted into a different form, say, it can be recorded on a
magnetic tape. In so doing, the amount of information can only be decreased.
In this example, the work necessary for information recording is done by a human
or a device moving the arrows together or apart. According to the general theory,
such work must be expended wherever there is a creation of new information, cre-
ation of new correlations, and not simply reprocessing the old ones. The general
theory in this case only points to a lower theoretical level of these expenditures.
In practice expenditure of energy can obviously exceed and even considerably ex-
ceed this thermodynamic level. A comparison of this expenditure with the minimum
theoretical value allows us to judge energy efficiency of real devices.
12.5 Energy costs in physical channels
Energy costs are necessary not only for creation and recording of information, but
also for its transmission, if the latter occurs in the presence of fluctuation distur-
bances, for instance, thermal ones. As is known from statistical physics, in linear
systems there is certain mean equilibrium fluctuational energy for every degree of
freedom. This energy equals Tfl /2, where Tfl is the environment (thermostat) tem-
perature. In a number of works (by Brillouin [4] (the corresponding English book is
[5]) and others), researchers came up with an idea that in order to transmit 1 Nat of
information under these conditions it is necessary to have energy at least Tfl (we use
energy units for temperature, so that the Boltzmann constant is equal to 1). In this
section we shall try to make this statement more exact and to prove it.
Let us call a channel described by transition probabilities p(y | x) and the cost
function c(x) a physical channel, if the variable y has the meaning of a complete
set of dynamical variables of some physical system S. The Hamilton function (en-
ergy) of the system is denoted by E(y) (it is non-negative). Taking this function into
account, we can apply standard formulae to calculate the equilibrium potential

Γ (β ) = −β F = ln e−β E(y) dy (12.5.1)
and entropy as a function of the mean energy

dΓ
H(R) = Γ + β R R=− (β ) 0 . (12.5.2)
dβ
Theorem 12.3. The capacity (see Section 8.1) of a physical channel [p(y | x), c(x)]
satisfies the inequality
TflC(a) EE(y) − afl , (12.5.3)
where the level afl and ‘fluctuation temperature’ Tfl are defined by the equations
dH
Tfl−1 = (afl ); H(afl ) = Hy|x . (12.5.4)
dR
The mean energy E[E(y)] and the conditional entropy Hy|x are calculated using the
extremum probability density p0 (x) (which is assumed to exist) realizing the channel
capacity. It is also assumed that the second equation (12.5.4) has root afl belonging
to the normal branch, where Tfl > 0.
Proof. Formulae (12.5.1), (12.5.2) emerge in the solution to the first variational
problem (for instance, see Section 3.6)—the conditional extremum problem for en-
tropy Hy with the constraint E[E(y)] = A. Therefore, the following inequality holds:
Hy H(EE(y)) (12.5.5)
(the level E[E(y)] is fixed). Further, as it follows from a general theory (see Corol-
lary from Theorem 4.1) the function H(R) is concave. Consequently, its derivative
dH(R)
β (R) = (12.5.6)
dR
is a non-increasing function of R.
The channel capacity coincides with Shannon’s amount of information
C(a) = Hy − Hy|x (12.5.7)
for the extremum distribution p0 (x), which takes this amount to a conditional max-
imum. From (12.5.5) and the usual inequality Hy Hy|x we find that
Hy|x H(E[E(y)]). (12.5.8)

12.5 Energy costs in physical channels 407
Because afl is a root of the equation Hy|x = H(a f ) in the normal branch of the con-
cave function H(·), where the derivative is non-negative, regardless of which branch
the value E[E(y)] belongs to, equation (12.5.4) implies the inequality afl E[E(y)].
Taking this inequality into account together with the non-increasing nature of the
derivative (12.5.6) we obtain that
H(E[E(y)]) − H(afl ) β (afl )[E[E(y)] − afl ].
The last inequality in combination with C(a) H(E[E(y)]) − H(afl ) resulting

from (12.5.7) and (12.5.5) yields (12.5.3). The proof is complete.
Because a f is positive (which follows from a non-negative nature of energy

E(y)), the term afl in (12.5.3) can be discarded. The inequality can only become
stronger as a result. However, it is not always advisable to do so, because for-
mula (12.5.3) allows for the following simple interpretation. The value afl can be
treated as mean ‘energy of fluctuational noise’, while the difference E[E(y)] − afl as
the ‘energy of a useful signal’ that remains after passing through a channel. Hence,
according to (12.5.3), for every Nat of the channel capacity, there is at least Tfl
amount of ‘energy of a useful signal’. If a is the energy of the useful signal before
passing through a channel (this energy is, naturally, greater than the remaining en-
ergy E[E(y)] − afl ), then the inequality a/C(a) Tfl will even more certainly hold.
It is the most straightforward to specialize these ideas for the case of a quadratic
energy
E(y) = ∑ e jk y j yk ≡ yT e2 y (12.5.9)
j,k
and additive independent noise ξ with a zero mean value:
y = x+ζ. (12.5.10)
Substituting (12.5.10) into (12.5.9) and averaging out we obtain that
E[E(y)] = E[xT e2 x] + E[ζ T e2 ζ ],

i.e. E[E(y)] = E[E(x)] + E[E(ζ )]. (12.5.11)
In view of (12.5.10) and the independence of noise we next have
Hy|x = Hζ . (12.5.12)
We recall that the first variational problem, a solution to which is either the afore-
mentioned function H(R) or the inverse function R(H), can be interpreted (as is
known from Section 3.2) as the problem of minimizing the mean energy E[E] with
fixed entropy. Thus, R(Hζ ) is the minimum value of energy possible for fixed en-
tropy Hζ , i.e.
EE(ζ ) R(Hζ ). (12.5.13)
However, the value R(Hζ ) due to (12.5.4), (12.5.12) is nothing but afl , and there-
fore (12.5.13) can be rewritten as follows
EE(ζ ) a f . (12.5.14)
From (12.5.11) and (12.5.14) we obtain E[E(y)] − afl E[E(x)]. This inequality
allows us to transform the basic derived inequality (12.5.3) to the form TflC(a)
E[E(x)] or
TflC(a) a, (12.5.15)
if the cost function c(x) coincides with the energy E(x).
The results given here concerning physical channels are closely connected to
Theorem 8.5. In this theorem, however, the ‘temperature parameter’ 1/β has a for-
mal mathematical meaning. In order for this parameter to have the meaning of phys-
ical temperature, the costs c(x) or b(y) have to be specified as physical energy.
According to inequality (12.5.15), in order to transmit one nat of informa-
tion through a Gaussian physical channel we need energy that is not less than
Tfl . It should be noted that we could not derive any universal inequality of the
type (12.5.15) containing a real (rather than effective) temperature.
Appendix A
Some matrix (operator) identities
A.1 Rules for operator transfer from left to right
Suppose we have two arbitrary not necessarily square matrices A and B such that
the matrix products AB and BA have meaning. In operator language, this means the
following: if A maps an element of space X into an element of space Y , then B
maps an element of space Y into an element of space X, thus it defines the inverse
mapping. Under broad assumptions regarding function f (z), the next formula holds:
A f (BA) = f (AB)A. (A.1.1)
Let us prove this formula under the assumption that f can be expressed as the
Taylor series
∞
1
f (z) = ∑ f n (0)zn (A.1.2)
n=0 n!
where z = AB or BA. (Note that a large number of generalizations can be obtained

via the passage to the limit lim fm = f , where { fm } is a sequence of appropriate
expandable functions).
Substituting (A.1.2) into (A.1.1) we observe that both the right-hand and the left-
hand sides of the equality turn into the same expression
∞
1
∑ n! f (n) (0)ABA · · · ABA.
n=0
This proves equality (A.1.1).

As a matter of fact, matrices AB and BA have different dimensionalities, and
actually they represent operators acting in different spaces (the first one in Y , the
second one in Y ). The same is true regarding matrices f (AB) and f (BA). However,
let us compare their traces. Using expansion (A.1.2), we obtain

https://doi.org/10.1007/978-3-030-22833-0
410 A Some matrix (operator) identities
1
tr f (AB) = tr f (0) + f (0) tr(AB) + f (0) tr(ABAB) + · · · , (A.1.3)
2
1
tr f (BA) = tr f (0) + f (0) tr(BA) + f (0) tr(BABA) + · · · (A.1.4)
2
However, tr(AB) = tr(BA) = ∑i j Ai j B ji and, therefore, tr(A[(BAk B)]) = tr([(BAk B)]A).
This is why all the terms in (A.1.3), (A.1.4), apart from the first one, are identical.
In general, the first terms tr( f (0)) are not identical, because the operator f (0) in the
expansion of f (AB) and the same operator in the expansion of f (BA) are multiples
of the identity matrices of different dimensions. However, if the following condition
is met
f (0) = 0 , (A.1.5)
then, consequently,
tr f (AB) = tr f (BA). (A.1.6)
If we were interested in determinants, then for the corresponding equality
det f (AB) = det f (BA) (A.1.7)
instead of (A.1.5), we would have the condition F(0) = 1, because ln(det F) =

tr(ln F) (here ln F = f ).
A.2 Determinant of a block matrix
It is required to compute the determinant of the matrix

AB
K= , (A.2.1)
CD
where A, D are square matrices, and B, C are arbitrary matrices. Matrix D is assumed
to be non-singular. Let us denote

1 0 AB A B 1 0
L= = , so that K = L
0 D−1 CD D−1C 1 0D
(here 1 and 0 are matrices) and, therefore,
det K = det D det L (A.2.2)
It is easy to check by direct multiplication that

A B 1B A − BD−1C 0
= . (A.2.3)
D−1C 1 01 D−1C 1
However,
A.2 Determinant of a block matrix 411

1B A − BD−1C 0
det = 1; det = det(A − BD−1C).
01 D−1C 1
Therefore, (A.2.3) yields det(L) = det(A − BD−1C). Substituting this equality

to (A.2.2) we obtain

AB
det = det D det(A − BD−1C). (A.2.4)
CD
Similarly, if A−1 exists, then

AB
det = det A det(D −CA−1 B) . (A.2.5)
CD
According to the above formulas, the problem of calculating the original deter-
minant is reduced to the problem of calculating determinants of smaller dimension.
References
1. Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton

(1957)
2. Bellman, R.E.: Dynamic Programming (Translation to Russian). Inostrannaya
Literatura, Moscow (1960)
3. Berger, T.: Rate Distortion Theory. A Mathematical Basis for Data Compres-
sion. Prentice Hall, Englewood Cliffs (1971)
4. Brilloiun, L.: Science and Information Theory (Translation from French to Rus-
sian). Fizmatgiz, Moscow (1960)
5. Brilloiun, L.: Science and Information Theory. Academic, New York (1962)
6. Dobrushin, R.L.: The general formulation of the fundamental theorem of Shan-
non in the theory of information. Usp. Mat. Nauk 14(6) (1959, in Russian)
7. Doob, J.: Stochastic Processes. Wiley Publications in Statistics. Wiley, New
York (1953)
8. Doob, J.: Stochastic Processes (Translation to Russian). Inostrannaya Liter-
atura, Moscow (1956)
9. Fano, R.M.: Transmission of Information: A Statistical Theory of Communica-
tions, 1st edn. MIT Press, Cambridge (1961)
10. Fano, R.M.: Transmission of Information: A Statistical Theory of Communica-
tions (Translation to Russian). Mir, Moscow (1965)
11. Feinstein, A.: Foundations of Information Theory. McGraw-Hill, New York
(1958)
12. Feinstein, A.: Foundations of Information Theory (Translation to Russian). In-
ostrannaya Literatura, Moscow (1960)
13. Gnedenko, B.V.: The Course of the Theory of Probability. Fizmatgiz, Moscow
(1961, in Russian)
14. Gnedenko, B.V.: The Theory of Probability (Translation from Russian).
Chelsea, New York (1962)
15. Goldman, S.: Information Theory. Prentice Hall, Englewood Cliffs (1953)
16. Goldman, S.: Information Theory (Translation to Russian). Inostrannaya Liter-

https://doi.org/10.1007/978-3-030-22833-0
414 References
17. Grishanin, B.A., Stratonovich, R.L.: The value of information and sufficient
statistics when observing a stochastic process. Izv. USSR Acad. Sci. Tech. Cy-
bern. 6, 4–12 (1966, in Russian)
18. Hartley, R.V.L.: Transmission of information. Bell Syst. Tech. J. 7(3) (1928)
19. Hartley, R.V.L.: Transmission of information (Translation to Russian). In:
A. Harkevich (ed.) Theory of Information and Its Applications. Fizmatgiz,
Moscow (1959)
20. Hill, T.L.: Statistical Mechanics. McGraw-Hill Book Company Inc., New York
(1956)
21. Hill, T.L.: Statistical Mechanics (Translation to Russian). Inostrannaya Liter-
22. Hirsch, M.J., Pardalos, P.M., Murphey, R. (eds.): Dynamics of Information Sys-
tems: Theory and Applications. Springer Optimization and Its Applications Se-
ries, vol. 40. Springer, Berlin (2010)
23. Huffman, D.A.: A method for the construction of minimum redundancy codes.
Proc. IRE 40(9), 1098–1101 (1952)
24. Jahnke, E., Emde, F.: Tables of Functions with Formulae and Curves. Dover
Publications, New York (1945)
25. Jahnke, E., Emde, F.: Tables of Functions with Formulae and Curves (Transla-
tion from German to Russian). Gostekhizdat, Moscow (1949)
26. Kolmogorov, A.N.: Theory of transmission of information. In: USSR Academy
of Sciences Session on Scientific Problems Related to Production Automation.
USSR Academy of Sciences, Moscow (1957, in Russian)
27. Kolmogorov, A.N.: Theory of transmission of information (translation from
Russian). Am. Math. Soc. Translat. Ser. 2(33) (1963)
28. Kraft, L.G.: A device for quantizing, grouping, and coding amplitude-
modulated pulses. Master’s Thesis, Massachusetts Institute of Technology,
Dept. of Electrical Engineering (1949)
29. Kullback, S.: Information Theory and Statistics. Wiley, New York (1959)
30. Kullback, S.: Information Theory and Statistics (Translation to Russian).
Nauka, Moscow (1967)
31. Leontovich, M.A.: Statistical Physics. Gostekhizdat, Moscow (1944, in Rus-
sian)
32. Leontovich, M.A.: Introduction to Thermodynamics. GITTL, Moscow-
Leningrad (1952, in Russian)
33. Pinsker, M.S.: The quantity of information about a Gaussian random stationary
process, contained in a second process connected with it in a stationary manner.
Dokl. Akad. Nauk USSR 99, 213–216 (1954, in Russian)
34. Rao, C.R.: Linear Statistical Inference and Its Applications. Wiley, New York
(1965)
35. Rao, C.R.: Linear Statistical Inference and Its Applications (Translation to Rus-
sian). Inostrannaya Literatura, Moscow (1968)
36. Ryzhik, J.M., Gradstein, I.S.: Tables of Series, Products and Integrals.
Gostekhizdat, Moscow (1951, in Russian)
References 415
37. Ryzhik, J.M., Gradstein, I.S.: Tables of Series, Products and Integrals (Transla-
tion from Russian). Academic, New York (1965)
38. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J.
27 (1948)
39. Shannon, C.E.: Communication in the presence of noise. Proc. IRE 37(1), 10–
21 (1949)
40. Shannon, C.E.: Certain results in coding theory for noisy channels. Inform.
Control 1(1) (1957)
41. Shannon, C.E.: Coding theorems for a discrete source with a fidelity criterion.
IRE Nat. Conv. Rec. 4(1), 142–163 (1959)
42. Shannon, C.E.: Certain results in coding theory for noisy channels (translation
to Russian). In: R.L. Dobrushin, O.B. Lupanov (eds.) Works on Information
Theory and Cybernetics. Inostrannaya Literatura, Moscow (1963)
43. Shannon, C.E.: Coding theorems for a discrete source with a fidelity criterion
(translation to Russian). In: R.L. Dobrushin, O.B. Lupanov (eds.) Works on
Information Theory and Cybernetics. Inostrannaya Literatura, Moscow (1963)
44. Shannon, C.E.: Communication in the presence of noise (translation to Rus-
sian). In: R.L. Dobrushin, O.B. Lupanov (eds.) Works on Information Theory
and Cybernetics. Inostrannaya Literatura, Moscow (1963)
45. Shannon, C.E.: A mathematical theory of communication (translation to Rus-
sian). In: R.L. Dobrushin, O.B. Lupanov (eds.) Works on Information Theory
and Cybernetics. Inostrannaya Literatura, Moscow (1963)
46. Stratonovich, R.L.: On statistics of magnetism in the Ising model (in Russian).
Fizika Tvyordogo Tela 3(10) (1961)
47. Stratonovich, R.L.: On the value of information (in Russian). Izv. USSR Acad.
Sci. Tech. Cybern. 5, 3–12 (1965)
48. Stratonovich, R.L.: Conditional Markov Processes and Their Application to the
Theory of Optimal Control. Moscow State University, Moscow (1966, in Rus-
sian)
49. Stratonovich, R.L.: The value of information when observing a stochastic pro-
cess in systems containing finite automata. Izv. USSR Acad. Sci. Tech. Cybern.
5, 3–13 (1966, in Russian)
50. Stratonovich, R.L.: Amount of information and entropy of segments of station-
ary Gaussian processes. Problemy Peredachi Informacii 3(2), 3–21 (1967, in
Russian)
51. Stratonovich, R.L.: Extremal problems of information theory and dynamic pro-
gramming. Izv. USSR Acad. Sci. Tech. Cybern. 5, 63–77 (1967, in Russian)
52. Stratonovich, R.L.: Conditional Markov Processes and Their Application to
the Theory of Optimal Control (Translation from Russian). Modern Analytic
and Computational Methods in Science and Mathematics. Elsevier, New York
(1968)
53. Stratonovich, R.L.: Theory of Information. Sovetskoe Radio, USSR, Moscow
(1975)
54. Stratonovich, R.L.: Topics in the Theory of Random Noise, vol. 1. Martino Fine
Books, Eastford (2014)
416 References
55. Stratonovich, R.L., Grishanin, B.A.: The value of information when a direct
observation of an estimated random variable is impossible. Izv. USSR Acad.
Sci. Tech. Cybern. 3, 3–15 (1966, in Russian)
56. Stratonovich, R.L., Grishanin, B.A.: Game problems with constraints of an in-
formational type. Izv. USSR Acad. Sci. Tech. Cybern. 1, 3–12 (1968, in Rus-
sian)
Index
A Chebyshev’s inequality, 14
α -information, 300 Chernoff’s inequality, 94
active domain, 59, 252, 307 code, 36, 219
additivity principle, 3 optimal, 37
asymptotic equivalence of value of information Shannon’s random, 221
functions, 360 uniquely decodable, 40
asymptotic theorem Kraft’s, 40
first, 92 condition of
second, 225, 227 multiplicativity, 31, 32
third, 356, 360, 367 normalization, 58, 304
average cost, 300 conditional Markov process, 163, 205
B D
Bayesian system, 300 decoding error, 49, 221
Gaussian, 338 distribution
stationary, 346 canonical, 79
bit, 4 extremum, 300
Boltzmann formula, 2
branch E
anomalous, 257, 301 elementary message, 38
normal, 257, 301 encoding of information, 35, 36
block, 36
C online, 35
channel optimal, 35
abstract, 250, 251 entropy, 2, 3
additive, 284 Boltzmann’s, 6
binary, 264, 266 conditional, 8, 30
capacity, 53, 57, 250 continuous random variable, 24, 25
discrete noiseless, 56 end of an interval, 103, 107
Gaussian, 267 maximum value, 6, 28
stationary, 277 properties, 6, 7
physical, 405 random, 5
capacity of, 406 rate, 15, 103, 105, 156
symmetric, 262 entropy density, 157

https://doi.org/10.1007/978-3-030-22833-0
418 Index
F N
Fokker–Planck equation, 157, 166 nat, 4
stationary, 159 neg-information, 290
free energy, 62
function P
cost, 56, 296 parameter
cumulant generating, 80 canonical, 78, 79
likelihood, 219 thermodynamic
value of information, 296, 322 conjugate, 72, 79
external, 71
G internal, 71, 78
Gibbs partition function, 62, 74
canonical distribution, 62, 78, 82, 392 potential
theorem, 83 characteristic, 20, 80, 94, 127, 188, 194, 201
conditional, 240
H thermodynamic, 65
Hartley’s formula, 2, 4 probability
final a posteriori, 116
I of error, 219
information average, 221
amount of process
Boltzmann’s, 6, 318 discrete, 104
Hartley’s, 4 Markov, 107
Shannon’s, 173, 178 stationary, 104
capacity, 57 Markov
mutual conditional, 113, 163, 205
conditional, 181 conditional, entropy of, 113
pairwise, 178 diffusion, 157
random, 179 secondary a posteriori, 118
rate, 196 stationary-connected, 196
triple, 185 stationary periodic, 138
Ising model, 69 stochastic point, 144
property of hierarchical additivity, 10, 30, 182
J
Jensen’s inequality, 6 R
Radon–Nikodym derivative, 25
K random flow, 144
Khinchin’s theorem, 225 risk, 300
Kotelnikov’s theorem, 282
S
L sequence of informationally stable Bayesian
Lévy formula, 97 systems, 367
law Shannon’s theorem, 225, 250
conservation of information amount, 35 generalized, 353, 385, 387
of thermodynamics, Second, 392, 398 simple noise, 175
generalized, 392, 399 stability
Legendre transform, 72, 81, 230 canonical, 88
length of record, 37, 42 entropic, 16, 19
sufficient condition, 19
M informational, 226
Markov chain, 108 Stirling’s formula, 150
Markov condition, 115
method of Lagrange multipliers, 58 T
micro state, 2 thermodynamic relation, 62, 64, 254, 309
Index 419
V variational problem
value of information, 289, 291, 296 first, 57, 58
Boltzmann’s, 319 second, 249–251
differential, 291, 292 third, 293, 300, 304
Hartley’s, 297
random, 312 W
Shannon’s, 301, 321 W -process secondary a posteriori, 118

Theory of Information and Its Value: Ruslan L. Stratonovich

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Theory of Information and Its Value: Ruslan L. Stratonovich

Uploaded by

Copyright:

Available Formats

Ruslan L.

ISBN 978-3-030-22832-3 ISBN 978-3-030-22833-0 (eBook)

Mathematics Subject Classification: 94A17, 94A05, 60G35

© Springer Nature Switzerland AG 2020

Recollections of Roman Belavkin

History of this translation

London, UK Roman Belavkin

Moscow, Russia Ruslan L. Stratonovich

which entropy is replaced by the Shannon’s amount of information. Therefore, the

K(α ) = Φ1 [Pextr ] + αΦ2 [Pextr ] = Φ1 + α A

serves as a ‘thermodynamic’ potential since dK = d Φ1 + α d Φ2 + Φ2 d α = Φ2 d α ,

d Φ1 + α d Φ2 ≡ [Φ1 (Pextr + δ P) − Φ1 (Pextr )] + α [Φ2 (Pextr + δ P) − Φ2 (Pextr )] = 0

follows from (1). Here, we take a partial variation ∂ P = Pextr (α + d α ) − Pextr (α )

1 Definition of information and entropy in the absence of noise . . . . . . . 1

4 First asymptotic theorem and related results . . . . . . . . . . . . . . . . . . . . . 77

7.5 Enhanced estimators for optimal decoding . . . . . . . . . . . . . . . . . . . . 232

12 Information theory and the second law of thermodynamics . . . . . . . . 391

where ξ is a random variable, and P(ξ ) is its probability distribution.

1.1 Definition of entropy in the case of equiprobable outcomes

Suppose we have M equiprobable outcomes of an experiment. For example, when

Denoting x = mn , we have n = ln x/ ln m. Then it follows from (1.1.3) that

where K = f (m)/ ln m is a positive constant, which is independent of x. It is related

1 ‘nat’ refers to natural digit that means natural unit.

1.2 Entropy and its properties in the case of non-equiprobable

I(ξ ) = H(ξ ) = − ln P(ξ ). (1.2.2)

P(pass) = 7/8, P(fail) = 1/8.

I(fail) = log2 (8) = 3 bits.

We shall call it the Boltzmann’s entropy or the Boltzmann’s amount of information.3

Substituting (1.2.5), (1.2.6) to (1.2.4), we obtain

For a particular form of function f (ζ ) = ln ζ it is easy to verify inequality (1.2.4)

f (ζ ) f (E[ζ ]) + f (E[ζ ])(ζ − E[ζ ]).

Averaging the latter inequality, we derive (1.2.4).

Hξ1 ξ2 = Hξ1 + Hξ2 . (1.2.8)

Hξ = −E[ln P(ξ )] = −E[ln P(ξ1 , ξ2 )] = Hξ1 ξ2 .

In view of independence, we have

P(ξ1 , ξ2 ) = P(ξ1 )P(ξ2 ).

Therefore, ln P(ξ1 , ξ2 ) = ln P(ξ1 ) + ln P(ξ2 ); H(ξ1 ξ2 ) = H(ξ1 ) + H(ξ2 ). Averaging

The property mentioned in Theorem 1.3 is a manifestation of the additivity prin-

and is easy to prove by an analogous method.

1.3 Conditional entropy. Hierarchical additivity

Let us generalize formulae (1.2.1), (1.2.3) to the case of conditional probabilities.

are associated with the random conditional entropy

H(ξk , . . . , ξn | ξ1 , . . . , ξk−1 ) = − ln P(ξk , . . . , ξn | ξ1 , . . . , ξk−1 ). (1.3.1)

Hξk ...ξn (| ξ1 , . . . ξk−1 ) = − ∑ P(ξk , . . . , ξn | ξ1 , . . . , ξk−1 )×

× ln P(ξk , . . . , ξn | ξ1 , . . . , ξk−1 ), (1.3.2)

and also for the result of total averaging:

Hξk ,...,ξn |ξ1 ,...,ξk−1 = E[H(ξk , . . . , ξn | ξ1 , . . . , ξk−1 )]

Fig. 1.2 A ‘decision tree’ for one particular example

In order to indicate a realization of ξ , it is necessary and sufficient to fix the

P(ξ1 | ξ2 ) = P(ξ1 , ξ2 )/P(ξ1 ).

Hξ2 (| ξ1 ) = − ∑ P(ξ2 | ξ1 ) ln P(ξ2 | ξ1 ).

Hξ2 |ξ1 = E[Hξ2 (| ξ1 )] = ∑ P(ξ1 )Hξ2 (| ξ1 ).

Hξk |ξ1 ,...,ξk−1 = E[Hξk (| ξ1 , . . . , ξk−1 )].

Proof. By the definition of conditional probabilities, they possess the following

P(ξ1 , . . . , ξn ) = P(ξ1 )P(ξ2 | ξ1 )P(ξ3 | ξ1 ξ2 ) · · · P(ξn | ξ1 , . . . , ξn−1 ). (1.3.5)

H(ξ1 , . . . , ξn ) = H(ξ1 ) + H(ξ2 | ξ1 ) + H(ξ3 | ξ1 ξ2 ) + · · ·

The property in Theorem 1.4 is a reflection of the simple additivity princi-

A substitution of these values into (1.2.4) yields

that completes the proof.

Theorem 1.6. Conditional entropy cannot exceed regular (non-conditional) one:

− ∑ P(ξ | η ) ln P(ξ | η ) ∑ P(ξ | η ) ln P(ξ ).

− ∑ P(ξ | η ) ln P(ξ | η ) − ∑ P(ξ ) ln P(ξ ),

that is equivalent to (1.3.8). The proof is complete.