Professional Documents
Culture Documents
Stratonovich
Theory
of Information
and its Value
Edited by Roman V. Belavkin
Panos M. Pardalos · Jose C. Principe
Theory of Information and its Value
Roman V. Belavkin • Panos M. Pardalos
Jose C. Principe
Editors
Theory of Information
and its Value
Editors
Roman V. Belavkin Panos M. Pardalos
Faculty of Science and Technology Industrial and Systems Engineering
Middlesex University University of Florida
London, UK Gainesville, FL, USA
Jose C. Principe
Electrical & Computer Engineering
University of Florida
Gainesville, FL, USA
Author
Ruslan L. Stratonovich (Deceased)
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword
It would be impossible for us to start this book without mentioning the main achieve-
ments of its remarkable author, Professor Ruslan Leontievich Stratonovich (RLS or
Ruslan later). He was a brilliant mathematician, probabilist and theoretical physicist,
best known for the development of the symmetrized version of stochastic calculus
(an alternative to Itô calculus), with stochastic differential equations and integrals
now bearing his name. His unique and beautiful approach to stochastic processes
was invented in the 1950s during the time of his doctoral work on the solution to the
notorious nonlinear filtering problem. The importance of this work was immediately
recognized by the great Andrei Kolmogorov, who invited Ruslan, then a graduate
student, for a discussion of his first papers.
This work was so much ahead of its time that its initial reception in the Soviet
mathematical community was mixed, mainly due to misunderstandings of the differ-
ences between the Itô and Stratonovich approaches. These and perhaps other factors
related to the cold war had obscured some of the achievements of Stratonovich’s
early papers on optimal nonlinear filtering, which, apart from the general solution
to the nonlinear filtering problem, contained also the equation for the Kalman–Bucy
filter as a special (linear) case as well as the forward–backward procedures for com-
puting posterior probabilities, which were later rediscovered in the hidden Markov
models theory. Nonetheless, the main papers were quickly translated into English,
and thanks to the remarkable hard work of RLS, by 1966 he had already published
two monographs—the Topics in the Theory of Random Noise (see [54] for a recent
reprint) and Conditional Markov Processes [52]. These books were also promptly
translated and had a better reception in the West, with the latter book being edited by
Richard Bellman. In 1968, Professor W. Murray Wonham wrote in a letter to RLS
‘Perhaps you are the prophet, who is honored everywhere except in his own land’.
Despite the difficulties, the following ten years in the 1960s were very productive
for Ruslan, who quickly became recognized as one of the top scientists in the World
in his field. He became a Professor by 1969 (in the post he remained at the Depart-
ment of Physics all his life). At that time in the 1960s, he managed to form a group
of young and talented graduate students including Grishanin, B. A., Sosulin, Yu. G.,
Kulman, N. K., Kolosov, G. E., Mamayev, D. D., Platonov, A. A. and Belavkin, V. P.
v
vi Foreword
Fig. 1 Stratonovich R. L. (second left) with his group in front of Physics Department, Moscow
State University, 1971. From left to right: Kulman, N. K., Stratonovich, R. L., Mamayev, D. D.,
Sosulin, Yu. G., Kolosov, G. E., Grishanin, B. A. Picture taken by Slava (V. P.) Belavkin
(Figure 1). These students began working under his supervision in completely new
areas of information theory, stochastic and adaptive optimal control, cybernetics and
quantum information. Ruslan was young and had a somewhat legendary reputation
among students and colleagues at the university, so that many students aspired to
work with him, even though he was known to be a very hard working and demand-
ing supervisor. At the same time, he treated his students as equals and gave them a
lot of freedom. Many of his former students recall that Ruslan had an amazing gift
to predict the solution before it was derived. Sometimes, he surprised his colleagues
by giving an answer to some difficult problem, and when they asked him how he
obtained it, his answer was ‘just verify it yourself, and you will see it is correct’.
Ruslan and his students spent a lot of their spare time together, either playing tennis
during the summer time or skiing in winters, such that they developed long-lasting
friendships.
In the mid-1960s, Ruslan read several specialist courses on various topics, in-
cluding adaptive Bayesian inference, logic and information theory. The latter course
included a lot of original material, which emphasized the connection between infor-
mation theory with statistical thermodynamics and introduced the Value of Informa-
tion, which he pioneered in his 1965 paper [47] and later developed together with
his students (mainly Grishanin, B. A.). This course motivated Ruslan to write his
third book called ‘Theory of Information’. Its first draft was ready in 1967, and it
is remarkable to note that RLS even had some negotiations with Springer to pub-
Foreword vii
lish the book in English, which unfortunately did not happen, perhaps due to some
bureaucratic difficulties in the Soviet Union. The monograph was quite large and
included parts on quantum information theory, which he developed together with
his new student Slava (V. P.) Belavkin. Although there was an agreement to publish
the book with a leading Soviet scientific publisher, the publication was delayed for
some unexplained reasons. In the end, an abridged version of the book was pub-
lished by a different publisher (Soviet Radio) in 1975 [53], which did not include
the quantum information parts!
Nonetheless, the book had become a classic even without having been translated
into English. Several anecdotal stories exist regarding the book that it was used
in the West and discussed at seminars with the help of Russian-speaking graduate
students. For example, Professor Richard Bucy used translated parts of the book in
his seminars on information theory, and in the 1990s he even suggested that the book
be published in English. In fact, in 1994 and 1995 Ruslan visited his former student
and collaborator Slava Belavkin in the University of Nottingham, United Kingdom,
who worked there at the Department of Mathematics (Figure 2). They had a plan to
publish a second edition of the book together in English and to include the parts on
quantum information. Because quantum information theory had progressed in the
1970s and 1980s, it was necessary to update the quantum parts of the manuscript,
and this had become the Achilles’s heel to their plan. During Ruslan’s visit, they
spent more time working on joint papers, which seemed as a more urgent matter.
I also visited my father (V.P. Belavkin) in Nottingham during the summer of 1994,
and I remember very clearly how happy Ruslan was during that visit (Figure 3),
especially for the ability to mow the lawn in his backyard—a completely new expe-
rience for someone who lived in a small Moscow flat all his life. Two years later, in
January 1997, Ruslan died after catching a flu during the winter examinations at the
Moscow State University. I went to his funeral at the Department of Physics, from
which I too had already graduated. It was a very big and sad event attended by a
crowd of students and colleagues. In the next couple of years, my father collabo-
rated with Valentina, Ruslans wife, on an English translation of the book, the first
version of which was in fact finished. Valentina too passed away two years later, and
my father never finished this project.
Before telling the reader how the translation of this book eventually came about,
I would like to write a few words about this book from my personal experience, how
it became one of my favourite texts on information theory, and why I believe it is so
relevant today.
Having witnessed first-hand the development of quantum information and filter-
ing theory in the 1980s (my father’s study in our small Moscow flat was also my
bedroom), I decided that my career could do without non-commutative probability
and stochastics. So, although I graduated from the same department as my father
viii Foreword
Fig. 2 Stratonovich with his wife during their visit to Nottingham, England, 1994. From left
to right: Robin Hudson, Slava Belavkin, Ruslan Stratonovich, Valentina Stratonovich, Nadezda
Belavkina
and Ruslan, I became interested in Artificial Intelligence (AI), and a couple of years
later I managed to get a scholarship to do a PhD in cognitive modelling of human
learning at the University of Nottingham. I was fortunate enough to be in the same
city with my parents, which allowed me to take a cycle ride through Wollaton Park
and visit them either for lunch or dinner. Although we often had scientific discus-
sions, I felt comfortable that my area of research was far away and independent of
my father’s territory. That, however, turned out to be a false impression.
During that time at the end of 1990s, I came across many heuristics and learning
algorithms using randomization in the form of the so-called soft-max rule, where
decisions were sampled from a Boltzmann distribution with a temperature param-
eter controlling how much randomization was necessary. And although using these
heuristics had clearly improved the performance of the algorithms and cognitive
models, I was puzzled by these links with statistical physics and thermodynamics.
Foreword ix
Fig. 3 Stratonovich with his wife at Belavkins’ home in Nottingham, England, 1994. From left
to right: Nadezda Belavkina, Slava Belavkin, Roman Belavkin, Ruslan Stratonovich and Valentina
Stratonovich
The fact that it was more than just a coincidence became clear when I saw that
performance of cognitive models could be improved by relating the temperature pa-
rameter dynamically to entropy. Of course, I could not help sharing these naive ideas
with my father, and to my surprise he did not criticize them. Instead, he went to his
study and brought an old copy of Ruslan’s Theory of Information. I spent the next
few days going through various chapters of the book, and I was immediately im-
pressed by the self-contained, and at the same time, very detailed and deep style of
the presentation. Ruslan managed to start each chapter with basic and fundamental
ideas, supported by very understandable examples, and then developed the material
to such depth and detail that no questions seemed to remain unanswered. However,
the main value of the book was in the ideas unifying theories of information, opti-
mization and statistical physics.
My main focus was on Chapters 3 and 9, which covered variational problems
leading to optimal solutions in the form of exponential family distributions (the
‘soft-max’), defined and developed the value of information theory and explored
many interesting examples. The value of information is an amalgamation of the-
ories of optimal statistical decisions and information, and its applications go far
beyond problems of information transmission. For example, the relation to machine
learning and cognitive modelling was very immediately clear—learning from math-
ematical point of view was simply an optimization problem with information con-
straints (otherwise, there is nothing to learn), and a solution to such a problem could
only be a randomized policy, where randomization was the consequence of incom-
plete information. Furthermore, the temperature parameter was simply the Lagrange
multiplier defined by the constraint, which also meant that an optimal temperature
could be derived (at least in theory) giving the solution to the notorious ‘exploration-
exploitation’ dilemma in reinforcement learning theory. A few years later, I applied
x Foreword
these ideas to evolutionary systems and derived optimal control strategies for mu-
tation rates in genetic algorithms (controlling randomization of DNA sequences).
Similar applications can be developed to control learning rates in artificial neural
networks and other data analysis algorithms.
This publication is a new translation of the 1975 book, which incorporates some
parts from the original translation by Ruslan’s wife, Valentina Stratonovich. The
publication has become possible, thanks to the initiatives of Professors Panos Parda-
los and Jose Principe. The collaboration was initiated at the ‘First International
Conference on Dynamics of Information Systems’ organized by Panos in the Uni-
versity of Florida in 2009 [22]. It is fair to say that at that time it was the only
conference dedicated to more than traditional information-theoretic aspects of data
and systems analysis, but also to the importance of analysing and understanding
the value of information. Another very important achievement of this conference
was the first attempt to develop a geometric approach to the value of information,
which is why one of the invited speakers to the conference was Professor Shun-
Ichi Amari. It was at this conference that the editors of this book first met together.
Panos, who by that time was the author and editor of dozens of books on global opti-
mization and data science, expressed his amazement at the unfortunate fact that this
book had still not been available in English. Jose Principe, known for his pioneering
work on information-theoretic learning, had already recognized the importance and
relevance of this book to modern applications and was planning the translation of
specific chapters. It was clear that there was a huge interest in the topic of value
of information, and we began discussing the possibility of making the new English
translation of this classic book, and finishing the project, which unfortunately was
never completed by Ruslan Stratonovich and Slava Belavkin.
Panos suggested that Vladimir Stozhkov, one of his Russian-speaking PhD stu-
dents, should do the initial translation. Vladimir took on the bulk of this work. The
equations for each chapter were coordinated by Matt Emigh and entered in LATEX
by students and visitors in the Department of Computational Neuro-Engineering
Laboratory (CNEL), University of Florida, during the Summer and Fall of 2016 as
follows: Sections 1.1–1.5 by Carlos Loza, 1.6–1.7 by Ryan Burt, 2.1–3.4 by Ying
Ma, 3.5–4.3 by Zheng Cao, 4.4–5.3, 6.1–6.5 and 8.6–8.8 by Isaac Sledge, 5.4–5.7
by Catia Silva, 5.8–5.11 by John Henning, 6.6–6.7 by Eder Santana, 7.3–7.5 by
Paulo Scalassara, 7.6–8.5 by Shulian Yu and Chapters 9–12 by Matt Emigh. This
translation and equations were then edited by Roman Belavkin, who also combined
it with the translation by Valentina Stratonovich in order to achieve a better reflec-
tion of the original text and terminology. In particular, the introductory paragraphs
of each chapter are largely based on Valentina’s translation.
We would like to take the opportunity to thank Springer, and specifically Razia
Amzad and Elizabeth Loew, for making the publication of this book possible. We
Foreword xi
also acknowledge the help of the Laboratory of Algorithms and Technologies for
Network Analysis (LATNA) at the Higher School of Economics in Nizhny Nov-
gorod, Russia, for facilitating meetings and collaborations among the editors and
accommodating many fruitful discussions on the topic in this beautiful Russian city.
With the emergence of data-driven economy, progress in machine learning and
AI algorithms and increased computational resources, the need for a better under-
standing of information, its value and limitations is greater than ever. This is why we
believe this book is even more relevant today than when it was first published. The
vast amount of examples pertaining to all kinds of stochastic processes and prob-
lems makes it a treasure trove of information for any researcher working in the areas
of data science or machine learning. It is a great pleasure to be able to contribute a
little to this project and see that finally this amazing book will be open to the rest of
the World.
This book was inspired by the author’s lectures on information theory in the De-
partment of Physics at Moscow State University, 1963–1965. Initially, the book was
written in accordance with the content of those lectures. The plan was to organize
the book in order to reflect all the paramount achievements of Shannon’s informa-
tion theory. However, while working on the book the author ‘lost his way’ and used
a more familiar style, in which the development of own ideas dominated over a thor-
ough recollection of existing results. That led to an inclusion of the original material
into the book and to an original interpretation of many central constructs of the the-
ory. The original material extruded a part of steady results, which were about to be
included into the book. For instance, the chapter devoted to known steady methods
of encoding and decoding in noisy channels was discarded.
The material included in the book is organized in three stages: the first, second,
third variational problems and the first, second, third asymptotical theorems, respec-
tively. This creates a clear panorama of the most fundamental content of Shannon’s
information theory.
Every writing style has its own benefits and disadvantages. The disadvantage of
the chosen style is that the work of many scientists in the field of interest remains
non-reflected (or insufficiently reflected). This fact should not be regarded as an
indication of insufficient respect to them. As a rule, an assessment of the material’s
originality and the attribution of the author’s ownership of the results are not given.
The only exception is made for a few explicit facts.
The book adopts ‘the principle of increasing complexity of material’, which
means that simpler and easily understood material is placed in the beginning of
a book (as well as in the beginning of a chapter). The reader is not required to be
familiar with more difficult and specific material situated towards the end of a chap-
ter/the book. This principle allows the inclusion of complicated material into the
book, while not making it difficult for a fairly wide range of readers. The hope is
that many readers will gain useful knowledge for themselves from the book.
While considering general questions, the author tried to lead statements with the
most generality possible. To achieve this, he often used the language of measure
theory. For example, he utilized the notation P(dx) for probability distribution. This
xiii
xiv Preface
should not scare off those who did not master the specified language. The point is
that, omitting the details, a reader can always use a simple dictionary which con-
verts those ‘intimidating’ terms into those more familiar. For instance, ‘probability
measure P(dx)’ can be treated as probability P(x) in the case of a discrete random
variable or as product p(x) dx in the case of a continuous random variable, where
p(x) is a probability density function and dx = dx1 . . . dxr is a differential corre-
sponding to a space of dimensionality r.
Focusing on various readers, the author did not attach significance to consistency
of terminology. He thought that it did not matter whether we say ‘expectation’ or
‘mean value’, ‘probabilistic measure’ or ‘distribution’. If we employ the apparatus
of generalized functions, then there always exists a probability density and it can
be used. Often there is no need to distinguish between minimization signs min. and
inf. By this we mean that if infimum is not attained within a considered region then
we can always ‘turn on’ an absolutely standard ε -procedure and, as a matter of
fact, nothing essential will change as a result. In the book, we pass from a discrete
probabilistic space to a continuous probabilistic space in a free manner. The author
tried to spare the reader any concern of inessential details and distractions from the
main ideas.
The general constructs of the theory are illustrated in the book by numerous cases
and examples. Due to a common importance of the theory and the examples, their
statement will not require a special radiophysics terminology. If a reader is interested
in application of the stated material to the problem of messages transmission through
radiochannels, he/she should fill abstract concepts with radiophysical content. For
instance, when considering noisy channels (Chapter 7), an input stochastic process
x = {xt } should be treated as a variable informational parameter of signal s(t, xt )
emitted by a radiotransmitter. Also, an output process y = {yt } should be treated
as a signal at a receiver input. A proper concretization of concepts is needed for
application of any mathematical theory.
The author expresses acknowledgements to the first reader of this book B.A. Gr-
ishanin, who rendered significant assistance while the book was being prepared to
print, and professor B.I. Tikhonov for discussions involving a variety of subjects
and a number of precious comments.
The term ‘information’ mentioned in the title of the book is understood here not in
the broad sense in which the word is understood by people working in the press,
radio, media, but in the narrow scientific sense of Claude Shannon’s theory. In other
words, the subject of this book is the special mathematical discipline, Shannon’s
information theory, which can solve its own quite specific problems.
This discipline consists of abstractly formulated theorems and results, which can
have different specializations in various branches of knowledge. Information theory
has numerous applications in the theory of message transmission in the presence of
noise, the theory of recording and registering devices, mathematical linguistics and
other sciences including genetics.
Information theory, together with other mathematical disciplines, such as the the-
ory of optimal statistical decisions, the theory of optimal control, the theory of algo-
rithms and automata, game theory and so on, is a part of theoretical cybernetics—a
discipline dealing with problems of control. Each of the above disciplines is an in-
dependent field of science. However, this does not mean that they are completely
separated from each other and cannot be bridged. Undoubtedly, the emergence of
complex theories is possible and probable, where concepts and results from differ-
ent theories are combined and which interconnect distinct disciplines. The picture
resembles trees in a forest: their trunks stand apart, but their crowns intertwine. At
first, they grow independently, but then their twigs and branches intertwine making
their new common crown.
Of course, generally speaking, the statement about uniting different disciplines
is just an assertion, but, in fact, the merging of some initially disconnected fields of
science is now an actual fact. As is evident from a number of works and from this
book, the following three disciplines are inosculating:
1. statistical thermodynamics as a mathematical theory
2. Shannon’s information theory
3. the theory of optimal statistical decisions (together with its multi-step or sequen-
tial variations, such as optimal filtering and dynamic programming).
xv
xvi Introduction
This book will demonstrate that the three disciplines mentioned above are ce-
mented by ‘thermodynamic’ methods with typical attributes such as ‘thermody-
namic’ parameters and potentials, Legendre transforms, extremum distributions, and
asymptotic nature of the most important theorems.
Statistical thermodynamics can be referred to as cybernetic disciplines only con-
ditionally. However, in some problems of statistical thermodynamics, its cybernetic
nature manifests itself quite clearly. It is sufficient to recall the second law of ther-
modynamics and ‘Maxwell’s demon’, which is a typical automaton converting in-
formation into physical entropy. Information is ‘fuel’ for perpetual motion of the
second kind. These points will be discussed in Chapter 12.
If we consider statistical thermodynamics as a cybernetic discipline, then L. Boltz-
mann and J. C. Maxwell should be called the first outstanding cyberneticists. It is
important to bear in mind that the formula expressing entropy in terms of probabil-
ities was introduced by L. Boltzmann, who also introduced the probability distri-
bution that was the solution to the first variational problem (of course, it does not
matter how we call the functions in this formula—energy or cost function).
During the emergence of Shannon’s information theory, the appearance of a well-
known notion of thermodynamics, namely entropy, was regarded by some scientists
as a curious coincidence, and little attention was given to the fact. It was thought that
this entropy had nothing to do with physical entropy (despite the work of Maxwell’s
demon). In this connection, we can recall a countless number of quotation marks
around the word ‘entropy’ in the first edition of the collection of Shannon’s pa-
pers translated into Russian (the collection under the editorship of A.N. Zheleznov,
Foreign Literature, 1953). I believe that now even terms such as ‘temperature’ in in-
formation theory can be written without quotation marks and understood merely as
a parameter incorporated in the expression for the extreme distribution. Similar laws
are valid both in information theory and statistical physics, and we can conditionally
call them ‘thermodynamics’.
At the beginning (from 1948 until 1959), only one ‘thermodynamic’ notion ap-
peared in Shannon’s information theory—entropy. There seemed to be no room for
energy and other analogous thermodynamic potentials in it. In that regard, the the-
ory looked feeble in comparison with statistical thermodynamics. This, however,
was short-lived. The situation changed when scientists realized that in applied infor-
mation theory, regarded as the theory of signal transmission, the cost function was
the analogue of energy, and risk or average cost was the analogue of average energy.
It became evident that a number of main concepts and relations between them are
similar in two disciplines. In particular, if the first variational problem is considered,
then we can speak about resemblance, ‘isomorphism’ of the two theories. Mathe-
matical relationships between the corresponding notions of both disciplines are the
same, and they are the contents of a mathematical theory that is considered in this
book.
The content of information theory is not limited to the specified relations. Be-
sides entropy the theory contains other notions such as the Shannon’s amount of
information. In addition to the first variational problem, related to an extremum of
entropy under fixed risk (i.e. energy), there are also possible variational problems, in
Introduction xvii
The second variational problem is related to the famous result of Shannon about
asymptotic zero probability of error for the transmission of messages through noisy
channels. The third variational problem is connected with asymptotic equivalence
of the values of information of Shannon’s and Hartley’s type. The latter results are
a splendid example of unity of discrete and continuous worlds. They are also a
good example that when the complexity of a discrete system grows, it is convenient
to describe it by continuous mathematical objects. Finally, they are an example of
how a complex continuous system behaves asymptotically similarly to a complex
discrete system. It would be tempting to observe something similar, say, in a future
asymptotical theory of dynamical systems and automata.
Contents
In modern science, engineering and public life, a big role is played by information
and operations associated with it: information reception, information transmission,
information processing, storing information and so on. The significance of informa-
tion has seemingly outgrown the significance of the other important factor, which
used to play a dominant role in the previous century, namely, energy.
In the future, in view of a complexification of science, engineering, economics
and other fields, the significance of correct control in these areas will grow and,
therefore, the importance of information will increase as well.
What is information? Is a theory of information possible? Are there any general
laws for information independent of its content that can be quite diverse? Answers
to these questions are far from obvious. Information appears to be a more difficult
concept to formalize than, say, energy, which has a certain, long established place
in physics.
There are two sides of information: quantitative and qualitative. Sometimes it is
the total amount of information that is important, while other times it is its quality,
its specific content. Besides, a transformation of information from one format into
another is technically a more difficult problem than, say, transformation of energy
from one form into another. All this complicates the development of information
theory and its usage. It is quite possible that the general information theory will
not bring any benefit to some practical problems, and they have to be tackled by
independent engineering methods.
Nevertheless, general information theory exists, and so do standard situations
and problems, in which the laws of general information theory play the main role.
Therefore, information theory is important from a practical standpoint, as well as in
fundamental science, philosophy and expanding the horizons of a researcher.
From this introduction one can gauge how difficult it was to discover the laws
of information theory. In this regard, the most important milestone was the work of
Claude Shannon [44, 45] published in 1948–1949 (the respective English originals
are [38, 39]). His formulation of the problem and results were both perceived as a
surprise. However, on closer investigation one can see that the new theory extends
and develops former ideas, specifically, the ideas of statistical thermodynamics due
© Springer Nature Switzerland AG 2020 1
R. V. Belavkin et al. (eds.), Theory of Information and its Value,
https://doi.org/10.1007/978-3-030-22833-0 1
2 1 Definition of information and entropy in the absence of noise
to Boltzmann. The deep mathematical similarities between these two directions are
not accidental. It is evidenced in the use of the same formulae (for instance, for
entropy of a discrete random variable). Besides that, a logarithmic measure for the
amount of information, which is fundamental in Shannon’s theory, was proposed
for problems of communication as early as 1928 in the work of R. Hartley [19] (the
English original is [18]).
In the present chapter, we introduce the logarithmic measure of the amount of
information and state a number of important properties of information, which follow
from that measure, such as the additivity property.
The notion of the amount of information is closely related to the notion of en-
tropy, which is a measure of uncertainty. Acquisition of information is accompanied
by a decrease in uncertainty, so that the amount of information can be measured by
the amount of uncertainty or entropy that has disappeared.
In the case of a discrete message, i.e. a discrete random variable, entropy is de-
fined by the Boltzmann formula
Hξ = − ∑ P(ξ ) ln P(ξ ),
ξ
discrete random variables, where it is sufficient to define entropy using one measure
or one probability distribution, in the general case it is necessary to introduce two
measures to do so. Therefore, entropy is now related to two measures instead of one,
and thus it characterizes the relationship between these measures. In our presenta-
tion, the general version of the formula for entropy is derived from the Boltzmann
formula by using as an example the condensation of the points representing random
variables.
There are several books on information theory, in particular, Goldman [16], Fein-
stein [12] and Fano [10] (the corresponding English originals are [11, 15] and [9]).
These books conveniently introduce readers to the most basic notions (also see the
book by Kullback [30] or its English original [29]).
one of the ‘simple’ (relative to the total system) subsystems. For two throws of dice,
the number of various pairs (ξ1 , ξ2 ) (where ξ1 and ξ2 both take one out of six values)
equals to 36 = 62 . Generally, for n throws the number of equivalent outcomes is 6n .
Applying formula (1.1.1) for this number, we obtain entropy f (6n ). According to
the additivity principle, we find that
f (6n ) = n f (6).
For other m > 1 the latter formula would take the form
f (mn ) = n f (m). (1.1.3)
S = Hph = k ln M. (1.1.7)
It is easy to see from the comparison of (1.1.5) and (1.1.6) that 1 nat is greater than
1 bit in log2 e = 1/ ln 2 ≈ 1.44 times.
In what follows, we shall use natural units (formula (1.1.5)), dropping subscript
‘nat’, unless otherwise stipulated.
Suppose the random variable ξ takes any of the M equiprobable values, say,
1, . . . , M. Then the probability of each individual value is equal to P(ξ ) = 1/M,
ξ = 1, . . . , M. Consequently, formula (1.1.5) can be rewritten as
H = − ln P(ξ ). (1.1.8)
1. Suppose now the probabilities of different outcomes are unequal. If, as earlier, the
number of outcomes equals to M, then we can consider a random variable ξ , which
takes one of M values. Considering an index of the corresponding outcome as ξ , we
obtain that those values are nothing else but 1, . . . , M. Probabilities P(ξ ) of those
values are non-negative and satisfy the normalization constraint: ∑ξ P(ξ ) = 1.
If we formally apply equality (1.1.8) to this case, then each ξ should have its own
entropy
H(ξ ) = − ln P(ξ ). (1.2.1)
Thus, we attribute a certain value of entropy to each realization of the variable ξ .
Since ξ is a random variable, we can also regard this entropy as a random variable.
As in Section 1.1, the a posteriori entropy, which remains after the realization of
ξ becomes known, is equal to zero. That is why the information we obtain once the
realization is known is numerically equal to the initial entropy
Similar to entropy H(ξ ), information I depends on the actual realization (on the
value of ξ ), i.e., it is a random variable. One can see from the latter formula that
information and entropy are both large when a posteriori probability of the given
realization is small and vice versa. This observation is quite consistent with intuitive
ideas.
Example 1.1. Suppose we would like to know whether a certain student has passed
an exam or not. Let the probabilities of these two events be
One can see from these probabilities that the student is quite strong. If we were
informed that the student had passed the exam, then we could say: ‘Your message
has not given me a lot of information. I have already expected that the student passed
the exam’. According to formula (1.2.2) the information of this message is quanti-
tatively equal to
I(pass) = log2 (8/7) = 0.193 bits.
If we were informed that the student had failed, then we would say ‘Really?’ and
would feel that we have improved our knowledge to a greater extent. The amount of
information of such a message is equal to
In theory, however, a greater role is not played by random entropy (random in-
formation, respectively) (1.2.1), (1.2.2), but by the average entropy defined by the
formula
Hξ = E[H(ξ )] = − ∑ P(ξ ) ln P(ξ ). (1.2.3)
ξ
E[ f (ζ )] f (E[ζ ]) (1.2.4)
that is valid for every concave function f (x). (Function f (x) = ln x is concave for
x > 0, because f (x) = −x−2 < 0). Indeed, denoting ζ = 1/P(ξ ) we have
1 M
P(ξ )
E[ζ ] = E = ∑ = M, (1.2.5)
P(ξ ) ξ =1
P(ξ )
1
E[ f (ζ )] = E ln = E[H(ξ )] = Hξ . (1.2.6)
P(ξ )
3 Boltzmann’s entropy is commonly referred to as ‘Shannon’s entropy’, or just ‘entropy’ within
the field of information theory.
1.2 Entropy and its properties in the case of non-equiprobable outcomes 7
Hξ ln M.
ζ ζ
ln − 1, (1.2.7)
E[ζ ] E[ζ ]
we obtain (1.2.4).
In the general case, it is convenient to consider the tangent line f (E[ζ ]) +
f (E[ζ ])(ζ − E[ζ ]) for function f (ζ ) at point ζ = E[ζ ] in order to prove (1.2.4).
Concavity implies
Proof. Suppose that ξ1 , ξ2 are two random variables such that the first one assumes
values 1, . . . , m1 and the second one—values 1, . . . , m2 . There are m1 m2 pairs of
ξ = (ξ1 , ξ2 ) with probabilities P(ξ1 , ξ2 ). Numbering the pairs in an arbitrary order
by indices ξ = 1, . . . , m1 m2 we have
P(ξ1 , . . . , ξn )
P(ξk , . . . , ξn | ξ1 , . . . , ξk−1 ) = (k n)
P(ξ1 , . . . , ξk−1 )
Let us introduce a special notation for the result of averaging (1.3.1) over ξk , . . . , ξn :
Fig. 1.1 Partitioning stages and the decision tree in the general case
1.3 Conditional entropy. Hierarchical additivity 9
If, in addition, we vary k and n, then we will form a large number of different en-
tropies, conditional and non-conditional, random and non-random. They are related
by identities that will be considered below.
Before we formulate the main hierarchical equality (1.3.4), we show how to in-
troduce a hierarchical set of random variables ξ1 , . . . , ξn , even if there was just one
random variable ξ initially.
Let ξ take one of M values with probabilities P(ξ ). The choice of one realization
will be made in several stages. At the first stage, we indicate which subset (from a
full ensemble of non-overlapping subsets E1 , . . . , Em1 ) the realization belongs to. Let
ξ1 be the index of such a subset. At the second stage, each subset is partitioned into
smaller subsets Eξ1 ξ2 . The second random variable ξ2 points to which smaller sub-
set the realization of the random variable belongs to. In turn, those smaller subsets
are further partitioned until we obtain subsets consisting of a single element. Appar-
ently, the number of nontrivial partitioning stages n cannot exceed M − 1. We can
juxtapose a fixed partitioning scheme with a ‘decision tree’ depicted on Figure 1.1.
Further considerations will be associated with a particular selected ‘tree’.
The probability of moving from the node ξ1 along the branch ξ2 is equal to the
conditional probability
The entropy, associated with a selection of one branch emanating from this node, is
precisely the conditional entropy of type (1.3.2) for n = 2, k = 2:
As in (1.3.3), averaging over all second stage nodes yields the full selection entropy
at the second stage:
As a matter of fact, the selection entropy at stage k in the node defined by values
ξ1 , . . . , ξk−1 is equal to
Hξk (| ξ1 , . . . , ξk−1 ).
At the same time the total entropy of stage k is equal to
For instance, on Figure 1.2 the first stage entropy is equal to Hξ1 = 1 bit. Node A
has entropy Hξ2 (| ξ1 = 1) = 0, and node B has entropy Hξ2 (| ξ1 = 2) = 23 log 32 +
3 log 3 = log2 3 − 3 bits. The average entropy at the second stage is apparently equal
1 2
to Hξ2 |ξ1 = 2 Hξ2 (| 2) = 12 log2 3 − 13 bits. An important regularity is that the sum
1
of entropies of all stages is equal to the full entropy Hξ , which can be computed
without partitioning a selection into stages. For the above example
1 1 1 2 1
Hξ1 ξ2 = log 2 + log 3 + log 6 = + log2 3 bits,
2 3 6 3 2
which does, indeed, coincide with the sum Hξ1 + Hξ2 |ξ1 . This observation is general.
Theorem 1.4. Entropy possesses the property of hierarchical additivity:
Hξ1 ,...,ξn = Hξ1 + Hξ2 |ξ1 + Hξ3 |ξ1 ξ2 + · · · + Hξn |ξ1 ,...,ξn−1 . (1.3.4)
Taking the logarithm of (1.3.5) and taking into account definition (1.3.1) of condi-
tional random entropy, we obtain
1.3 Conditional entropy. Hierarchical additivity 11
Averaging this equality according to (1.3.3) gives (1.3.4). This completes the proof.
Theorem 1.5. Whatever probability distributions P(ξ ) and Q(ξ ) are, the following
inequality holds:
P(ξ )
∑ P(ξ ) ln Q(ξ ) 0. (1.3.7)
ξ
Proof. The proof is similar to the proof of Theorem 1.2. It is based on inequal-
ity (1.2.4) for function f (x) = ln x. We set ζ = Q(ξ )/P(ξ ) and perform averaging
with weight P(ξ ).
Then
Q(ξ )
E[ζ ] = ∑ P(ξ )
P(ξ ) ∑
= Q(ξ ) = 1
ξ ξ
and
Q(ξ )
E[ f (ζ )] = ∑ P(ξ ) ln .
ξ
P(ξ )
Q(ξ )
∑ P(ξ ) ln P(ξ ) ln 1 = 0,
ξ
Hξ |η Hξ . (1.3.8)
Proof. Using Theorem 1.5 we substitute P(ξ ) and Q(ξ ) by P(ξ | η ) and P(ξ ),
respectively, therein. Then we will obtain
Averaging this inequality over η with weight P(η ) results in the inequality
Hξ |ηζ = Hξ |ζ . (1.3.9)
The idea that the general case of non-equiprobable outcomes can be asymptotically
reduced to the case of equiprobable outcomes is fundamental for information theory
in the absence of noise. This idea belongs to Ludwig Boltzmann who derived for-
mula (1.2.3) for entropy. Claude Shannon revived this idea and broadly used it for
derivation of new results.
In considering this question here, we shall not try to reach generality, since these
results form a particular case of more general results of Section 1.5. Consider the
set of independent realizations η = (ξ1 , . . . , ξn ) of a random variable ξ = ξ j , which
assumes one of two values 1 or 0 with probabilities P[ξ = 1] = p < 1/2; P[ξ =
0] = 1 − p = q. Evidently, the number of such different combinations (realizations)
is equal to 2n . Let realization ηn1 contain n1 ones and n − n1 = n0 zeros. Then its
probability is given by
P(ηn1 ) = pn1 qn−n1 . (1.4.1)
Of course, these probabilities are different for different n1 . The ratio P(η0 )/P(ηn ) =
(q/p)n of the largest probability to the smallest one is big and increases fast with a
growth of n. What equiprobability can we talk about then? The thing is that due to
1.4 Asymptotic equivalence of non-equiprobable and equiprobable outcomes 13
Let us find the variance Var[n1 ] = Var[ξ1 + · · · + ξn ] of the number of ones. Due to
independence of the summands we have
and
E[ξ 2 ] = E[ξ ] = p; E[ξ 2 ] − (E[ξ ])2 = p − p2 = pq.
Therefore,
Var(n1 ) = npq; Var(n1 /n) = pq/n. (1.4.2)
Hence, we have obtained that the mean deviation
√
Δ n1 = n1 − np ∼ pqn
increases with n, but slower than the mean value np and the length of the entire
A typical relative deviation Δ n1 /n1 decreases according
range 0 n1 n grow.
√
to
the law Δ n1 /n1 ∼ q/np. Within the bounds of the range |n1 − pn| ∼ pqn the
difference between probabilities P(ηn1 ) is still quite substantial:
Δ n1 √ pqn
P(ηn1 ) q q
= ≈
P(ηn1 −Δ n1 ) p p
(as it follows from (1.4.1)) and increases with a growth of n. However, this increase
√
is relatively slow, as compared to that of the probability itself. For Δ n1 ≈ pqn, the
corresponding inequality
P(ηn1 ) 1
ln ln
P(ηn1 −Δ n1 ) P(ηn1 )
becomes stronger and stronger with a growth of n. Now we state the forgoing in
more precise terms.
Theorem 1.7. All of the 2n realizations of η can be partitioned into two sets An and
Bn , so that
1. The total probability of realizations from the first set An vanishes:
P(An ) → 0 as n → ∞; (1.4.3)
2. Realizations from the second set Bn become relatively equiprobable in the fol-
lowing sense:
14 1 Definition of information and entropy in the absence of noise
ln P(η ) − ln P(η )
→ 0, η ∈ Bn ; η ∈ Bn . (1.4.4)
ln P(η )
Proof. Using Chebyshev’s inequality (for instance, see Gnedenko [13] or its trans-
lation to English [14]), we obtain
Var(n1 )
P[|n1 − pn| ε ] .
ε2
Taking into account (1.4.2) and assigning ε = n3/4 , we obtain from here that
We include realizations ηn1 , for which |n1 − pn| n3/4 into set An and the rest of
them into set Bn . Then the left-hand side of (1.4.5) is nothing but P(An ), and passing
to the limit n → 0 in (1.4.5) proves (1.4.3).
For the realizations from the second set Bn , the inequality pn − n3/4 < n1 < pn +
3/4
n holds, which, in view of (1.4.1), gives
q
− n(p ln p + q ln q) − n3/4 ln < − ln P(ηn1 ) <
p
q
< −n(p ln p + q ln q) + n3/4 ln . (1.4.6)
p
Hence
ln P(η ) − ln P(η ) < 2n3/4 ln q , s. t. η ∈ Bn , η ∈ Bn
p
and
q
| ln P(η )| > −n(p ln p + q ln q) − n3/4 ln .
p
Consequently,
ln P(η ) − ln P(η ) 2n−1/4 ln qp
<
ln P(η ) −p ln p − q ln q − n−1/4 ln q .
p
In order to obtain (1.4.4), one should pass to the limit as n → ∞. This ends the proof.
Inequality (1.4.6) also allows us to evaluate the number of elements of the set Bn .
Theorem 1.8. Let Bn be a set described in Theorem 1.7. Its cardinality M is such
that
ln M
→ −p ln p − q ln q ≡ Hξ1 as n → ∞. (1.4.7)
n
Theorem 1.8a If we suppose that realizations of set An have zero probability, real-
izations of set Bn are equiprobable and compute entropy Hη by the simple formula
Hη = ln M (see (1.1.5)), then the entropy rate Hη /n in the limit will be equivalent
to the entropy determined by formula (1.2.3), i.e.
H̃η
lim = −p ln p − q ln q. (1.4.8)
n→∞ n
Note that formula (1.2.3) can be obtained as a corollary of the simpler rela-
tion (1.1.5).
Proof. According to (1.4.5) the sum
∑ P(η ) = P(Bn )
η ∈Bn
The indicated sum involves the elements of set Bn and has M summands. In conse-
quence of (1.4.6) each summand can be bounded above:
q
P(η ) < exp n(p ln p + q ln q) + n ln .
3/4
p
so that
P(Bn ) 1 q
M < exp −n(p ln p + q ln q) + n ln .
3/4
(1.4.10)
min P(η ) min P(η ) p
In the case when ξ1 takes one out of m values there are mn different realiza-
tions of process η = (ξ1 , . . . , ξn ) with independent components. According to the
nH
aforesaid only M = e ξ1 of them (which we can consider as equiprobable) deserve
attention. When P(ξ1 ) = 1/m and Hξ1 = ln m, the numbers mn and M are equal; oth-
nH
erwise, (when Hξ1 < ln m) the fraction of realizations deserving attention e ξ1 /mn
unboundedly decreases with a growth of n. Therefore, the vast majority of realiza-
tions in this case is not essential and can be disregarded. This fact underlies coding
theory (see Chapter 2).
The asymptotic equiprobability takes place under more general assumptions
in the case of ergodic stationary processes as well. Boltzmann called the above
equiprobable realizations ‘micro states’ contrasting them to ‘macro states’ formed
by an ensemble of ‘micro states’.
P(An ) → 0 as n → ∞; (1.5.2)
1.5 Asymptotic equiprobability and entropic stability 17
Proof. Putting, for instance, ε = 1/m and η = 1/m, we obtain from (1.5.1) that
for n N(1/m, 1/m). Let m increase by ranging over consecutive integer numbers.
We define sets An as sets of realizations satisfying the inequality
when N(1/m, 1/m) n < N(1/(m + 1), 1/(m + 1)). For such a definition prop-
erty (1.5.2) apparently holds true due to (1.5.5). For the additional set Bn we have
that entails
ln P(η n ) − ln P(η n ) 2Hη n/m 2/m
− ln P(η ) n − ln P(η n ) 1 − 1/m
for η n ∈ Bn , η n ∈ Bn that proves property (1.5.3) (the convergence of n =
N(1/m, 1/m) invokes the convergence m → ∞).
According to (1.5.7), the probabilities P(η n ) of all realizations from set Bn lie in
the range
e−(1+1/m)Hη n < P(η n ) < e−(1−1/m)Hη n .
At the same time, the total probability P(Bn ) = ∑Bn P(η n ) is enclosed between 1 −
1/m and 1. Hence we get the following range for the number of terms:
(on the left-hand side the least number is divided by the greatest number, and on the
right-hand side, vice versa). Therefore,
1 ln(1 − 1/m) ln Mn 1
1− + < < 1+ ,
m Hη n Hη n m
that entails (1.5.4). Note that term ln (1 − 1/m)/Hη n converges to zero because en-
tropy Hη n does not decrease. The proof is complete.
18 1 Definition of information and entropy in the absence of noise
The property of entropic stability, which plays a big role according to Theo-
rem 1.9, can be conveniently checked for different examples by determining the
variance of random entropy:
If this variance does not increase too fast with a growth of n, then applying Cheby-
shev’s inequality can prove (1.5.1), i.e. entropic stability. We now formulate three
theorems related to this question.
var(H(η n ))
lim = 0, (1.5.8)
n→∞ Hη2 n
for every random variable with a finite variance and every ε > 0. Supposing here
that ξ = H(ηη n )/Hη n and taking into account (1.5.8), we obtain
H(η n )
P
− 1 ε → 0 as n → 0
Hη n
for every ε . Hence, (1.5.1) follows from here, i.e. entropic stability.
Theorem 1.11. If the entropy Hη n increases without bound and there exists a finite
upper limit
var(H(η n ))
lim sup < C, (1.5.9)
n→∞ Hη n
then the family of random variables is entropically stable.
Proof. To prove the theorem it suffices to note that the quantity Var[Hη n ]/Hη2 n ,
which on the strength of (1.5.9) can be bounded as follows:
Var(H(η n )) C
2
+ε (s.t. n > N(ε ))
Hη n Hη n
tends to zero, because Hη n increases (and ε is arbitrary), so that Theorem 1.10 can
be applied to this case.
which can be called the entropy rate and the variance rate, respectively. A number of
more or less general methods have been developed to calculate these rate quantities.
According to the theorem stated below, finiteness of these limits guarantees entropic
stability.
Theorem 1.12. If limits (1.5.10) exist and are finite, the first of them being different
from zero, then the corresponding family of random variables is entropically stable.
Proof. To prove the theorem we use the fact that formulae (1.5.10) imply that
Hη n = H1 n + o(n) (1.5.11)
Var(H(η n )) = D1 n + o(n). (1.5.12)
Here, as usual, o(n) means that o(n)/n → 0. Since H1 > 0, entropy Hη n infinitely
increases. Dividing expression (1.5.11) by (1.5.12), we obtain a finite limit
Var(H(η n )) D1 + o(1) D1
lim = lim = .
n→∞ Hη n n→∞ H1 + o(1) H1
Thus, the conditions of Theorem 1.11 are satisfied that proves entropic stability.
Hξ j > C1 . (1.5.13)
In this case, the conditions of Theorem 1.11 are satisfied, and, thereby, entropic
stability of family {η n } follows from it.
This potential is similar to the potentials that will be considered further (Sections 4.1
and 4.4). With the help of this potential it is convenient to investigate the rate of
convergence (1.5.2–1.5.4) in Theorem 1.9. This subject is covered in the following
theorem.
Theorem 1.13. Let potential (1.5.15) be defined and differentiable on the interval
s1 < α < s2 (s1 < 0; s2 > 0), and let the equation
d μ0
(s) = (1 + ε )Hη (ε > 0) (1.5.16)
ds
have a root s ∈ [0, s2 ]. Then the subset A of realizations of η defined by the constraint
H(η )
− 1 > ε, (1.5.17)
Hη
Proof. The proof is analogous to that of Theorem 1.9 in many aspects. Rela-
tion (1.5.19) follows from (1.5.17). Inequality (1.5.17) is equivalent to the inequality
∑ P(η ) = 1 − P(A),
B
we find that the number of terms in the above sum (i.e. the number of realizations
in B) satisfies the inequality
Therefore,
We can obtain a number of simple approximate relations from the formulae pro-
vided in the previous theorem, if we use the condition that ε is small. For ε = 0,
there is a null root s = 0 of equation (1.5.16), because ddsμ (0) = Hη . For small values
of ε , the value of s is small as well. Thus, the right-hand side of equation (1.5.16)
can be expanded to the Maclaurin series
Furthermore, we write down the Maclaurin expansion for the expression in the ex-
ponent of (1.5.18):
1
sμ0 (s) − μ0 (s) = μ0 (0)s2 + O(s3 ).
2
Substituting (1.5.21) into the above expression, we obtain
ε 2 Hη2
P(A) exp − [1 + O(ε 3 )]. (1.5.22)
2μ0 (0)
Up to now we have assumed that a random variable ξ , with entropy Hξ , can take
values from some discrete space consisting of either a finite or a countable number
of elements, for instance, messages, symbols, etc. However, continuous variables
are also widespread in engineering, i.e. variables (scalar or vector), which can take
values from a continuous space X, most often from the space of real numbers. Such a
random variable ξ is described by the probability density function p(ξ ) that assigns
the probability
ΔP = p(ξ )d ξ ≈ p(A)Δ V (A ∈ Δ X)
ξ εΔ X
P(Ai ) = ωi .
p(ξ ) = ∑ ωi δ (ξ − Ai ), ξ ε X. (1.6.3)
i
ΔP
P(Ai ) ≈ for Ai ∈ Δ X. (1.6.4a)
ΔN
Then, for the sum over points lying within Δ X, we have
ΔP
− ∑ P(Ai ) ln P(Ai ) ≈ −Δ P ln
ΔN
.
Ai ∈Δ X
Summing with respect to all regions Δ X, we see that the entropy (1.6.4) assumes
the form
ΔP
Hξ ≈ − ∑ Δ P ln . (1.6.5)
ΔN
If we introduce a measure ν0 (ξ ) specifying the density of points Ai , and such that
by integrating ν0 (ξ ) we calculate the number of points
ΔN = ν0 (ξ )d ξ
ξ εΔ X
ν0 (ξ ) = ∑ δ (ξ − Ai )
i
p(ξ ) p(ξ )
≈ .
ν0 (ξ ) ν0 (ξ )
This formula can be also derived from (1.6.5), since Δ P ≈ pΔ V , Δ N ≈ v0 Δ V when
the ‘radius’ r0 is significantly smaller than sizes of regions Δ X. However, if the
‘smoothing radius’ r0 is significantly longer than the mean distance between points
Ai , then smoothed functions (1.6.7) will have a simple (smooth) representation,
which was assumed in, say, formula (1.6.2).
Discarding the signs ∼ in (1.6.8) for both densities, instead of (1.6.2), we obtain
the following formula for entropy Hξ :
p(ξ )
Hξ = − p(ξ ) ln dξ . (1.6.9)
X ν0 ( ξ )
ν (A) should be given such that the measure P is absolutely continuous with respect
to ν .
A measure P is called absolutely continuous with respect to measure ν , if for
every set A from F, such that ν (A) = 0, the equality P(A) = 0 holds. According to
the well-known Radon–Nikodym theorem, it follows from the absolute continuity
of measure P with respect to measure ν that there exists an F-measurable function
f (x), denoted dP/d ν and called the Radon–Nikodym derivative, which generalizes
the notion of probability density. It is defined for all points from space X excluding,
perhaps, some subset Λ , for which ν (Λ ) = 0 and therefore P(Λ ) = 0.
Thus, if the condition of absolute continuity is satisfied, then the entropy Hξ is
defined with the help of the Radon–Nikodym derivative by the formula
dP
Hξ = − ln (ξ )P(d ξ ). (1.6.13)
X−Λ −Λ0 dν
The subset Λ , for which function dP/d ν is not defined, has no effect on the result
of integration since it has null measure P(Λ ) = ν (Λ ) = 0. Also, there is one more
inessential subset Λ0 , namely, a subset on which function dP/d ν is defined but equal
to zero, because
dP
P(Λ0 ) = ν (d ξ ) = 0 · v(d ξ ) = 0
Λ0 dν Λ0
dP
H(ξ ) = − ln (ξ ) (1.6.14)
dν
plays the role of random entropy analogous to random entropy (1.2.2). It is defined
almost everywhere in X, i.e. in the entire space excluding, perhaps, sets Λ + Λ0 of
zero probability P.
By analogy with (1.6.11), if N = ν (X) < ∞, then instead of ν (A) we can intro-
duce a normalized (i.e. probability) measure
where
P(d ζ )
ds2 = 2H P/P±δ P = 2 ln P(d ζ ). (1.6.19)
P(d ζ ) ± δ P(d ζ )
By performing a decomposition of function − ln (1 ± δ P/P) by δ P/P, it is not dif-
ficult to assure yourself that this metric can be also given by the equivalent formula
[δ P(d ζ )]2
ds2 = = [δ ln P(d ζ )]2 P(d ζ ). (1.6.20)
P(d ζ )
Here and further we assume that differentiability conditions are satisfied. It follows
from (1.6.22) that
2
ds ∂ 2 ln Pλ (d ζ )
=− Pλ (d ζ ). (1.6.23)
dλ ∂ 2λ
2
∂ ln Pλ (d ζ ) ∂ 2 ln Pλ (d ζ ) ∂ ∂ ln Pλ (d ζ )
+ Pλ (d ζ ) = Pλ (d ζ )
∂λ ∂ 2λ ∂λ ∂λ
We obtain
∂ 2 ln Pλ (d ζ ) d 2Γ (λ )
− = .
∂ 2λ dλ 2
Therefore, due to (1.6.23) we get
2
ds d 2Γ
= ,
dλ dλ 2
and also 2
∂ 2 ln Pλ (d ζ ) ds
− = . (1.6.26)
∂ 2λ dλ
Moreover, it is not difficult to make certain that expression (1.6.17) can be rewritten
in the following integral form:
1 1 λ
∂ ln Pλ (d ζ ) ∂ 2 ln Pλ (d ζ )
H P/Q
=− dλ P0 (d ζ ) = − dλ dλ P0 (d ζ ).
0 ∂λ 0 0 ∂ λ 2
To derive the last formula we have taken into account that [∂ (ln Pλ (d ζ ))/∂ λ ]
P0 (d ζ ) = 0 for λ = 0. Then we take into consideration (1.6.26) and obtain that
1 λ 1 2
ds 2 ds
H P/Q
= dλ d λζ = (1 − λ ) dλ . (1.6.27)
0 0 dλ 0 dλ
28 1 Definition of information and entropy in the absence of noise
Entropy (1.6.13), (1.6.16) defined in the previous section possesses a set of prop-
erties, which are analogous to the properties of an entropy of a discrete random
variable considered earlier. Such an analogy is quite natural if we take into account
the interpretation of entropy (1.6.13) (provided in Section 1.6) as an asymptotic case
(for large N) of entropy (1.6.1) of a discrete random variable.
The non-negativity property of entropy, which was discussed in Theorem 1.1, is
not always satisfied for entropy (1.6.13), (1.6.16) but holds true for sufficiently large
N. The constraint
P/Q
Hξ ln N
results in non-negativity of entropy Hξ .
Now we move on to Theorem 1.2, which considered the maximum value of en-
tropy. In the case of entropy (1.6.13), when comparing different distributions P we
need to keep measure ν fixed. As it was mentioned, quantity (1.6.17) is non-negative
and, thus, (1.6.16) entails the inequality
Hξ ln N.
Hξ = ln N.
Theorem 1.15. Entropy (1.6.13) attains its maximum value equal to ln N, when
measure P is proportional to measure ν .
1.7 Properties of entropy in the generalized version. Conditional entropy 29
This result is rather natural in the light of the discrete interpretation of for-
mula (1.6.13) given in Section 1.6. Indeed, a proportionality of measures P and ν
means exactly a uniform probability distribution on discrete points Ai and, thereby,
Theorem 1.15 becomes a paraphrase of Theorem 1.2. The following statement is an
P/Q P/Q
analog of Theorems 1.2 and 1.15 for entropy Hξ : entropy Hξ attains a mini-
mum value equal to zero when distribution P coincide with Q.
P/Q
According to Theorem 1.15, it is reasonable to interpret entropy Hξ , defined
by formula (1.6.17), as a deficit of entropy, i.e. as lack of this quantity needed to
attain its maximum value.
So far we assumed that measure P is absolutely continuous with respect to mea-
sure Q or (that is the same for finite N) measure ν . It raises a question how to
P/Q
define entropy Hξ or Hξ , when there is no such absolute continuity. The answer
to this question can be obtained if formula (1.6.16) is considered as an asymptotic
case (for very large N) of the discrete version (1.6.1). If in condensing the points
Ai (introduced in Section 1.6) we regard, contrary to formula P(Ai ) ≈ Δ P/(N Δ Q)
(see (1.6.4a)), the probabilities P(Ai ) of some points as finite: P(Ai ) > c > 0 (c is
independent of N), then measure QN , as N → ∞, will not be absolutely continuous
P/Q
with respect to measure P. In this case, the deficiency of Hξ will increase without
bound as N → ∞. This allows us to assume
P/Q
Hξ =∞
if measure P is not absolutely continuous with respect to Q (i.e. singular with re-
spect to Q). However, the foregoing does not define the entropy Hξ in the absence
of absolute continuity, since we have indeterminacy of the type ∞ − ∞ according
to (1.6.16). In order to eventually define it, we require a more detailed analysis of
the passage to the limit N → ∞ related to condensing points Ai .
Other properties of the discrete version of entropy, mentioned in Theorems 1.3,
1.4, 1.6, are related to entropy of many random variables and conditional entropy.
With a proper definition of the latter notions, the given properties will take place for
the generalized version, based on definition (1.6.13), as well.
Consider two random variables ξ , η . According to (1.6.13) their joint entropy is
of the form
P(d ξ , d η )
Hξ ,η = − ln P(d ξ , d η ). (1.7.1)
ν (d ξ , d η )
At the same time, applying formula (1.6.13) to a single random variable ξ or η we
obtain
P(d ξ )
Hξ = − ln P(d ξ ),
ν1 (d ξ )
P(d η )
Hη = − ln P(d η ).
ν2 (d η )
30 1 Definition of information and entropy in the absence of noise
Here, ν1 , ν2 are some measures; their relation to ν will be clarified later. We define
conditional entropy Hξ |η as the difference
Hξ |η = Hξ η − Hη , (1.7.2)
Hξ η = Hη + Hξ |η . (1.7.3)
Taking into account (1.6.13) and (1.7.2), it is easy to see that for Hξ |η we will have
the formula
P(d ξ | η )
Hξ |η = − ln P(d ξ d η ), (1.7.4)
ν (d ξ | η )
where P(d ξ | λ η ), ν (d ξ | η ) are conditional measures defined as the Radon–
Nikodym derivatives with the help of standard relations
P(ξ ∈ A, η ∈ B) = P(ξ ∈ A | η )P(d η ),
η ∈B
ν (ξ ∈ A, η ∈ B) = v(ξ ∈ A | η )ν2 (d η )
η ∈B
(sets A, B are arbitrary). The definition in (1.7.4) uses the following random entropy:
P(d ξ | η )
H(ξ | η ) = − ln .
ν (d ξ | η )
ν (d ξ , d η ) ν1 (d ξ )ν2 (d η ), (1.7.5)
Its non-negativity is seen from here due to condition (1.7.5). Therefore, the differ-
ence (1.7.6) is non-negative. The proof is complete.
P(d ξk )
H(ξk ) = − ln , (1.7.10)
νk (d ξk )
Hξk = E[H(ξk )], (1.7.11)
P(d ξk | ξ1 , . . . , ξk−1 )
H(ξk | ξ1 , . . . , ξk−1 ) = − ln , (1.7.12)
νk (d ξk )
P(d ξk | ξ1 , . . . , ξk−1 )
Hξk |ξ1 ,...,ξk−1 = E[H(ξk | ξ1 , . . . , ξk−1 )] = − ln . (1.7.13)
νk (d ξk )
where
N= ν (d ξ , d η ), N1 = ν1 (d ξ ), N2 = ν2 (d η ),
so that
P/Q
Hξ |η = ln N1 − Hξ |η . (1.7.17a)
1.7 Properties of entropy in the generalized version. Conditional entropy 33
Hξ Hξ η − Hη .
Using (1.7.14), in view of (1.7.15), (1.7.16), we obtain from the last formula that
P/Q P/Q P/Q
Hξ η − Hη Hξ .
Comparing it with (1.7.4b), it is easy to see that the sign has been replaced
with the opposite one () for entropy H P/Q . This is convincing evidence that en-
P/Q P/Q
tropies Hξ , Hξ |η cannot be regarded as measures of uncertainty in contrast to
entropies (1.6.1) or (1.6.13).
In the case of many random variables, it is expedient to impose constraints of
type (1.7.15), (1.7.16) for many variables and use the hierarchical additivity property
P/Q P/Q P/Q P/Q P/Q
Hξ = Hξ + Hξ + Hξ + · · · + Hξn |ξ . (1.7.20)
1 ,...,ξn 1 2 | ξ1 3 | ξ1 , ξ2 1 ,...,ξn−1
Typically in online processing, the quantitative equality between the amounts of in-
formation to be encoded and information that has been encoded is maintained only
on average. With time, random time lag is produced. For a fixed length message
sequence, the length of its code sequence will have random spread increasing with
time; and vice versa: for a fixed record length, the number of elementary messages
transmitted will have increasing random spread.
Another approach can be called ‘block’ or ‘batch’ encoding. A finite collection (a
block) of elementary messages is encoded. If there are several blocks, then different
blocks are encoded and decoded independently. In this approach, there is no increase
in random time lag, but there is a loss of some realizations of the message. A small
portion of message realizations cannot be encoded and is lost, because there are not
enough code realizations. If the block is entropically stable, then the probability of
such a loss is quite small. Therefore, when we use the block approach, we should
investigate problems related to entropic stability.
Following tradition, we mostly consider the online encoding in the present chap-
ter. In the last paragraph, we study the errors of the aforementioned type of encod-
ing, which occur in the case of a finite length of an encoding sequence.
Let us confirm the validity and efficiency of the definitions of entropy and the
amount of information given earlier by considering a transformation of informa-
tion from a certain form to another. Such a transformation is called encoding and is
applied to transforming and storing information.
Suppose that the message to be transmitted and recorded is represented as a se-
quence ξ1 , ξ2 , ξ3 , . . . of elementary messages or random variables. Let each elemen-
tary message be represented in the form of a discrete random variable ξ , which takes
one out of m values (say, 1, . . . , m) with probability P(ξ ). It is required to transform
the message into a sequence η1 , η2 , . . . of letters of some alphabet A = (A, B, . . . , Z).
The number of letters in this alphabet is denoted by D. We treat a space, punctuation
marks and so on as letters of the alphabet, so that a message is represented in one
word. It is required to record a message in such a way that the message itself can be
recovered from the record without any losses or errors. The respective theory has to
show which conditions must be satisfied and how to do so.
Since there is no fixed constraint between the number of elementary messages
and the number of letters in the alphabet, the trivial ‘encoding’ of one message by
one special letter may not be the best strategy in general. We will associate each
separate message, i.e. each realization ξ = i of a random variable, with the corre-
sponding ‘word’ V (i) = (η1i , . . . , ηlii ) in the alphabet A (li is the length of that word).
The full set of such words (their number equals m) forms a code. Having defined
the code, we can recover the message in letters of alphabet A from its realization
ξ1 , ξ2 , . . . , i.e. it will take the form
2.1 Main principles of encoding discrete information 37
nHξ = L ln D. (2.1.1)
This relationship tells us that every letter is used to its ‘full power’; it is an indication
of an optimal code.
Further, we consider some certain realization of message ξ = j. According
to (1.2.1) it contains random information H(ξ ) = − ln P( j). But for an optimal
code every letter of the alphabet carries information ln D due to (2.1.2). It follows
from here that the code word V (ξ ) of this message (which also carries information
H(ξ ) = − ln P(ξ ) in the optimal code) must consist of l(ξ ) = − ln P(ξ )/ ln D letters.
For non-optimal encoding, every letter carries less information than ln D. That is
why the length of the code word l(ξ ) (which carries information − ln P(ξ ) = H(ξ ))
must be greater. Therefore, for each code
l (ξ ) Hξ / ln D.
lav Hξ / ln D (2.1.3)
Example 2.1. Let ξ be a random variable with the following values and probabili-
ties:
ξ 1 2 3 4 5 6 7 8
P(ξ ) 1 1 1 1 1 1 1 1
4 4 8 8 16 16 16 16
The message is recorded in the binary alphabet (A, B), so that D = 2. We take in the
code words
In order to tell whether this code is good or not, we compare it with an optimal
code, for which relation (2.1.1) is satisfied. Computing entropy of an elementary
message by formula (1.2.3), we obtain
3
Hξ = ln 2 + ln 2 + ln 2 = 2.75 nat.
4
There are three letters per elementary message in code (2.1.4), whereas the same
message may require L/n = Hξ / ln 2 = 2.75 letters in the optimal code according
to (2.1.1). Consequently, it is possible to compress the record by 8.4%. As an opti-
mal code we can choose the code
2.1 Main principles of encoding discrete information 39
Hηs (| η1 , η2 ) = 1 bit; Hηs |η1 ,η2 = 2 bit; Hηs |η1 ,η2 ,η3 = 1 bit.
Thus, every letter carries the same information 1 bit independently of its situation.
Under this condition random information of an elementary message ξ is equal to
the length of the corresponding code word:
H (ξ ) = l (ξ ) bit. (2.1.6)
Therefore, the average length of the word is equal to information of the elementary
message:
Hξ = lav bit = lav ln 2 nat.
This complies with (2.1.1) so that the code is optimal. Relation P(ξ ) = (1/2)l(ξ ) ,
which is equivalent to equality (2.1.6), is a consequence of independence of letter
distribution in a literal record of a message.
Relation (2.1.3) with the equality sign is valid under the assumption that n is
large, when we can neglect probability P(An ) of non-essential realizations of the
sequence ξ1 , . . . , ξn (see Theorems 1.7, 1.9). In this sense it is proper asymptotically.
We cannot establish a similar relation for a finite n. Indeed, if we require precise
reproduction of all literal messages of a finite length n (their number equals mn ),
then length L of the record will be determined from the formula DL = mn . It will
be impossible to shorten it (for instance, to the value nHξ / ln D) without loss of
some messages (the probability of which is finite for finite n). On the other hand,
supposing that length L = L(ξ1 , . . . , ξn ) is not constant for finite n, we can encode
messages in such a way that (2.1.3) transforms into the reverse inequality lav =
Hξ / ln D.
We demonstrate the validity of the last fact in the case of n = 1. Let there be
given three possibilities ξ = 1, 2, 3 with probabilities 1/2, 1/4, 1/4. By selecting a
code
ξ 1 2 3
(2.1.7)
V (ξ ) (A) (B) (AB)
corresponding to D = 2, we obtain lav = 12 ln 2 + 24 ln 2 + 24 ln 2 = 1.5 bits. Therefore,
for the given single message inequality (2.1.3) is violated.
40 2 Encoding of discrete information in the absence of noise and penalties
Let us now consider the case of an unbounded (from one side) sequence ξ1 , ξ2 ,
ξ3 , . . . of identically distributed independent messages. It is required to encode the
sequence. In order to apply the reasoning of the previous paragraph, one has to
decompose this sequence into intervals, called blocks, which contain n elementary
messages each. Then we can employ the aforementioned general asymptotic (‘ther-
modynamic’) relations for large n.
However, there also exists a different approach for studying encoding in the
absence of noise—to avoid a sequence decomposition into blocks and to discard
inessential realizations. The corresponding theory will be stated in this paragraph.
Code (2.1.7) fits for transmitting (or recording) one single message but is not
suitable for a transmission of a sequence of such messages. For instance, the record
ABAB can be simultaneously interpreted as the record V (1)V (2)V (3) of the mes-
sage (ξ1 , ξ2 , ξ3 ) = (1 2 3) or the record V (3)V (1)V (2) of the message (3 1 2) (when
n = 3), let alone the messages (ξ1 , ξ2 ) = (3 3); (ξ1 , ξ2 , ξ3 , ξ4 ) = (1 2 1 2), which
correspond to different n. The code in question does not make it possible to unam-
biguously decode a message and, thereby, we have to reject it if we want to transmit
a sequence of messages.
A necessary condition, that a long sequence of code symbols can be uniquely
decomposed into words, restricts the class of feasible codes. Codes in which every
semi-infinite sequence of code words is uniquely decomposed into words are called
uniquely decodable. As it can be proven (for instance, in the book by Feinstein [12]
or its English original [11]) the inequality
m
∑ D−l(ξ ) 1 (2.2.1)
ξ =1
In such codes any code word cannot be a forepart (‘prefix’) of another word. In
code (2.1.7) this condition is violated for word (A) because it is a prefix of word
(AB). Codes can be drawn in the form of a ‘tree’ (Figure 2.1) similarly to ‘trees’
represented on Figures 1.1 and 1.2. However (if considering a code separately from
a message ξ ), those ‘trees’ probabilities are not assigned to ‘branches’ of a ‘code
tree’. A branch choice is conducted in stages by recording a next letter η of the
word. The end of the word corresponds to a special end node which we denote as
a triangle on Figure 2.1. For each code word there is a code line coming from a
start node to an end node. The ‘tree’ corresponding to code (2.1.5) is depicted on
Figure 2.1.
Any end node does not belong to a different code line for Kraft’s uniquely de-
codable codes.
Theorem 2.1. Inequality (2.2.1) is a necessary and sufficient condition for existence
of a Kraft uniquely decodable code.
Proof. At first, we prove that inequality (2.2.1) is satisfied for every Kraft’s code.
We emanate a maximum number (equal D) of auxiliary lines (which are represented
by wavy lines on Figure 2.2) from each end node of the code tree. We suppose that
they multiply in an optimal way (each of them reproduces D offspring lines) at all
subsequent stages. Next, we calculate the number of auxiliary lines corresponding
to index k at some high-order stage. We assume that k is greater than the length
of the longest word. The end node of a word of length l(ξ ) will produce Dk−l(ξ )
auxiliary lines at stage k. The total number of auxiliary lines is equal to
∑ Dk−l(ξ ) . (2.2.2)
ξ
Now we emanate auxiliary lines not only from end nodes but also from every interim
node if the number of code lines coming out from it is less than D. In turn, let
those lines multiply in a maximum way at other stages. At k-th stage the number of
auxiliary lines will increase compared to (2.2.2) and will become equal to
42 2 Encoding of discrete information in the absence of noise and penalties
Dk ∑ Dk−l(ξ ) .
ξ
Hence, m1 D and the alphabet contains enough letters to fill out all one-letter
words. Further, we consider two-letter words. Besides already used letters located
at the first position, there are also D − m1 letters, which we can place at the first po-
sition. We can place any of D letters after the first letter. Totally, there are (D − m1 )D
possibilities. Every node out of D − m1 non-end nodes of the first stage produces D
lines. The number of those two-letter combinations (lines) must be enough to fill
out all two-letter words. We denote their number by m2 . Indeed, keeping only terms
with l(ξ ) = 1 or l(ξ ) = 2 in the left-hand side of (2.2.1), we have
D−1 m1 + D−2 m2 1
i.e. m2 D2 − Dm1 . This exactly means that the number of two-letter combinations
is enough for completing two-letter words. After such a completion there are D2 −
Dm1 − m2 two-letter combinations left available, which we can use for adding new
letters. The number of three-letter combinations equal (D2 − Dm1 − m2 )D is enough
for completing three-letter words and so forth. Every time a part of the terms of the
sum ∑ξ D−l(ξ ) is used in the proof. This finished the proof.
As is seen from the proof provided above, it is easy to actually construct a code
(filling out a word with letters) if an appropriate set of lengths l(1), . . . , l(n) is given.
Next we move to the main theorems.
Theorem 2.2. The average word length lav cannot be less than Hξ / ln D for any
encoding.
Proof. We consider the difference
D−l(ξ )
lav ln D − Hξ = E [(l (ξ )) ln D + ln P (ξ )] = −E ln .
P (ξ )
2.2 Main theorems for encoding without noise. I.i.d. messages 43
since
D−l(ξ ) D−l(ξ )
E ∑ P (ξ )
P (ξ ) ξ
P (ξ )
lav ln D − Hξ 1 − ∑ D−l(ξ ) 0.
ξ
Proof. We choose lengths l(ξ ) of code words in such a way that they satisfy the
inequality
ln P (ξ ) ln P (ξ )
− l (ξ ) < − + 1. (2.2.5)
ln D ln D
Such a choice is possible because the double inequality contains at least one integer
number.
It follows from the left-hand side inequality that P(ξ ) D−l(ξ ) and, thus,
1 − ∑ P (ξ ) ∑ D−l(ξ )
ξ ξ
i.e. the decipherability condition (2.2.1) is satisfied. As is seen from the arguments
just before Theorem 2.2, it is not difficult to practically construct code words with
selected lengths l(ξ ). Averaging out both sides of the right-hand side inequality
of (2.2.5), we obtain (2.2.4). This finishes the proof.
44 2 Encoding of discrete information in the absence of noise and penalties
Hξ 1
lav < + .
ln D n
Increasing n, we can make value 1/n as small as needed that proves the theorem.
The provided results have been derived for the case of a stationary sequence of
independent random variables (messages). These results can be extended to the case
of dependent messages, the non-stationary case or others. In this case, it is essential
that a sequence satisfies certain properties of entropic stability (see Section 1.5).
The previous reasoning does not only resolve principal questions about existence of
asymptotically optimal codes, i.e. codes for which an average length of a message
converges to a minimum value, but also provides practical methods to find them. By
selecting n and determining lengths of code words with the help of the inequality
ln P (ξ1 , . . . , ξn ) ln P (ξ1 , . . . , ξn )
− l ( ξ1 , . . . , ξn ) − +1
ln D ln D
we obtain a code that is asymptotically optimal according to Theorem 2.4. However,
such a code may not be quite optimal for fixed n. In other words, it may appear that
the message ζ = (ξ1 , . . . , ξn ) can be encoded in a better way, i.e. with a smaller
length lav .
Huffman [23] (see also Fano [10] or its original in English [9]) has found a sim-
ple method of optimal encoding, which corresponds to a minimum average length
amongst all possible ones for a given message.
At first, we consider the case of D = 2 and some certain optimal code K for
message ζ . We will investigate which mandatory properties the optimal code or the
respective ‘tree’ possesses.
2.3 Optimal encoding by Huffman. Examples 45
At every stage (maybe except the last one) there must be a maximum number
of D lines emanating from each non-end node. Otherwise, the end line from the
last stage can be moved to that ‘incomplete’ node and, thereby, reduce length lav .
There must be at least two lines at the last stage. Otherwise, the last stage would
have been redundant and could have been excluded that would have shortened the
corresponding code line. Amongst the lines of the last stage there are two lines,
which correspond to two least likely probable realizations ζ1 , ζ2 of the message
chosen from the set of its all realizations ζ1 , . . ., ζm . Equivalently, we could have
been able to exchange messages by assigning a less likely realization of a message
to a longer word and, thus, to shorten the average word length.
But if two least likely messages have code words ending at the last stage, then we
can suppose that their lines emanate from the same node from the penultimate stage.
If it does not work for K, then by switching two messages we make it work and, thus,
we obtain another definite code having the same average length and, consequently,
being at least as good as the original code K.
We turn the node from the penultimate stage (from which the two aforemen-
tioned lines emanate) into an end node. Correspondingly, we treat two least likely
realizations ζ1 , ζ2 as one. Then the number of distinct realizations is equal to m − 1
and the respective probabilities are P(ζ1 ) + P(ζ2 ), P(ζ3 ), . . ., P(ζm ). We can think
that there are a new (‘simplified’) random variable, a new code and a new (‘trun-
cated’) ‘tree’. The average word length of this variable can be expressed in terms of
the prior average length by the following formula:
lav = lav + P(ζ1 ) + P(ζ2 ).
Evidently, if the original code ‘tree’ minimizes lav , then the ‘truncated’ code ‘tree’
minimizes lav and vice versa. Therefore, the problem of finding an optimal code
‘tree’ is reduced to the problem of finding an optimal ‘truncated’ code ‘tree’ for
a message with m − 1 possibilities. All those considerations related to the initial
message above can be applied to this ‘simplified’ message. Thus, we obtain a ‘twice
simplified’ message with m−2 possibilities. Then the aforementioned consideration
is also applied to the latter message and so forth until a trivial message with two
possibilities is obtained. During the process of the specified simplification of a code
‘tree’ and a message, two branches of the code ‘tree’ merge into one at every step
and eventually its structure is understood completely.
Similar reasoning can also be applied to the case D > 2. Suppose that all the
nodes of the code ‘tree’ are entirely filled, which occurs when m − 1 is divided by
D − 1 without a remainder. In fact, each node (except the terminal ones) adds D − 1
new lines (if we suppose that only one line comes to the first node from below),
and the quotient (m − 1)/(D − 1) is equal to the number of nodes. In this case every
‘simplification’ decreases the number of realizations of a random variable by D − 1.
Then the D least probable outcomes merge, and the D longest branches of the code
tree (among the remaining ones) are shortened by unity and are replaced by one
terminal node.
46 2 Encoding of discrete information in the absence of noise and penalties
There is some difficulty in the case when quotient (m − 1)/(D − 1) is not integer.
This means that at least one of internal nodes of the code ‘tree’ must be incomplete.
But, as it was mentioned, incomplete nodes can belong only to the last stage in the
optimal tree, i.e. they are associated with the last choice. Without ‘impairing’ the
tree we can transpose messages related to words of the maximum length in such a
way that: 1) only one single node will be incomplete, 2) messages corresponding to
that node will have the least probability. We denote the remainder of the division of
m − 1 by D − 1 as r (0 < r < D − 2). Then the single incomplete node will have r + 1
branches. According to the aforesaid, the first ‘pruning’ of the ‘tree’ and the sim-
plification of the random variable will be the following: we select r + 1 least likely
probabilistic realizations out of ζ1 , . . ., ζm and replace them by a single realization
with the aggregate probability. The ‘pruned’ tree will have already filled internal
nodes. For further simplification we take D least likely realizations out of the ones
formed at the previous simplification and replace them by a single one with the ag-
gregate probability. The same operation is conducted further. We have just described
a method to construct an optimal Huffman’s code. The aforesaid considerations tell
us that the resultant code ‘tree’ may not coincide with the optimal ‘tree’ K but is not
worse than that, i.e. it has the same average length lav .
V (0) = A; V (1) = B.
Hξ / ln D = 0.544 (2.3.1)
and, thereby, lav = 0.544 is possible. The comparison of the latter number with 1
shows that we can achieve a significant improvement of conciseness if we move
to more complicated types of encoding—to n > 1.
2. Suppose that n = 2. Then we will have the following possibilities and probabili-
ties:
ζ 1 2 3 4
ξ1 ξ2 11 10 01 00
P (ζ ) 0.015 0.110 0.110 0.765
Here ζ means an index of pair (ξ1 , ξ2 ).
At the first ‘simplification’ we can merge realizations ζ = 1 and ζ = 2 into
one having probability 0.015 + 0.110 = 0.125. Among realizations left after the
‘simplification’ and having probabilities 0.110, 0.125, 0.765 we merge two least
2.3 Optimal encoding by Huffman. Examples 47
likely ones again. The scheme of such ‘simplifications’ and the respective code
‘tree’ are represented on Figures 2.3 and 2.4.
It just remains to position letters A, B along the branches in order to obtain the
following optimal code:
ζ 1 2 3 4
ξ1 ξ2 11 10 01 00
V (ξ1 ξ2 ) AAA AAB AB B
Its average word length is equal to
Fig. 2.3 The ‘simplification’ scheme for n = 2 Fig. 2.4 The code ‘tree’ for the
considered example with n = 2
By this time, the value is sufficiently closer to the limit value lav = 0.544 (2.3.1)
than lav = 0.68 (2.3.2). Increasing n and constructing optimal codes by the spec-
ified method, we can approach value 0.544 as close as necessary. Of course, that
is achieved by complication of the coding system.
The encoding methods described in Section 2.2 are such that the record length of a
fixed number of messages is random. The provided theory gives estimators for the
average length but tells nothing about its deviation. Meanwhile, in practice a record
length or a message transmission time can be bounded by technical specifications.
It may occur that a message record does not fit in permissible limits and, thus, the
given realization of the message cannot be recorded (or transmitted). This results
in certain losses of information and distortions (dependent on deviation of a record
length). The investigation of those phenomena is a special problem. As it will be
seen below, entropical stability of random variables
η n = (ξ1 , . . . ., ξn ) .
ln P (η n ) ln P (η n )
− l (η n ) < − +1
ln D ln D
2.4 Errors of encoding without noise in the case of a finite code sequence length 49
according to (2.2.5). Then the record of those messages, for which the constraint
ln P (η n )
− +1 L (2.4.1)
ln D
holds, will automatically be within fixed bounds. When decoding, those realizations
will be recovered with no errors. Records of some realizations will not fit in. Thus,
when decoding such a case, we can stick with any realization that will entail emer-
gence of errors, as a rule. There arises a question how to estimate probabilities of a
correct decoding and an erroneous decoding.
Evidently, the probability of decoding error will be at most the probability of the
inequality
ln P (η n )
− +1 > L (2.4.2)
ln D
which is reverse to (2.4.1).
Inequality (2.4.2) can be represented in the form
H (η n ) L − 1
> ln D.
Hη n Hη n
But the ratio H(η n )/Hη n converges to one for entropically stable random variables.
Taking into account the definition of entropically stable variables (Section 1.5), we
obtain the following result.
Theorem 2.5. If random variables η n are entropically stable and the record length
L = Ln increases with a growth of n in such a way that the expression
Ln − 1
ln D − 1
Hη n
is kept larger than some positive constant ε , then probability Per of decoding error
converges to zero as n → ∞.
Ln ln D > (1 + ε ) Hη n
if Hη n → ∞ as n → ∞.
The realizations, which have not fitted in and thereby have been decoded erro-
neously, pertain to set An of non-essential realizations. We remind that set An is
treated in Theorem 1.9.
Employing the results of Section 1.5 we can derive more detailed estimators for
the error probability Per and estimate the rate of decrease.
50 2 Encoding of discrete information in the absence of noise and penalties
Theorem 2.6. For the fixed record length L > (Hη n / ln D) + 1 the probability of
decoding error satisfies the inequality
Var (H (η n ))
Per . (2.4.3)
[(L − 1) ln D − Hη n ]2
P [|ξ − E [ξ ]| ε ] Var (ξ ) /ε 2
L−1
ε= ln D − 1
Hη n
Using Theorem 1.13, it is easy to show a faster exponential law of decay for the
probability Per .
Theorem 2.7. For the fixed length L > Hη / ln D + 1 the probability of decoding er-
ror satisfies the inequality
Per eμ (s)−sμ (s) , (2.4.6)
where μ (α ) is the potential defined by formula (1.5.15), and s is a positive root of
the equation
μ (s) = (L − 1) ln (D)
if such a root lies both in the domain and in the differentiability region of the poten-
tial.
In order to prove the theorem we just need to use formula (1.5.18) from Theo-
rem 1.13 by writing
2.4 Errors of encoding without noise in the case of a finite code sequence length 51
L−1
ε= ln D − 1.
Hη
For small ε we can replace inequality (2.4.6) with the following inequality:
2
[(L − 1) ln D − Hη ]
Per exp − (2.4.7)
2Var (H (η ))
It tells us about an exponential law of decay for the probability Per with a growth
of n.
Formulae (2.4.7) and (2.4.8) correspond to the case, in which the probability
distribution of entropy can be regarded as approximately Gaussian due to the central
limit theorem.
Chapter 3
Encoding in the presence of penalties. First
variational problem
Symbols Vi l (i)
Dot +− 2
Dash + + +− 4 (3.1.1)
Spacing between letters −−− 3
Spacing between words − − − − −− 6
that allows us to find M(L) as a function of L. After M(L) has been found, it becomes
easy to determine information that can be transmitted by a recording of length L. As
in the previous cases, maximum information is attained when all of M(L) scenarios
are equiprobable. At the same time
H L /L = ln(M(L))/L
3.1 Direct method of computing information capacity of a message for one example 55
where H L = Hvi1 ...vik is entropy of the recording. We take the limit of this relation-
ship for L → ∞ and, thereby, obtain entropy of a recording meant for a unit of length:
As is seen from this formula, there is no need to find an exact solution of equa-
tion (3.1.3) but it is sufficient to consider only asymptotic behaviour of ln M(L) for
large L. Equation (3.1.3) is linear and homogeneous. As for any such equation, we
can look for a solution in the form
Such a solution usually turns out to be possible only for certain (‘eigen’) val-
ues λ = λ1 , λ2 , . . .. With the help of the spectrum of these ‘eigenvalues’ a general
solution can be represented as follows:
∑ W −l(i) = 1 (3.1.9)
i
λ = lnW. (3.1.10)
Since function M(L) cannot be complex and alternating-sign, the eigenvalue must
be real: Im λm = 0. But (3.1.10) is the only real eigenvalue. Consequently, the value
lnW is the desired eigenvalue λm with the maximum real part. Formula (3.1.11) then
takes the form
M (L) ≈ Cm W L
and limit (3.1.4) appears to be equal to
H = λm = lnW (3.1.12)
that yields the solution of the problem in question. This solution has been first found
by Shannon [45] (the English original is [38]).
The recording considered above or a transmission having a fixed length in the spec-
ified alphabet is an example of a noiseless discrete channel. Here we provide a more
general definition of this notion.
Given variable y taking a discrete set Y (not necessarily with a finite cardinality)
of values. Further, some numerical function c(y) is given on Y . By reasons that will
get clear later we call it a cost function.
Let there be given some number a and a fixed condition
c (y) a. (3.2.1)
that is, with the requirement that a recording length does not exceed the specified
value of L. The consideration of this example will be continued in Section 3.4 (Ex-
ample 3.4).
Variable Ma (the number of realizations) tells us about capabilities of the given
system (channel) to record or transmit information. The maximum amount of infor-
mation that can be transmitted is evidently equal to ln Ma . This amount can be called
information capacity or channel capacity. However, a direct calculation of Ma is
coupled sometimes with some difficulties, as it is seen from the example covered in
Section 3.1. Thus, it is more convenient to define the notion of channel capacity not
as ln Ma but a bit differently.
We introduce a probability distribution P(y) on Y and replace condition (3.2.1)
by an analogous condition for the mean
Here maximization is conducted over different distributions P(y), which are com-
patible with constraint (3.2.5).
Hence, channel capacity is defined as a solution of one variational problem.
As it will be seen from Section 4.3, there exists a direct asymptotic relationship
between value (3.2.6) and Ma . Speaking of applications of these ideas to statistical
physics, relationships (3.2.1) (taken with an equality sign) and (3.2.5) correspond to
microcanonical and canonical distributions, respectively. Asymptotic equivalence of
these distributions is well known in statistical physics.
The above definitions of a channel and its capacity can be modified a little, for
instance, by substituting inequalities (3.2.1), (3.2.6) with the following two-sided
inequalities
a1 c (y) a2 ; a1 ∑ c (y) P (y) a2 (3.2.7)
y
or with inequalities in the opposite direction (if the number of realizations remains
finite). Besides, in more complicated cases there can be given several numerical
functions or a function taking values from some other space. All these modifications
do not usually relate to fundamental changes, so we do not especially address them
here.
The cost function c(y) may have different physical or technical meaning in dif-
ferent problems. It can characterize ‘costs’ of certain symbols, point to unequal
costs incurred in recording or transmission of some symbol, for instance, a differ-
ent amount of paint or electric power. It can also correspond to penalties placed
on different adverse factors, for instance, it can penalize excessive height of letters
and other. In particular, if length of symbols is exposed to penalties, then (3.2.3) or
c(y) = l(y) holds true for y = i.
58 3 Encoding in the presence of penalties. First variational problem
It is essential that all three specified problem statements lead to the same solu-
tion if parameters a, I, β are coordinated properly. We call any of these mentioned
statements the first variational problem.
It is convenient to study the addressed questions with the help of thermodynamic
potentials introduced below.
P (y) 0, y ∈ Y. (3.3.2)
must be satisfied. However, this requirement should not be necessarily included into
the set of constraints, since the solution of the problem with all other constraints
retained but without the given requirement turns out (as the successive inspection
will show) to satisfy it automatically. We introduce Lagrangian multipliers β , γ and
try to find an extremum of the expression
Here differentiation is carried out with respect to those and only those P(y),
which are different from zero in the extremum distribution P(·). We assume that
this distribution exists and is unique. A subset of Y , the elements of which have
non-zero probabilities in the extremum distribution, is denoted by Y . We call this Y
an ‘active’ domain. Hence, equation (3.3.4) is valid only for elements y belonging
to Y . From (3.3.4) we get
Theorem 3.1. When solving the problem of maximum entropy (3.2.6) under the con-
straint (3.2.8), probabilities of all elements of Y , at which the cost function c(y)
takes finite values, are different from zero in the extremum distribution. Therefore, if
function c(y) is finite for all elements, then set Y coincides with the entire set Y .
60 3 Encoding in the presence of penalties. First variational problem
Proof. Assume the contrary, i.e. assume that some elements y ∈ Y have null prob-
ability P0 (y) = 0 in the extremum distribution P0 (y). Since distribution P0 is ex-
tremum, for another measure P1 (even non-normalized) with the same nullity set
Y − Y the following relationship
is valid.
Now we consider P(y1 ) = 0, y1 ∈ Y − Y and suppose that P(y) = 0 at other points
of subset Y − Y . We choose other probabilities P(y), y ∈ Y in such a way that the
constraints
∑ P (y) = 1; ∑ c (y) P (y) = ∑ c (y) P0 (y) (3.3.8)
y y y
are satisfied and differences P(y) − P0 (y) (y ∈ Y ) are linearly dependent on P(y1 ).
Non-zero probability P(y1 ) = 0 will lead to the emergence of the extra term
−P(y1 ) ln P(y1 ) in the expression for entropy Hy . Supposing P1 (y) = P(y) for y ∈ Y
and taking into account (3.3.7), we obtain that
For sufficiently small P(y1 ) (when − ln P(y1 ) − β c(y1 ) − γ + O(P(y1 )) > 0) the ex-
pression in the right-hand side of (3.3.9) is certainly positive. Thus, the entropy of
distribution P satisfying the same constraints (3.3.8) will exceed the entropy of the
extremum distribution P0 , which is impossible. So, element y1 having a zero proba-
bility in the extremum distribution does not exist. The proof is complete.
So, as a result of the fact that Y and Y coincide for finite values of c(y) derived
in Theorem 3.1, henceforth we use formula (3.3.5) or (3.3.6) for the entire set Y .
3.3 Solution of the first variational problem. Thermodynamic parameters and potentials 61
∂ 2 Hy 1 !
= − δ yy
, y, y ∈ Y
∂ P (y) ∂ P (y ) P (y)
where δyy is a Kronecker symbol. Thus, taking into account disappearance of the
first-order derivatives and neglecting the third-order terms with respect to deviation
Δ P(y), y ∈ Y from the extremum distribution P we obtain
1
Hy = Hy −C = − ∑ [P (y)]2 . (3.3.10)
P (y)
y∈Y
Since all P(y) included here are positive, the difference Hy − C is negative that
proves maximality of C.
It can also be proven that if β > 0, then the extremum distribution found above
corresponds to the minimum of average cost (3.2.9) with the fixed entropy (3.2.10).
In order to prove it, we need to take into account that analogously to (3.3.10) the
following relationship
1
K = K − Kextr = − ∑ [P (y)]2 < 0.
P (y)
y∈Y
F (T ) = −T ln Z. (3.3.12)
With the help of the mentioned free energy we can compute entropy C and energy
R. We now prove a number of relations common in thermodynamics as separate
theorems.
Theorem 3.2. Free energy is connected with entropy and average energy by the
simple relationship
F = R − TC. (3.3.13)
c (y) − F
H (y) = .
T
Averaging of this equality with respect to y leads to relationship (3.3.13). The proof
is complete.
Theorem 3.3. Entropy can be computed via differentiating free energy with respect
to temperature
C = −dF/dT. (3.3.15)
dF dZ c(y) c (y)
− = ln Z + T Z −1 = ln Z + T Z −1 ∑ e− T ,
dT dT y T2
3.3 Solution of the first variational problem. Thermodynamic parameters and potentials 63
dF
− = T −1 (−F + R) = C.
dT
This end the proof.
As is seen from formulae (3.3.13), (3.3.15), energy (average cost) can be ex-
pressed using free energy as follows:
dF
R = F −T . (3.3.16)
dT
It is simple to verify that this formula can be rewritten in the following more compact
form:
d d ln Z 1
R= (β F) = − , β= . (3.3.17)
dβ dβ T
After calculation of functions C(T ), R(T ) it is straightforward to find the channel
capacity C = C(a) (3.2.6). Equation (3.2.8), i.e. the equation
R (T ) = a (3.3.18)
will determine T (a). So, the channel capacity will be equal to C(a) = C(T (a)).
Similarly, we can find the minimum average cost (3.2.9) with the given amount
of information I for problem (3.2.9), (3.2.10). Averaging (3.2.10) taking the form
C (T ) = I,
Theorem 3.4. If distribution (3.3.6) exists, i.e. the sum (3.3.11) converges, then the
following formula
dR
= T, (3.3.19)
dC
is valid, so that for T > 0 the average cost is an increasing function of entropy, and
for T < 0 it is decreasing.
dR = dF +CdT + T dC.
64 3 Encoding in the presence of penalties. First variational problem
dR = T dC. (3.3.20)
dH = dQ/T
where dQ is the amount of heat entering the system and augmenting its internal
energy by dR. H = C is the entropy.
Theorem 3.5. If distribution (3.3.6) exists and c(y) is not constant within Y , then
R turns out to be an increasing function of temperature. Also, channel capacity
(entropy) C is an increasing function of T for T > 0.
Proof. Differentiating (3.3.17) we obtain
dR 1 d 2 (β F)
=− 2 . (3.3.21)
dT T dβ 2
dC 1 d 2 (β F)
=− 3 . (3.3.22)
dT T dβ 2
d 2 (β F)
− >0 (3.3.23)
dβ 2
1 dZ
= − ∑ c (y) e−β c(y) / ∑ e−β c(y) = −E [c (y)]
Z dβ
(3.3.25)
1 d2Z
= ∑ c2 (y) e−β c(y) / ∑ e−β c(y) = E c2 (y)
Z dβ 2
From the latter formula, due to (3.3.23), we can make a conclusion about concavity
of function F(T ) for T > 0 and convexity of F(T ) for T < 0.
The above-mentioned facts are a particular demonstration of properties of ther-
modynamic potentials. The relative examples of such phenomena are F(T ), ln Z(β ).
Hence, these potentials play a significant role not only in thermodynamics but also
in information theory. In the next chapter, we will touch upon subjects related to
asymptotic equivalence between constraints of types (3.2.1) and (3.2.5).
Example 3.1. For simplicity let there be only two symbols initially: m = 2; y = 1, 2,
which correspond to different costs
analog of energy) can be chosen arbitrarily. In the example in question the optimal
probability distribution has the form
P (1, 2) = e±a/T 2 cosh (a/T ) (3.4.2a)
according to (3.3.14). The determined functions (3.4.2) can be used in cases, for
which there exists sequence yL = (y1 , . . . , yL ) of length L that consists of symbols
described above. If the number of distinct elementary symbols equals 2, then the
number of distinct sequences is evidently equal to m = 2L . Next we assume that the
costs imposed on an entire sequence are the sum of the costs imposed on symbols,
which constitute this sequence. Hence,
L
c yL = ∑ c (yi ) . (3.4.3)
i=1
Application of the above-derived formula (3.3.6) to the sequence shows that in this
case the optimal probability distribution for the sequence is decomposed into a prod-
uct of probabilities of different symbols. That is, thermodynamic functions F, R, H
related to the given sequence are equal to a sum of the corresponding thermody-
namic functions of constituent symbols. In the stationary case (when an identical
distribution of the costs corresponds to symbols situated at different places) we have
F = LF1 ; R = LR1 ; H = LH1 , where F1 , R1 , H1 are functions for a single symbol,
which have been found earlier [for instance, see (3.4.2)].
Example 3.2. Now we consider a more difficult example, for which the principle
of the cost additivity (3.4.3) does not hold. Suppose the choice of symbol y = 1 or
y = 2 is cost-free but a cost is incurred when switching symbols. If symbol 1 follows
1 or symbols 2 follows 2 in a sequence, then there is no cost. But if 1 follows 2 or
2 follows 1, then the cost 2d is observed. It is easy to see that in this case the total
cost of the entire sequence yL can be written as follows:
3.4 Examples of application of general methods for computation of channel capacity 67
L−1
c yL = 2d ∑ 1 − δ y j y j+1 . (3.4.4)
j=1
Further, it is required to find the conditional capacity of such a channel and its
optimal probability distribution. We begin with a calculation of the partition func-
tion (3.3.11):
L−1
2β d δy j ,y j+1
Z = e−2β d(L−1) ∑ ∏e . (3.4.5)
y1 ,...yL j=1
and
Z = 2 (1 + k)L−1 = 2L e−β d(L−1) coshL−1 β d.
Consequently, due to formula (3.3.12) we have
d
F = −LT ln 2 + (L − 1) d − (L − 1) T ln cosh .
T
With the help of these functions it is easy to find channel capacity and average cost
meant for one symbol in the asymptotic limit L → ∞:
C d d d
C1 = lim = ln 2 + ln cosh − tanh
L→∞ L T T T
R1 = d − d tanh (d/T ) .
68 3 Encoding in the presence of penalties. First variational problem
Example 3.3. Let us now consider the combined case when there are costs of
both types: additive costs (3.4.3) of the same type as in Example 3.1 and paired
costs (3.4.4). The total cost function has the form
L L−1
c yL = bL + a ∑ (−1)y j + 2d ∑ 1 − δ y j y j−1 .
j=1 j=1
which is a bit more complicated than (3.4.6). However, now the matrix has a more
difficult representation
β a+β d −β a−β d
e e
V = e−β b−β d β a−β d −β a+β d . (3.4.7)
e e
Having taken the largest root of this equation and taking into account (3.4.8), we
find the limit free energy computed for one symbol
#
a 2 a −4d/T
F1 = −T ln λm = b − T ln cosh + sinh +e .
T T
3.4 Examples of application of general methods for computation of channel capacity 69
From here we can derive free energy for one symbol and respective channel capacity
in the usual way.
As in Example 3.2, an optimal probability distribution corresponds to a Markov
process. A respective transition probability P(y j+1 | y j ) is connected with ma-
trix (3.4.7) and differs from this matrix by normalization factors. Next we present
the resulting expressions
The statistical systems considered in the last two examples have been studied
in statistical physics under the name of ‘Ising model’ (for instance, see Hill [21]
(originally published in English [20]) and Stratonovich [46]).
Example 3.4. Now we apply methods of the mentioned general theory to that case of
different symbol lengths, which was investigated in Section 3.1. We suppose that y is
an ensemble of variables k, Vi1 , . . ., Vik , where k is a number of consecutive symbols
in a recording and Vi j is a length of symbol situated on the j-th place. If l(i) is a
length of symbol i, then we should consider function (3.2.4) as a cost function.
According to the general method we calculate the partition function for the case
in question
∞
1
Z=∑ ∑ exp [−β l (i1 ) − · · · − β l (ik )] = ∑ Z1k (β ) = 1 − Z1 (β )
k i1 ,...,ik k=0
d dZ1
R= ln (1 − Z1 ) = − (β ) (1 − Z1 )−1
dβ dβ
dZ1
C = β ln (1 − Z1 ) − β (1 − Z1 )−1 (F = T ln (1 − Z1 )) . (3.4.10)
dβ
Let L be a fixed recording length. Then condition (3.2.8), (3.3.18) will take the form
of the equation
dZ1 (β )
− (1 − Z1 (β ))−1 =L (3.4.11)
dβ
from which β can be determined.
Formulae (3.4.10), (3.3.11), (3.4.9) provide the solution to the problem of chan-
nel capacity C(L) computation. It is also of our interest to find the channel capacity
rate
70 3 Encoding in the presence of penalties. First variational problem
1 dZ1
− = ∑ l (i) P (i) = lav
Z1 d β i
analogously to (3.3.25). That is why equation (3.4.11) can be reduced to the form
Z1 (β ) / (1 − Z1 (β )) = L/lav .
Z1 (β ) → 1 as L/lav → ∞. (3.4.13)
(1 − Z1 ) ln (1 − Z1 ) → 0 as Z1 → 1.
βF 1
− = (1 − Z1 ) ln (1 − Z1 ) → 0 as L→∞ (3.4.14)
R dZ1 /d β
provided that dZ1 /d β (i.e. lav ) tends to the finite value dZ1 (β0 )/d β . According
to (3.4.12) and (3.4.14) we have
C1 = lim β = β0 (3.4.15)
in this case. Due to (3.4.13) the limit value β0 is determined from the equation
Z1 (β0 ) = 1. (3.4.16)
In consequence of (3.4.9) this equation is nothing but equations (3.1.7), (3.1.9) de-
rived earlier. At the same time formula (3.4.15) coincides with relationship (3.1.12).
So, we see that the general standardized method yields the same results as the spe-
cial method applied earlier produces.
The method of potentials worded in Sections 3.3 and 3.4 can be generalized to
more difficult cases when there are a larger number of external parameters, i.e., this
method resembles methods usually used in statistical thermodynamics even more.
3.5 Methods of potentials in the case of a large number of parameters 71
Here we outline possible ways to realize the above generalization and postpone a
more elaborated analysis to the next chapter.
Let cost function c(y) = c(y, a) depend on a numeric parameter a now and be
differentiable with respect to this parameter. Then free energy
and other functions introduced in Section 3.4 will be dependent not only on temper-
ature T (or parameter β = 1/T ), but also on the value of a. The same holds true for
the optimal distribution (3.3.14) now having the form
F(T, a) − c(y, a)
P(y | a) = exp . (3.5.2)
T
Certainly, formula (3.3.15) and other results from Section 3.3 will remain valid
if we account parameter a to be constant when varying parameter T , i.e. if regular
derivatives are replaced with partial ones.
Hence, entropy of distribution (3.5.2) is equal to
∂ F(T, a)
Hy = − . (3.5.3)
∂T
Now in addition to these results we can derive a formula of partial derivatives taken
with respect to the new parameter a.
We differentiate (3.5.1) with respect to a and find
−1
∂ F(T, a) c(y, a) ∂ c(y, a) c(y, a)
∂a
= ∑ exp − T ∑ ∂ a exp − T
y
or, equivalently,
∂ F(T, a) ∂ c(y, a)
=∑ P(y | a) ≡ −E [B(y) | a] , (3.5.4)
∂a y ∂a
∂F
A=− . (3.5.5a)
∂a
Internal parameter A defined by such a formula is called conjugate to parameter a
with respect to potential F.
In formula (3.5.5) Hy and A = E[B] come as functions of T and a, respectively.
However, they can be interpreted as independent variables. Then instead of F(T, a)
we consider another potential Φ (Hy , A) expressed in terms of F(T, a) via the Leg-
endre transform:
∂F ∂F
Φ (Hy , A) = F + Hy T + Aa = F − Hy −a . (3.5.6)
∂ Uy ∂a
Example 3.5. The goal is to encode a recording by symbols ρ . Each symbol has cost
c0 (ρ ) as an attribute. Besides this cost, it is required to take account of one more
additional expenditure η (ρ ), say, the amount of paint used for a given symbol. If
we introduce the cost of paint a, then the total cost will take the form
c(ρ ) = c0 (ρ ) + aη (ρ ). (3.5.8)
Let it be required to find encoding such that the given amount of information I
(meant per one symbol) is recorded (transmitted), at the same time the fixed amount
of paint K is spent (on average per one symbol) and, besides, average costs E[c0 (ρ )]
are minimized.
In order to solve this problem we introduce T and a as auxiliary parameters,
which are initially indefinite and then found from additional conditions. Paint con-
sumption η (ρ ) per symbol ρ is considered as the random variable η . As the second
variable ζ we choose a variable complementing η to ρ so that ρ = (ν , ζ ). Thus, the
cost function (3.5.8) can be rewritten as
c(η , ζ ) = c0 (η , ζ ) + aη .
3.5 Methods of potentials in the case of a large number of parameters 73
Now we can apply formulae (3.5.1)–(3.5.7) and other ones, for which B = −η , to
the problem in question. According to (3.5.1) free energy F(T, a) is determined by
the formula
1
F(T, a) = −T ln ∑ exp[−β c0 (η , ζ ) − β aη ] β= , (3.5.9)
η ,ζ
T
For the final determination of probabilities, entropy and other variables it is left
to concretize parameters T and a employing conditions formulated above. Namely,
average entropy Hηζ and average paint consumption E[η ] are assumed to be fixed:
Using formulae (3.5.3) and (3.5.4) we obtain the system of two equations
∂ F(T, a) ∂ F(T, a)
− = I; =K (3.5.12)
∂T ∂a
for finding parameters T = T (I, K) and a = a(I, K).
The optimal distribution (3.5.10) minimizes the total average cost R = E[c(η , ζ )]
as well as the partial average cost
After having determined optimal probabilities P(ρ ) completely we can perform ac-
tual encoding by methods of Section 3.6.
74 3 Encoding in the presence of penalties. First variational problem
Basically, the given variational problem is solved in the same way as it was
done in Section 3.3. But note that partial derivatives are replaced with variational
derivatives in this modified approach. After variational differentiation with respect
to P(dx) instead of (3.3.4) we will have the extremum condition:
P(d ξ )
ln = β F − β c(ξ ), (3.6.3)
ν (d ξ )
where β F = −1 − γ .
From here we obtain the extremum distribution
Averaging (3.6.3) and taking into account (3.6.1), (3.6.2) we obtain that
Hξ = β E[c(ξ )] − β F, C = β R − β F. (3.6.3b)
The latter formula coincides with equality (3.3.13) of the discrete version. As in
Section 3.3 we can introduce the following partition function (integral)
Z= e−c(ξ )/T ν (d ξ ), (3.6.4)
3.6 Capacity of a noiseless channel with penalties in a generalized version 75
F(T, a) − c(ξ , a)
P(d ξ | a) = exp ν (d ξ ). (3.6.6)
T
Finally, resulting equalities (3.5.3), (3.5.5) and other ones remain intact.
2. As an example we consider the case when X is an r-dimensional real space
ξ = (x1 , . . . , xr ) and function c(ξ ) is represented as a linear quadratic form
1 r
2 i,∑
c(ξ ) = c0 + (xi − bi )ci j (x j − b j ),
j=1
c0 + rT /2 = a. (3.6.7)
In this case the extremum distribution is Gaussian and its entropy C(a) can be also
found with the help of formulae from Section 5.4.
Chapter 4
First asymptotic theorem and related results
In the previous chapter, for one particular example (see Sections 3.1 and 3.4) we
showed that in calculating the maximum entropy (i.e. the capacity of a noiseless
channel) the constraint c(y) a imposed on feasible realizations is equivalent, for
a sufficiently long code sequence, to the constraint E[c(y)] a on the mean value
E[c(y)]. In this chapter we prove (Section 4.3) that under certain assumptions such
equivalence takes place in the general case; this is the assertion of the first asymp-
totic theorem. In what follows, we shall also consider the other two asymptotic
theorems (Chapters 7 and 11), which are the most profound results of information
theory. All of them have the following feature in common: ultimately all these the-
orems state that, for utmost large systems, the difference between the concepts of
discreteness and continuity disappears, and that the characteristics of a large collec-
tion of discrete objects can be calculated using a continuous functional dependence
involving averaged quantities. For the first variational problem, this feature is ex-
pressed by the fact that the discrete function H = ln M of a, which exists under the
constraint c(y) ≤ a, is asymptotically replaced by a continuous function H(a) cal-
culated by solving the first variational problem. As far as the proof is concerned, the
first asymptotic theorem turns out to be related to the theorem on canonical distri-
bution stability (Section 4.2), which is very important in statistical thermodynam-
ics and which is actually proved there when the canonical distribution is derived
from the microcanonical one. Here we consider it in a more general and abstract
form. The relationship between the first asymptotic theorem and the theorem on the
canonical distribution once more underlines the intrinsic unity of the mathematical
apparatus of information theory and statistical thermodynamics.
Potential Γ (α ) and its properties are used in the process of proving the indicated
theorems. The material about this potential is presented in Section 4.1. It is related
to the content of Section 3.3. However, instead of regular physical free energy F we
consider dimensionless free energy, that is potential Γ = −F/T . Instead of parame-
ters T , a2 , a3 , . . . common in thermodynamics we introduce symmetrically defined
parameters α1 = −1/T , α2 = a2 /T , α3 = a3 /T , . . .. Under such choice the temper-
ature is an ordinary thermodynamic parameter along with the others.
H (ζ , a) = H0 (ζ ) − a2 F 2 (ζ ) − · · · − ar F r (ζ ). (4.1.2)
B1 = H0 ; B2 = F 2 ; ... ; Br = F r (4.1.5)
Γ (α ) = ln Z (4.1.7)
where
Z= exp[B1 α1 + · · · + Br αr ] d ζ (4.1.8)
where
Bα = B1 α1 + · · · + Br αr ; e−Φ (B) dB = dζ .
ΔB ΔB
In turn, we call the latter distribution canonical.
In the case of the canonical distribution (4.1.6) it is easy to express the character-
istic function
Θ (iu) = eiuB P(dB | α ) (4.1.10)
The logarithm
μ (s) = ln Θ (s) = ln esB(ζ ) P(d ζ | α ) (4.1.11a)
80 4 First asymptotic theorem and related results
μ (s) = Γ (α + s) − Γ (α ), (4.1.12a)
∂ m Γ (α )
k j1 ,..., jm = . (4.1.13)
∂ α j1 · · · ∂ α jm
Hence we see that the potential Γ (α ) is the cumulant generating function for the
whole family of distributions P(dB | α ).
For m = 1 we have from (4.1.13) that
∂Γ (α )
k j ≡ A j ≡ E[B j ] − . (4.1.14)
∂αj
∂Γ (α )
Al = , l = 2, . . . , r
∂ αl
are equivalent to the equalities
∂F
Al = − , l = 2, . . . , r,
∂ al
of type (3.5.5a).
With the given definition of parameters, the relationship defining entropy has a
peculiar form when energy (average cost) and temperature have the appearance of
regular parameters. Substituting (4.1.1) to the formula for physical entropy
P(d ζ )
H = −E ln
dζ
Plugging (4.1.2) in here and taking account of notations (4.1.4), (4.1.5), we obtain
H = Γ − α E[B]
∂ 2 Γ (α )
ki j = . (4.1.15)
∂ αi ∂ α j
We also take into account that the correlation matrix ki j is positive semi-definite.
Therefore, the matrix of the second derivatives ∂ 2Γ /∂ αi ∂ α j is positive semi-
definite as well, which proves convexity of the potential. This ends the proof.
Corollary 4.1. In the presence of just one parameter α1 , r = 1, function H(A1 )
defined by the Legendre transform
dΓ
H(A1 ) = Γ − α1 (A1 )A1 A1 = (α1 ) (4.1.16)
d α1
is concave.
Indeed, as it follows from formula (4.1.16) and the formula of the inverse Leg-
endre transform
dH
Γ (α1 ) = H + α1 A1 (α1 ) α1 = − (A1 ) ,
dA1
are valid.
Since by virtue of Theorem 4.1 inequality d 2Γ /d α12 0 holds true (when the
differentiability condition holds), we deduce that dA1 /d α1 0 and thereby
d2H d α1
=− 0. (4.1.16a)
dA21 dA1
82 4 First asymptotic theorem and related results
This statement can also be proven without using the differentiability condition.
In conclusion of this paragraph we provide formulae pertaining to a characteristic
potential of entropy that has been defined earlier by the formula of type (1.5.15).
Nonetheless, these formulae generalize the previous one.
of random entropy
P(d ζ | α )
H(ζ | α ) = − ln
v(d ζ )
has the form
To prove this, it is sufficient to substitute (4.1.17) into (4.1.18) and take into con-
sideration the formula Γ (α ) = ln eα B(ζ ) ν (d ζ ), which is of type (4.1.10a) defining
Γ (α ).
Differentiating (4.1.19) by s0 and equating s0 to zero (analogously to (4.1.12) for
m = 1), we can find a mean value of entropy that coincides with (4.1.14b). Repeated
differentiation will yield the expression for variance.
where T = −1/α is temperature. The theorem about stability of this distribution (i.e.
about the fact that it is formed by a ‘microcanonical’ distribution for a cumulative
system including a thermostat) is called Gibbs theorem.
Adhering to a general and formal exposition style adopted in this chapter, we
formulate the addressed theorem in abstract form.
Preliminary, we introduce several additional notions. We call the conditional dis-
tribution
Pn (ξ1 , . . . , ξn | α ) (4.2.2)
an n-th degree of the distribution
P1 (ξ1 | α ), (4.2.3)
if
Pn (ξ1 , . . . , ξn | α ) = P1 (ξ1 | α ) · · · P1 (ξn | α ). (4.2.4)
Let the distribution (4.2.3) be canonical:
Then in consequence of (4.2.4) we obtain that the following equality holds true for
the joint distribution (4.2.2):
where
n n
Bn (ξ1 , . . . , ξn ) = ∑ B1 (ξk ), ϕn (ξ1 , . . . , xn ) = ∑ ϕ1 (ξk ).
k=1 k=1
i.e. the condition of canonicity (4.2.1) is actually satisfied for the ‘small’ sys-
tem (4.2.3), and
1 1
B1 (ξ1 ) = Bn (ξ1 , . . . , ξ1 ); ϕ1 (ξ1 ) = ϕn (ξ1 , . . . , ξ1 ).
n n
The derivation of the canonical ‘small’ distribution from the canonical ‘large’
distribution is natural, of course. The following fact proven below is deeper: the
canonical ‘small’ distribution is approximately formed from a non-canonical ‘large’
distribution. Therefore, the canonical form of a distribution appears to be stable in
a sense that it is asymptotically formed from different ‘large’ distributions. In fact,
this explains the important role of the canonical distribution in theory, particularly,
in statistical physics.
Theorem 4.3. Let the functions B1 (ξ ), ϕ1 (ξ ) be given, for which the corresponding
canonical distribution is of the form
where the functions Ψn (A) are determined from the normalization constraint, and A
plays the role of a parameter. Then the distribution of a ‘small’ system
∞
n
1
Pn (ξ1 , . . . , ξn | A) = exp −Ψn (A) − iκnA + ∑ [iκB1 (ξk ) − ϕ1 (ξk )] dκ
2π −∞ k=1
where
Γ (α ) = ln ∑ exp[α B1 (ξk ) − ϕ1 (ξk )]. (4.2.13)
ξk
We apply the method of the steepest descent to the integral in (4.2.12) using the fact
that n takes large values.
Further, we determine a saddle point iκ = α0 from the extremum condition for
the expression situated in the exponent of (4.2.12), i.e. from the equation
dΓ
(n − 1) (α0 ) = nA − B1 (ξ1 ). (4.2.14)
dα
In view of that the point turns out to be dependent on ξ , it is convenient to also
consider point α1 independent from ξ1 and defined by the equation
dΓ
(α1 ) = A, (4.2.15)
dα
Point α1 is close to α0 for large n.
It follows from (4.2.14), (4.2.15) that
1 A − B1
Γ (α0 ) − Γ (α1 ) = Γ (α1 )ε + Γ (α )ε 2 + · · · = (ε = α0 − α1 ).
2 n−1
From here we have
A − B1 Γ
ε=
− ε 2 − · · ·
(n − 1)Γ 2Γ
A − B1 Γ (A − B1 )2
=
− + O(n−3 ). (4.2.16)
(n − 1)Γ 2(n − 1)2 (Γ )3
holds true and, consequently, a direction of the steepest descent of the function
(n − 1)Γ (α ) − α nA − α B1 at point α0 (and also at point α1 ) is orthogonal to the
real axis. Indeed, if difference α − α0 = iy is imaginary, then
(n − 1)Γ (α ) − nα A + α B1 =
1
(n − 1)Γ (α0 ) − nα0 A + α0 B1 − (n − 1)Γ (α0 )y2 + O(y3 ).
2
Drawing the contour of integration through point α0 in the direction of the steep-
est descent, we use the equality that follows from the formula
a 2 b 3 c 4
exp − x + x + x + · · · dx ≈ (2π /a)1/2 exp[c/8a2 + · · · ], (4.2.16b)
2 6 24
3/2
where a > 0, b/a2 1, c/a2 1, . . . . This equality is
1
I≡ exp {(n − 1)Γ (iκ) + iκ[B1 (ξ1 ) − nA]} dκ
2π
(
= [2π (n − 1)Γ (α0 )]−1/2 exp (n − 1)Γ (α0 )+
Γ (α0 ) )
−2
+ α0 [B1 (ξ1 ) − nA] + + O(n ) . (4.2.16c)
8(n − 1)Γ (α0 )2
A Γ (α1 ) 1
α (A) = α1 + + ; χ (A) = − ,
nΓ (α1 ) 2nΓ (α1 )2 2Γ (α1 )
we obtain (4.2.10). The proof is complete.
4.2 Asymptotic results of statistical thermodynamics. Stability of the canonical distribution 87
It is not necessary to account for the first equality in (4.2.17) because function
ψ (A) is unambiguously determined by functions α (A), χ (A) due to the normaliza-
tion constraint.
Since a number of terms in (4.2.17) disappears in the limit n → ∞, the limit
expression (4.2.10) has the form
ϑ+ (x) − ϑ+ (x − c) = ϑ− (x − c) − ϑ− (x).
Such a generalization will require quite insignificant changes in the proof of the
theorem. Expansion (4.2.11) needs to be substituted by expansion (4.2.20), where
the extra term ln θ (iκ) will appear in the exponent in formula (4.2.12) and others.
This will lead to a minor complication of final formulae.
Results of Theorem 4.3 also allow a generalization in a different direction. It
is not necessary to require that the factor e− ∑k=1 ϕ1 (ξk ) (independent from A and
n
μn (β ) μ0 (β )
μn (β ) → ∞,
→ (β , β ∈ Q)
μn (β ) μ0 (β )
where θ (·) is a function independent of n with spectrum θ (iκ). Then the summation
of distribution P(ξ1 | An ) over ηn and ζn transforms (4.2.22) into an expression
of type (4.2.10). In this expression, functions ψ (An ), α (An ), χ (An ) are determined
from the corresponding formulae like (4.2.17); as n → ∞, function α (An ) turns into
the function inverse to the function
4.3 Asymptotic equivalence of two types of constraints 89
An = μn (α ), (4.2.23)
μn (β ) = ln ∑ eβ ζn Pn (ζn ),
ζn
Proof. The proof is analogous to the proof of Theorem 4.3. The only difference is
that now there is an additional term ln θ (iκ) and the expression (n − 1)Γ1 (iκ) must
be replaced with μn (iκ). Instead of formula (4.2.12) now we have
−Ψn (A)−ϕ1 (ξ1 )
P(ξ1 | An ) = e exp[iκB1 (ξ1 ) − iκAn + μn (iκ) + ln θ (iκ)]dκ
after summation over ηn and ζn . The saddle point iκ = α0 is determined from the
equation
θ (α0 )
μn (α0 ) − An + B1 (ξ ) + = 0.
θ (α0 )
The root of the latter equation is asymptotically close to root α of equation (4.2.23).
Namely,
−1 θ
α0 − α = (μn ) B1 + + · · · = O(μ −1 ).
θ
Other changes do not require explanation.
The theorems presented in this paragraph characterize the role of canonical dis-
tributions just as the Central Limit Theorem characterizes the role of Gaussian dis-
tributions. As a matter of fact, this also explains the fundamental role of canonical
distributions in statistical thermodynamics.
Consider asymptotic results related to the content of Chapter 3. We will show that
when computing maximum entropy (capacity of a noiseless channel), constraints
imposed on mean values and constraints imposed on exact values are asymptoti-
cally equivalent to each other. These results are closely related to the theorem about
stability of the canonical distribution proven in Section 4.2.
We set out the material in a more general form than in Sections 3.2 and 3.3 by
using an auxiliary measure ν (dx) similarly to Section 3.6.
Let the space X with measure ν (dx) (not normalized to unity) be given. Entropy
will be defined by the formula
90 4 First asymptotic theorem and related results
P(dx)
H =− ln P(dx) (4.3.1)
v(dx)
B(x) A (4.3.2)
where symbol E means averaging with respect to measure P. Averaging (4.3.2), (4.3.3)
with respect to measure P from G we obtain (4.3.6), thereby set G of distributions
defined by constraint (4.3.5). Hence,
P defined by constraint (4.3.6) embodies set G
if we introduce the following entropy:
P(dx)
H = sup − P(dx) ln , (4.3.7)
P∈G v(dx)
= ln v(X).
H (4.3.10)
The ‘large system’ appears to be an n-th degree of the ‘small system’. The afore-
mentioned formulae (4.3.1–4.3.11) can be applied to both mentioned systems. A
n , H n can be provided both for the ‘small’ and for the ‘large’
definition of entropies H
systems. For the ‘small’ system, the values H 1 , H 1 are essentially different, but for
the ‘large’ system the values H n , H n , according to the foregoing, are relatively close
to each other in the asymptotic sense
n /H n → 1,
H n → ∞. (4.3.12)
The extremum distribution (the one yielding H n ) has the following form:
P(dx1 , . . . , dxn | An ) = N −1 ϑ ∑ B1 (xk ) − An v1 (dx1 ) · · · v1 (dxn ), (4.3.14)
k
where N is a normalization constant quite simply associated with entropy H:
H = ln N = ln ϑ ∑ B1 (xk ) − An v1 (dx1 ) · · · v1 (dxn ). (4.3.15)
k
Formula (4.3.14) is an evident analogy of both (4.2.19) and (4.2.22). It means that
the problem of entropy (4.3.15) calculation is related to the problem of calculation
of the partial distribution (4.2.9) considered in the previous paragraph. As there, the
conditions of the exact multiplicativity
are not necessary for the proof of the main result (convergence (4.3.12)). Next we
formulate the result directly in a general form by employing the notion (introduced
in Section 4.2) of a canonically stable sequence of random variables (ξn being un-
derstood as the set x1 , . . . , xn ).
(the measure νn (Σn ) of the entire space of values ξn is assumed to be finite). Then
n can be computed by the asymptotic formula
the entropy H
where
Computation of the integral can be carried out with the help of the saddle-point
method (the method of the steepest descent), i.e. using formula (4.2.16b) with vary-
ing degrees of accuracy. The saddle point iκ = α0 is determined from the equation
the theorem is valid. Undoubtedly, Theorem 4.5 (and also Theorem 4.4, possibly)
can be extended to a more general case. By implication this is evidenced by the
fact that for the example covered in Sections 3.1 and 3.4 the condition of canonical
stability is not satisfied but, as we have seen, the passage to the limit (4.3.12) occurs
as L → ∞.
Along with other asymptotic theorems, which will be considered later on (Chap-
ters 7 and 11), Theorems 4.3–4.5 constitute the principal content of information
theory perceived as ‘thermodynamic’ in a broad sense, i.e. asymptotic theory. Many
important notions and relationships of this theory take on their special significance
during the passage to the limit pertaining to an expansion of systems in considera-
tion. Those notions and relationships appear to be asymptotic in the indicated sense.
Theorem 4.6. Suppose that the domain Q of the parameter implicated in (4.4.1)
contains the interval s1 < α < s2 (s1 < 0, s2 > 0), and the potential μ (α ) is differ-
entiable on this interval. Then the cumulative distribution function
where
μ (s) = x, (4.4.5)
if the latter equation has a root s ∈ (s1 , 0].
and besides
∑ P(ξ | α ) 1. (4.4.8)
B(ξ )<x
holds true for any α ∈ (s1 , 0], including the case α = s, where s is a solution of
equation (4.4.5) if it exists. Apparently, (4.4.9) turns into (4.4.4) for this solution.
This ends the proof.
Since value μ (0) is nothing else but the following mean value
A = ∑ B(ξ )P(ξ )
ξ
Theorem 4.7. If the conditions of Theorem 4.6 are satisfied with the only one differ-
ence that equation (4.4.7) has a positive root s ∈ [0, s2 ], then instead of (4.4.4) the
following inequality is valid:
1 − F(x) e−sμ (s)+μ (s) . (4.4.10)
The proof of this theorem is analogous to that of the previous one, and we shall
skip it here.
96 4 First asymptotic theorem and related results
When comparing (4.1.5) with (4.1.14) [see also (3.5.5a)] it is easy to see that x
plays the role of a parameter conjugate to α with respect to potential μ (α ). The
expression
γ (x) = s(x)x − μ (s) = sμ (s) − μ (s) (4.4.11)
considered as a function of x is actually the Legendre transform of potential μ (α ).
With the help of potential (4.4.11) and (4.4.4) formula (4.4.10) is reduced to
Then there is the same expression e−γ permanently located in the indicated right-
hand sides of the formulae, where
∂μ
γ (x1 , . . . , xr ) = ∑ xi si − μ (s) xi = , i = 1, . . . , r (4.4.14)
∂ si
[in particular, (4.4.1)] and not with p(ξ ). Then instead of (4.4.13), (4.4.14) we will
have
where
∂Γ
γ (x) = ∑ αi xi − Γ (α ), (a) = x j (4.4.16)
i ∂aj
is the Legendre transform of potential Γ (α ). In order to make sure that these formu-
lae are valid, we need to take into account (4.1.12a) and formulae (4.4.13), (4.4.14)
from the previous case.
3. Now we derive formulae that hold with an equality sign as opposed to (4.4.3),
(4.4.10), (4.4.13), (4.4.15) but are asymptotic, i.e. valid for large n and Γ .
4.4 Some theorems about the characteristic potential 97
Theorem 4.8. Suppose that the random variable B is a sum of identically distributed
independent random variables B1 (ξ ), . . . , Bn (ξ ) and the corresponding to it char-
acteristic potential μ (α ) = nμ1 (α ) is defined and differentiable (a sufficient number
of times) on the closed interval s1 α s2 (s1 < 0, s2 > 0). Then for values of x1 ,
for which the equation
μ (s) = nx1 ≡ x (4.4.17)
has a root s ∈ (s1 , s2 ) and the inequality
holds true, the cumulative distribution function (4.4.3) of random variable B can be
found from the asymptotic formula:
for x < E[B], F(x)
= [2π μ (s)s2 ]−1/2 e−γ (x) [1 + O(n−1 )],
for x > E[B], 1 − F(x)
Proof. We choose values x < E[B] and x > E[B] such that the corresponding roots
s and s of equation (4.4.17) lie on segment (s1 , s2 ). Then we apply the famous
inversion formula to them (the Lévy formula):
c −itx
1 e − e−itx μ (it)
F(x ) − F(x) = lim e dt. (4.4.20)
c→∞ 2π −c it
Here eμ (it) = enμ1 (it) is a characteristic function. We represent the limit in the right-
hand side of (4.4.20) as a limit of a difference of two integrals
1 1
e−zx+μ (z)−ln z dz − e−zx +μ (z)−ln z dz = I − I . (4.4.21)
2π i L 2π i L
− zx + μ (z) − ln z =
3
1 y2 1 1 iy
− z0 x + μ (z0 ) − ln z0 − μ (z0 )y2 − 2 + μ (z0 )(iy)3 − + y4 · · ·
2 2z0 6 3 z0
due to (4.2.16b) (note that the largest term of a residue O(n−1 ) is given by the
1 (4) −2
fourth derivative Γ (4) and takes the form μ (μ1 ) ). Comparing (4.4.22) with
8n 1
equation μ1 (s) − x1 = 0 or, equivalently, with (4.4.17), we have
1 1
μ1 (s)(z0 − s) + (z − s)2 · · · = ; z0 − s = + O(n−2 ). (4.4.25)
nz0 nz0 μ1 (s)
In order to derive the desired relationship (4.4.19) it is sufficient to apply the formula
resulted from (4.4.24) and (4.4.25)
1
I = e−sx+μ (s) {2π nμ1 (s)}−1/2 [1 + O(n−1 )] (4.4.26)
s
4.4 Some theorems about the characteristic potential 99
by virtue of (4.4.20) (it is accounted here that s < 0; roots [. . .]−1/2 are positive).
The last equality determines F(x) and F (x) up to some additive constant:
In order to estimate constant K(n) we consider point s∗ belonging to segment (s1 , s),
point s∗ from segment (s , s2 ) and values x∗ , x∗ corresponding to them. We take into
account inequality F(x∗ ) 1 − F(x∗ ) + F(x∗ ) and substitute (4.4.28) to its right-
hand side and, finally, obtain that
Hence,
∗
F(x∗ ) = O(e−nγ1 (x∗1 ) ) + O(e−nγ (x1 ) )
(γ1 (x1 ) = γ (x1 )/n = x1 μ1 − μ1 ; x∗1 = x∗ /n, x1∗ = x∗ /n).
If x∗1 , s∗ and x1∗ , s∗ do not depend on n and are chosen in such a way that
then terms O(e−nγ1 (x∗1 ) ), O(e−nγ1 (x1 ) ) in (4.4.30) converge to zero as n → ∞ faster
than e−nγ1 (x1 ) O(n−1 ). Also, we obtain the first equality (4.4.19) from (4.4.30). The
second equality follows from (4.4.29) when the inequalities
are satisfied.
Since γ (x) is monotonic within segments μ (s1 ) x μ (0) and μ (0) x
μ (s2 ), points x∗1 and x1∗ (for which inequalities (4.4.31), (4.4.32) would be valid)
As it is seen from the provided proof, the requirement (one of the conditions of
Theorem 4.8) that random variable B is equal to a sum of identically distributed
independent random variables turns out not to be necessary. Mutual (proportional)
increase of resulting potential μ (s) is sufficient for terms similar to term μ (4) /(μ )2
in the right-hand side of (4.4.24) to be small. That is why formula (4.4.19) is valid
even in a more general case if we replace O(n−1 ) with O(μ −1 ) and understand this
estimation only in the specified sense.
If we apply formula (4.4.19) to distribution p(B | α ) dependent on a parameter
instead of distribution p(B), then in consequence of (4.1.12a) we will have μ (s) =
Γ (α + s) − Γ (α ); μ (s) = Γ (α + s); sμ (s) − μ (s) = sΓ (α + s) − Γ (α + s) +
Γ (α ) = γ (x) − α x + Γ (α ) and, thereby, formula (4.4.19) becomes
x < Γ (α ) F(x)
=
x > Γ (α ) 1 − F(x)
= 2π [Γ (α + s)s2 ]1/2 e−Γ (α )+α x−γ (X) [1 + O(Γ −1 )], (4.4.33)
Finally, the generalization of the last formula to the case of a parametric distribu-
tion [the generalization of (4.4.33) to a multivariate case, correspondingly] will be
represented as follows:
The aforementioned results speak for an important role of potentials and their im-
ages under the Legendre transform. The applied method of proof unite Theorem 4.8
with Theorems 4.3–4.5.
Chapter 5
Computation of entropy for special cases.
Entropy of stochastic processes
In the present chapter, we set out the methods for computation of entropy of many
random variables or of a stochastic process in discrete and continuous time.
From a fundamental and practical points of view, of particular interest are the sta-
tionary stochastic processes and their information-theoretic characteristics, specifi-
cally their entropy. Such processes are relatively simple objects, particularly a dis-
crete process, i.e. a stationary process with discrete states and running in discrete
time. Therefore, this process is a very good example for demonstrating the basic
points of the theory, and so we shall start from its presentation.
Our main attention will be drawn to the definition of such an important charac-
teristic of a stationary process as the entropy rate, that is entropy per unit of time or
per step. In addition, we introduce entropy Γ at the end of an interval. This entropy
together with the entropy rate H1 defines the entropy of a long interval of length T
by the approximate formula
HT ≈ H1 T + 2Γ ,
which is the more precise, the greater is T . Both constants H1 and Γ are calculated
for a discrete Markov process.
The generalized definition of entropy, given in Section 1.6, allows for the appli-
cation of this notion to continuous random variables as well as to the case when
the set of these random variables is continuum, i.e. to a stochastic process with a
continuous parameter (time).
In what follows, we show that many results related to a discrete process can be
extended both to the case of continuous sample space and to continuous time. For
instance, we can introduce the entropy rate (not per one step but per a unit of time)
and entropy of an end of an interval for continuous-time stationary processes. The
entropy of a stochastic process on an interval is represented approximately in the
form of two terms by analogy with the aforementioned formula.
For non-stationary continuous-time processes, instead of constant entropy rate,
one should consider entropy density, which, generally speaking, is not constant in
time.
Entropy and its density are calculated for various important cases of continuous-
time processes: Gaussian processes and diffusion Markov processes.
Entropy computation of stochastic processes carried out here allows us to cal-
culate the Shannon’s amount of information (this will be covered in Chapter 6) for
stochastic processes.
If we introduce conditional entropy Hξk |ξk−1 ξk−2 , then applying Theorem 1.6a for
ξ = ξk , η = ξk−1 , ζ = ξk−2 will yield the inequality
5.1 Entropy of a segment of a stationary discrete process and entropy rate 105
Hξk Hξk |ξk−1 Hξk |ξk−1 ,ξk−2 · · · Hξk |ξk−1 ,...,ξk−l · · · 0. (5.1.2)
Besides, all conditional entropies are non-negative, i.e. bounded below. Hence, there
exists the non-negative limit
Theorem 5.1. If {ξk } is a stationary discrete process such that Hξk < ∞, then the
limit
lim Hξ1 , ..., ξl /l
l→∞
Hξm+n ...ξ1 = Hξm ...ξ1 + Hξm+1 |ξm ...ξ1 + Hξm+2 |ξm+1 ...ξ1 + · · · + Hξm+n |ξm+n−1 ...ξ1 . (5.1.4)
Since
Hξm+l |ξm+l−1 ...ξ1 = H1 + om+l (1) = H1 + om (1) (i 1)
due to (5.1.3) (here is o j (1) → 0 as j → ∞), it follows from (5.1.4) (after dividing
by m + n) that
Hξm+n ...x1 Hξ ...ξ n n
= m 1+ H1 + om (1). (5.1.5)
m+n m+n m+n m+n
Let m and n converge to infinity in such a way that n/m → ∞. Then n/(m + n)
converges to 1, while the ratio Hξm ...ξ1 /(m + n), which can be estimated as
1 m
Hξm ...ξ1 H ,
m+n m + n ξ1
clearly converges to 0.
Therefore, we obtain the statement of the theorem from equality (5.1.5). The
proof is complete.
It is also easy to prove that as l grows, the ratio Hξ1 ...ξl /l changes monotonically,
i.e. does not increase. For that we construct the difference
106 5 Computation of entropy for special cases. Entropy of stochastic processes
1 1 1 1
δ = Hξ1 ...ξl − Hξ1 ···ξl+1 = Hξ1 ...ξl − [H + Hξl+1 |ξ1 ...ξl ]
l l +1 l l + 1 ξ1 ...ξl
1 1
= Hξ1 ...ξl − H
l(l + 1) l + 1 ξl+1 |ξ1 ...ξl
and reduce it to
l
1 1
δ=
l(l + 1)
[Hξ1 ...ξl − lHξl+1 |ξ1 ...ξl ] = ∑ [H
l(l + 1) i=1 ξi |ξ1 ...ξi−1
− Hξi |ξi−l ...ξi−1 ].
(5.1.6)
In consequence of inequalities (5.1.2) summands in the right-hand side of (5.1.6)
are non-negative. Thus, non-negativity of difference δ follows from here.
By virtue of Theorem 5.1 the following equality holds:
Hξ1 ...ξm + Hξ1 ...ξn − Hξ1 ...ξm+n = Hξ1 ...ξm + Hξm+1 ...ξm+n − Hξ1 ···ξm+n . (5.1.8)
We can switch the order of limits here due to the mentioned symmetry about a
transposition between m and n. By virtue of the indicated monotonicity this limit
(either finite or infinite) always exists. Passing from form (5.1.8) to form (5.1.9) and
using the hierarchical relationship of type (1.3.4)
n
Hξm+1 ...ξm+n |ξ1 ,...,ξm = ∑ Hξm+i |ξ1 ,...,ξm+i−1
i=1
we perform the passage to the limit m → ∞ and rewrite equality (5.1.10) as follows:
Hξ1 ...ξm + Hξ1 ...ξn − Hξ1 ...ξm+n = 2Γ + om (1) + on (1) + om+n (1).
This complies with (5.1.10) and confirms the above statement about the increase of
entropy by 2Γ .
1. Let the discrete (not necessarily stationary) process {ξk } be Markov. This means
that joint distribution laws of two consecutive random variables can be decomposed
into the product
P(ξk , ξk+1 , . . . , ξk+l ) = P(ξk )πk (ξk , ξk+1 ) · · · πk+l−1 (ξk+l−1 , ξk+l ) (5.2.1)
∑ π j (ξ , ξ ) = 1. (5.2.2)
ξ
108 5 Computation of entropy for special cases. Entropy of stochastic processes
and, consequently,
Hence,
Similarly, we have
Hξ1 ...ξn = Hξ1 + Hξ2 |ξ1 + Hξ3 |ξ2 + · · · + Hξn |ξn−1 (5.2.6)
is satisfied.
The latter can be easily derived if we rewrite the joint distribution P(ξk , ξk+1 ) =
Pst (ξk )π (ξk , ξk+1 ) according to (5.2.1) and sum it over ξk = ξ that yields P(ξk+1 ).
Taking into account (5.2.3) it is easy to see that entropy rate (5.1.3) for a station-
ary Markov process coincides with entropy corresponding to transition probabilities
π (ξ , ξ ) with the stationary probability distribution:
H1 = − ∑ Pst (ξ ) ∑ π (ξ , ξ ) ln π (ξ , ξ ). (5.2.8)
ξ ξ
holds true exactly. We can also derive the result (5.2.10) with the help of (5.1.12). In-
deed, because Hξ j+1 |ξ1 ...ξ j = H1 , as was noted earlier, there is only one non-zero term
left in the right-hand side of (5.1.13): 2Γ = Hξ1 − H1 , which coincides with (5.2.10).
2. So, given the transition probability matrix
π = π (ξ , ξ ) , (5.2.12)
in order to calculate entropy, one should find the stationary distribution and then
apply formulae (5.2.8), (5.2.9). equation (5.2.7) defines the stationary distribution
Pst (ξ ) quite clearly if a Markov process is ergodic, i.e. if eigenvalue λ = 1 is a
non-degenerate eigenvalue of matrix (5.2.12). According to the theorem about de-
composition of determinants, equation det(π − 1) = 0 entails (5.2.7), if we assume
*
Pst (ξ ) = Aξ ξ ∑ Aξ ξ , (5.2.13)
ξ
aξ ξ = π (ξ , ξ ) − δξ ξ ≡ π − 1.
where ⎛ ⎞ ⎛ ⎞
π 00 0 ··· 0 π 01 0 ···
⎜ 0 · · ·⎟ ⎜0 0 · · ·⎟
Π 00 = ⎝ 0 ⎠, Π =⎝
01 0 ⎠,
.. .. .. .. .. .. ..
. . . . . . .
and so on.
Here π i j denotes a matrix of dimensionality ri × r j that describes the transition
from subset Ei containing ri states to subset E j containing r j states. There are zeros
in the rest of matrix cells. Sets E1 , E2 , . . . constitute ergodic classes. Transitions from
set E0 to ergodic classes E1 , E2 , . . . occur. There is no exit from each of these classes.
Hence, its own stationary distribution is set up within each class. This distribution
can be found by the formula of type (5.2.13)
*
Psti (ξ ) = Aiiξ ξ ∑ Aiiξ ξ , ξ ∈ Ei (5.2.15)
ξ ∈E i
with the only difference that now we consider algebraic co-factors of submatrix
πξllξ − δξ ξ
, ξ , ξ ∈ El , which are not zeros. Probabilities Pstl (ξ ) address to El .
The full stationary distribution appears to be the linear combination
∑ Psti (ξ )Pstj (ξ ) = 0 i = j.
ξ
qi = ∑ P1 (ξ ) + ∑ P1 (ξ )[Πξ0iξ + (Π 00 Π 0i )ξ ξ + (Π 00 Π 00 Π 0i )ξ ξ + · · · ], i = 0.
ξ ∈Ei ξξ
qi = ∑ P1 (ξ ) + ∑ P1 (ξ )([1 − Π 00 ]−1 Π 0i )ξ ξ , q0 = 0.
ξ ∈Ei ξ ,ξ
H1 = ∑ qi H1i , (5.2.17)
i
are computed quite similar to the ergodic case. The reason is that a non-ergodic
process is a statistical mixture (with probabilities qi ) of ergodic processes having a
fewer number of states ri .
Summation in (5.2.17), (5.2.18) is carried out only over ergodic classes E1 ,
E2 , . . ., i.e. subspace E0 has a zero stationary probability. The union of all ergodic
classes E1 + E2 + · · · = Ea (on which a stationary probability is concentrated) can be
called an ‘active’ subspace. Distributions and transitions exert influence on entropy
rate H1 in the ‘passive’ space E0 up to the point they have influence on probabili-
ties qi . If the process is ergodic, i.e. there is just one ergodic class E1 besides E0 ,
then q1 = 1 and the passive space does not have any impact on the entropy rate. In
this case the entropy rate of a Markov process in space E0 + Ea coincides with the
entropy rate of a Markov process taking place in subspace Ea = E1 and having the
transition probability matrix π11 .
3. In order to illustrate application of the derived above formulae, we consider in
this paragraph several simple examples.
Example 5.1. At first we consider the simplest discrete Markov process—the pro-
cess with two states, i.e. matrix (5.2.12) is a 2 × 2 matrix. In consequence of the
normalization constraint (5.2.2) its elements are not independent. There are just two
independent parameters μ and ν that define matrix π :
1−μ μ
π= .
ν 1−ν
where
h2 (x) = −x ln x − (1 − x) ln (1 − x). (5.2.20a)
Next, one easily obtains the boundary entropy by formula (5.2.10) as follows:
μ ν h2 (μ ) + μ h2 (ν )
2Γ = h2 − .
μ +ν μ +ν
Example 5.2. Now suppose there is given a process with three states that have the
transition probability matrix
⎛ ⎞
1 − μ μ μ
π = ⎝ ν 1 − ν ν ⎠ (μ = μ + μ etc).
λ λ 1 − λ
where
−ν ν
A11 = = λ ν − λ ν = λ ν + λ ν + λ ν ;
λ −λ
A21 =μ λ + μ λ + λ μ ;
A31 =μ ν + ν μ + ν μ .
h3 (μ , μ ) = −μ ln μ − μ ln μ − (1 − μ − μ ) ln (1 − μ − μ ). (5.2.22)
The given process with three states appears to be non-ergodic if, for instance, λ =
λ = 0, μ = ν = 0, so that the transition probability matrix has a ‘block’ type
⎛ ⎞
1−μ μ 0
π = ⎝ ν 1 − ν 0⎠ .
0 0 1
5.3 Entropy rate of components of a discrete and conditional Markov process 113
For such a matrix, the third state remains constant, and transitions are made only
between the first and the second states. Algebraic co-factors (5.2.21) vanish. As is
easy to see, the following distributions are stationary:
⎧
⎪
⎨ν /(μ + ν ), for ξ = 1;
Pst1 (ξ ) = μ /(μ + ν ), for ξ = 2; Pst2 (ξ ) = δξ 3 .
⎪
⎩
0, for ξ = 3;
The first of them coincides with (5.2.19). Functions Pst1 (ξ ) and Pst2 (ξ ) are orthogonal.
Using the given initial distribution P1 (ξ ), we find the resultant stationary distribution
by formula (5.2.16):
Now, due to (5.2.17,) it is easy to rewrite the corresponding entropy rate as follows:
of the Markov process by the conditional entropy Hx1 ...xn |y1 ...yn called entropy of the
conditional Markov process {xk } (for fixed {yk }).
Along with entropy rate hxy = hξ of a stationary Markov process, we introduce
the entropy rates of the initial y-process
1
hy = lim Hy1 ...yn (5.3.2)
n→∞ n
One may infer that (in a general case) the x-process can be considered as a non-
Markov a priori process and the conditional y-process with fixed x. Their entropy
rates will be, respectively,
1
hx = lim Hx1 ...xn ; hy|x = lim Hy1 ...yn |x1 ...xn . (5.3.5)
n→∞ n n→∞
Since we already know how to find entropy of a Markov process, it suffices to learn
how to calculate just one of the variables hy or hx|y . The second one is found with
the help of (5.3.4). Due to symmetry the corresponding variable out of hx , hy|x can
be computed in the same way, whereas the second one—from (5.3.7).
The method described below (clause 2) can be employed for calculating entropy
of a conditional Markov process in both stationary and non-stationary cases. As a
rule, stationary probability distributions and limits (5.3.2), (5.3.3), (5.3.5) exist only
in the stationary case.
Conditional entropy Hx1 ...xn |y1 ...yn can be represented in the form of the sum
Hx1 ...xn |y1 ...yn = Hx1 |y1 ...yn + Hx2 |x1 y1 ...yn + Hx3 |x1 x2 y1 ...yn + · · · + Hxn |x1 ...xn−1 y1 ...yn
hx|y = lim Hxk |x1 ...xk−1 y1 ...yn = Hxk |...ξk−2 ξk−1 yk yk+1 ... . (5.3.8)
k→∞,n−k→∞
Theorem 5.2. For stationary process {ξk } = {xk , yk } the limits (5.3.3) and (5.3.8)
are equal.
Proof. By virtue of
lim Hxk |x1 ...xk−1 y1 ...yn = Hxk |ξ1 ...ξk−1 yk yk+1 ... ,
n−k→∞
lim Hxk |ξ1 ...ξk−1 yk yk+1 ... = Hxk |...ξk+2 ξk−1 yk yk+1 ... ,
k→∞
we have
Hxk |x1 ...xk−1 y1 ...yn = Hxk |...ξk−2 ξk−1 yk yk+1 ... + ok (1) + on−k (1) (5.3.9)
5.3 Entropy rate of components of a discrete and conditional Markov process 115
Hx1 ...xn |y1 ...yn = Hx1 ...xm |y1 ...yn + Hxm+1 |ξ1 ...ξm ym+1 ...yn
+ · · · + Hxm+r |ξ1 ...ξm+r−1 ym+r ...yn
+ Hxm+r−1 ...xn |ξ1 ...ξm+r ym+r+1 ...yn
we obtain that
1 1
H = H +
n x1 ...xn |y1 ...yn m + r + s x1 ...xm |y1 ...yn
r r
+ H + [om (1) + os (1)]+
m + r + s xk |...ξk−1 yk ... m + r + s
1
+ H . (5.3.10)
m + r + s xm+r+1 ...xn |ξ1 ...ξm+r ym+r+1 ...yn
Here we mean that
m
Hx1 ...xm |y1 ...yn = ∑ Hxk |x1 ...xk−1 y1 ...yn mHxk
k=1
and
Hxm+r+1 ...xn |ξ1 ...ξm+r ym+r+1 yn sHxk ,
because conditional entropy is less or equal than regular one. Thus, if we make
the passage to the limit m → ∞, r → ∞, s → ∞ in (5.3.10) such that r/m → ∞ and
r/s → ∞, then there will be left only one term Hxk |...,ξk−1 ,yk in that expression. This
proves the theorem.
As is seen from the above proof, Theorem 5.2 is valid not only in the case of a
Markov joint process {xk , yk }. Furthermore, in consequence of the Markov condi-
tion we have
and
Hxk |x1 ...xk−1 y1 ...yn = Hxk |ξk−1 yk ...yn
in the case of a Markov process. Consequently, formula (5.3.8) takes the form
2. Let us now calculate entropies of the y-process and a conditional process in-
duced by a Markov joint process. In order to do this, we consider the conditional
entropy
Hyk |y1 ...yk−1 = −E [ln P(yk | y1 . . . yk−1 )], (5.3.11)
116 5 Computation of entropy for special cases. Entropy of stochastic processes
We use the Markov condition (5.2.1) and then write down the multivariate prob-
ability distribution
or, equivalently,
if we denote ∑x π (x, y; x , y ) = π (x, y; y ). Then formula (5.3.11) takes the form
Hyk |y1 ...yk−1 = −E ln ∑ Wk−1 (x)π (x, yk−1 ; yk ) .
x
then we obtain
5.3 Entropy rate of components of a discrete and conditional Markov process 117
That is what the main results are. We see that in order to calculate conditional
entropy of some components of a Markov process we need to investigate posterior
probabilities {Wk−1 (·)} as a stochastic process in itself. This process is studied in
the theory of conditional Markov processes. It is well known that when k increases,
the corresponding probabilities are transformed by certain recurrent relationships.
In order to introduce them, let us write down the equality analogous to (5.3.14)
replacing k − 1 with k:
Substituting (5.3.14) to the last formula we obtain the following recurrent relation-
ships:
∑xk−1 Wk−1 (xk−1 )π (xk−1 , yk−1 ; xk , yk )
Wk (xk ) = . (5.3.20)
∑xk−1 ,xk Wk−1 (xk−1 )π (xk−1 , yk−1 ; xk , yk )
Process {Wk−1 (·)} (considered as a stochastic process by itself) is called a sec-
ondary a posteriori W -process. As is known from the theory of conditional Markov
processes, this process is Markov. Let us consider its transition probabilities.
Transformation (5.3.20) that can be represented as
∑ξ Wk−1 (ξ )π (ξ , ξ )δyyk
Wk (ξ ) = , (5.3.21)
∑ξ Wk−1 (ξ )π (ξ , yk )
π W (Wk−1 ,Wk ) =
δyy ∑ξ Wk−1 (ξ )π (ξ , ξ )
= ∑ δ Wk (ξ ) − k )π (ξ , y ) ∑ Wk−1 (ξ )π (ξ , yk ). (5.3.22)
yk ∑ ξ Wk−1 ( ξ k ξ
Theorem 5.3. Entropy Hy2 ...yl |y1 of a non-Markov y-process coincides with the anal-
ogous entropy of the corresponding secondary a posteriori (Markov) process:
are valid together with formula (5.3.12) and the analogous formula for {Wk }, it is
sufficient to prove the equality
Here we have HWk |W1 ...Wk−1 = HWk |Wk−1 due to the Markovian property.
Let S0 be some initial point from the sample space W (·). According to (5.3.21)
transitions from this point to other points S(yk ) may occur depending on the value
of yk (S(yk ) is the point with coordinates (5.3.21)). Those points are different for
different values of yk . Indeed, if we assume that two points from S(yk ), say S = S(y )
and S = S(y ), coincide, i.e. the equality
Example 5.3. Suppose that a non-Markov process {yk } is a process with two states,
i.e. yk can take one out of two values, say 1 or 2. Further, suppose that this process
can be turned into a Markov process by adding variable x and thus decomposing
state y = 2 into two states: ξ = 2 and ξ = 3. Namely, ξ = 2 corresponds to x = 1,
y = 2; ξ = 3 corresponds to x = 2, y = 2. State y = 1 can be related to ξ = 1, x = 1,
for instance.
The joint process {ξk } = {xk , yk } is stationary Markov and is described by the
transition probability matrix
⎛ ⎞
π11 π12 π13
πξ ξ = ⎝π21 π22 π23 ⎠ .
π31 π32 π33
Wk (1) = 0;
Wk−1 (1)π12 +Wk−1 (2)π22 +Wk−1 (3)π32
Wk (2) = ;
Wk−1 (1)(π12 + π13 ) +Wk−1 (2)(π22 + π23 ) +Wk−1 (3)(π32 + π33 )
(5.3.28)
Wk (3) = 1 −Wk (2).
We denote point (1, 0, 0) that belongs to the sample space (W (1),W (2),W (3))
and corresponds to distribution (5.3.27) as S0 . Further, we investigate possible tran-
sitions from that point. Substituting value Wk−1 = (1, 0, 0) to (5.3.28) we obtain the
transition to the point
W (0) = 0;
π12
W (2) = (5.3.29)
π12 + π13
π13
W (3) =
π12 + π13
which we denote as S1 . In consequence of (5.3.18) such a transition occurs with
probability
p1 = π12 + π13 . (5.3.30)
The process stays at point S0 with probability 1 − p1 = π11 .
Now we consider transitions from point S1 . Substituting Wk−1 by values (5.3.29)
in formula (5.3.28) with yk = 2 we obtain the coordinates
5.3 Entropy rate of components of a discrete and conditional Markov process 121
W (1) = 0;
π12 π22 + π13 π32
W (2) = (5.3.31)
π12 (π22 + π23 ) + π13 (π32 + π33 )
W (3) = 1 −W (2)
obtained from (5.3.17a). The return to point S0 occurs with probability 1 − p2 . Sim-
ilarly, substituting values (5.3.31) to (5.3.33) we obtain the following probability:
(π12 π22 + π13 π32 )(π22 + π23 ) + (π12 π23 + π13 π33 )(π32 + π33 )
p3 = (5.3.34)
π12 (π22 + π23 ) + π13 (π32 + π33 )
of the transition from S2 to the next point S3 and so forth. Transition probabilities pk
to the points following each other are calculated consecutively as described. Each
time a return to point S0 occurs with probability 1 − pk . The probability that there
has been no return yet to point S0 at time k is apparently equal to p1 p2 · · · pk . If
consecutive values pk do not converge to 1, then the indicated probability converges
to zero as k → ∞. Therefore, usually a return to point S0 eventually occurs with
probability 1. If we had chosen some different point S0 as an initial point, then a walk
over some different sequence of points would have been observed but eventually the
process would have returned to point S0 . After such a transition the aforementioned
walk over points S0 , S1 , S2 , . . . (with already calculated transition probabilities) will
be observed.
The indicated scheme of transitions of the secondary Markov process allows us
to easily find a stationary probability distribution. It will be concentrated in points
S0 , S1 , S2 , . . .
For those points the transition probability matrix has the form
⎛ ⎞
1 − p1 p1 0 0 · · ·
⎜1 − p2 0 p2 0 · · ·⎟
⎜ ⎟
π w = ⎜1 − p3 0 0 p3 · · ·⎟ . (5.3.35)
⎝ ⎠
.. .. .. .. ..
. . . . .
1 p1 · · · pk
Pst (S0 ) = ; Pst (Sk ) = ,
1 + p1 + p1 p2 + · · · 1 + p1 + p1 p2 + · · ·
k = 1, 2, . . . (5.3.36)
which is resulted from (5.3.23) or from Theorem 5.3 and formula (5.2.8) applied to
transition probabilities (5.3.35).
In the particular case when the value of x does not affect transitions from one
value of y to another one we have
That is why formula (5.3.40) coincides with (5.2.20), and this is natural, because,
when condition (5.3.38) is satisfied, process {yk } becomes Markov itself.
Ri j = E [(ξi − mi ) (ξ j − m j )] . (5.4.1)
As is known, such random variables have the following joint probability density
function:
p ( ξ1 , . . . , ξl ) =
1 l
= (2π ) −1/2
det −1/2
Ri j
exp − ∑ (ξi − mi ) ai j (ξ j − m j ) (5.4.2)
2 i, j=1
v(d ξ1 , . . . , d ξl )/d ξ1 · · · d ξl
H(ξ1 , . . . , ξl ) =
1 1 1 l
ln 2π + l ln v1 + ln det
Ri j
+ ∑ (ξi − mi )ai j (ξ j − m j ). (5.4.5)
2 2 2 i, j=1
To calculate entropy (1.6.13), one only needs to perform averaging of the latter
expression. Taking into account (5.4.1), we obtain
1 1 1 l
Hξ1 ,...,ξl = ln 2π + l ln v1 + ln det
Ri j
+ ∑ ai j Ri j
2 2 2 i, j=1
1 1 1
= ln 2π + l ln v1 + ln det
Ri j
+ . (5.4.6)
2 2 2
For the choice (5.4.4), the result has the following simple form:
1
Hξ1 ,...,ξl = ln det
Ri j
. (5.4.6a)
2
Matrix R =
Ri j
is symmetric. Therefore, as is known, there exists a unitary trans-
formation U that diagonalizes this matrix:
∑ Uir∗ Ri jU js = λr δrs .
i, j
Here λr are the eigenvalues of the correlation matrix. They satisfy the equation
With the help of these eigenvalues we can rewrite the entropy Hξ1 ...ξl as follows:
1 l
Hξ1 ,...,ξl = ∑ ln λr .
2 r=1
(5.4.8)
Each of the two entropies on the right-hand side of this equation can be determined
by formula (5.4.8). This leads to the relationship
1 1
Hξk |ξ1 ,...,ξk−1 = tr ln r(k) − tr ln r(k−1) (5.4.9)
2 2
where
⎛ ⎞ ⎛ ⎞
R11 . . . R1k R11 . . . R1,k−1
⎜ .. ⎟ , ⎜ ⎟
r(k) = ⎝ ... . . . . ⎠ r(k−1) = ⎝ ... . . . ..
. ⎠
Rk1 . . . Rkk Rk−1,1 . . . Rk−1,k−1 .
Here we assume the multiplicativity condition (1.7.9). In this case, the random en-
tropy (1.6.14) turns out to be equal to
1 1
H(ξ1 , . . . , ξl ) = ln N + ln det
Ri j
− ∑ ln
λk
2 2 k
1 ( ξk − m
k )2 1
2∑
− + ∑(ξi − mi )ai j (ξ j − m j ) , (5.4.10)
k
λk 2 i, j
where N = ∏ Nk .
Now averaging of this entropy entails the relationship
Hξ1 ,...,ξl =
1 1 1 1 l
ln N + ln det
Ri j
− ∑ ln
λk − ∑ k )2 + .
Rkk + (mk − m (5.4.11)
2 2 k 2 k
λk 2
Introducing matrix R =
λk δkr
and employing a matrix form, we can reduce equal-
ity (5.4.11) to
Hξ1 ,...,ξl =
1 1 − 1 (m − m)
ln N + tr ln(R−1 R) − tr R−1 (R − R) T R−1 (m − m).
(5.4.12)
2 2 2
126 5 Computation of entropy for special cases. Entropy of stochastic processes
Here we have taken into account that det R−det R = det R−1 R; tr R−1 R = tr I = l; m−
is a column-matrix; T denotes transposition. Comparing (5.4.12) with (1.6.16) it
m
P/Q
is easy to see that thereby we have found entropy Hξ ...ξ of distribution P with
1 l
respect to distribution Q. That entropy turns out to be equal to
1
T R−1 (m − m)
= tr G(R−1 R) + (m − m)
P/Q
Hξ (5.4.13)
1 ,...,ξl 2
where G(x) = (x − 1 − ln x)/2.
The latter formula has been derived under the assumption of multiplicativ-
ity (1.7.9) of measure ν (d ξ1 , . . . , d ξl ) and, consequently, multiplicativity of Q(d ξ1 ,
. . . , d ξl ). However, it can be easily extended to a more general case. Let measure
Q be Gaussian and be defined by vector m k = EQ [ξk ] and the correlation matrix
Rkr that is not necessarily diagonal. Then entropy H P/Q is invariant with respect
to orthogonal transformations (and more generally with respect to non-singular
linear transformations) of the l-dimensional real space Rl . By performing a ro-
tation, we can achieve a diagonalization of matrix R and afterwards we can ap-
ply formula (5.4.13). However, formula (5.4.13) is already invariant with respect
to linear non-singular transformations, and thereby it is valid not only in the case
of a diagonal matrix R, but also in the case of a non-diagonal matrix R. Indeed,
for the linear transformation ξk = ∑r Ckr ξr (i.e. ξ = Cξ ) the following transfor-
mations take place: m − m = C(m − m), R = CRCT , R = CRC T . Consequently,
(R ) = (C ) R C , (R R) = (C ) R RC hold true as well. That is why
−1 T −1 −1 −1 −1 T −1 −1 T
1
2∑
H(ξ1 , . . . , ξl ) − Hξ1 ,...,ξl = a jk (η j ηk − R jk) =
1 1 1 l
= η T aη − tr 1 = η T aη − . (5.4.14)
2 2 2 2
The mean square of this random deviation coincides with the desired variance. Here
we have denoted η j = ξ j − m j . When averaging the square of the given expression,
one needs to take into account that
(see (1.5.15), (4.1.18)). Thus, substituting (5.4.5) to the last formula and taking into
consideration the form (5.4.2) of the probability density function, we obtain that
−1/2 sl
−l/2
μ0 (s) = ln (2π ) ( det R) exp sHξ1 ,...,ξl − ×
2
1−s
2 ∑
× exp − ηi ai j η j d η1 · · · d ηl
i, j
sl
= sHξ1 ,...,ξl − + ln det−1/2 [(1 − s)a] + ln det−1/2 R (5.4.18)
2
(ηi = ξi − mi ).
l sl
μ0 (s) = − ln(1 − s) − + sHξ2 ,...,ξl
2 2
that holds true for s < 1. In particular, this result can be used to derive for-
mula (5.4.16).
Theorem 5.1 is also valid in the generalized version. Luckily, that theorem can
be proven in the same way. Now it means the equality
1 P(d ξ1 , . . . , d ξl )
H1 = − lim ··· ln P(d ξ1 , . . . , d ξl ). (5.5.3)
l→∞ l v1 (d ξ1 ) . . . vl (d ξl )
∞
2Γ = lim Hξ1 ,...,ξn − nH1 =
n→∞
∑ ξ j+1 |ξ j ...ξ1
H − H1 (5.5.4)
j=0
holds true. This variable can be interpreted as the entropy of the ends of the segment
of the sequence in consideration.
Besides the aforementioned variables and relationships based on the definition of
entropy (1.6.13), we can also consider analogous variables and relationships based
on definition (1.6.17) in their turn. Namely, similarly to (5.5.2), (5.5.4) we can in-
troduce
P/Q P/Q
H1 = Hξ
k |ξk−1 ξk−2 ...
P(d ξk | ξk−1 , ξk−2 , . . .)
= ln P(d ξk | ξk−1 , ξk−2 , . . .) (5.5.6)
Q(d ξk )
∞
2ΓP/Q = ∑ Hξ |ξ ...ξ − H1
P/Q P/Q
. (5.5.7)
j+1 j 1
j=0
(inverse to inequality (5.5.5)) takes place for entropy H P/Q . Therefore, ‘entropy of
the end” Γ P/Q has to be non-positive.
If we use expressions of types (1.7.13), (5.5.2) for conditional entropies, then
it will be easy to see that difference Hξ j+1 |ξ1 ,...,ξ j − H1 turns out to be indepen-
dent of νk . Consequently, boundary entropy Γ does not depend on νk . Analogously,
Γ P/Q appears to be independent of Q (if the multiplicativity condition is satisfied),
whereas the equation Γ P/Q = −Γ is valid. This is useful to keep in mind when
writing down formula (5.1.13) for both entropies. That formula takes the form
P/Q P/Q
Hξ1 ,...,ξl = lH1 + 2Γ + ol (1), Hξ = lH1 − 2Γ + ol (1). (5.5.7a)
1 ...ξl
1
U jr = √ e2π i jr/l , (5.5.8)
l
l−1
λr = ∑ Rs e−2π isrt/l , r = 0, 1, . . . , l − 1. (5.5.9)
s=0
which is satisfied due to (5.5.9). So, (5.5.8) defines the transformation that diag-
onalizes the correlation matrix R j−k . It is easy to check its unitarity. Indeed, the
Hermitian conjugate operator
" "
" ∗ " " 1 −2π t jr/l "
" "
U = U jr = " √ e
+ " "
l "
1 l ( j−k)r ε l−1
ε 1 − εl
∑ U jrUkr∗ = l ∑ e2π i l =
l ∑ εr = l 1−ε
= δ jk
r r=1 r=0
2π i( j−k)/l
(ε = e ).
After the computation of the eigenvalues (5.5.9), one can apply formula (5.4.8)
to obtain the entropy Hξ1 ,...,ξl .
In the considered case of invariance with respect to rotations it is easy to also
calculate entropy (5.4.13). Certainly, it is assumed that not only measure P but also
measure Q possesses the described property of symmetry (‘circular stationarity’).
Consequently, the correlation matrix R jk of the latter measure has the same proper-
ties that R jk has. Besides, mean values mk , m k are constant for both measures (they
are equal to m and m, respectively).
The unitary transformation U =
U jr
diagonalizes not only matrix R, but also
matrix R (even if the multiplicativity condition does not hold). Furthermore, simi-
larly to (5.5.9) the mean values of R have the form
l−1
λr = ∑ Rs e−2π isr/l , r = 1, . . . , l − 1. (5.5.10)
s=0
1 l−1
H1 = ∑ ln λr ,
2l r=0
(5.5.12)
1 l−1 λr 1 (m − m)
2
∑
P/Q
H1 = ln + . (5.5.13)
l r=0 λr 2
λ0
which already possesses the periodicity property. If Rs is quite different from zero
only for
s
l, then supplementary terms in (5.5.14) will visibly affect only a
small number of elements of the correlation matrix R jk situated in corners where
j l, l − k l or l − j l, k l.
After the transition to the correlation matrix (5.5.14) (and, if needed, after the
we can use formulae (5.5.12), (5.5.13),
analogous transition for the second matrix R)
(5.5.9), (5.5.10) derived before.
Taking into account (5.5.9) we will have
∞ l−1 ∞
λr = ∑ ∑ Rs+nl e−2π isr/l = ∑ Rσ e−2π iσ r/l
n=−∞ s=0 σ =−∞
132 5 Computation of entropy for special cases. Entropy of stochastic processes
Analogously,
r! ∞
λr = ϕ = ∑ e−2π iσ r/l Rσ .
l σ =−∞
1 l−1 r!
H̄1 = ∑
2l r=0
ln ϕ
l
, (5.5.16)
1 l−1 ϕ (r/l) 2
1 (m − m)
H̄1 = ∑ G
P/Q
+ .
l r=0 ϕ(r/l) 2 ϕ(0)
1
P/Q ϕ (μ ) 2
1 (m − m)
H1 = G dμ +
0 ϕ(μ ) 2 ϕ(0)
1/2
ϕ (μ ) 2
(m − m)
= G dμ + . (5.5.18)
−1/2 ϕ(μ ) 2ϕ(0)
Here when changing the limits of integration we account for the property ϕ (μ + 1) =
ϕ (μ ) that follows from (5.5.15).
For large l the endpoints of the segment do a relatively small impact in compar-
ison with large full entropy having order lH1 . Passage (5.5.14) to the correlation
function Rs changes entropy Hξ1 ,...,ξl by some number that does not increase when l
grows. Thus, the following limits are equivalent:
1 1
lim H̄ξ1 ,...,ξl = lim Hξ1 ,...,ξl .
l→∞ l l→∞ l
Formula (5.5.18) is valid not only when the multiplicativity condition (5.5.1) is
satisfied. For the stationary Gaussian case this condition means that matrix R jk is a
multiple of the unit matrix: R jk = ϕδ
jk , ϕ(μ ) = ϕ = const. Then formula (5.5.18)
yields
1
P/Q ϕ (μ ) 2
1 (m − m)
H1 = G dμ + .
0 ϕ 2 ϕ
The provided results can be also generalized to the case when there are not only
one random sequence {. . . , ξ1 , ξ2 , . . .} but several (r) stationary and stationary asso-
ciated sequences {. . . , ξ1α , ξ2α , . . .}, α = 1, . . . , r described by the following correla-
tion matrix: " "
" α ,β "
"R j−k " , α , β = 1, . . . , r
αβ
or by the matrix of spectral functions ϕ (μ ) =
ϕ αβ (μ )
, ϕ αβ (μ ) = ∑∞
σ =−∞ Rσ
e−2π iμσ and the column-vector (by index α ) of mean values m =
mα
.
Now formula (5.5.17) is replaced by the matrix generalization
1/2 1/2
1 1
H1 = tr [ln ϕ (μ )] d μ = ln [det ϕ (μ )] d μ . (5.5.19)
2 −1/2 2 −1/2
Certainly, the represented results follow from formulae (5.4.6a), (5.4.13). By the
form they represent the synthesis of (5.4.6a), (5.4.13) and (5.5.17), (5.5.18).
3. The obtained results allow to make a conclusion about entropic stability (see
Section 1.5) of a family of random variables {ξ l } where ξ l = {ξ1 , . . . , ξl } is a seg-
ment of a stationary Gaussian sequence. Entropy Hξ1 ,...,ξl increases approximately
linearly with a growth of l. According to (5.4.16) the variance of entropy also grows
linearly. That is why ratio Var[Hξ1 ,...,ξl ]/Hξ2 ,...,ξ converges to zero, so that the con-
1 l
dition of entropic stability (1.5.8) for entropy (5.4.5) turns out to be satisfied.
Further we move to entropy (5.4.10). The conditions
will be satisfied for it if variance Var[Hξ1 ,...,ξl ] = Var[H P/Q (ξ1 , . . . , ξl )] (defined by
formula (5.4.17)) increases with a growth of l approximately linearly, i.e. if there
exists the finite limit
1
D1 = lim VarH P/Q (ξ1 , . . . , ξl ). (5.5.21)
l→∞ l
134 5 Computation of entropy for special cases. Entropy of stochastic processes
Furthermore, according to the contents of Section 1.7 (see (1.7.17)) we can intro-
duce the conditional entropy
ρ
P(d ξα | ξγδ ) β
P(d ξα d ξγδ )
P/Q
H β = ln β
(5.6.2)
ξα |ξγδ Q(d ξα )
The introduced entropies obey regular relationships met in the discrete version.
For instance, they obey the additivity condition
P/Q P/Q P/Q
H =H +H , α < β < δ. (5.6.3)
ξαδ ξα
β β
ξβδ |ξα
When writing formulae (5.6.2), (5.6.3) it is assumed that measures Q, ν satisfy the
multiplicativity condition
β β β β
Q(d ξα d ξγδ ) = Q(d ξα )Q(d ξγδ ), v(d ξα ξγδ ) = v1 (d ξα ) + v2 (d ξγδ ) (5.6.4)
([α , β ] does not overlap with [γ , δ ]) that is analogous to (1.7.8). The indicated mul-
tiplicativity condition for measure Q means that the auxiliary process {ηt } is such
β
that its values ξα and ξγδ for non-overlapping intervals [α , β ], [γ , δ ] must be inde-
pendent. The multiplicativity condition for measure ν means in addition that the
constants
β β
Nα = v(d ξα )
Taking into account (5.6.5), regular entropy of type (1.6.16) and conditional en-
tropy of type (1.7.4) can be found by the formulae
P/Q
Hξ β = F(β ) − F(α ) − H β , (5.6.6)
α ξα
P/Q
Hξ β |ξ δ = F(β ) − F(α ) − H β (5.6.7)
α γ ξα |ξγδ
where the right-hand side variables are defined via relations (5.6.1), (5.6.2).
2. Further we consider a stationary process {ξt } defined for all t. In this case it is
natural to choose the auxiliary process {ηt } to be stationary as well.
In view of the fact that the entropy in a generalized version possesses the same
properties as the entropy in a discrete version when the multiplicativity condition is
satisfied, the considerations and the results related to a stationary process in discrete
time and stated in Sections 5.1 and 5.5 can be extended to the continuous-time case.
Due to ordinary general properties of entropy, conditional entropy Hξ τ |ξ 0 (σ >
0 −σ
0) does not monotonically increase with a growth of σ . This fact entails existence
of the limit
lim Hξ τ |ξ 0 = Hξ τ |ξ 0 (5.6.8)
σ →∞ 0 −σ 0 −∞
which we define as Hξ τ |ξ 0 .
0 −∞
136 5 Computation of entropy for special cases. Entropy of stochastic processes
Hξ τ1 +τ2 |ξ τ1 = Hξ τ2 |ξ 0
τ1 −σ 0 − τ1 − σ
Hξ τ1 +τ2 |ξ 0 = Hξ τ1 |ξ 0 + Hξ τ2 |ξ 0 .
0 −∞ 0 −∞ 0 −∞
Theorem 5.4. If entropy Hξ t is finite, then entropy rate (5.6.10) can be determined
0
by the limit
1
h = lim Hξ t . (5.6.11)
t→∞ t 0
Proof. The proof exploits the same method as the proof of Theorem 5.1 does. We
use the additivity property (1.7.4a) and thereby represent entropy Hξot in the form
1
we observe that term σ +n Hξ0σ σm+n Hξ 1 converges to 0. Term σ +n
n
oσ (1) also goes
0
to 0. At the same time σ +n h converges to h because n/(σ + n) → 1. Therefore, the
n
The statements from Section 5.1 related to boundary entropy Γ can be general-
ized to the continuous-time case. Similarly to (5.1.10) that entropy can be defined
by the formula !
2Γ = lim Hξ0σ + Hξ0τ − Hξ σ +τ (5.6.15)
σ →∞,τ →∞ 0
analogous to (5.1.12).
With the help of variables h, Γ entropy Hξ t of a finite segment of a stationary
0
process can be expressed as follows:
Hξ t = th + 2Γ + ot (1). (5.6.17)
0
If we take into account (5.6.1), then we can easily see from definition (5.6.15) of
boundary entropy Γ that this entropy is independent of choice of measure Q (or ν )
similarly to Section 5.4.
If the multiplicativity conditions are satisfied, then the formula
P/Q
Hξ t = thP/Q − 2Γ + ot (1), (5.6.18)
0
will be valid for entropy H P/Q in analogy with (5.6.17). In the last formula Γ is the
same variable as the one in (5.6.17).
of vector x(t). As is well known, all the main results of the theory of finite-
dimensional vectors can be used in this case. In so doing, it is required to make
trivial changes in the formulas such as replace a sum by an integral, etc. The meth-
ods of calculating entropy given in Section 5.4 can be extended to the case of con-
tinuous time if we implement the above-mentioned changes. The resulting matrix
138 5 Computation of entropy for special cases. Entropy of stochastic processes
formulae (5.4.8a), (5.4.13) retain their meaning with new understanding of matrices
and vectors.
Certainly, the indicated expression is not supposed to be finite now. The condition
of their finiteness is connected with the condition of absolute continuity of measure
P with respect to measure ν or Q.
If we understand vectors and matrices in a generalized sense, then formu-
lae (5.4.8a), (5.4.13) are valid for both finite and infinite domain intervals [a, b] of a
process in stationary and non-stationary cases. Henceforth, we shall only consider
stationary processes and determine their entropy rates h, hP/Q .
For this purpose we can apply the approach employed in clause 2 of Section 5.5.
This approach uses the passage to a periodic stationary process. While considering
a process on interval [0, T ] with correlation function R(t,t ) = R(t − t ), one can
construct the new correlation function
∞
R̄(τ ) = ∑ R(τ + nT ) (5.7.1)
n=−∞
which, apart from stationarity, also possesses the periodicity property. Formula (5.7.1)
is analogous to formula (5.5.14). The process
1 N−1
ξ̄ (t) = √ ∑ ξ (t + jT )
N j=0
from the latter formula, where S(ω ) denotes the spectral density
∞
S(ω ) = e−iωτ R(τ )d τ = S(−ω ) (5.7.3)
−∞
of process ξ (t).
Substituting (5.7.2) to (5.4.8a) we obtain
5.7 Entropy of a Gaussian process in continuous time 139
1 1 1 r!
H̄ξ T = tr ln R = ∑ ln λr = ∑ ln S 2π . (5.7.4)
0 2 2 r 2 r T
P/Q
Now we move to entropy Hξ T that is determined by formula (5.4.13). It is
0
convenient to apply this formula after diagonalizations of matrices R(t − t ) and
− t ). Then
R(t
λr
tr G(R−1 R) = ∑ G (5.7.5)
r
λr
holds true, where
∞
r!
λr = S 2π , S = τ )d τ
e−iω t R( (5.7.6)
T −∞
in analogy with (5.7.2), (5.7.3). After diagonalization the second term in the right-
hand side (5.4.13) takes the form
1 T
(m − m = c+U −1 R−1Uc = ∑
T )R−1 (m − m) λr−1 cr (5.7.7)
2
(c = U + (m − m)).
Because
√
for r = 0
T (m − m)
cr =
Utr+ [m(t) − m(t)]dt =
0 for r = 0
are satisfied, entropy (5.7.8) turns out to be finite, which speaks for absolute conti-
nuity of measure P with respect to measure Q.
2. Now we move to finding of entropy rate for a stationary process defined on
a continuous-time scale. Since substitution (5.7.1) of a correlation function makes
a significant impact only on boundary effects, entropies (5.7.4), (5.7.8) differ from
entropies corresponding to correlation function R(τ ) of order 1 and not of order
T
1. That is why derived expressions (5.7.4), (5.7.8) can be used to determine the
entropy rate
1 1
h = lim H T = lim H̄ξ T ,
T ξ0
T →∞ T →∞ T 0
1 P/Q 1 P/Q
hP/Q = lim Hξ T = lim H̄ξ T .
T →∞ T 0 T →∞ T 0
Thus, in the obtained formulae, we should pass to the limit T → ∞ with the
summations over r becoming integrals. From (5.7.4) we have
∞
1 2π r 2π 1 ∞
h= lim ∑ ln S = ln S(ω )d ω . (5.7.10)
4π T →∞ r=−∞ T T 4π −∞
These results can be also obtained from formulae (5.5.17), (5.5.18), which de-
fine an entropy rate of stationary sequences as a limiting results of an unlimited
concentration of points on a time scale. If we pick up points tk = kΔ from a station-
ary Gaussian process in continuous time t, then we can obtain stationary Gaussian
sequence {ξ (tk )} having the correlation matrix
R jk = R(( j − k)Δ)
and the same values m. Comparing formula (5.5.15) with (5.7.3) it is easy to see that
∞ ∞
ωΔ
ϕ Δ = ∑ R(σ Δ)e−iωσ Δ Δ = R(τ )e−iωτ d τ + oΔ (1)
2π σ =−∞ −∞
Considering entropy with respect to a unit of time, rather than a single element
of the sequence, we have
5.7 Entropy of a Gaussian process in continuous time 141
P/Q
h = lim H1 /Δ, hP/Q = lim H1 /Δ.
Δ→∞ Δ→0
Substituting (5.5.17), (5.5.18) hereto and accounting for equality (5.7.12) and the
we obtain that
analogous equality for ϕ, S,
π /Δ
1
h= lim [− ln Δ + ln S(w)] dw, (5.7.13)
4π Δ→0 −π /Δ
1 ∞ S(ω ) (m − m) 2
hP/Q
= G dω + . (5.7.14)
2π −∞ ω)
S(
2S(0)
Formula (5.7.14) coincides with (5.7.11). In turn (5.7.13) differs from (5.7.10) by
the term − ln Δ , which can be referred to as the appropriately selected measure ν .
Using the freedom of choice of measure ν we can represent formulae (5.7.10),
(5.7.13) in a different form that is more convenient. We assume that spectral density
S(ω ) converges to a finite non-zero limit as ω → ∞:
We take into account that ϕ (μ )Δ = S(2π μ /Δ ) + oΔ (1) and thereby obtain that
the value of ϕ (μ )Δ converges to S(∞) when μ is fixed and Δ → 0. It is expedient
to mark this limit value and rewrite formula (5.5.16) as follows:
v(d ξ (t j ))
= (2π e)−1/2 (S(∞)/Δ)−1/2 (5.7.16)
d ξ (t j )
instead of formula (5.4.4). Then the first term in (5.7.15) will be related to measure
ν and instead of (5.7.15) we will get
1 l−1 ϕ (r/l)Δ
H̄1 = ∑ ln S(∞) .
2l r=0
(5.7.17)
where c is some number. Hence, we obtain a finite value for entropy rate h if con-
straints (5.7.19) are satisfied and all the particularities of the spectral density (points
at which it turns into zero or infinity) are logarithmically integrable as, for instance,
zeros and poles of the type
If these conditions are satisfied, then measure P is absolute continuous with respect
to measure ν constructed in this special way and defined by (5.7.16) and the multi-
plicativity condition.
The convergence condition of the other integral (5.7.11) in the upper limit has
the form 2
∞
S(ω )
− 1 d ω < ∞, (5.7.19a)
c ω)
S(
analogous to (5.7.9). This condition is necessary in order for measure P to be abso-
lutely continuous with respect to Q.
If the multiplicativity condition is satisfied for measure Q, then (in a stationary
case) its correlation matrix has to be a multiple of an identity matrix and its spectral
ω ) = S.
density has to be constant: S( Equality S(∞) = S(∞)
is necessary for absolute
continuity. Consequently, S( ω ) = S(∞). Substituting the latter value to (5.7.11) we
obtain the formula
1 ∞ S(ω )
h P/Q
= G dω (G(x) = (x − 1 − ln x)/2). (5.7.20)
2π −∞ S(∞)
for equivalent means. Similarity of the last formula with (5.7.18) is evident. The
only difference between them is the choice of an integrand.
ω2 + γ2
S(ω ) = S0 .
ω2 + β 2
We would like to find its entropy rates. Since S(∞) = S0 , according to (5.7.18) we
have
1 ∞ ω2 + γ2 1
h= ln 2 d ω = (γ − β ).
2π 0 ω +β2 2
Further we apply formula (5.7.20) and conclude that
5.7 Entropy of a Gaussian process in continuous time 143
∞ 2
1 ω + γ2 1 ∞ ω2 + γ2
hP/Q = − 1 − ln 2
2π 0 ω2 + β 2 2π 0 ω +β2
γ −β2
2 1
= −h− (γ − β )2 . (5.7.21)
4β 4β
ω ) = S0 /ω 2 .
S( (5.7.22)
Then the integral in (5.7.14) will be reduced to the integrals involved in (5.7.21) (for
γ = 0) and we will obtain
hP/Q = β /4. (5.7.23)
When choosing the spectral density (5.7.22), the measure Q (it describes the
stochastic process η (t)) does not satisfy the multiplicativity condition. However,
this condition will hold for derivative η̇ (t) = d η (t)/dt because it has a uniform
spectral density S0 . When passing to the derivative η̇ (t), the process ξ (t) also
needs to be replaced with its derivative ξ̇ (t). That is why we can consider that re-
sult (5.7.23) complies with the multiplicativity condition.
∞
2
1 S(ω ) 2 S(0)
(m − m)
D0 = −1 dω + . (5.7.24)
2π 0 ω)
S( 4S2 (0)
144 5 Computation of entropy for special cases. Entropy of stochastic processes
of the event that the first point τ1 from the set of n consecutive random points τ1 <
· · · < τn lies within interval [t1 ,t1 + Δ1 ], the second one—within interval [t2 ,t2 + Δ2 ]
and so forth, the last one—within interval [tn ,tn + Δn ] and also there are no other
random points lying in other places of [a, b]. The simplest system of random points
is a Poisson system for which
b n b
1
Pn = β (t) dt exp − β (t) dt ,
n! a a
* b n
pn (τ1 , . . . , τn ) = n!β (τ1 ) · · · β (τn ) β (t) dt . (5.8.5)
a
Taking β out of the sign of the logarithm, this expression can be apparently written
as follows:
146 5 Computation of entropy for special cases. Entropy of stochastic processes
P/Q P/Q1
Hξ b = (β − 1)T − ln β E [n] + Hξ b (T = b − a). (5.8.7)
a a
Then
∞
∑
P/Q
Hξ b 1 = ··· Pn pn (t1 , . . . ,tn ) ln[eT Pn pn (t1 , . . . ,tn )] dt1 · · · dtn (5.8.8)
a
n=0
a<t1 <···<tn <b
where
∞
∑ Pn
P/Q
Hξ b |n 1 = ··· pn (t1 , . . . ,tn ) ln[eT pn (t1 , . . . ,tn )] dt1 · · · dtn . (5.8.10)
a
n=0
a<t1 <···<tn <b
= E [ln(eT −γ T γ n )] = γ T ln γ + (1 − γ )T
P/Q1
Hξ b (5.8.11)
a
since E [n] = γ T .
Next we obtain from (5.8.7) that
P/Q γ
Hξ b = (β − γ )(b − a) + γ (b − a) ln . (5.8.12)
a β
We divide the last result by b − a to get entropy rate of one Poisson measure with
respect to another:
γ β β
hP/Q = β − γ + γ ln = γ − 1 − ln 0.
β γ γ
Entropies (5.8.11), (5.8.12) are proportional to b − a due to the fact that Poisson pro-
cess are processes with independent increments. Entropies (5.8.9), (5.8.10) depend
on an interval length in a more complicated way. If we introduce the function
5.8 Entropy of a stochastic point process 147
∞
xn
S(x) = e−x ∑ n! ln n!
n=0
The probability that an elementary interval is hit by more than one random point
is assumed to be a value of order Δ 2 , i.e. of a higher order of smallness. Then
for sufficiently large N, number n of random points each lying within [a, b] with
probability close to one will not differ from the number of points n = ∑k=0 N−1
ξ (tk )
hitting Z.
A point process on Z is a discrete process having 2N distinct realizations in total.
Event n = 0 is realized by only one scenario. Event n − 1 can be realized by N
different scenarios, i.e. it constitutes N different realizations. Event n = 2 consists of
N(N − 1)/2 different realizations and so forth. Similarly to Section 1.6 we introduce
a measure conveying information about a number of different realizations to obtain
Note that
∞
∑
v(n ) = 2N .
n =0
It is easy to also compute the number of realizations of the event that τ1 hits interval
[t1 ,t1 + Δ1 ), at the same time τ2 hits interval [t2 ,t2 + Δ2 ) and so on, τn hits interval
[tn ,tn + Δn ). Assuming Δk
Δ and n N, the number of such realizations equals
148 5 Computation of entropy for special cases. Entropy of stochastic processes
Δ1 Δn
Δv ≈ ··· .
Δ Δ
Next we remind that the probability of such a set is given by formula (5.8.4). Since
ΔP
≈ Pn pn (t1 , . . . ,tn )Δ n
Δv
we employ formulae (1.6.5), (1.6.6) to obtain
Factoring out Δ from the sign of the logarithm, we can express the derived result in
terms of entropy (5.8.8):
1 P/Q
Hξ ≈ ln E [n] + b − a − Hξ b 1 . (5.8.13)
Δ a
P/Q
That substantiates the introduction of entropy Hξ 1 of measure P with respect to
Poisson measure Q1 . If we conduct partitioning of interval [a, b] not in a uniform
fashion, then due to relationship tm+1 − tm = β (tm )Δ , entropy Hξ will be expressed
P/Q
(in analogy with (5.8.13)) in terms of entropy Hξ of measure P with respect to
Poisson measure Q having non-uniform density β (t).
3. In the case of a stationary point process we can consider entropy rate per unit
of time in average. Here we provide two alternative approaches.
1) According to Theorem 5.2, we can calculate this by formula (5.6.11). Here we
need to investigate the behaviours of entropies (5.8.8), (5.8.10) for large n.
In the case of a stationary process the mean number of points E [n] = n is propor-
tional to interval length b − a = T . Besides, for an ergodic process the dependence
of its variance Var[n] on T , for large n, approaches a linear dependence rule:
Var(n) = D0 T + O(1).
Also, random variable n obeys the Central Limit Theorem. For large n the proba-
bility of inequality n0 n n0 + Δ N can be found with the help of the Gaussian
distribution:
1
e−(n0 −n̄)
2 /Var(n)
ΔP ≈ ΔN (Δ N 2 Var(n)), (5.8.14)
2π Var(n)
where Δ N has the same meaning as in (1.6.4a). This allows us to compute Hn ap-
proximately as an entropy of a Gaussian variable. Thus, we obtain from (5.8.14)
that
ΔP 1 (n − n̄)2
H(n) = − ln Pn ≈ − ln ≈ ln(2π Var(n)) + . (5.8.15)
ΔN 2 2Var(n)
This approximation is a particular case of formula (5.4.5). Averaging out (5.8.15)
we find that
5.8 Entropy of a stochastic point process 149
1 1
Hn ≈ln 2π eVar(n) ≈ ln(2π eD0 T ).
2 2
It follows from the obtained dependence that, in particular, the limit
1
lim Hn = 0
T →∞ T
vanishes. This means that a contribution of entropy Hn to entropy rate (5.6.11) can
be reduced to zero. Consequently, only entropy (5.8.10) influences the entropy rate:
1 P/Q1
hP/Q1 = lim H . (5.8.16)
T →∞ T ξ0T |n
Using the denotation
P/Q1
Φ (n) = Hξ 0 (| n) = ··· pn (t1 , . . . , tn )×
T
0<t1 <···<tn <T
Φ (n) = Φ (n + 1) − Φ (n)
Φ (n) = Φ (n + 1) − 2Φ (n) + Φ (n − 1)
if
Var(n)
lim Φ (n̄) = 0. (5.8.20)
T →∞ T
It has been already indicated that ratio Var[n]/T usually converges to a finite limit
D0 as T → ∞. Therefore, condition (5.8.20) is satisfied if
150 5 Computation of entropy for special cases. Entropy of stochastic processes
Φ (n̄) → 0 as T → ∞. (5.8.21)
As the investigation shows, condition (5.8.21) stating that dependence Φ (n) ap-
proaches a direct proportionality as n → cT is satisfied for a large number of practi-
cal examples.
Thus, in compliance with formula (5.8.19) entropy rate can be computed if we
suppose that the number of random points sampled from large interval [0, T ] is not
random but known in advance and equal to n = E [n].
Example 5.6. In this example we calculate entropy of a Poisson process by the spec-
ified method. The number of points on the entire interval is assumed to be non-
random and equal to n = γ T . At the same time pn (t1 , . . . ,tn ) = n!T −n . We substitute
the latter function to (5.8.17) and find out that
Φ (n) = T + ln n! − n ln T.
we obtain that
1
Φ (n) = T + n ln n − n + ln(2π n) − n ln T.
2
Condition (5.8.21) actually takes place because of
1 1
Φ (n̄) ≈ = .
n̄ γ T
We substitute the derived expression for Φ (n) to (5.8.19) with n = γ T and pass to
the limit T → ∞. Thus, we obtain
hP/Q1 = 1 + γ ln γ − γ (5.8.22)
which, of course, coincides with the entropy rate determined from (5.8.11).
2) The other method to calculate entropy rate is based on its definition (5.6.10) as
conditional entropy. Fixation of process ξ−∞0 means fixation of all random points . . .,
τ−2 , τ−1 , τ0 < 0 occurred before moment t = 0. Therefore, (5.6.10) can be rewritten
as
1 P/Q 1 P/Q
hP/Q1 = Hξ τ |...,1 τ ,τ = E [Hξ τ 1 (| . . . , τ−1 , τ0 )]. (5.8.23)
τ 0 −1 0 τ 0
P/Q
Here entropy Hξ τ (| . . . , τ−1 , τ0 ) can be defined by formula (5.8.8) with probabil-
0
ities Pn and densities pn (τ1 , . . . , τn ) replaced with the respective conditional proba-
bilities and densities
It is desirable to consider a small length τ of interval [0, τ ] because in this case the
probability of two or more points hitting the interval is negligibly small and those
probabilities can be disregarded. Then we can remain only two terms in expres-
sion (5.8.24) supposing p1 (t1 | . . . , τ−1 , τ0 ) = 1/τ . We denote that
P/Q1
Hξ τ (| . . . , τ−1 , τ0 ) = [1 − p(0 | . . . , τ−1 , τ0 )τ ]×
0
( )
× ln eτ [1 − p(0 | . . . , τ−1 , τ0 )τ ] +
+ p(0 | . . . , τ−1 , τ0 )τ ln[eτ p(0 | . . . , τ−1 , τ0 )] + O(τ 2 )
or, equivalently,
P/Q1
Hξ τ (| . . . , τ−1 , τ0 ) = τ − p(0 | . . . , τ−1 , τ0 )τ +
0
We substitute the last expression into (5.8.23), make a passage to the limit τ → 0
and thereby obtain the desired entropy rate
Example 5.7. We apply the derived formula to a stationary point process with a
bounded after-action. We suppose that intervals σ = tm+1 − tm between adjacent
random points are independent random variables having an identical probability
density function w (σ ). Then evidently
*
∞
P(n = 1 | . . . , τ−1 , τ0 ) = ω (−τ0 )τ ω (σ )d σ + O(τ 2 )
− τ0
so that *
∞
p(0 | . . . , τ−1 , τ0 ) = ω (−τ0 )τ ω (σ )d σ .
− τ0
152 5 Computation of entropy for special cases. Entropy of stochastic processes
Example 5.8. Further, we consider a more complicated example. Let the increments
τm+1 − τm of a given stationary point process be mutually independent as before
but their probability density functions alternate between w1 (σ ) and w2 (σ ), which
are different from each other. In other words, if τm+1 − τm is distributed as w1 , then
τm+2 − τm+1 is distributed as w2 ; furthermore, increment τm+3 − τm+2 is distributed
as w1 and also τm+4 − τm+3 is distributed as w2 and so forth. Such a point process is
equivalent to a stationary process with two states A1 and A2 when the probabilities
of staying in each state are mutually independent and have distributions w1 (for the
time in state A1 ) and w2 (for the time in state A2 ), respectively. Moreover, the random
points can be classified by the additional parameter ϑ , supposing that ϑm = 1 if the
transition from A1 to A2 occurs at point τm and ϑm = 2 if the reverse transition from
A2 to A1 is observed.
In the described case the probability density p(0 | . . . , τ−1 , τ0 ) depends not only
on time τ0 of the occurrence of the last random point but also on its type ϑ0 . Namely,
*
∞
p(0 | . . . , τ−1 , τ0 ) = p(0 | τ0 , ϑ0 ) = ωϑ0 (−τ0 ) ωϑ0 (ρ )d ρ . (5.8.29a)
− τ0
We still need to calculate P(ϑ , d σ ). A priori, the probability of hitting the elemen-
tary interval [−σ , −σ + d σ ] by a random point of any of two specified types is the
same and equals
5.9 Entropy of a discrete Markov process in continuous time 153
∞
dσ
dP = , σ̄ϑ = ρωϑ (ρ )d ρ ,
σ̄1 + σ̄2 0
because the mean density of points is equal to (σ 1 + σ 2 )−1 for each type.
If point
ϑ0 = 1 comes out, then this point will be the last one with probability σ∞ w1 (ρ )d ρ ,
i.e. other points will not hit interval [−σ + d σ , 0]. If ϑ0 = 2, then point τ0 will be
last with probability σ∞ w2 (ρ )d ρ . Therefore, formula (5.8.30) can be rewritten as
hP/Q1 =
2 ∞ ∞
1
1+ ∑
σ̄1 + σ̄2 ϑ =1 0
ln ωϑ (σ ) − ln
σ
ωϑ (ρ )d ρ − 1 ωϑ (σ )d σ . (5.8.31)
ωϑ (σ ) = μϑ e−μϑ σ .
Then ∞
1
σ̄ϑ = , ωϑ (ρ )d ρ = e−μϑ σ ,
μϑ σ
μ1 μ2 2
μ1 + μ2 ϑ∑
hP/Q1 = 1 + [ln μϑ − 1]
=1
2 μ1 μ2 μ1 μ2
= 1− + (ln μ1 + ln μ2 ). (5.8.32)
μ1 + μ2 μ1 + μ2
If, in particular, μ1 = μ2 = γ , then formula (5.8.32) coincides with (5.8.22) since
under this condition the point process in consideration turns into a Poisson process.
Consider a Markov process ξ (t) with a discrete state space, i.e. a process having
either a finite or a countable number of possible states. In contrast to the process
considered in Section 5.2, it flows in continuous time now. Let its transition proba-
bilities be defined by the differential transition probability πt (x, x ) according to the
formula
However, if we want to calculate the entropy of the continuous set of values ξ0T ,
then we have to apply the formulae of the generalized version, i.e. the theory of
Sections 1.6 and 5.6.
Denote by {τ j } points on the time axis (i.e. time moments), at which we observe
jumps from one state of the process to another. We define the auxiliary measure
Q1 (ξ0T ) as a Poisson measure (with a unit density) for the system of random points
{τ j }. We pick out the initial entropy (5.9.1) by the formula
Hξ T = Hξ (0) + Hξ T
0 0+0 |ξ (0)
and define conditional entropy Hξ T |ξ (0) (or, more succinct, Hξ T |ξ (0) ) according to
0+0 0
formulae (1.7.17a), (1.7.17), (5.6.1) from the generalized version. Then we will have
P/Q1 P/Q
Hξ T = −Hξ (0) + Hξ T |ξ1(0) . (5.9.1a)
0 0
Due to the Markovian property of the process, the latter formula can be rewritten as
follows:
P/Q P/Q P/Q
H γ 1 = H β 1 +H γ 1 .
ξα | ξ ( α ) ξα | ξ ( α ) ξβ | ξ ( β )
So that
P/Q1 P/Q1 P/Q P/Q
Hξ T = −Hξ (0) + H t + H t2 1 + · · · + Hξ T |ξ1(t ) (5.9.1b)
0 ξ01 |ξ (0) ξt |ξ (t1 ) tN N
1
P/Q
Hξ τ |ξ 1(0) = τ hP/Q1 . (5.9.2)
0
5.9 Entropy of a discrete Markov process in continuous time 155
Taking this into account, we derive from (5.9.1a) and (5.9.2) that
P/Q1
Hξ T = −Hξ (0) + hP/Q1 T.
0
We compare the last formula with (5.6.18) and conclude that in this case ot (1) = 0
and
2Γ = Hξ (0) = − ∑ P(ξ (0)) ln P(ξ (0)). (5.9.3)
ξ (0)
It is convenient to use (5.9.2) for small τ in order to determine entropy rate hP/Q1
similarly to the approach used in Section 5.8 (see (5.8.23) and (5.8.24a)).
We can neglect the probability of more than one transition point hitting (0, τ ] for
small τ . Then possible scenarios are the following: either there is no transition with
probability
1 + π (ξ (0), ξ (0))τ + O(τ 2 )
or the transition to state x = ξ (0) occurs with probability π (ξ (0), x )τ + O(τ 2 ).
Analogously to (5.8.25) we write down the entropy of these events as
( )
Hξ τ 1 (| ξ (0)) = [1 + π (ξ (0), ξ (0))τ ] ln eτ [1 + π (ξ (0), ξ (0))τ ]
P/Q
0
Averaging with respect to ξ (0) and making the passage to the limit
1 P/Q
hP/Q1 = lim Hξ τ |ξ 1(0)
τ →0 τ 0
dP(ξ )
= ∑ P(ξ )π (ξ , ξ ) = 0 (5.9.5)
dt ξ
Example 5.9. Given a process with two states characterized by the differential tran-
sition probabilities
π (1, 1) π (1, 2) −μ μ
= . (5.9.6)
π (2, 1) π (2, 2) ν −ν
Then equation (5.9.5) is reduced to
−μ P(1) + ν P(2) = 0
which generalizes (5.9.5) (A is arbitrary). In the given case formula (5.9.1) may
become invalid and thereby we may need to employ measure Q in order to define
Hξ (0) .
The provided formulae (5.9.4), (5.9.8) can be applied to calculate the average
entropy rate not only in a stationary case. Those formulae are also appropriate in
a non-stationary case due to the Markovian properties of the process, namely, they
5.10 Entropy of diffusion Markov processes 157
define the entropy density h(t) per unit of time, which may depend on time t. There-
fore, the respective averaging should be conducted with respect to the non-stationary
distribution P(d ξ ).
∂ 1 ∂2
ṗt (x) = − [apt (x)] + [bpt (x)] (5.10.1)
∂x 2 ∂ x2
where ( )
pt (x) dx = P xt ∈ [x, x + dx] .
In order to determine the entropy HxT for the given process, we need to select mea-
0
sure Q such that measure P is absolutely continuous with respect to it. It is also
desired for such measure Q to be as simple as possible.
It is known (for instance, see [48]) that one of the desired measures corresponds
to a diffusion process with the same local variance b(x,t) but with a null drift, i.e. a
measure, for which (5.10.1) is replaced with the equation
1 ∂2 ( )!
q̇t (x) = [bt q(x)] qt (x)dx = Q xt ∈ [x, x + dx] .
2 ∂ x2
That is, the Radon–Nikodym derivative takes the form
T
P(dx0T ) p(x(0)) a(x(t),t) ∗ 1
= exp d x(t) − a(x(t),t)dt (5.10.2)
Q(dx0T ) q(x(0)) 0 b(x(t),t) 2
where the stochastic integral . . . d ∗ x(t) is understood in the Itô sense. Further, we
try to satisfy the multiplicativity property (5.6.4) in order for the theory of Sec-
tion 5.6 to be applicable. For this purpose we represent process {x(t),t ∈ [0, T ]} as
process {ξ (0), ξ (t),t ∈ [0, T ]} where ξ (0) = x(0), ξ (t) = ẋ(t),t > 0 and require that
b(x,t) = b(t) does not depend on x. Then measure Q will correspond to the Gaussian
delta-correlated process:
ξ (t)dQ = 0, ξ (t)ξ (t )dQ = b(t)δ (t − t )
P/Q
Therefore, it is convenient to compute entropy Hxt+τ |x(t) approximately, taking into
t
consideration smallness of τ . Then we pass to the limit
1 P/Q
hP/Q = lim Hxt+τ |x(t) . (5.10.3)
τ →0 τ t
P/Q a(x(t)) ( 1 )
Hxt+τ |x(t) = E [x(t + τ ) − x(t)] − a(x(t))τ + o(τ ).
t b 2
Taking into account that according to the definition of drift a the relationship
is valid, we have
1 P/Q 1 a(x(t))2
H t+τ = E + o(1).
τ xt |x(t) 2 b
Substituting the last expression into (5.10.3) we find the entropy density
1
hP/Q = E [a(x(t))]2 (5.10.4)
2b
which is independent of t in a stationary case.
It was assumed above that a and b do not depend on time t according to the sta-
tionarity condition. If that condition was not satisfied, we would obtain the entropy
density
1 a(x(t),t)2
hP/Q (t) = E (5.10.5)
2 b(x(t),t)
dependent on time via the described method (here the condition of independence of
b from x can also be not satisfied). In terms of the indicated density the full entropy
obtained from (5.10.2)
P/Q P(dx0T )
HxT = E ln
0 Q(dx0T )
5.10 Entropy of diffusion Markov processes 159
Then in a stationary case with b = const the entropy density can be found by for-
mula (5.10.4) employing the stationary probability density function pst (x), which
satisfies the stationary Fokker–Planck equation
d b d2
− [a(x)pst (x)] + pst (x) = 0.
dx 2 dx2
If we introduce the potential function
x
U(x) = − a(x)dx (5.10.7)
c
Applying integration by parts (also accounting that density pst (x) vanishes on
boundaries, for instance, as t → ±∞), the latter formula can be reduced to the form
2
1 −2U(x)/b d U(x) 1 da(x)
h P/Q
= e dx = − E . (5.10.9)
4N dx2 4 dx
Example 5.10. Let function a(x) be linear: a(x) = −β x + γ . In order for the process
to be stationary, positiveness β > 0 is necessary. Function (5.10.7) is given by
1
U(x) = β x2 − γ x
2
and distribution (5.10.8) appears to be Gaussian:
# 2
β β γ 2 πb
pst (x) = exp − x− , N= .
πb b β β
As is known, the given process is a Gaussian process having the spectral density
S(ω ) = 2b/(ω 2 + β 2 ). That is why it coincides with the process considered in Sec-
tion 5.7 (Example 5.5). Certainly, result (5.10.10) is equivalent to the corresponding
result (5.7.23), the derivation of which is based on the theory of Gaussian processes.
The results provided in this paragraph may be generalized to a multivariate case
when there is a multicomponent Markov process {x(t)} = {x1 (t), . . . , xl (t)}. Sup-
pose it is characterized by drifts aρ (x,t), ρ = 1, . . . , l and the matrix of local vari-
ances (diffusion parameters) bρσ (x,t), ρ , σ = 1, . . . , l. Then selecting measure Q
described in the formulation of Theorem 4.1 of Stratonovich’s monography [48],
we obtain the Radon–Nikodym derivative
P(dx0T | x(0)) T 1
Q(dx0T | x(0))
= exp ∑ aρ bρ σ d xσ − 2 aσ dt
0 ρ ,σ
−1 ∗
(5.10.11)
Applying Green’s formula and taking into account the fact that an integral over
boundary vanishes, we obtain that
1 ∂
4 ρ ∑
hP/Q (t) = − bσ π pst [a b−1 ]dx1 · · · dxl
,σ ,π
∂ xπ ρ ρ σ
1 ∂ −1
= − E bσ π [a b ] .
4 ∂ xπ ρ ρ σ
If matrix bρσ is non-singular and independent of x, then the following simple for-
mula takes place:
1 ∂ aρ
hP/Q = − ∑ E bσ π b−1 σρ
4 ρ ,σ ,π ∂ xπ
1 ∂ aρ 1
=− E ∑ ≡ − E [div a].
4 ρ ∂ xρ 4
1. The results and the methods stated in Section 5.3 can be generalized to the
continuous-time case. It is assumed that the joint process {ξ (t)} = {x(t), y(t)} is
P/Q
Markov. The theory of Markov processes allows us to calculate entropy Hξ T =
a
P/Q
HxT yT , where Q is a probability measure such that the measure P is absolutely con-
a a
tinuous with respect to it. It is convenient to select measure Q in such a way that
processes {x(t)} and {y(t)} are independent for the selected measure:
will be valid. Note that they are analogous to the relationships (5.3.1), (5.3.6), (5.2.5)
from the respective discrete version.
In the case of continuous time it is convenient to introduce the entropy densities
P/Q 1 P/Q
hxy (t) = lim Hxt+τ yt+τ |x(t) y(t) (5.11.4)
τ →0 τ t t
P/Q2 1 P/Q2
hy (t) = lim Hyt+τ |yt (5.11.5)
τ →0 τ t a
P/Q 1 P/Q
hx (t) = lim Hxt+τ |x1 t ,
τ →0 τ t a
1
P/Q2 (5.11.9)
P/Q2 P/Q
hy|x (t) = lim Hyt+τ |xt+τ − Hyt |xt 2 .
τ →0 τ a a a a
P/Q
In a stationary case, as is seen from (5.11.4), density hxy does not depend on
t, i.e. it is reasonable to determine other entropy densities with the help of an ad-
ditional passage to the limit a → −∞. In such a case entropies get strictly propor-
tional to τ and the passage to the limit τ → 0 becomes redundant. Moreover, formu-
lae (5.11.5), (5.11.6) are replaced with the following:
5.11 Entropy of a composite, conditional Markov process and its components 163
P/Q2 1 P/Q
hy (t) = Hyt+τ |y1 t ,
τ t −∞
1
P/Q P/Q P/Q
hx|y 1 (t) = lim Hxt+τ |y1 t+τ − Hxt |yt 1
τ a→−∞ a a a a
P/Q P/Q
(similarly for the other pair hx 1 , hy/x 2 ). At the same time, relationships (5.11.8),
(5.11.9) retain their significance. All those entropy densities will be constant in a
stationary case. Furthermore, we can prove that they are equivalent to the entropy
rates, i.e. we can prove the equalities:
P/Q P/Q
is valid for density hx|y 1 (as well as for hy|x 2 , similarly) in a stationary case.
All these statements extend the respective results proven in Section 5.3 to the
continuous-time case.
P/Q
Generalizing the methods of Section 5.3, we can compute the entropy HyT 2 or
0
P/Q
its density hy 2 for part of the components {y} of the Markov process ξ and the
P/Q P/Q
entropy HxT |yT1 or density hx|y 1 of the conditional Markov process {x(t)}, as well
0 0
P/Q1 P/Q1 P/Q P/Q
as the analogous entropies HxT , hx and HyT |xT2 , hy|x 2 .
0 0 0
P/Q
Now consider entropy Hyt+τ |y2 t . It can be represented in the form:
t 0
τ
P/Q P[dyt+
t | yt0 ]
Hyt+τ |y2 t = E ln τ . (5.11.10)
t 0 Q2 [dyt+
t | y(t)]
Then we have
P[dyt+
t
τ
| yt0 ] P[dyt+
t
τ
| x(t), yt0 ] t
τ =E τ y .
Q2 [dyt+
t | y(t)] Q2 [dyt+
t | y(t)] 0
or, equivalently,
τ
P/Q τ P(dyt+
t | ξ)
Hyt+τ |y2 t =E P(dyt+
t | ξ )W (d ξ ) ln t+τ W (d ξ ) (5.11.12a)
t 0 Q2 (dyt | y)
where averaging E is taken only over random variables W (·). Note that these vari-
ables W (·) constitute the secondary a posteriori W -process, which is a Markov pro-
cess with known transition probabilities.
Formulae (5.11.12), (5.11.12a) appear to be a generalization of formulae (5.3.15),
(5.3.16) of the discrete version. They are valid for all τ but it is more convenient to
P/Q
use them for small τ if we want to determine hy 2 . The particular examples con-
sidered below will confirm the last statement.
2. Now let {x(t)} be a Markov process with a discrete number of states that is
similar to the process considered in Section 5.9. It is characterized by the differential
transition matrix πt (x, x ) (introduced in Section 5.9) defining transition probabili-
ties P(ξ (t + Δ ) | ξ (t)). Selecting a Poisson measure for transition points as Q1 , we
P/Q P/Q
obtain density hx 1 (t) of entropy HxT 1
0
∑ P(x(t)) ∑
P/Q
hx 1 (t) = 1+ ln πt (x(t), x ) − 1 πt (x(t), x ) . (5.11.13)
xt x =xt
Process {y(t)} = {y1 (t), . . . , yl (t)} should be constructed in the following way.
Given the vector of drifts aρ (x(t), y(t),t) (dependent not only on y(t), but also on
x(t)) and the matrix of local variances bρσ (y(t),t). This matrix is assumed to be
non-singular and independent of x(t).
Then process y(t) will be a diffusion process considered in Section 5.10 for a
fixed realization of {x(t)}. Applying the results derived therein, we can find the
P/Q P/Q
density hy/x 2 (t) of the entropy HyT |xT2 . We choose a measure of a diffusion process
0 0
with zero drift and the same matrix of local variances bρσ (y,t) as Q2 . According
to (5.10.12) we get
5.11 Entropy of a composite, conditional Markov process and its components 165
1
2 ρ∑
aρ (x(t), y(t),t)b−1
P/Q2
hy (t | x0T ) = ρσ (y(t),t)×
,σ
P/Q P/Q
In order to find density hy/x 2 of entropy HyT |xT2 it is left to carry on extra averaging
0 0
with respect to x0T in (5.11.14):
1
2 ρ∑
aρ (x(t), y(t),t)b−1
P/Q
hy|x 2 (t) = ρσ (y(t),t)×
,σ
P/Q
Due to additivity (5.11.2), (5.11.9), thus, we have found entropy density hxy
for the joint (combined) Markov process {ξ (t)} = {x(t), y(t)}. Namely, we should
substitute expressions (5.11.13), (5.11.15) into (5.11.9).
P/Q P/Q
According to (5.11.1), other entropy densities hy 2 and hx|y 1 are connected by
analogous relationship (5.11.8). Hence, in order to complete the computation of all
densities, it only remains to compute one of them.
The above-described process {y(t)} (without a fixed realization of {x(t)}) is a
non-Markov process. We determine the corresponding entropy by simply employing
formula (5.11.12).
For small τ it follows from (5.10.11) that
τ
P[dyt+
t | x(t), y(t)]
τ =
Q2 [dyt+t | y(t)]
τ
= exp ∑ aρ bρσ yσ (t + τ ) − yσ (t) − aσ + o(τ )
−1
ρ ,σ 2
τ
= 1 + ∑ aρ b−1 ρσ σ y (t + τ ) − yσ (t) − aσ
ρ ,σ 2
1
2 ρ∑
+ aρ b−1 −1
ρσ [yσ (t + τ ) − yσ (t)][yπ (t + τ ) − yπ (t)]bπν aν + o(τ ).
,σ
Substituting the latter expression to (5.11.11) and denoting ∑x(t) . . .Wt (xt ) = Eps . . .
for brevity, we obtain that
τ
P[dyt+t | yt0 ]
t+τ =
Q2 [yt | y(t)]
= 1 + ∑ Eps [aρ ]b−1
ρσ [yσ (t + τ ) − yσ (t)]
ρ ,σ
1 ( )
+ ∑ b−1 −1 −1
ρσ [yσ (t + τ ) − yσ (t)][yπ (t + τ ) − yπ (t)]bπν − τ bρν Eps [aν aρ ]
2
+o(τ ).
166 5 Computation of entropy for special cases. Entropy of stochastic processes
Due to (5.11.10), (5.11.12) we should take the logarithm of the last expression.
When taking the indicated logarithm we need to keep only the following terms:
τ
P[dyt+t | yt0 ]
ln τ =
Q2 [dyt+
t | y(0)]
= Eps [aρ ]b−1
ρσ [yσ (t + τ ) − yσ (t)]
1 ( −1 )
+ bρσ [yσ (t + τ ) − yσ (t)] [yπ (t + τ ) − yπ (τ )] b−1
πν − b−1
ρν
2
1 ( ) 2
×Eps [aν aρ ] − Eps [aρ ]b−1
ρσ [yσ (t + τ ) − yσ (t)] + o(τ ). (5.11.16)
2
τ
The corresponding averaging will be carried out in several steps: first, over yt+
t with
t t
fixed x(t), y0 ; second, over x(t) (with weight Wt (x(t))) with fixed y0 ; and, finally,
over yt0 or, equivalently, over W . At the first stage of averaging we need to account
for the formulae
so that
1 P[dyt+t
τ
| yt0 ]
E ln τ
t
x(0), y0 =
τ Q2 [dyt+
t | y(0)]
1
= Eps [aρ ]b−1 −1
ρσ aσ − Eps [aρ ]bρσ aσ Eps [aσ ] + o(1).
2
Further averaging will result in
1 P/Q2 1
Hyt+τ |yt = E Eps [aρ ]b−1
ρσ E ps [aσ ] + o(1).
τ t 0 2
Next, we take the passage to the limit τ → 0 and find out that
1
hy (t) = E ∑ Eps [aρ ]bρσ Eps [aσ ]
−1
P/Q2
2 ρ ,σ
1
= P(dWt , dy(t)) ∑ ∑ Wt (x)aρ (x, y(t),t)
2 x,x ρ ,σ
× b−1
ρσ (y(t),t)aσ (x , y(t),t)Wt (x ). (5.11.17)
The following fact helps to find distribution P(dWt , dy(t)) involved in the latter
formula: variables {Wt , yt } constitute a Markov process called a secondary a posteri-
ori W -process in Stratonovich’s monography [48]. The probability density function
pt (W, y) of process {Wt , y(t)} satisfies the particular Fokker–Planck equation that
was also derived in that monography. It takes the form
5.11 Entropy of a composite, conditional Markov process and its components 167
d pt ∂ ∂
=
[πxx W (x)pt ] − [Eps [aρ ]pt ]
dt ∂ W (x ) ∂ yρ
1 ∂2
+ ∑
2 ∂ W (x)∂ W (x )
( )
× W (x)(aρ (x) − Eps [aρ ])b−1
ρσ (aσ (x ) − Eps [aσ ])W (x )pt
∂2 ( )
+∑ W (x)(aρ (x) − Eps [aρ ])pt
∂ W (x)∂ yρ
1 ∂2
+ ∑
2 ∂ yρ ∂ yσ
[aρσ pt ]. (5.11.18)
Example 5.11. Let {x(t)} be a process with two states and transition matrix (5.9.6).
In Section 5.9 we have found the corresponding entropy
P/Q1 μν μν
hx = 1+ ln 2 . (5.11.19)
μ +ν e
P/Q2 1
hy|x = [P(x = 1)a2 (1) + P(x = 2)a2 (2)]
2b
or, equivalently,
P/Q2 ν a2 (1) + μ a2 (2)
hy|x = (5.11.20)
2b(μ + ν )
if we take into account (5.9.7).
P/Q
The sum of expressions (5.11.19), (5.11.20) yields entropy rate hxy of the com-
bined process.
168 5 Computation of entropy for special cases. Entropy of stochastic processes
P/Q
Now we proceed to the computation of entropy rate hy 2 for the example in
question. If we introduce variable zt = Wt (1) −Wt (2), then mean drift Eps [a] can be
represented as
P/Q2 1
hy = [a(1) + a(2)]2 +
8b
1 ( 1 )
+ E [a2 (1) − a2 (2)]z + [a(1) − a(2)]2 z2 . (5.11.21)
4b 2
It is easy to see that averaging of a posteriori probabilities Wt (x) = P[xt = x | yt0 ]
results in a priori probabilities P[xt = x], which have form (5.9.7) in a stationary
case. Therefore,
ν −μ
E [z] = E [Wt (1) −Wt (2)] = P[x = 1] − P[x = 2) =
μ +ν
and formula (5.11.21) can be reduced to
or, equivalently,
1
P/Q ν a2 (1) + μ a2 (2) [a(1) − a(2)]2
hy 2 = − (1 − z2 )pst (z)dz. (5.11.22)
2b(μ + ν ) 8b −1
In the given case process {zt } appears to be Markov itself. The corresponding
Fokker–Planck equation
∂ ( ) 1 ∂2
ṗ(z) = − [ν − μ + (μ + ν )z]p(z) + [(1 − z2 )2 p(z)]
∂z 2b ∂ z2
follows from (5.11.18).
Equating derivative ṗst (z) to zero and integrating the resulting equation, we ob-
tain the stationary probability density function
z
const dx
pst (z) = exp 2b [ν − μ − (μ + ν )x] } (5.11.23)
(1 − z2 )2 0 (1 − x 2 )2
The integral with this probability density function involved in expression (5.11.22)
has been computed in [48] (translated to English [52]) and turned out to be equal to
5.11 Entropy of a composite, conditional Markov process and its components 169
(
√ √
1− z2 p(z)dz = 4Kq (b μν ) 2Kq (b μν )+
# # )−1
ν √ μ √
+ Kq+1 (b μν ) + Kq−1 (b μν ) (5.11.24)
μ ν
where q = b2 (ν − μ ) and also Kq (z) is McDonald’s function (see Ryzhik and Grad-
stein [36], the corresponding translation to English is [37]).
The substitution of (5.11.24) to (5.11.22) solves the problem of determining
P/Q
entropy rate hy 2 of the stationary non-Markov process y. Combining (5.11.19),
P/Q P/Q P/Q P/Q
(5.11.20), (5.11.22) it is easy to find entropy hx|y 1 = hx 1 + hy|x 2 − hy 2 of the
conditional Markov process x(t).
3. The aforementioned derivation of formula (5.11.17) is also applicable in other
cases, for instance, when process {x(t)} is a diffusion Markov process or consti-
tutes a portion of components of a combined diffusion Markov process {ξ (t)} =
{x(t), y(t)}. We consider the last case in more details. We suppose that the com-
bined Markov process having the components
which satisfies the following equation derived in the indicated monography [48]
(translated to English [52]):
∂ 1 ∂2
ωt (x) = ∑ [b ωt ]
∂t 2 α ,β ∂ xα ∂ xβ αβ
∗
m
∂ m+l
d yσ
−∑ aα ωt + ∑ bαρ bρσ ωt −1
− Eps [aσ ]
α =1 ∂ xα σ ,ρ =m+1 dt
m+l ∗
d yσ
+ ∑ (aρ − Eps [aρ ])b−1 ω
ρσ t − E ps σ .
[a ]
ρ ,σ =m+1 dt
170 5 Computation of entropy for special cases. Entropy of stochastic processes
In the given case entropies of process {x(t)} and combined process {x(t), y(t)}
P/Q
can be found by applying formula (5.9.4). Calculation of entropy HyT 2 and its
0
P/Q
density hy 2 (t) requires a special approach because process {y(t)} itself is not
Markov.
Fundamentally, the respective calculation can be carried out in the same way as
in clause 2. We use formula (5.11.12) selecting a Poisson measure (for transition
time moments of process y(t)) with a unit density as measure Q2 . For small τ we
have
τ
P[dyt+
t | x(t), y(t)] eτ [1 + πt (x(t), y(t), y(t))τ ] + O(τ 2 ) for y(t + τ ) = y(t)
t+τ = τ
Q2 [dyt | y(t)] e πt (x(t), y(t), y(t + τ ))τ + O(τ 2 ) for y(t + τ ) = y(t).
+ O(τ 2 ). (5.11.27)
τ
First of all, here we implement averaging E [. . . | x(t), yt0 (t)] over yt+
t ; then, av-
eraging E [. . . | y0 ] over x(t) with weight Wt ; and then, finally, averaging over other
t
variables.
Next, (5.11.27) entails the following formula for the entropy density:
P/Q2
hy (t) =
= E 1+
∑ Eps [π (x(t), y(t), y )] ln Eps [π (x(t), y(t), y )] − 1 } (5.11.28)
y =y(t)
5.11 Entropy of a composite, conditional Markov process and its components 171
(i.e. formula (5.9.4) taken for ξ = y) is that π (y, y ) are replaced with a posteriori
means
Eps [π (x(t), y(t), y )] = ∑ Wt (x)π (x, y(t), y )
x
and averaging with respect to {Wt , y(t)} with weight P[dWt dy(t)] is considered in-
stead of averaging with respect to y(t) with weight P(dy). Process {Wt , y(t)} ap-
pears to be a secondary a posteriori Markov process and thereby it is not difficult
to find its transition probabilities which define, in particular, stationary distribution
Pst [dWt dy(t)].
It is interesting to compare formula (5.11.28) with the conditional entropy
∑ π (x, y(t), y )[ln π (x(t), y(t), y ) − 1] .
P/Q
hy|x 2 (t) = E 1 + (5.11.30)
y =y(t)
We have written down the last expression similarly to (5.11.29) by first supposing
that x(t) is fixed and then averaging with respect to it.
The provided method of computation of entropy density hy can be extended to
the case when process {x(t)} is not Markov itself but combined process {x(t), y(t)}
is a Markov process (with discrete states). Suppose it is described by the differential
transition matrix πt (x, y, x , y ). In this case the form of resultant formula (5.11.28)
P/Q
remains unchanged. Density hx can be determined analogously.
It is seen from the aforesaid that the described method of computation of entropy
for a portion of components of a Markov process has a wide range of applications.
The most difficult stage is finding the distribution P(dWt , dy(t)) of a secondary a
posteriori W -process.
Chapter 6
Information in the presence of noise. Shannon’s
amount of information
Hx = Hη .
Hx = Hη + Hζ |η .
Hζ |η = ∑ P(η )Hζ (| η )
Hζ (| η ) = − ∑ P(ζ | η ) ln P(ζ | η )
ζ
P(y ∈ Gl | x ∈ Ek ) = 0, if l = k (6.1.4)
(b) Transitions from all points of region Ek occur with equal probabilities and lead
to y ∈ Gk
P (y | x ∈ Ek ) = P (y | k) (6.1.5)
It is easy to see that it follows from (a) that
Hl|k = 0 (6.1.6)
or, equivalently,
Hk = Hkl (6.1.7)
(k is the index of region Ek , l is the index of region Gl ). Also
Theorem 6.2. From an information theoretic point of view, simple noise is equiva-
lent to a non-random degenerate transformation k = k(x) (where k(x) is the index
of subset Ek containing x), i.e. Hx|y = Hx|k .
P (x) P (y | x) P (x) P (y | l)
P (x | y) = = , where y ∈ Gl
∑x P (x ) P (y | x ) ∑x ∈El P (x ) P (y | l)
(l = k) and reducing the resulting fraction by common factor P(y | l), we obtain
P (x)
P (x | y) = if x ∈ El , P (x | y) = 0 if x ∈
/ El .
P (El )
P (x | y) = p (x | l) (y ∈ Gl ). (6.1.9)
If a similar equality holds true, then we say that variable l is a sufficient variable
or a sufficient statistic, which replaces y. Hence, we have obtained that index l of a
set serves as a sufficient statistic in the given case. It follows from equality (6.1.9)
that Hx (| y) = Hx (| l) and (after averaging this equality) also
One can see from the definition of simple noise and Theorem 6.2 that the notion
of simple noise is reversible: the noise corresponding to the inverse transition with
probabilities P(x | y) is simple, if the noise of the original transition with probabili-
ties P(y | x) is simple as well.
Indeed, substituting (6.1.4) to (6.1.3) it is easy to verify that are different from
zero only those probabilities P(x, y), for which x and y hit regions Ek , Gk with the
same index k. Probabilities P(x | y) are null if indices k and l of regions Ek x, Gl y
are not equal. Therefore, property (6.1.4) is reversible. In addition to (6.1.6), (6.1.7)
the relations
Hk|l = 0, Hl = Hkl = Hk (6.1.11)
are valid. Furthermore, equality (6.1.9) is evidently an inverse of equality (6.1.5).
Thus, the indicated reversibility has been proven: besides the degenerate transfor-
mation k = k(x) we can also consider a non-randomized degenerate transformation
l = l(y), where l is the index of region Gl point y belongs to.
6.1 Information losses under degenerate transformations and simple noise 177
Theorem 6.2 entails that noise destroys information because such a destruction
takes place for a degenerate transformation l = l(x) according to Theorem 6.1.
3. In order to accurately convey information in the case of either a degenerate
transformation or simple noise, we need to connect information not with variable x
(which is distorted under the transformation) but with variable η = k (which remains
constant) because l = k. Consequently, the amount of transmitted information will
be equal to
I = Hk . (6.1.12)
Next we reduce this relation to another form by employing
which follows from the definition of conditional entropies (Section 1.3). For fixed x
number k(x) of region Ek x is completely determined. Hence,
Hk|x = 0 (6.1.14)
I = Hx − Hx|k .
I = Hx − Hx|y . (6.1.15)
Furthermore, in analogy with (6.1.14) we have Hl|y = 0 or Hk|y = 0 (that is the same
because l = k with probability 1). Therefore,
Hk = Hk − Hk|y . (6.1.16)
The stated results are referred to the case of simple noise, however, they can be
implicitly (in the asymptotical sense) extended to the case of arbitrary noise as it
can be seen from the following (Section 7.6). Approximate relations (7.6.19) are
derived instead of exact relations (6.1.17) therein. In order to do so we need not
consider single random variables x and y but instead sequences x1 , . . . , xn and y1 ,
. . . , yn of such variables as n → ∞. Just as the case of arbitrary probabilities is
asymptotically reduced to the case of equally probable possibilities (refer to Sec-
tions 1.4 and 1.5), the case of arbitrary noise is asymptotically (as n → ∞) reduced
to the case of simple noise. That is why, if we connect information with correspond-
ing (approximate) sufficient statistics k(x) = 1, . . . , M (for that it has to be at most
Hk − Hk|l : ln M Hk − Hk|l ), then in the limit n → ∞ we can avoid errors invoked
178 6 Information in the presence of noise. Shannon’s amount of information
by distortions. This fact is related to the famous Shannon’s theorem and is one of its
possible interpretations (see Chapter 7).
reflects the average amount of uncertainty, which has disappeared due to the recep-
tion of information. Such interpretation of the amount of information was introduced
earlier in Section 1.1 (equation (1.1.2)).
Applying the regular relationship Hx|y = Hxy − Hy (refer to (1.3.4)) we can rep-
resent formula (6.2.1) in the form
The symmetry of this value is seen from the last formula. In other words, it remains
unchanged if x and y switch roles.
Hence, the same amount of uncertainty will disappear on average, if x is treated
as a quantity observed and y is treated as an independent variable:
The relations among values Hx , Hy , Hxy , Hx|y , Hy|x , Ixy are described graphically
on Figure 6.1. Since conditional entropy does not surmount regular entropy (Theo-
rem 1.6), information Ixy is non-negative.
6.2 Mutual information for discrete random variables 179
Fig. 6.1 Relationships between the information characteristics of two random variables
P (x, y) P (x | y)
I (x, y) = H (x) + H (y) − H (x, y) = ln = ln . (6.2.5)
P (x) P (y) P (x)
Further, (6.2.5) entails the following formula for the joint distribution:
P (x)
I (x, y) = − ln . (6.2.9)
P (x | y)
P (x) P (x)
ln −1 +
P (x | y) P (x | y)
E [I (x, y) | y] 0.
Switching x and y in the previous argument, we obtain the second desired in-
equality: E [I(x, y)|x] 0.
It is interesting to note that the reverse inequality
∑ I (x, y) P (x) 0 where ∑y I(x, y)P(y) 0
x
will take place if we implement averaging with weight P(x) (or P(y)) instead of
P(x | y).
Indeed, rewriting the right-hand side of (6.2.9) or (6.2.5) in the form ln P(y|x)
P(y) or
employing the inequality
P (y | x) P (y | x)
ln −1
P (y) P (y)
6.3 Conditional mutual information. Hierarchical additivity of information 181
we obtain
P (y | x)
I (x, y) −1
P (y)
instead of (6.2.10). Further, we average out the latter inequality with weight P(x)
and obtain
1
∑ I (x, y) P (x) P(y) ∑ P (y | x) P (x) − 1 = 0
x x
because
∑ P (y | x) P (x) = P (y) .
x
Hence, there are plenty of negative values that random mutual information I(x, y)
assumes. This is one of the reasons why we treat the corresponding mean value Ixy
but not I(x, y) as the amount of information.
P (x | yz)
I (x, y | z) = ln (6.3.1)
P (x | z)
or, equivalently,
Now we see that conditional or regular mutual information can be expressed in terms
of corresponding conditional or regular entropies, respectively.
182 6 Information in the presence of noise. Shannon’s amount of information
Since entropy possesses the property of hierarchical additivity (Section 1.3), mu-
tual information possesses an analogous property as well. Let x from formula (6.2.1)
consist of several components x = (x1 , . . . , xn ). Then the formula of hierarchical ad-
ditivity (1.3.4) can be applied to entropies Hx , Hx|y yielding
According to (6.2.1) we take the difference Hx − Hx|y , group terms in pairs and
thereby obtain
I(x1 ...xn )y = Hx1 − Hx1 |y + Hx2 |x1 − Hx2 |x1 y + · · · + Hxn |x1 ...xn−1 − Hxn |x1 ...xn−1 y .
But due to (6.3.4) every difference Hxk |x1 ...xk−1 − Hxk |x1 ...xk−1 y is nothing else but con-
ditional mutual information Ixk y|x1 ...xk−1 . Therefore,
I(x1 ...xn )y = Ix1 y + Ix2 y|x1 + Ix3 y|x1 x2 + · · · + Ixn y|x1 ...xn−1 . (6.3.5)
It is not difficult to understand that the same formula is valid for conditional mutual
information as well
n
I(x1 ...xn )y|z = ∑ Ixk y|x1 ...xk−1 |z . (6.3.6)
k=1
Now we assume that the second random variable is also compound: y = (y1 , . . . , yr ).
Then applying formula (6.3.6) to every mutual information Ixk y|x1 ...xk−1 we will have
r
Ixk (y1 ...yr )|x1 ...xk−1 = ∑ Ixk yl |x1 ...xk−1 y1 ...yl−1 .
l=1
Here, Ix2 y2 |x1 y1 = Ix2 y2 if P(x2 , y2 | x1 , y1 ) = P(x2 , y2 ), i.e. if both x2 and y2 are inde-
pendent of x1 , y1 .
Just as the property of hierarchical additivity (1.3.4) is valid not only for aver-
age entropies, but also for random entropies (1.3.6), relations analogous to (6.3.5)–
6.3 Conditional mutual information. Hierarchical additivity of information 183
(6.3.8) can be represented for random mutual information in the case of mutual
information. For instance,
n r
I (x1 , . . . , xn ; y1 , . . . , yr ) = ∑ ∑ I (xk yl | x1 , . . . , xk−1 , y1 , . . . , yl−1 ) .
k=1 l=1
This reasoning is quite parallel to the previous one if we use (1.3.6) instead
of (1.3.4).
2. Conditional mutual information (6.3.3) are non-negative that can be derived,
for example, from formulae (6.3.4) by taking into account Theorems 1.6, 1.6a. This
fact allows us to obtain various inequalities from the formulae of hierarchical ad-
ditivity (6.3.5), (6.3.7). Namely, mutual information I(x1 ...xn )y or I(x1 ...xn )(y1 ...yr ) from
the left-hand side is at least a sum of any portion of terms from the right-hand side of
the equality. We present a simple example to illustrate the latter argument: consider
the mutual information I(x1 x2 )y between a pair of two random variables x1 , x2 and
variable y. Formula (6.3.5) yields I(x1 x2 )y = Ix1 y + Ix2 y|x1 . Since Ix2 y|x1 0, the next
inequality follows
I(x1 ,x2 )y Ix1 y . (6.3.9)
The equality sign
I(x1 ,x2 )y = Ix1 y (6.3.10)
takes place if and only if
The last equality is exactly the condition of Markovian communication of the triplet
x2 , x1 , y.
We can also derive the following relationship from (6.3.7):
Thus, we see that the mutual information of the given random variables is greater
or equal than the mutual information of a portion of the indicated variables. This is
analogous to the inequality Hx1 Hx1 x2 for entropy (because Hx2 |x1 0). Mean-
while, the relation Hx|z Hx does not have a counterpart for mutual information.
The inequality Ixy|z Ixy is not valid in a general case.
Example 6.1. Let x, y, z be random variables with two feasible values, which are
represented by the probabilities
184 6 Information in the presence of noise. Shannon’s amount of information
Then
Therefore,
Hx = h (3/8) , Hx|y = h (1/4)
so that
Ixy = h (3/8) − h (1/4) . (6.3.14)
Finally, the following difference can be found from (6.3.13), (6.3.14):
Ixy − Ixy|z = h (3/8) − (1/2) h (1/4) − (1/2) h (1/2) = 0.183 bits > 0. (6.3.15)
Example 6.2. Now we assume that random variables x, y, z with two feasible values
are described by the probabilities
1
P (z1 ) = P (z2 ) = ,
2
" " 3/8 1/8 " "
"P (yi , x j | z1 )" = , "P (yi , x j | z2 )" = 1/3 1/6
1/8 3/8 1/6 1/3.
Hx (| z1 ) = Hx (| z2 ) = Hx = h (1/2) ,
Ixy (| z1 ) = h (1/2) − h (1/4) ,
Ixy (| z2 ) = h (1/2) − h (1/3) ,
Ixy|z = h (1/2) − (1/2) h (1/4) − (1/2) h (1/3)
Furthermore, since
" " " "
"P (yi , x j )" = 17/48 7/48 , "P (xi | y j )" = 17/24 7/24
7/48 17/48 7/24 7/24
6.3 Conditional mutual information. Hierarchical additivity of information 185
we obtain that
Ixy = h (1/2) − h (7/24) .
Therefore,
Ixy − Ixy|z = (1/2) h (1/4) + (1/2) h (1/3) − h (7/24) = −0.04 bits < 0. (6.3.16)
The sign of the difference Ixy −Ixy|z in these examples has been influenced by con-
cavity of function h(p) = −p ln p− (1− p) ln (1 − p). Indeed, for a concave function
we have
h (E[ξ ]) − E[h (ξ )] 0. (6.3.17)
Furthermore, values 3/8, 7/24 are nothing else but the mean:
or, equivalently,
Taking into account that H(x, y, z) = − ln P(x, y, z) we obtain the following formula
for the three-fold distribution law from (6.3.21):
P (x, y, z) = exp {−H (x) − H (y) − H (z) + I (x, y) + I (y, z) + I (z, x) − I (x, y, z)} .
for Example 6.2. Mean Ixyz equals 0.183 bits and −0.04 bits, respectively, according
to (6.3.15), (6.3.16).
Thus, non-negativity of triple correlation is not necessary.
Mutual information of a larger number of random variables is constructed in
analogy with formula (6.3.19). In a general case the n-fold mutual information is
defined by the formula
n
Ix1 ...xn = ∑ Hxi − ∑ Hxi x j + ∑ Hxi x j xk − · · · − (−1)n Hxi ...xn (6.3.22)
i=1
where summation
is taken over all possible distinct pairs n(n − 1)/2! terms ,
triplets n(n − 1)(n − 2)/3! and other combinations of indices 1, . . . , n.
We can prove that mutual information (6.3.22) is represented analogously to
(6.3.18) as the difference between regular and conditional mutual information of
a smaller multiplicity
Ix1 ...xn+1 = Ix1 ...xn − Ix1 ...xn |xn+1 . (6.3.23)
It is reasonable to use an induction in order to prove the equivalence of (6.3.22)
and (6.3.23). We should subtract the expression for the conditional mutual informa-
tion from the expression situated in the right-hand side of (6.3.22)
Ix1 ...xn |xn+1 = ∑ Hxi |xn+1 − ∑ Hxi xk |xn+1 − · · · − (−1)n Hxi ...xn |xn+1
n(n − 1)
= ∑ Hxi xn+1 − nHxn − ∑ Hxi xk xn+1 + Hxn
2
− · · · − (−1)n Hx1 ...xn xn+1 + (−1)n Hxn+1 .
By analyzing the terms involved in the resulted expression and taking into account
the equation
n (n − 1)
−n + − · · · + (−1)n = (1 − 1)n − 1 = −1
2
we ascertain that the latter expression is equivalent to the sum in the right-hand side
of formula (6.3.22) if we substitute n by n + 1 therein. Equality (6.3.22) is valid for
n = 2 and n = 3, thereby, it will be valid for all larger values of n.
In the case of high multiplicities (starting with the triple connection considered
above) we cannot say anything certain about non-negativity of mutual information
that took place in the case of the pairwise mutual information.
6.4 Mutual information in the general case 187
1. In the last paragraphs, it was assumed that variables in consideration are discrete,
i.e. they take values from either a finite or a countable sample space, i.e. only general
properties of entropy were essentially used. It stands to reason that the aforemen-
tioned formulae and statements related to mutual information can be propagated to
the case of arbitrary continuous or combined random variables. Indeed, it was shown
in Sections 1.6 and 1.7 how to define the notion of entropy (possessing all properties
of the entropy from the discrete version) for arbitrary random variables. We should
just use the relations provided therein. The presence of regular properties of the en-
tropy in the generalized version will provide the validity of the same relations (as in
the discrete version) for mutual information in the generalized version.
Further we consider two arbitrary random variables x, y. Referring to the abstract
language of the measure theory it means that there are given the probabilistic space
(Ω , F, P) and two Borel subfields F1 ⊂ F, F2 ⊂ F. The first F1 is defined by con-
straints imposed on x(ω ), ω ∈ Ω (i.e. by constraints of type x(ω ) < c). The second
F2 is defined by constraints imposed on y(ω ), ω ∈ Ω . It is also assumed that besides
probability measure P there are given:
1. measure ν on the united Borel field F12 = σ (F1 ∪ F2 );
2. measure ν1 on field F1 ;
3. measure ν2 on field F2 .
These measures are such that the constraint
The last equality can be treated as an original definition of mutual information inde-
pendently of use of the notion of entropy that is quite convenient in general. Such a
definition allows not to introduce auxiliary measures ν1 , ν2 , ν .
The following expression for the random entropy corresponds to formula (6.4.4):
188 6 Information in the presence of noise. Shannon’s amount of information
P (dx | y)
I (x, y) = ln . (6.4.5)
P (dx)
P (dxdy) P(dy | x)
I (x, y) = ln = ln (6.4.6)
P (dx) P (dy) P(dy)
Ixy = E[I (x, y)]. (6.4.7)
These relations are analogous to formulae (6.2.4), (6.2.5) from the discrete version.
In the generalized case formulae (6.2.8), (6.2.7) take the form:
The other formulae from Sections 6.2 and 6.3 can be duplicated with no trouble.
Henceforth, we write down the respective formulae in the required forms.
Satisfaction of the multiplicativity condition (6.4.1) for the normalized measure
Q (dxdy) = ν (dxdy) /N = ν (dxdy) ν (dxdy)
μ (0) = 0 (6.4.11)
μ (1) = 0 (6.4.12)
since P(dx)P(dy) = P(dx) P(dy) = 1. The important results of the theory of
optimal coding are expressed in terms of characteristic potential (6.4.10) in the pres-
ence of noise (Theorems 7.2, 7.3).
1 r+s
+ ∑
2 ρ ,σ =r+1
xρ − mρ Sρ−1,σ (xσ − mσ ) . (6.5.2)
In order to find average mutual information Ixy it is reasonable to use the simple
formula (5.4.6a), which yields
1 1 1
Ixy = − ln det K + ln det R + ln det S. (6.5.3)
2 2 2
This result can be represented in a different way. Since
ln det A = tr ln A (6.5.4)
or, accounting for (6.5.1) and multiplying the matrices, the following
1 1 US−1 1 1 US−1
Ixy = − tr ln = − ln det ,
2 U T R−1 1 2 U T R−1 1
or, equivalently,
1
Ixy = − tr ln 1 −U T R−1US−1 . (6.5.7)
2
If we now employ the series expansion of the logarithmic function
∞
zk
ln (1 − z) = − ∑ , (6.5.8)
k=1 k
where R1 is the correlation coefficient and also all matrices R, S, U consist of one
element, namely, R = (σ12 ), S = (σ22 ), U = (σ1 σ2 R1 ). Then
−1 −1
US−1U T R−1 = (σ1 σ2 R1 ) σ22 (σ1 σ2 R1 ) σ12 = R21 (6.5.10)
Because
1 σ32 −σ2 σ3 R1
S−1 = ,
σ22 σ32 1 − R21 −σ2 σ3 R1 σ22
R23 − 2R1 R2 R3 + R22
US−1U T = σ12
1 − R21
we have
US−1U T R−1 = R23 + R22 − 2R1 R2 R3 / 1 − R21 .
The latter matrix contains just one element. We use formula (6.5.6) to obtain
1 R23 + R22 − 2R1 R2 R3
Ix1 ,(x2 ,x3 ) = − ln 1 − . (6.5.13)
2 1 − R21
This result can be also derived from formula (6.5.3) by calculating the determi-
nants
det K = σ12 σ22 σ32 1 − R21 − R22 − R23 + 2R1 R2 R3
(6.5.14)
det S = σ22 σ32 1 − R21 , det R = σ12 .
Therefore, (6.5.3) yields
1 1
Ixy = ln 1 − R21 − ln 1 − R21 − R22 − R23 + 2R1 R2 R3
2 2
that is equivalent to (6.5.13).
In addition to the preceding we determine the triple mutual information (6.3.19)
among three Gaussian random variables. Without loss of generality we can consider
their correlation matrix in the form (6.5.12). Applying (5.4.6a), (6.5.14) we obtain
that
1 1 1 1
Hx1 x2 x3 − Hx1 − Hx2 − Hx3 = ln det K − ln σ12 − ln σ22 − ln σ32
2 2 2 2
1
= ln 1 − R21 − R22 − R23 + 2R1 R2 R3 .
2
According to (6.3.20), by adding pairwise mutual information Ix1 x2 , Ix2 x3 , Ix3 x1 hav-
ing the form (6.5.11) hereto, we derive that
1
Ix1 x2 x3 =ln 1 − R21 − R22 − R23 + 2R1 R2 R3
2
1 1 1
− ln 1 − R21 − ln 1 − R22 − ln 1 − R23
2 2 2
if we decompose the obtained expression by R1 , R2 , R3 , then, as is easy to see, we
will get the following formula for the triple mutual information:
Ix1 x2 x3 = R1 R2 R3 + O R4 .
6.5 Mutual information for Gaussian variables 193
It is useful to confront the last formula with the analogous formula of the pairwise
mutual information
1
Ix1 x2 = R21 + O R4
2
resulting from (6.5.11).
3. Particular case 3. Here we consider the case of additive independent noises
(disturbances) when the variables of the second group y1 , y2 , . . . , yr (s = r) are the
sums of the variables of the first group x1 , . . . , xr with independent Gaussian random
variables ξ1 , . . . , ξr :
yα = xα + ξα , α = 1, . . . , r. (6.5.15)
In this case the numbers of the first group variables and the second group variables
are equal (s = r). Let R and N be the correlation matrices of variables x1 , . . . , xr
and additive noises ξ1 , . . . , ξr , respectively. It follows from the condition of inde-
pendence between noises ξ and x that the correlation matrices for sum (6.5.15) are
expressed as T
S = R + N, U = R U =R . (6.5.16)
In order to apply formula (6.5.7) we calculate U T R−1US−1 . Here we have
U T R−1US−1 = R (R + N)−1
and thereby
−1
1 −U T R−1US−1 = RN −1 + 1 .
Consequently, formula (6.5.7) yields
1 1
Ixy = tr ln 1 + RN −1 = ln det 1 + RN −1 . (6.5.17)
2 2
Suppose that λk are eigenvalues of matrix RN −1 . Then it is apparent that
1
2∑
Ixy = ln (1 + λk ) .
k
Hence, we can use formula (6.5.19) after the indicated transformation in the given
case too.
4. Let us find the characteristic potential (6.4.10) of the random mutual infor-
mation for Gaussian variables. For this purpose we represent the random mutual
information (6.4.6), (6.5.2) as follows:
1 r+s
2 i,∑
I (x1 , . . . , xr ; y1 , . . . , ys ) = Ixy − (xi − mi ) Ai j (x j − m j ) , (6.5.20a)
j=1
where we note
−1 −1 −1
−1 R 0 R U R 0
Ai j
= A = K − = −
0 S−1 UT S 0 S−1 .
we take into account (6.5.3) and represent the characteristic potential (6.5.21) as
s 1−s 1−s 1 R sU
μ (s) = ln det K + ln det R + ln det S − ln det . (6.5.23)
2 2 2 2 sU T S
or, equivalently,
1
μ (s) = − ln det 1 − s2U T R−1US−1 − sIxy
2
1 ! s !
= − ln det 1 − s2 B + ln det 1 − B
2 ! 2
T −1
B = U R US −1
s ∞ 1 2k−1 !
μ (s) = ∑k
2 k=1
s − 1 tr Bk (6.5.26)
in analogy with (6.5.9). In particular, it is easy to find the variance of the random
mutual information of Gaussian variables. In order to do so, we need to plug in the
coefficient incident to 12 s2 in the specified expansion (6.5.26) that results in
VarI (x1 , . . . , xr ; y1 , . . . , ys ) = tr B = tr B.
Thus, we see that all statistical properties of the random mutual information of Gaus-
sian variables are defined just by a single matrix B = US−1U T R−1 or B.
For the aforementioned particular case 1 we have
1 s
μ (s) = − ln 1 − s2 R21 + ln 1 − R21
2 2
according to formulae (6.5.25), (6.5.9). It is not difficult to apply the derived formu-
lae to other particular cases.
1. Now we suppose that both the first and the second groups of random variables are
stationary processes
∞
λ−∞ = {xt , −∞ < t < ∞} , y∞
−∞ = {yt , −∞ < t < ∞} , (6.6.1)
in discrete or continuous time t. These processes are assumed to be not only station-
ary but also stationary-connected, so that the combined process z∞ −∞ = {xt , yt , −∞ <
t < ∞} is stationary. For the interval [0, T ], we can consider the entropies HxT , HyT ,
0 0
HzT defined according to formulae and results from Chapter 5. Here
0
According to the general formula (6.2.2) these entropies allow the determination
of mutual information
IxT ,yT = HxT + HyT − HxT ,yT . (6.6.2)
0 0 0 0 0 0
Let us define the mutual information rate of the processes {xt } and {yt } as the limit
1
ixy = lim I T T. (6.6.3)
T →∞ T x0 ,y0
If we substitute (6.2.2) hereto, then, evidently,
6.6 Information rate of stationary and stationary-connected processes. Gaussian processes 197
1 1 1
ixy = lim HxT lim HyT − lim HxT ,yT . (6.6.4)
T →∞ T 0 T →∞ T 0 T →∞ T 0 0
The limits situated in the right-hand side of this equality exist according to Theo-
rems 5.1 and 5.4. They are equal to
respectively, (Hξ 1 |ξ 0 coincides with Hξ0 |...ξ−2 ξ−1 = H 1 in the discrete time case
0 −∞
when t is integer; we use formula (5.6.10) with τ = 1 in the continuous time case).
Therefore, (6.6.4) takes the form
ixy = hx + hy − hxy , 1
Ixy = Hx1 + Hy1 − Hxy
1
. (6.6.6)
Certainly, mutual information (6.6.2) and information rate (6.6.6) can be finite
not only in the case when HxT , HyT , HxT yT or hx , hy , hxy are finite individually. That
0 0 0 0
is why, theoretically, one can compute information leaving out a computation of en-
tropies. However, in practical cases, when information is finite, we can always select
auxiliary measures ν , Q (involved in the definition of entropy from Section 1.6) in
such a way that all terms of (6.6.2), (6.6.6) are finite. Then the problem of comput-
ing the information will be reduced to the simpler problem of computing entropy
rates (considered in Chapter 5) at least for one (the most convenient) selection of
measure ν or Q. According to the aforesaid in Section 6.4 we need to be sure that
either one of the multiplicativity constraints (6.4.1) and (6.4.8) is satisfied. Note that
these multiplicativity constraints are expressed as
∞
ν (dx−∞ dy∞ ∞ ∞
−∞ ) = ν1 (dx−∞ )ν2 (dy−∞ )
and
∞ ∞
Q(dx−∞ ) = Q1 (dx−∞ )Q2 (dy∞
−∞ ) (6.6.7)
for processes (6.6.1), respectively. In this case, according to (6.4.9) formulae (6.6.2),
(6.6.6) can be replaced with
P/Q P/Q1 P/Q2 P/Q P/Q1 P/Q2
IxT yT = HxT yT − HxT − HyT , ixy = hxy − hx − hy , (6.6.8)
0 0 0 0 0 0
1 P/Q P/Q
lim H T = Hξ 1 |ξ 0 .
T →∞ T ξ0 0 −∞
Besides the information rate we consider the information of the end of the inter-
val that is analogous to entropy Γ of the end of the interval involved in the formula
Hξ T = hT + 2Γ + oT (1). (6.6.9)
0
[see (5.6.17)].
198 6 Information in the presence of noise. Shannon’s amount of information
By comparing relation (5.6.15) with formula (6.6.2) it is easy to see that constant
2Γ can be interpreted as the mutual information between stochastic processes on
two half-lines (−∞, 0) and (0, ∞):
2Γ = Iξ 0 ∞ . (6.6.10)
−∞ ξ0
Formulae (6.6.9), (6.6.10) are valid for each of processes {xt }, {yt }, {xt , yt }.
Substituting similar expressions for every entropy to (6.6.2) and taking into ac-
count (6.6.6), we obtain that
This equality allows us to calculate the mutual information of the finite segment
[0, T ) more precisely than by formula IxT yT ≈ ixy T following from (6.6.3).
0 0
2. We apply the formulae provided above to calculate the information rate be-
tween two Gaussian stationary random sequences {xt , t = . . . , 1, 2, . . .}, {yt , t =
. . . , 1, 2, . . .}. Since mean
values of Gaussian variables
do not affect the value of the
mutual information for instance, see (6.5.3) , without loss of generality we can
suppose that the respective mean values are equal to zero:
E[xt ] = 0, E[yt ] = 0.
1/2 1/2
1 1
1
Hxy = ln det
ϕ αβ (μ )
d μ = ln[ϕ 11 (μ )ϕ 22 (μ )−ϕ 12 (μ )ϕ 21 (μ )]d μ ,
2 −1/2 2 −1/2
1/2
1
Hx = ln ϕ 11 (μ )d μ ,
2 −1/2
1/2
1
Hy = ln ϕ 22 (μ )d μ .
2 −1/2
6.6 Information rate of stationary and stationary-connected processes. Gaussian processes 199
|ϕ 12 (μ )|2
Because = R2μ is the square of the correlation coefficient for spectral
ϕ 11 (μ )ϕ 22 (μ )
components, which correspond to the value of μ , the expression in the right-hand
side of equality (6.6.12) can be interpreted as the sum of mutual information of
distinct spectral components. In turn, each summand is defined by the simple for-
mula (6.5.11).
Let us move on to a multivariate case. We find the information rate between a
group of stationary Gaussian sequences {xiα }, α = 1, . . . , r and another group of
ρ −r ρ
sequences {yt } = {xt }, ρ = r + 1, . . . , r + s. In aggregate these sequences are
ij
described by correlation matrix Rt−q , i, j = 1, . . . , r + s or by the matrix of spectral
densities 6.6.13
ϕx (μ ) ϕxy (μ )
ϕ (μ ) =
ϕ (μ )
=
ij
(6.6.13)
ϕyx (μ ) ϕy (μ )
+ ( μ ) is the Hermitian transpose. Here ϕ ( μ ) =
ϕ αβ ( μ )
is the
where ϕyx (μ ) = ϕxy x
density matrix for a group of processes {xtα }, ϕy (μ ) =
ϕ ρσ (μ )
is the matrix for
ρ −r
processes {yt } and, finally, ϕxy (μ ) =
ϕ ασ (μ )
is the matrix of mutual spectral
functions.
Incorporating (5.5.19) into formula (6.6.6) we find the entropy rate
1/2
1
1
Ixy = [ln det ϕx (μ ) + ln det ϕy (μ ) − ln det ϕxy (μ )]d μ . (6.6.14)
2 −1/2
We can apply all those transformations, which led from formula (6.5.3) to (6.5.6) to
the integrand in question. After that equality (6.6.14) will take the form
1/2
1
1
Ixy =− ln det[1 − ϕxy (μ )ϕy−1 (μ )ϕyx (μ )ϕx−1 (μ ) d μ (6.6.15)
2 −1/2
But it is nothing else but the sum of the respective expressions of G(Sx−1 Sx ) and
G(Sy−1 Sy ).
That is why only logarithmic terms will remain in the integrand:
P/Q P/Q P/Q
ixy = hxy − hx − hy
1 ∞
= [tr ln Sx + tr ln Sy − tr ln S − tr ln Sx − tr ln Sy + tr ln S]d
ω.
4π −∞
6.6 Information rate of stationary and stationary-connected processes. Gaussian processes 201
Terms with auxiliary spectral densities Sx , Sy are completely discarded because of
the absence of mutual correlations Sασ (ω ) again, since
S 0 1 0
tr ln Sx + tr ln Sy = tr ln x + ln
= tr ln S,
0 1 0 Sy
Here we have also used formula (6.5.4). The integrand is analogous to the expression
situated in the right-hand side of (6.5.3). Just like in Section 6.5 it can be reduced to
the form (6.5.6) or (6.5.7). At the same time the derived formula (6.6.18) takes the
form
∞
1
ixy = − ln det[1 − Sxy (ω )Sy−1 (ω )Syx (ω )Sx−1 (ω ) d ω . (6.6.19)
4π −∞
There is an evident analogy of this result with the corresponding formula (6.6.15) for
stationary sequences. In a particular case when we consider the mutual information
between one process {xt } with another process {yt } from (6.6.19) we have
1 ∞ |Sxy (ω )|2
ixy = − ln 1 − dω .
4π −∞ Sx (ω )Sy (ω )
so that μ (s) ≈ μ 1 (s)T . This quantity can be easily computed with the help of
formula (6.5.25) in the same way as information rate (6.6.19) can be determined
from (6.5.6). Taking into account the aforesaid, it is not difficult to find out what
form an expression of the rate potential will take in different cases. Thus, in the case
when formula (6.6.19) is valid the rate potential is represented as
∞
1
μ 1 (s) = − {ln det[1 − s2 Sxy (ω )Sy−1 (ω )Syx (ω )Sx (ω )]
4π −∞
+ s ln det[1 − Sxy (ω )Sy−1 (ω )Syx (ω )Sx−1 (ω )]} d ω .
Besides information rate (6.6.19) we can obtain several quantities from this result,
namely, the variance rate
∞
1
(VarIxy )1 = tr Sxy (ω )Sy−1 (ω )Syx (ω )Sx−1 (ω ) d ω
2π −∞
1. Let there be given arbitrary (not necessarily stationary) stochastic processes {xt },
{yt } in continuous or discrete time t. The information of these processes on a fixed
segment a t b equals
P/Q P/Q P/Q
Ixab ,yba = Hxb ,yb − Hxb − Hyb (6.7.1)
a a a a
for every a, b. We can use the denotations Q1 (dxab ) = Q(dxab ), Q2 (dyba ) = Q(dyba )
meaning that Q1 and Q2 are induced by measure Q(dxab dyba ).
Assuming the difference between the values of (6.7.1) with b = t + τ and b = t,
we find the increment of mutual information
τ P/Q P/Q P/Q P/Q P/Q P/Q
Ix,y = Hxt+τ yt+τ − Hxt yt + Hxt − Hxt+τ + Hyt − Hyt+τ .
a a a a a a a a
This expression can be represented with the help of conditional entropies as follows:
τ P/Q P/Q P/Q P/Q
Ix,y = Hyt+τ |xt yt + Hxt+τ |xt yt+τ − Hxt+τ |xt − Hyt+τ |yt . (6.7.3)
t a a t a a t a t a
6.7 Mutual information of components of a Markov process 203
Indeed,
P/Q P/Q P/Q P/Q P/Q
Hxt+τ yt+τ − Hxt yt = Hxt+τ yt+τ |xt yt = Hyt+τ |xt yt + Hxt+τ |xt yt+τ .
a a a a t t a a t a a t a a
P/Q 1 P/Q
hy|xt yt = lim Hyt+τ |xt yt ,
a a τ →0 τ t a a
and τ
P/Q P(dyt+
t | yta ) t+τ
Hyt+τ |yt = E ln τ P(dyt | yt
a .
) (6.7.6)
t a Q(dyt+
t | yta )
τ
However, probability P(dyt+
t | yt0 ) can be represented as
τ τ
P(dyt+
t | yta ) = E[P(dyt+
t | xta , yta ) | yta ]. (6.7.7)
204 6 Information in the presence of noise. Shannon’s amount of information
Next, we denote the conditional averaging by xta , yt with weight P(dxta dyt | yta ) and
the (non-conditional) averaging by yta with weight P(dyta ) as E1 and E2 , respectively.
Then (6.7.7), (6.7.6) can be rewritten as
τ τ
P(dyt+
t | yta ) = E1 [P(dyt+
t | xta , yta )]
and τ
P/Q P(dyt+
t | yta ) t+τ
Hyt+τ |yt = E2 ln τ P(dyt | ya ) .
t
t a Q(dyt+
t | yta )
Averagings in (6.7.5) can be represented as the consecutive averagings E2 E1 . Then
the difference between entropies (6.7.5), (6.7.6) will take the form
τ
P/Q P/Q P(dyt+ t | xta , yta ) t+τ
Hyt+τ |xt yt − Hyt+τ |yt = E2 E1 ln τ P(dy t | x a a −
t t
, y )
t a a t a Q(dyt+ t | yta )
τ
P(dyt+
t | xt , yt )
τ
− ln E1 τ
a a
E1 [P(dyt+ t | xta , yta )] . (6.7.8)
Q(dyt+t | yta )
P/Q P/Q
Analogously, we can work out the second difference Hxt+τ |xt yt+τ − Hxt+τ |xt in-
t a a t a
τ t+τ
cluded in (6.7.3). We denote the averaging by yt+ a , xt with weight P(dya dxt | xa )
t
will have
P/Q P/Q P(dxtt+τ | xta , yt+
a )
τ
t+τ t t+τ
Hxt+τ |xt yt+τ − Hxt+τ |xt = E4 E3 ln P(dxt | xa , ya ) −
t a a t a Q(dxtt+τ | xta )
P(dxtt+τ | xta , yt+ τ
a ) t+τ t t+τ
− ln E3 E3 [P(dxt | xa , ya )] . (6.7.9)
Q(dxtt+τ | xta )
The change of mutual information (6.7.3) is equal to the sum of the indicated
expressions (6.7.8), (6.7.9). The subtrahend terms in the brackets differ from each
other by the order (that is not the same for them) of operations of averaging and
non-linear transformation. It is not difficult to make certain that measure Q does not
make any essential influence on them.
It is convenient to use these expressions (derived without using Markovian
properties) for calculation of the mutual information between one part of compo-
nents of a Markov process and the other part of its components. The joint process
{xt , yt } = {ξt } is assumed to be Markov with respect to measure P. At the same
time process {yt } is assumed to be Markov with respect to measure Q. Then
τ τ
P(dyt+
t | xta , yta ) = P(dyt+
t | xt , yt ) (τ > 0)
τ τ
Q(dyt+
t | xta , yta ) = Q(dyt+
t | yt )
This explains why averaging E1 in formula (6.7.8) will be reduced to the averag-
ing by dxt dyt performed with the weight
We can see from here that measure Q cancels out and we obtain
P/Q P/Q
Hyt+τ |xt yt − Hyt+τ |yt =
t a a
t a
τ
τ P(dyt+
t | ξt )
= E2 Wt (d ξt ) P(dyt+ | ξt ) ln t+τ . (6.7.11)
ξt yt+
t
τ t
ξt P(dyt | ξt )Wt (d ξt )
Let us consider the second expression of (6.7.8). The following relations are valid
for the joint Markov process {xt , yt } with respect to P and the Markov process {xt }
with respect to Q:
The sum of expressions (6.7.11), (6.7.13) gives the desired mutual informa-
tion (6.7.3):
τ
τ t+τ P(dyt+t | ξt )
Ixy = E2 Wt (d ξt ) t+τ P(dyt | ξt ) ln t+τ
ξt yt ξt P(dyt | ξt )Wt (d ξt )
t+τ t+τ
t (d ξt ) τ τ P(dx t | xt , yt )
+ E4 W P(dxt dyt | ξt ) ln
t+ t+
.
ξt xtt+τ yt+ τ t+τ
| ξt )Wt (d ξt )
t ξt P(dxt
For a Markov joint process the sum of two first terms in (6.7.10), (6.7.12) can be
represented in the simpler form
Taking this into account when summing up expressions (6.7.10), (6.7.12), we also
obtain the result in the different form
τ P(d ξtt+τ | ξt ) t+τ
Ixy =E ln P(d ξ | ξ )
Q(d ξtt+τ | ξt ) t t
ξtt+τ
τ
P(dyt+
t | ξt ) t+τ
− E2 t+τ ln t+τ Wt (d ξt ) P(dyt | ξt )Wt (d ξt )
yt ξt Q(dyt | yt ) ξt
P(dxtt+τ | ξt ) t+τ
− E4 ln t+τ Wt (d ξt ) P(dxt | ξt )Wt (d ξt ) , (6.7.15)
xtt+τ ξt Q(dxt | yt ) ξt
2. Let us consider various particular cases and start with the case when {xt , yt }
is a discrete stationary Markov process in discrete time, i.e. a Markov chain. Then
there is no need to introduce measure Q in the formulae from the previous clause.
β
This measure can be disregarded by substituting P(dxβα )/Q(dxα ) and H P/Q with
β
P(xα ) and −H, respectively. Also we can directly use the results of Sections 5.2
and 5.3.
Assume that a Markov chain is described by transition probabilities π (x, y; x , y )
as well as in Sections 5.2 and 5.3. Applying formula (5.2.8) we find the entropy rate
of the combined process
The entropy rate is expressed via formula (5.3.23) for components x and y taken
separately:
hx = − )
Pst (dW ∑ (x, y)π (x, y, x , y ) ln
W ∑ (x , y )π (x , y ; x , y ),
W
x,y,x ,y x ,y ,y
hy = − Pst (dW ) ∑ W (x, y)π (x, y; x , y ) ln ∑ W (x , y )π (x , y , x , y ).
x,y,x ,y x ,y ,x
In the given formulae stationary distribution Pst (x, y) is a solution of the equation
[see (5.2.7)] and also distribution Pst (dW ) = pst (W ) ∏ξ dW (ξ ) is a solution of the
analogous equation (which corresponds to the secondary a posteriori Markov pro-
cess having transition probabilities (5.3.22)). The latter equation has the following
form:
∑ pst (W (x, y))×
yk
δy ,yk ∑x,y W (x, y)π (x, y; x , y )
× ∏ δ W (x , y ) −
= pst (W (x , y ))
Example 6.3. As an example, consider a Markov process {ξt } with three states ξ =
a, b, c covered in Section 5.3. It can be represented as a combination {xt , yt } of
processes {xt } and {yt }, each of which has two states x, y = 1, 2. States ξ = a,
ξ = b, ξ = c are interpreted as the combined states
which has certain symmetry. It does not change if we swap processes {x}, {y} (i.e.
under the substitution x ↔ y, 1 ↔ 2). Therefore, we have
hx = hy , (6.7.20a)
− ∑ π (1, 2; x , y ) ln π (1, 2; x , y ) = h3 (υ , υ )
(6.7.21)
x ,y
= −2υ ln υ − (1 − 2υ ) ln(1 − 2υ ).
210 6 Information in the presence of noise. Shannon’s amount of information
Stationary distribution Pst (x, y) possesses the symmetry property Pst (1, 1) = Pst (2, 2).
When using it equation (6.7.18) yields
2υ λ
hxy = h3 (λ , μ ) + h3 (υ , υ ). (6.7.22)
λ + 2υ λ + 2υ
In consequence of (6.7.20a) in order to determine mutual information (6.7.19) it
only remains to find entropy hx or hy . For the example in question it was found in
clause 3 of Section 5.3. According to (5.3.36), (5.3.37) it can be computed by the
formula
h2 (p1 ) + ∑∞k=1 p1 · · · pk h2 (pk+1 )
hy = , (6.7.23)
1 + ∑∞k=1 p1 · · · pk
where p1 , p2 , . . . are the values determined by relations (5.3.30), (5.3.32), (5.3.34)
and so on. For this example in (6.7.20), these relations take the form
p1 = λ + μ ,
λ (1 − υ ) + μ (1 − μ )
p2 = ,
λ +μ
[λ (1 − 2υ ) + μυ ](1 − υ ) + [λ υ + μ (1 − λ − μ )](1 − μ )
p3 = ,
λ (1 − υ ) + μ (1 − μ )
···
With the help of the results of Sections 5.9 and 5.11 or the formulae from clause 1
of the present paragraph we can calculate the density of the mutual information ixy ,
which coincides with the information rate in a stationary case.
6.7 Mutual information of components of a Markov process 211
In order for the mutual information density ixy to be finite (see below) we suppose
that matrix π (x, y; x , y ) takes the following special form [compare with (5.11.26)]:
where m and l are the numbers of states of processes x(t) and y(t), respectively.
This means that the combined (composite) measure Q(x0T yT0 ) = Q1 (x0T )Q2 (yT0 ) are
described by the matrix
1 1
π Q (x, y; x , y ) = −2δxx , δyy + (1 − δxx )δyy + δ (1 − δyy ), (6.7.26)
m−1 l − 1 xx
which representation reminds (6.7.25).
Further we perform derivations using the formula
P/Q P/Q1 P/Q2
ixy = hxy − hx − hy (6.7.27)
where
πW (y, y ) = ∑ W (x) ∑ π (x, y; x , y ). (6.7.29)
x x
Analogously,
m−1
= 1 + E ∑ πW (x, x ) ln
πW (x, x )
P/Q
hx 1 , (6.7.30)
x =x
e
where
(y) ∑ π (x, y; x , y ).
πW (x, x ) = ∑ W (6.7.31)
y y
212 6 Information in the presence of noise. Shannon’s amount of information
P/Q
Density hxy is calculated via the methods provided in Section 5.9. In order for
this value to be finite we need the special form (6.7.25) of a combined transition
probability matrix. With the help of formula (5.9.8) for matrices (6.7.25), (6.7.26)
we obtain
3 4
hxy = 2 + E ∑ π1 (x, y; x ) ln[(m − 1)π1 (x, y; x )] − 1
P/Q
x =x
+E ∑ π2 (x, y; y ){ln[(l − 1)π2 (x, y; y )] − 1}
. (6.7.32)
y =y
When substituting (6.7.28), (6.7.30), (6.7.32) to (6.7.27) we take into account that
E π2 (x, y; y ) = E π2 (x, y; y ) | yt0 = E ∑ π2 (x, y; y )P(x | yt0 ) = E πW (y, y )
x
and analogously
E π1 (x, y; x ) = E πW (x, x ) .
The latter equalities allow to cancel out the terms linear with respect to π1 or π2 and
rewrite the result as follows:
ixy = ∑ P(x, y) ∑
π1 (x, y; x ) ln π1 (x, y; x ) + ∑ π2 (x, y; y ) ln π2 (x, y; y )
x,y x =x y =y
− dx)
P(dW ∑ πW (x, x ) ln πW (x, x )
x =x
− P(dW dy) ∑
πW (y, y ) ln πW (y, y ). (6.7.34)
y =y
It is easy to see from the latter formula that the obtained expression is non-negative
due to a positive definiteness of matrix
b−1
σ ρ
.
5. In conclusion of this paragraph we suppose that process {xt } is Markov sepa-
rately and at the same time process {yt } under a fixed realization of {xt } is a mul-
tivariate diffusion process with parameters aρ (x, y,t), bρσ (y,t), ρ , σ = 1, . . . , s. A
matrix of local variances bρσ is assumed to be independent of x and non-singular.
In order to determine mutual information Ixy τ or i in this case we can apply
xy
formula (6.7.16) defining each of two terms from the right-hand side by formu-
214 6 Information in the presence of noise. Shannon’s amount of information
Therefore,
1−z
a(1) − ∑ a(z)W (z) = [a(1) − a(2)]W (2) = [a(1) − a(2)] ,
z 2
1+z
a(2) − ∑ a(z)W (z) = [a(2) − a(1)]W (1) = [a(1) − a(2)] ,
z 2
The probability density function pst (z) is specified in clause 2 of Section 5.11. The
integral in (6.7.37) can be computed and according to (5.11.24) we obtain
6.7 Mutual information of components of a Markov process 215
1 √ ( √
ixy = [a(1) − a(2)]2 Kq (b μυ ) 2Kq (b μυ )
2b #
# )−1
υ √ μ √
+ Kq+1 (b μυ ) + Kq−1 (b μυ )
μ υ
1
q = b(υ − μ ) .
2
The determined mutual information is nothing else but the difference of entropies
(5.11.20), (5.11.22) found earlier.
In conclusion we note that result (6.7.36) is valid not only in the case when pro-
cess {xt } is Markov. It is only required for its validity that the conditional process
yt described by measure P(dyba | xab ) is diffusional and dependent on xta (in a causal-
P/Q P/Q
randomized sense) so that Hxt+τ |xt yt = Hxt+τ |xt . This generalization can be easily
t a a t a
derived from formula (6.7.12) (if we replace x with y therein), which is not con-
strained by the Markov condition for xt . The theory worded in clause 1 (which does
not employ Markovian properties) is also useful for derivation of other results.
Chapter 7
Message transmission in the presence of noise.
Second asymptotic theorem and its various
formulations
In this chapter, we provide the most significant asymptotic results concerning the ex-
istence of optimal codes for noisy channels. It is proven that the Shannon’s amount
of information is a bound on Hartley’s amount of information transmitted with
asymptotic zero probability of error. This is the meaning of the second asymptotic
theorem. Further we provide formulae showing how quickly the probability of error
for decoding decreases when the block length increases. Contrary to the conven-
tional approach, we represent the above results not in terms of channel capacity
(i.e., we do not perform the maximization of the limit amount of information with
respect to the probability density of the input variable), but in terms of Shannon’s
amount of information.
Theorems 7.1, 7.3, 7.5 successively strengthen one another. Such ordering is cho-
sen to facilitate studying this material. Certainly, each of these theorems makes the
previous ones redundant as far as results are concerned. However, the strengthening
of the results is achieved by complicating the proof. There are certain grounds to
believe that the above complications are not paid off by the importance of strength-
ening (passing from Theorem 7.3 to the much more complicated Theorem 7.5 has
very small influence on the behaviour of the coefficient α in the region where R is
close to Ixy ); we present all these theorems for the reader to study any of them if
they wish to.
All the mentioned results can be extended from the case of blocks of independent
identically distributed random variables to the case of an arbitrary family of infor-
mationally stable random variables. This generalization is performed in a standard
way, and we shall use it only once (in Theorem 7.2).
In spite of the fact that the presentation of this chapter is given in the simplest
discrete version, the results obtained here are of general importance. Their extension
to the general case is concerned essentially only with changing the way of writing
the formulae.
Consider some communication channel. We denote its input variable (at a selected
time moment), which we call a transmitted character or letter, as x. It can assume
discrete values from some set X. It is convenient to suppose that probabilities P(x)
are also given. Then x will serve as a given random variable.
We consider a noisy channel. This means that for a fixed value of x a variable on a
channel output (at a fixed time moment) is random, i.e. it is described by conditional
probabilities P(y | x). Random variable y can be called a received character or letter.
It is assumed that a process of the letter x transmission and the letter y reception
can occur many times with the same probabilities P(x), P(y | x) (although a general-
ization to the case of varied probabilities is possible, see Theorem 7.2). Let n letters
constitute a block or word, for instance,
sufficient to select code words at random. These questions and also methods of
decoding will be considered in the present chapter.
Assuming independence of processes of transmitting consecutive letters, it is
easy to write down the probabilities of words (7.1.1) in terms of the initial prob-
abilities
n n
P(ξ ) = ∏ P(xi ), P(η | ξ ) = ∏ P(yi | xi ),
i=1 i=1
P(ξ , η ) = P(ξ )P(η | ξ ) (7.1.2)
(in a similar case we say that channel [P(ξ ), P(η | ξ )] is an n-th degree of channel
[P(x), P(y | x)]).
We set a goal to transmit M messages consisting of n-character words, i.e. to
transmit the amount of information ln M. We may try to do it the following way. We
select M distinct words
out of all possible words of type ξ = (x1 , . . . , xn ). Their ensemble builds up the code,
which must be known at both ends (transmitting and receiving ones) of the channel.
Each of M messages is confronted with one of code words (7.1.3), say k-th message
is transmitted by word ξk . At the same time the word at the receiving end may be
random and scattered because of the noise effect. The probabilities corresponding to
it are P(η | ξ ). Having received word η , a recipient of information cannot precisely
say yet which of the two words ξk and ξl (l = k) was transmitted if probabilities
P(η | ξk ) and P(η | ξl ) are not zeros for both words. He/she can only speculate
about a posteriori probabilities of this or that code word. If we suppose that all a
priori probabilities P(k) of all M messages (and thereby of all code words ξk ) are
equivalent (equal 1/M), then according to Bayes’ formula we obtain a posteriori
probability for code word ξk :
M P(η | ξk )
1
P(η | ξk )
P [ξk | η ] = = . (7.1.4)
∑l M1 P(η | ξl ) ∑l P(η | ξl )
If a recipient of information chooses code word ξk for fixed η , then he/she will re-
ceive a correct message with probability P(ξk | η ) or make an error with probability
1 − P(ξk | η ). In order to minimize the probability of error, the recipient apparently
has to select word ξk , which corresponds to the maximum (out of M possible ones)
probability (7.1.4) or, equivalently, the maximum likelihood function P(η | ξk ). In
this case, the average probability Per of error will be determined by averaging the
probability of error 1 − P(ξk | η ) with respect to η :
P(η | ξk )
Per = 1 − E max P(ξk | η ) = 1 − ∑ max P(η )
k η k ∑l P(η | ξl )
220 7 Message transmission in the presence of noise. Second asymptotic theorem
due to (7.1.4) or
1
M∑
Per = 1 − max P(η | ξk ), (7.1.5)
η k
since
1
P(η ) = ∑ P(η | ξl ) .
l M
The choice of the maximum likelihood function among the functions
P(η )
D0 (ξ , η ) = −I(ξ , η ) = ln . (7.1.8)
P(η | ξ )
Since P(η | ξk ) = max P(η | ξl ) inside region Gk η (by the definition of this re-
gion), we have
Per (k) = 1 − ∑ max P(η | ξl ). (7.1.10)
η ∈Gk l
7.2 Random code and the mean probability of error 221
Fig. 7.1 Schematic diagram of encoding and decoding for a noisy channel
1 M
Per = ∑ Per (k)
M k=1
(7.1.11)
over all the messages, each of which occurs with probability 1/M, then the sum
in (7.1.10) will propagate to the entire sample space of η and, consequently, we
obtain formula (7.1.5).
For a fixed decoding rule, such as the optimal rule described in the previous section,
the probability of error (i.e. the probability that a recipient selects a wrong word ξk
different from a word actually transmitted) depends on a chosen code. In order to
decrease the frequency of decoding errors caused by noise, it is desirable to select
code words that are ‘dissimilar’, lying one from another, in some sense, as ‘far’
as possible. Because we cannot simultaneously increase the ‘distance’ between the
code points ξ1 , . . . , ξM without decreasing their number M, it is desirable to arrange
code points in the space X n of values ξ ‘as uniformly as possible’. The desired
‘uniformity’ is achieved due to the Laws of Large Numbers for large M (and n) if
we select the code points randomly and independently of each other.
The Shannon’s random code is constructed as follows. Code point ξ1 is obtained
as a result of sampling random variable ξ with probabilities P(ξ ). The second point
(also the third one and the others) is sampled independently of other ones and by
the same method. Consequently, the second point is an independent random variable
with probabilities P(ξ2 ). In aggregate, all code points ξ1 , . . . , ξM are described by
the probability distribution P(ξ1 ) · · · P(ξM ).
For every fixed code (ξ1 , . . . , ξM ) obtained by the specified method and a fixed
message k there is some probability of decoding error. We denote that probability as
Per (| k, ξ1 , . . . , ξM ). According to (7.1.9) it is equal to
222 7 Message transmission in the presence of noise. Second asymptotic theorem
and thereby
Per (| k, ξ1 , . . . , ξk ) = 1 − ∑ P(η | ξk ). (7.2.2)
∀P(η |ξl )<P(η |ξk )
1 M
Per (| k, ξ1 , . . . , ξM ) and Per (| ξ1 , . . . , ξM ) = ∑ (| k, ξ1 , . . . , ξM ),
M k=1
Otherwise,
Per = ∑ P(ξk , η ) f ∑ P(ξ ) , (7.2.6)
ξk , η P(η |ξ )<P(η |ξk )
In formulae (7.2.2), (7.2.5), (7.2.6) sign < does not rule out sign . The matter is
that there can exist ‘questionable’ points η equidistant from several competing code
points, say, P(η | ξl ) = P(η | ξk ) (l = k). Such points can be attributed equally to
region El or region Ek . If we take into account the ambiguity related to such points,
then as is easy to see we will have the following inequality instead of (7.2.6)
∑ P(ξk , η ) f ∑ P(ξ )
ξk , η P(η |ξ )P(η |ξk )
Per ∑ P(ξk , η ) f ∑ P(ξ ) , (7.2.8)
ξk , η P(η |ξ )P(η |ξk )
where there is a strict inequality within the summation sign on the left. Contributions
of all ‘questionable’ points are excluded from the left-hand side expression, whereas
those contributions are counted multiple times in the right-hand side expression.
Further, we consider the expression ∑P(η |ξ )P(η |ξk ) P(ξ ), which is an argument of
function f . The inequality P(η | ξ ) P(η | ξk ) or P(η | ξ )/P(η ) P(η | ξk )/P(η )
is equivalent to the inequalities
P(ξ | η )
eI(ξ ,η ) eI(ξk ,η ) ; eI(ξk ,η ) , (7.2.9)
P(ξ )
P(η | ξ ) P(ξ | η )
= = eI(ξ ,η ) .
P(η ) P(ξ )
From the inequality P(ξ ) P(ξ | η )e−I(ξk ,η ) (i.e. from the second inequality
of (7.2.9)) we obtain via summation that
The given inequality can be only reinforced if the sum in the right-hand side is
replaced with one:
∑ P(ξ ) e−I(ξk ,η ) .
P(η |ξ )P(η |ξk )
Using the latter inequality for the right-hand side (7.2.8) and taking into account the
increasing nature of function f , we find that
!
!
Per ∑ P(ξk , η ) f e−I(ξk ,η ) = E f e−I(ξk ,η ) = f e−I dF(I). (7.2.10)
ξk , η
224 7 Message transmission in the presence of noise. Second asymptotic theorem
Fig. 7.2 Function f (x) = 1 − (1 − x)M−1 and the majorizing polygonal line
Here we use F(I) = P{I(ξ , η ) < I} to denote the cumulative distribution function
of the information of communication I(ξ , η ) of variables ξ and η having joint dis-
tribution P(ξ , η ). Hence, the derived estimator is expressed only in terms of this
distribution function.
The behaviour of function f is shown on Figure 7.2. This function increases from
0 to 1 on interval 0 x 1. It has the derivatives
and thereby it is concave. Evidently, this function can be majorized by the polygonal
curve
f (x) min [(M − 1)x, 1] min [Mx, 1] . (7.2.11)
After that formula (7.2.10) will take the form
−I
−I
Per min Me , 1 dF(I) M e dF(I) + dF(I), (7.2.12)
Iln M I<ln M
i.e. ∞
Per M e−I dF(I) + F(ln M). (7.2.13)
I=ln M
The breakpoint I = ln M in (7.2.12) may be substituted by any other point I = λ ;
doing that we can only make the inequality stronger. The derived inequalities will
be used later on.
7.3 Asymptotic zero probability of decoding error. Shannon’s theorem 225
One important and far from trivial fact is that the average decoding error can be
made as small as needed by an appropriate code selection when increasing the num-
ber of characters n without diminishing a noise level in the channel and without
decreasing the amount of transmitted information per one character. This result was
obtained by Shannon in 1948 [45] (the English original is [38]) and (being formu-
lated in terms of channel capacity) usually has the name of Shannon’s theorem.
Theorem 7.1. Consider channel [P(y | x), P(x)] and channel [P(η | ξ ), P(ξ )], which
is the n-th power of the former (as in Section 7.1, see (7.1.1), (7.1.2)). Further, sup-
pose that the amount ln M of transmitted information increases as n → ∞ according
to the law
ln M = ln[enR ] nR (7.3.1)
(the square brackets mean an integer part), where R is a value independent of n and
satisfying the inequality
R < Ixy < ∞. (7.3.2)
Then there exists a sequence of codes K (n) such that
Applying the Law of Large Numbers (the Khinchin’s theorem, see, for instance,
Gnedenko [13] (also translated to English[14])) to (7.3.4) we obtain
1
P I(ξ , n) − Ixy < ε → 1
n
and thereby
P{|I(ξ , n) − Iξ n | nε } → 0 as n → ∞, (7.3.5)
for whatever ε > 0.
226 7 Message transmission in the presence of noise. Second asymptotic theorem
Next we consider the average error probability with respect to the ensemble of
random codes described in Section 7.2. We suppose that ε = (Ixy − R)/2 = (Iξ η −
nR)/2n, so that n(R + ε ) = Iξ η − nε . It is apparent that ε > 0 due to (7.3.2). Since
−I M e−I for I > Iξ n − nε ,
min [M e , 1]
1 for I Iξ n − nε ,
Further, we consider the first term situated in the right-hand side. The inequality
I > nR + nε entails
e−I < e−nR−nε , e−I dF(I) < e−nR−nε dF(I) e−nR−nε .
I>nR+nε I>nR+nε
Therefore,
M e−I dF(I) M e−nR−nε
I>n(R+ε )
This expression converges to zero as n → ∞. The second term in the right-hand side
of (7.3.6) goes to zero because of (7.3.5). Consequently,
Per → 0 as n → ∞. (7.3.7)
Among the ensemble of random codes there must be code (ξ1n , . . . , ξMn ), for which
its error probability does not exceed the average probability
A. ∞ > Iξ n η n → ∞ as n → ∞;
B. The ratio I(ξ n , η n )/Iξ n η n converges in probability to 1 as n → ∞.
Theorem 7.2 (The general form of the second asymptotic theorem). Let ξ n , η n ,
n = 1, 2, . . . be a sequence of informationally stable random variables. Further, sup-
pose that the amount of transmitted information increases as n → ∞ according to
the law
where μ > 0 is independent of n. Then there exists a sequence of codes such that
Per → 0 as n → ∞.
ε = μ /2. (7.3.11)
The second term in the right-hand side converges to zero as n → ∞ due to (7.3.10).
At the same time we can apply (7.3.9) to estimate the corresponding first term as
follows:
M e−I dF(I)
I>(1−ε )Iξ n η n
eI ξ n η n(1−μ ) e−(1−ε )Iξ n η n dF(I) e(ε −μ )Iξ n η n .
I>(1−ε )Iξ n η n
Therefore, it goes to zero due to (7.3.11) and property A from the definition of
informational stability. The considerations, which are analogous to the ones in the
previous theorem finishes the proof.
228 7 Message transmission in the presence of noise. Second asymptotic theorem
In addition to the results of the previous section, we can obtain stronger results
related to the rate, with which of the error probability vanishes. It turns out that
the probability of error for satisfactory codes decreases mainly exponentially with a
growth of n:
Per ea−α n (7.4.1)
where a is a value weakly dependent on n and α is a constant of main interest.
Rather general formulae can be derived for the latter quantity.
1.
Theorem 7.3. Under the conditions of Theorem 7.1 the following inequality is
valid:
Per 2e−[sμ (s)−μ (s)]n (7.4.2)
where
μ (t) = ln ∑ P1−t (x, y)Pt (x)Pt (y) (7.4.3)
x,y
[see (6.4.10), with argument s replaced by t] and s is a positive root of the equation
It is also assumed that R is relatively close to Ixy in order for the latter equation to
have a solution. Besides, the value of s is assumed to lie within a differentiability
interval of potential μ (t).
or
∞
Per enR 1 − F(nR) e−I dF(I) + F(nR) (7.4.6)
−∞
in consequence of (7.3.1).
7.4 Asymptotic formula for the probability of error 229
where
enμ (t) = e−tI F(dI) (7.4.8)
i.e. n
e nμ (t)
=E e −tI(ξ ,η )
= ∑e −tI(x,y)
P(x, y) (7.4.9)
x,y
and
(
μ s) = −R (7.4.13)
s < 0 since − R < μ (0) (μ (0) = μ (1) > 0).
Therefore,
(
μ s) = μ (
s + 1) − μ (1). (7.4.16)
That is why (7.4.13) takes the form
μ (
s + 1) = −R. (7.4.17)
230 7 Message transmission in the presence of noise. Second asymptotic theorem
s+ 1 = s.
2. According to the provided theorem potential μ (s) defines the coefficient sitting
in the exponent (7.4.1)
α = sμ (s) − μ (s) (7.4.19)
as the Legendre transform of the characteristic potential:
− ln Per
Fig. 7.3 Typical behaviour of the coefficient α (R) = limn→∞ n appearing in the exponent of
formula (7.4.1)
d2α 1
= .
dR2 μ (s)
In particular, we have
d2α 1
(Ixy ) = (7.4.22)
dR2 μ (0)
at point R = Ixy , s = 0. As is easy to see from definition (7.4.9) of function μ (s), here
μ (0) coincides with the variance of random information I(x, y):
These results can be propagated to a more general [in comparison with (7.1.1)]
case of arbitrary informationally stable random variables, i.e. Theorem 7.2 can be
enhanced in the direction of accounting for rate of convergence of the error probabil-
ity. We will not pursue this point but limit ourselves to indicating that the improve-
ment in question can be realized by a completely standard way. Coefficient α should
not be considered separately from n. Instead, we need to operate with combination
αn = nα . Analogously, we should consider only the combinations
Only the method of writing formulae will change in the above-said text.
Formulae which assess a behaviour of the error probability were derived in the
works of Shannon [42] (the English original is [40]) and Fano [10] (the English
original is [9]).
Theorem 7.4. Suppose that we have a channel [P(η | ξ ), P(ξ )] (just as in Theo-
rem 7.1), which is an n-th power of channel [P(y | x), P(x)]. Let the decoding be
performed on the basis of the minimum distance
n
D(ξ , η ) = ∑ d(x j , y j ) (7.5.1)
j=1
(R < Ixy is independent of n). Then there exists a sequence of codes having the
probability of decoding error
Per 2e−n[s0 γ (s0 )−γ (s0 )] . (7.5.3)
Proof. As earlier, we will consider random codes and average out a decoding error
with respect to them.
First, we write down the inequalities for the average error, which are analogous
to (7.2.1)–(7.2.4) but with an arbitrarily assigned distance D(ξ , η ). Now we perform
averaging with respect to η in the last turn:
where
and
0, if all D(ξl , η ) > D(ξk , η ), l = k,
υ (k, ξ1 , · · · , ξM , η ) = (7.5.9)
1, if at least one D(ξl , η ) D(ξk , η ), l = k.
Fη [λ ] = ∑ P(ξl ). (7.5.11)
D(ξl ,η )λ
We have omitted index k here because the expressions in (7.5.10), (7.5.12) turn out
to be independent of it. Selecting some boundary value nd (independent of η ) and
using the inequality
3 4 MFη [D(ξ , η )] for D(ξ , η ) nd,
min MFη [D(ξ , η )] , 1 (7.5.13)
1 for D(ξ , η ) > nd
Further, we average out the latter inequality by η and thereby obtain the estimator
for the average probability of error
where
P2 = ∑ P(ξ , η ) (7.5.16)
D(ξ ,η )>nd
where
γ (s0 ) = d, s0 > 0;
nγ (s) = ln E esD(ξ ,η ) , γ (s) = ln E esd(x,y) (7.5.19)
In order to derive an analogous estimator for it, we need to apply the multivariate
generalization of Theorem 4.6 or 4.7, i.e. formula (4.4.13). This yields
P1 e−n[r0 ϕr +t ϕt −ϕ (r0 ,t0 )]
∂ ϕ (r0 ,t0 ) ∂ ϕ (r0 ,t0 )
ϕr = , ϕt = (7.5.21)
∂r ∂t
r0 < 0, t0 < 0.
which constitutes the system of equations (7.5.4) together with other equations fol-
lowing from (7.5.19), (7.5.22). Simultaneously, inequality (7.5.24) turns into (7.5.3).
The proof is complete.
236 7 Message transmission in the presence of noise. Second asymptotic theorem
2. Now we turn our attention to the particular case when R(< Ixy ) is so far from
Ixy that root r0 [see equations (7.5.4)] becomes positive. Then it is reasonable to
choose ∞ as nd in inequality (7.5.13), so that that inequality takes the form
3 4
min MFη [D(ξ , η )] , 1 MFη [D(ξ , η )] .
which is equal to nϕ (0,t) according to (7.5.23). Due to Theorem 4.6 (while substi-
tuting M with enR that reinforces the inequality) we obtain
3 4 ∗
Per exp nR − n t ∗ ϕt (0,t ∗ ) − ϕ (0,t ∗ ) = en[ϕ (0,t )+R] . (7.5.26)
Next, we address equations (7.5.4). We use R∗ to denote a certain value R, for which
the root r0 equals 0.
The other roots s0 , t0 corresponding to this value are denoted as s∗0 and t0∗ , respec-
tively. The second equation from (7.5.4) takes the form
Comparing (7.5.28) with (7.5.27), we see that t ∗ = t0∗ . So, the third equation
from (7.5.4) can be rewritten as follows:
This result is valid when r0 > 0 and when formula (7.5.3) cannot be used. Tak-
ing into account a character of change of potentials ϕ (r,t), γ (s), we can make cer-
tain that the constraint r0 > 0 is equivalent to the constraint R < R∗ or s0 > s∗0 .
After introducing the Legendre convex conjugate α (R) of function γ (s) by equali-
ties (7.4.19), (7.4.20), formulae (7.5.3), (7.5.30) can be represented as follows:
7.5 Enhanced estimators for optimal decoding 237
2e−nα (R) for R∗ < R < Ixy
Per = −n[α (R∗ )+R∗ −R] (7.5.31)
e for R < R∗ .
where f (y) is a function, which will be specified below. The corresponding distance
D(ξ , η ) = ∑i d(xi , yi ) is more general than (7.1.8). By introducing the notation
γy (β ) = ln ∑ Pβ (y | x)P(x) (7.5.33)
x
1 + t − r = −t, r = 1 + 2t (7.5.38)
particularly, r0 = 1 + 2t0 , since in this case every term of the latter sum turns into
zero. According to (7.5.35), (7.5.36) the first equation of (7.5.4) takes the form
e−γ (s0 ) ∑ eγy (1−s0 )+s0 f (y) f (y) − γy (1 − s0 ) =
y
= e−ϕ (1+2t0 ,t0 ) ∑ e2γy (−t0 )+(1+2t0 ) f (y) f (y) − γy (−t0 ) (7.5.39)
y
238 7 Message transmission in the presence of noise. Second asymptotic theorem
in the given case. In order to satisfy the latter equation we suppose that
1 − s0 = −t0 (7.5.40)
γy (1 − s0 ) + s0 f (y) = 2γy (−t0 ) + (1 + 2t0 ) f (y) (7.5.41)
But the last equation is satisfied by virtue of the same relations (7.5.38), (7.5.40),
(7.5.42) that can be easily verified by substituting them to (7.5.34).
Hence, both equations of system (7.5.4) are satisfied. In consequence of (7.5.38),
(7.5.40), (7.5.43) the remaining equation can be reduced to
(1 − s0 )γ (s0 ) + R = 0 (7.5.44)
Differentiating this expression or, taking into account (7.5.35), we obtain that equa-
tion (7.5.44) can be rewritten as
1 γy (1−s0 )
R = e−γ (s0 ) ∑ e 1−s0 (1 − s0 )γy (1 − s0 ) − γy (1 − s0 ) . (7.5.46)
y
P(x, y)
γy (1) = ∑ ln P(y | x)
x P(y)
γy (1) = ln P(y)
1
P(y) ∑
γy (1) − γy (1) =H(y) − P(x, y)H(y | x)
x
7.5 Enhanced estimators for optimal decoding 239
that
∑ eγy (1) γy (1) − γy (1) = Hy − Hy|x = Ixy . (7.5.47a)
y
Analyzing expression (7.5.46), we can obtain that s0 > 0 if R < Ixy . Due to
(7.5.38), (7.5.40) the other roots are equal to
r0 = 2s0 − 1, t0 = s0 − 1. (7.5.48)
Apparently, they are negative if 0 < s0 < 1/2, i.e. if R is sufficiently close to Ixy . If
s0 exceeds 1/2 thus, r0 becomes positive because of the first equality of (7.5.48) ,
then we should use formula (7.5.30) instead of (7.5.3) as it was said in the previous
clause. Due to (7.5.48) values s∗0 , t0∗ obtained from the condition r0 = 0 turn out to
be the following:
s∗0 = 1/2, t0∗ = −1/2.
The ‘critical’ value of R∗ is obtained from equation (7.5.46) by substituting s0 = 1/2,
i.e. it turns out to be equal to
1 1 2γy (1/2) 1 1 1
∗
R =− γ
2 2
=e −γ (1/2)
∑e γ
2 y 2
− γy
2
y
or, equivalently, 1 1
∗ ∑y e2γy (1/2) 1
2 γy 2 − γ 2
R = (7.5.49)
∑y e2γy (1/2)
where we take into account (7.5.35), (7.5.42).
The results derived above can be formulated in the form of a theorem.
Theorem 7.5. Under the conditions of Theorems 7.1, 7.3 there exists a sequence of
codes such that the probability of decoding error satisfies the inequality
2en[s0 R/(1−s0 )+γ (s0 )] for R∗ R < Ixy
Per n[γ (1/2)+R] (7.5.50)
e for R < R∗
where
1/(1−s0 )
γ (s0 ) = ln ∑ e γy (1−s0 )/(1−s0 )
≡ ln ∑ ∑ P(x)P 1−s0
(y | x)
y y x
γ (s) = γ10 s + γ20 s2 + γ11 ss0 + γ30 s3 + γ21 s2 s0 + γ12 ss20 + · · · . (7.5.52)
In this case,
γ10 = −Ixy (7.5.53)
due to (7.5.47), (7.5.47a).
Substituting (7.5.52) to (7.5.51) we obtain
Taking into account (7.5.53), it follows from the latter formula that
The substitution of this expression to (7.5.54) allows us to find value α0 with the
order of magnitude (Ixy − R)3 .
Coefficients γik can be computed with the help of (7.5.45). For convenience of
computation we transform the latter expression to a somewhat different form by
introducing the conditional characteristic potential of the random information:
γy (1 − t) = μ (t | y) + (1 − t) ln P(y). (7.5.58)
μ (0 | y) = 0,
μ (0 | y) = −E[I(x, y) | y] = −m,
μ (0 | y) = E[I 2 (x, y) | y] − {E[I(x, y) | y]}2 = Var[I(x, y) | y] = D,
μ (0 | y) = −k,
··· (7.5.59)
Taking into account (7.5.59), we can represent the expression situated in the ex-
ponent in the form
Consequently,
s
exp μ (s | y) + μ (s0 | y)
1 − s0
1 1 1 1
= 1 − ms + m2 s2 − m3 s3 + Ds2 + · · · − ks3 + · · · −
2 6 2 6
1 1
− mDs + m s s0 − mss0 + · · · +
3 2 2
D − m ss20 + · · ·
2 2
1 1
= 1 − ms + (D + m2 )s2 − mss0 − (k + 3Dm + m3 )s3 +
2 6
1
2 2
+ m s s0 + D − m ss20 + · · · .
2
After averaging the latter expression over y according to (7.5.60) and denoting a
mean value by an overline, we will have
1
γ (s) = ln 1 − ms + (D + m2 )s2 − mss0 −
2
1 1
− (k + 3Dm + m3 )s3 + m2 s2 s0 + D − m ss20 + · · · .
6 2
242 7 Message transmission in the presence of noise. Second asymptotic theorem
1
γ (s) = −ms + D + m2 − (m)2 s2 − mss0 −
2
1
− [k + 3Dm + m3 + 2(m)3 − 3(D + m2 )m]s3 +
6
1
+ m2 s2 s0 − (m)2 s2 s0 + D − m ss20 + · · · . (7.5.61)
2
equivalent to (7.4.3). That is why the terms with s, s2 , s3 , . . . in (7.5.61) are automat-
ically proportional to the full cumulants
Further, we substitute this equality to formula (7.5.54), which is taking the form
1 1
α0 = μ (0)s0 + μ (0)m − Ixy s30 + · · · .
2 2 2
2 3
Here we have taken into account that, according to the third equality (7.5.59), it
holds true that
μ (0) = D − MI 2 (x, y) − Ixy
2
− D = m2 − Ixy
2
.
Moreover, we compare the last result with formula (7.4.24), which is valid in the
same approximation sense. At the same time we take into account that
since the conditional variance Var does not exceed the regular (non-conditional) one
μ (0).
In comparison with (7.4.24), equation (7.5.63) contains additional positive terms,
because of which α0 > α . Therefore, inequality (7.5.50) is stronger than inequal-
ity (7.4.2) (at least for values of R, which are sufficiently close to Ixy ). Thus, Theo-
rem 7.5 is stronger than Theorem 7.3.
A number of other results giving an improved estimation of behaviour of the
probability of decoding error are provided in the book by Fano [10] (the English
original is [9]).
2. Let us now compare the amount of information Ikη |ξ1 ...ξM with Iξ η . The former
amount of information is defined by the formula
where
Ikη (| ξ1 , . . . , ξM ) = Hη (| ξ1 , . . . , ξM ) − Hη |k (| ξ1 , . . . , ξM )
=∑f ∑ P(k)P(η | ξk ) − ∑ P(k)Hη (| ξk ); (7.6.4)
η k k
f (z) = −z ln z. (7.6.5)
In turn, information Iξ η can be rewritten as
Iξ η = Hη − Hη |ξ = ∑ f ∑ P(ξ )P(η | ξ ) − ∑ P(ξ )Hη (| ξ ). (7.6.6)
η ξ ξ
It is easy to make sure that after averaging (7.6.4) over ξ1 , . . . , ξk the second (sub-
trahend) term coincides with the second term of formula (7.6.6). Indeed, the expec-
tation
∑ P(ξk )H(| ξk )
ξk
E [ f (ζ )] f (E[ζ ])
But E[P(η | ξk )] = ∑k P(ξk )P(η | ξk ) does not depend on k and coincides with the
argument of function f from the first term of relation (7.6.7). Hence, for every η
f ∑ P(ξ )P(η | ξ ) −E f ∑ P(k)P(η | ξk ) 0
ξ k
where Per (k, ξ1 , . . . , ξM ) is the probability of decoding error under the condition that
message k has been transmitted (a code is fixed). At the second stage, if l = k,
then we should point out which of the remaining messages has been received. The
corresponding uncertainty cannot be larger than Per ln (1 − M). Therefore,
Further, we can perform averaging over an ensemble of random codes and anal-
ogously, using (7.6.10) one more time, obtain
Because Ikl|ξ1 ,...,ξM = Hl|ξ1 ,...,ξM − Hl|k,ξ1 ,...,ξM , it follows from (7.6.11) that
The same reasoning is applicable if we switch k and l. Then in analogy with (7.6.12)
we will have
enR − 1 M, enR M − 1,
i.e.
1 + ln(1 − e−nR )/nR − Per − ln 2/nR Ikl|ξ1 ,...,ξM /nR.
Taking into account (7.6.9) and the relationship Iξ η − nIxy we obtain the resultant
inequality
Ixy ln 2 1
Per 1 − − + ln(1 − e−nR ), (7.6.15)
R nR nR
which defines a lower bound for the probability of decoding error. As nR → ∞, the
inequality (7.6.15) turns into the asymptotic formula
It makes sense to use the latter formula when R > Ixy (if R < Ixy , then the inequality
becomes trivial). According to the last formula, the errorless decoding inherently
does not take place when R > Ixy , so that the boundary Ixy for R is substantial.
4. Uniting formulae (7.6.9), (7.6.14) and replacing the factor in Per with ln M, we
will obtain the result
It can be concluded from the previous theorems in this chapter that one can
increase I and perform encoding and decoding in such a way that Iξ η /I → 1
as n → 0 and Iat the same time Per → 0. Then, apparently, the length of interval
ln 2 ξ η
1 − Per − I , I tends to zero:
This means that with increasing n the following approximations are valid with a
greater degree of accuracy:
I ≈ Ikl /I ≈ Ikη /I ≈ Iξ η /I
or
I/n ≈ Ikl /n ≈ Ikη /n ≈ Iξ η /n. (7.6.19)
These asymptotic relations generalize equalities (6.1.17) concerned with simple
noise (disturbances). Thus, arbitrary noise can be regarded as asymptotically equiv-
alent to simple noise. Index l of code region Gl is an asymptotically sufficient coor-
dinate (see Section 6.1).
Just as in the case of simple noise (Section 6.1) when the use of Shannon’s
amount of information
Ixy = Hx − Hx|y (7.6.20)
was justified by (according to (6.1.17)) its ability to be reduced to a simpler ‘Boltz-
mann’ amount of information
Hk = − ∑ P(k) ln P(k),
k
in the case of arbitrary noise the use of information amount (7.6.20) is most con-
vincingly justified by the asymptotic equality Iξ η / ln M ≈ 1.
Chapter 8
Channel capacity. Important particular cases of
channels
This chapter is devoted to the second variational problem, in which we try to find
an extremum of the Shannon’s amount of information with respect to different input
distributions. We assume that the channel, i.e. a conditional distribution on its output
with a fixed input signal, is known. The maximum amount of information between
the input and output signals is called channel capacity. Contrary to the conventional
presentation, from the very beginning we introduce an additional constraint con-
cerning the mean value of some function of input variables, i.e. we consider a con-
ditional variational problem. Results for the case without the constraint are obtained
as a particular case of the provided general results.
Following the presentation style adopted in this book, we introduce potentials, in
terms of which a conditional channel capacity is expressed. We consider a number of
important particular cases of channels more thoroughly, and for which it is possible
to derive explicit results. For instance, in the case of Gaussian channels, the general
formulae in matrix form are obtained using matrix techniques.
In this chapter, the presentation concerns mainly the case of discrete random
variables x, y. However, many considerations and results can be generalized directly
by changing notation (for example, via substituting P(y | x), P(x) by P(dy | x), P(dx)
and so on).
In the previous chapter, it was assumed that not only noise inside a channel (de-
scribed by conditional probabilities P(y | x)) is statistically defined, but also signals
on a channel’s input, which are described by a priori probabilities P(x). That is
why the system characterized by the ensemble of distributions [P(y | x), P(x)] (or,
equivalently, by the joint distribution P(x, y)) was considered as a communication
channel.
Usually, the distribution P(x) is not an inherent part of a real communication
channel as distinct from the conditional distribution P(y | x). Sometimes it makes
© Springer Nature Switzerland AG 2020 249
R. V. Belavkin et al. (eds.), Theory of Information and its Value,
https://doi.org/10.1007/978-3-030-22833-0 8
250 8 Channel capacity. Important particular cases of channels
sense not to fix the distribution P(x) a priori but just to fix some technically impor-
tant requirements, say of the form
a1 ∑ c(x)P(x) a2 , (8.1.1)
x
where the maximization is considered over all P(x) compliant with condition (8.1.1)
or (8.1.2).
As a result of the specified maximization, we can find the optimal distribution
P0 (x) for which
C = Ixy (8.1.4)
or at least an ε -optimal Pε (x) for which
0 C − Ixy < ε ,
where ε is infinitesimally small. After that we can consider system [P(y | x), P0 (x)] or
[P(y | x), Pε (x)] and apply the results of the previous chapter to it. Thus, Theorem 7.1
induces the following statement.
Theorem 8.1. Suppose there is a stationary channel, which is the n-th power of
channel [P(y | x), c(x)]. Suppose that the amount ln M of transmitted information
grows with n → ∞ according to the law
8.1 Definition of channel capacity 251
ln M = ln[enR ],
R < C, (8.1.5)
and C < ∞ is the capacity of the channel [P(y | x), c(x)]. Then there exists a sequence
of codes such that
Per → 0 as n → ∞.
In order to derive this theorem from Theorem 7.1, evidently, it suffices to select
a distribution P(x), that is consistent with condition (8.1.1) or (8.1.2) in such a way
that the inequality R < Ixy C holds. This can be done in view of (8.1.3), (8.1.5).
In a similar way, the other results of the previous chapter obtained for the chan-
nels [P(y|x), P(x)] can be extended to the channels [P(y|x), c(x)]. We shall not dwell
on this any longer.
According to definition (8.1.3) of capacity of a noisy channel, its calculation
reduces to solving a certain extremum problem. The analogous situation was en-
countered in Sections 3.2, 3.3 and 3.6, where we considered capacity of a noiseless
channel. The difference between these two cases is that entropy is maximized for
the first one, whereas the Shannon’s amount of information is maximized for the
other. In spite of such a difference, there is a lot in common for these two extremum
problems. In order to distinguish the two, the latter extremum problem will be called
the second extremum problem of information theory.
The extremum (8.1.3) is usually achieved at the boundary of the feasible range (8.1.1)
of average costs. Thereby, condition (8.1.1) can be replaced with a one-sided in-
equality of type (8.1.2) or even with the equality
E[c(x)] = a (8.1.6)
in the generalized version (see Section 6.4). Therefore, we compare distinct distri-
butions P(·) on F1 , which satisfy a condition of type (8.1.1), (8.1.2) or (8.1.6).
1. We use X to denote the space of values, which an input variable x can take. For
the extremum distribution P0 (dx) corresponding to capacity (8.1.3), the probability
can be concentrated only in a part of the indicated space. Furthermore, let us denote
= 1 (i.e. P0 (X − X)
by X the minimal subset X ∈ X, for which P0 (X) = 0). We shall
call it an ‘active domain’.
When solving the extremum problem we suppose that x is a discrete variable
for convenience. Then we can consider probabilities P(x) of individual points x and
take partial derivatives of them. Otherwise, we would have introduced variational
derivatives that are associated with some complications, which is not of a special
type though.
We try to find a conditional extremum with respect to P(x) of the expression
P(y | x)
Ixy = ∑ P(x)P(y | x) ln
x,y ∑ P(x)P(y | x)
x
∑ c(x)P(x) = a, (8.2.1)
x
∑ P(x) = 1. (8.2.2)
x
It is allowed not to fix the non-negativity constraint for probabilities P(x) for now
but to check its satisfaction after having solved the problem.
Introducing the indefinite Lagrange multipliers −β , 1 + β ϕ , which will be deter-
mined from constraints (8.2.1), (8.2.2) later on, we construct the expression
P(y | x)
K = ∑ P(x)P(y | x) ln − β ∑ c(x)P(x) + (1 + β ϕ ) ∑ P(x). (8.2.3)
x,y ∑ P(x)P(y | x) x x
x
We will seek its extremum by varying values P(x), x ∈ X corresponding to the active
Equating the partial derivative of (8.2.3) by P(x) to zero, we obtain the
domain X.
8.2 Solution of the second variational problem. Relations for channel capacity and potential 253
equation
P(y | x)
∑ P(y | x) ln P(y)
− β c(x) + β ϕ = 0 for x ∈ X, (8.2.4)
y
Factoring (8.2.4) by P(x) and summing over x with account of (8.2.1), (8.2.2),
we have Ixy = β (a − ϕ ), i.e.
C = β (a − ϕ ). (8.2.6)
This relation allows us to exclude ϕ from equation (8.2.4) and thereby rewrite it in
the form
P(y | x)
∑ P(y | x) ln P(y) = C − β a + β c(x) for x ∈ X. (8.2.7)
y
Formulae (8.2.7), (8.2.1), (8.2.2) constitute the system of equations that serves
if the region X is already
for a joint determination of variables C, β , P(x) (x ∈ X),
selected. For a proper selection of this region, solving the specified equations will
yield positive probabilities P(x), x ∈ X.
We can multiply the main equation (8.2.4) or (8.2.7) by P(x) and, thus, rewrite it
in the inequality form
P(y | x)
∑ P(x)P(y | x) ln P(y)
= [C − β a + β c(x)]P(x), (8.2.8)
y
which is convenient because it is valid for all values x ∈ X and not only for x ∈ X.
Equation (8.2.7) does not necessarily hold true beyond region X.
It is not difficult to write down a generalization of the provided equations for the
case when random variables are not discrete. Instead of (8.2.7), (8.2.5) we will have
P(dy | x)
P(dy | x) ln = C − β a + β c(x), x ∈ X (8.2.9)
P(dy)
P(dy) = P(dy | x)P(x) .
In this case, the desired distribution turns out to be P(dx) = p(x)dx, x ∈ X.
2. Coming back to the discrete version, we prove the following statement.
Theorem 8.2. A solution to equations (8.2.7), (8.2.1), (8.2.2) corresponds to the
maximum of information Ixy with respect to variations of distribution P(x) that leave
the active domain X invariant.
where β , ϕ do not vary. It is easy to see that this matrix is negative semi-definite
Indeed,
(since P(y) 0), so that K is a convex function of P(x), x ∈ X.
2
P(y | x)P(y | x ) 1
−∑ ∑ f (x) f (x ) = − ∑
∑ P(y | x) f (x) 0
x,x y P(y) y P(y) x
C = −d ϕ (T )/dT. (8.2.11)
8.2 Solution of the second variational problem. Relations for channel capacity and potential 255
Proof. We will vary parameter β = 1/T or, equivalently, parameter a inside con-
straint (8.2.1). This variation is accompanied by variations of parameter ϕ and dis-
tribution P(x). We take in an arbitrary point x of the active domain X. If variation
da is not rather large, then the condition P(x) + dP(x) > 0 will remain valid af-
ter the variations, i.e. x will belong to the variated active domain. An equality of
type (8.2.4) holds true at point x before and after variating. We differentiate it and
obtain the equation for variations
dP(y)
− ∑ P(y | x) − [c(x) − ϕ ]d β + β d ϕ = 0. (8.2.12)
P(y)
y∈Y
Here the summation is carried out over region Y , where P(y) > 0. Further, we
consecutively multiply the latter equality by P(x) and sum over x. Taking into ac-
count (8.2.5), (8.2.1) and keeping the normalization constraint
∑ dP(y) = d ∑ P(y) = dl = 0,
y y
dϕ
C = β2 , (8.2.13)
dβ
dϕ
−T (T ) + ϕ (T ) = a, i.e. R = a, (8.2.15)
dT
serving to determine quantity T .
It is convenient to consider formula (8.2.14) as the Legendre transform of func-
tion ϕ (T ):
∂ϕ
R(S) = T S + ϕ (T (S)) S=− .
∂T
Then, according to (8.2.15), capacity C will be a root of the equation
R(C) = a. (8.2.16)
256 8 Channel capacity. Important particular cases of channels
dΓ
(β ) = −a. (8.2.18)
dβ
d 2C(a)
0. (8.2.21)
da2
Proof. Let variation da correspond to variations dP(x) and dP(y)= ∑x P(y | x)dP(x).
We multiply (8.2.12) by dP(x) and sum it over x ∈ X taking into account that
∑ dP(x) = 0, ∑ c(x)dP(x) = da
x x
[dP(y)]2
∑ P(y) + dad β = 0. (8.2.22)
y∈Y
Apparently, the first term cannot be negative herein. That is why da d β 0. Divid-
ing this inequality by positive quantity (da)2 , we have
d β /da 0. (8.2.23)
8.2 Solution of the second variational problem. Relations for channel capacity and potential 257
The desired inequality (8.2.21) follows from here if we take into account that
β = dC/da according to (8.2.20). The proof is complete.
As is seen from (8.2.22), the equality sign in (8.2.21), (8.2.23) relates to the case
when all dP(y)/da = 0 within region Y .
A typical behaviour of curve C(a) is represented in Figure 8.1. In consequence
of relation (8.2.20), which can be written as dC/da = β , the maximum point of
function C(a) corresponds to the value β = 0. For this particular value of β equa-
tion (8.2.7) takes the form
P(y | x)
∑ P(y | x) ln P(y)
= C. (8.2.24)
y∈Y
Here c(x) and a are completely absent. Equation (8.2.24) corresponds to channel
P(y | x) defined without accounting for conditions (8.1.1), (8.1.2), (8.1.6). Indeed,
solving the variational problem with no account of condition (8.1.6) leads exactly
to equation (8.2.24), which needs to be complemented with constraint (8.2.2). We
denote this maximum value C by Cmax .
Now we discuss corollaries of Theorem 8.4. Function a = R(C) is an inverse
function to function C(a). Therefore, it is defined only for C Cmax and is two-
valued (at least on some interval adjacent to Cmax ), if sign < takes place in (8.2.21)
for β = 0. It holds for one branch that
dR(C)
> 0, i.e. T > 0 or β > 0;
dC
We call this branch normal. For the other, anomalous branch we have
dR(C)
< 0, β < 0.
dC
It is easy to comprehend that the normal branch is convex:
d 2 R(C) dT
≡ >0 (C < Cmax ),
dC2 dC
In its turn, the anomalous branch is concave:
d 2 R(C) dT
≡ <0 (C < Cmax ),
dC2 dC
as it follows from concavity of function C(a).
If we consider function ϕ (T ), which is the Legendre transform of R(C):
dR
ϕ (T ) = −CT + R(C(T )) T=
dC
258 8 Channel capacity. Important particular cases of channels
[see (8.2.15)], then its normal and anomalous branches will be convex and concave,
respectively, since
dϕ d2ϕ dC
= −C, =− ,
dT dT 2 dT
1. If
a1 R(Cmax ) a2 ,
then, evidently, the maximum value of channel capacity
C = Cmax
relates to feasible ones. In this case fixing the constraint (8.1.1) does not result in
a decrease of the channel capacity.
2. If
a2 R(Cmax ),
then the interval [a1 , a2 ] is related to the normal branch of the dependence be-
tween C and a. Here function C(a) is non-decreasing and, consequently, the ca-
pacity equals
C = C(a2 ).
3. If
R(Cmax ) < a1 ,
8.3 The type of optimal distribution and the partition function 259
then we need to consider the anomalous branch. Since function C(a) is non-
increasing for a < Cmax , the maximum value of C(a) is attained for a = a1 , i.e.
C = C(a1 ).
The presentation in this section will be less general than the results from the previous
section. It is restricted by the existence condition of the inverse linear transformation
L−1 described below.
As is seen from the form of equations (6.2.8), (6.2.9), their solution can be easily
written, if we can find the transformation
or, equivalently,
L f (y) = ∑ P(y | x) f (y), L f (y) = p(y | x) f (y)dy.
y
The inverse transformation (8.3.1) can be expressed with the help of the kernel
Q(y, dx) = q(y, x) dx as follows:
L−1 g(x) = Q(y, dx)g(x)
or
L−1 g(x) = ∑ Q(y, x)g(x), L−1 g(x) = q(y, x)g(x)dx. (8.3.3)
x
For simplicity, let us stop at the discrete version. Then
Q(y, x)
=
P(y|x)
−1 will
be a matrix, which is the inverse of the matrix
P(y | x), y ∈ Y , x ∈ X
.
With the help of the same (transposed) matrix we can also write down the trans-
formation
G(x) = F(y)L−1 = ∑ F(y)Q(y, x), (8.3.4)
y
which is an inverse of
or
∑ P(y | x)[ln P(y) +C − β a] = −β c(x) − Hy (| x), x ∈ X (8.3.5)
y
i.e.
P(y) = exp β a −C − ∑ Q(y, x)[β c(x) + Hy (| x)] . (8.3.6)
x∈X
The left-hand side can be represented in the form ∑x∈X P(x)P(y | x). Consequently,
P(x)L = exp β a −C − ∑ Q(y, x)[β c(x) + Hy (| x)] .
x
The unknown parameters C, β are determined with the help of (8.2.1), (8.2.2).
Summing (8.3.7) over x and taking into account that the left-hand side turns into
one, we obtain the equation
∑ exp − ∑ Q(y, x)[ β c(x) + Hy (| x)] = eC−β a , (8.3.8)
y x
which is equivalent to (8.2.2). Note that we have used the constraint ∑x Q(y, x) = 1,
because due to (8.3.2), (8.3.3) g(x) = 1 corresponds to function f (y) = 1 and vice
versa. Moreover, substituting (8.3.7) to (8.2.1), we have
∑ ∑ exp β a −C − ∑ Q(y, x)[β c(x) + Hy (| x)] Q(y, x)c(x) = a. (8.3.9)
y x∈X x
Then expression (8.3.8) takes the form of the usual partition function
defined by formulae (3.3.11), (3.6.4) earlier. Here b(y) and ν (y) are analogs of en-
ergy and ‘degree of degeneracy’, respectively. Equation (8.3.11) can be rewritten as
follows:
d ln z
a = ∑ Z −1 e−β b(y) b(y)v(y) = − . (8.3.12)
y dβ
dΓ dΓ dΓ
Γ ( β0 ) − β0 (β0 ) = 0, a0 = − (β0 ), a=− (β ), (8.3.14)
dβ dβ dβ
That is
dC dC
β0 = (a0 ), β= (a).
da da
Proof. For Hy (x) = Hy|x it follows from (8.3.10), (8.3.11) in consequence of the
equality ∑x Q(y, x) = 1 (already mentioned before) that
Γ (β ) = −Hy|x + Γ0 (β ), (8.3.15)
where
Γ0 (β ) = ∑ e−β b(y) .
y
262 8 Channel capacity. Important particular cases of channels
is a concave function. If we take into account (8.3.15), we find from (8.2.18), (8.2.19)
that
dΓ0 dΓ0
C + Hy|x = Γ0 − β , = −a (8.3.17)
dβ dβ
and according to (8.3.16)
C + Hy|x = S0 (a). (8.3.18)
In consequence of the indicated concavity, the ratio [S0 (a) − S0 (a0 )]/(a − a0 ) be-
longs to the interval between dS dS0
da (a) = β and da (a0 ) = β0 . Selecting a0 from the
0
condition S0 (a0 ) = Hy|x , we obtain the statement of the theorem due to (8.3.18),
(8.2.19). The proof is complete.
We point out one corollary of the given theorem. If a a0 (and thereby β β0 ),
then we obtain
β (a − a0 ) C β0 (a − a0 ), (8.3.19)
In particular,
C β0 a. (8.3.20)
if also β0 a0 > 0.
H ≡ − ∑ p j ln p j .
j
8.4 Symmetric channels 263
It is not difficult to understand that (8.4.2) coincides with Hy|x . Therefore, for-
mula (8.3.5) can be represented as
∑ P(y | x) ln ∑ P(x)P(y | x) +C − β a + Hy|x = −β c(x) (8.4.3)
y x
(x = x1 , . . . , xr ),
or, equivalently,
∑ P(y | x) ln ∑ P(x)P(y | x) +C + Hx|y = 0 (8.4.4)
y x
(x = x1 , . . . , xr ),
Ixy = Hy − H.
∑ P(y | x) = ∑ pj
x j
C = ln r − H. (8.4.7)
Since
r
pj 1
r= ∑ pj
=E
pj
, (8.4.8)
j=1
where
Q(dy)
r = P(dy) = Q(dy).
P(dy)
Instead of (8.4.9) we have
Q(dy) Q(dy)
C = ln E − E ln .
P(dy) P(dy)
According to (8.4.5), (8.4.6), the capacity of such a channel takes place for uniform
distributions
P(x) = 1/2 , P(y) = 1/2. (8.5.2)
Due to (8.4.7) it is equal to
8.5 Binary channels 265
α = − sR − ln[p1−s + (1 − p)1−s ] + s ln 2
p1−s ln p + (1 − p)1−s ln(1 − p)
= − (1 − s)
p1−s + (1 − p)1−s
− ln[p1−s + (1 − p)1−s ] + ln 2 − R. (8.5.7)
α = ρ ln ρ + (1 − ρ ) ln(1 − ρ ) + ln 2 − R = ln 2 − h2 (ρ ) − R. (8.5.8)
ρ ln p + (1 − ρ ) ln(1 − p) = R − ln 2. (8.5.9)
266 8 Channel capacity. Important particular cases of channels
Accounting for (8.5.9), (8.5.3) it is easy to see that the condition R < C means the
inequality
ρ ln p + (1 − ρ ) ln(1 − p) < p ln p + (1 − p) ln(1 − p)
or
p
(ρ − p) ln < 0,
1− p
p 1
i.e. ρ > p, since < 1 (because p < ). As is seen from (8.5.9), the value
1− p 2
ln[4p(1 − p)]
ρ=
ln[(1 − p)/p]
Hy (| 1) = h2 (α ), Hy (| 2) = h2 (α ).
That is why
−1 1 (1 − α )[β c1 + h2 (α )] − α [β c2 + h2 (α )]
L
β c(x) + Hy (| x)
= ,
D −α [β c1 + h2 (α )] + (1 − α )[β c2 + h2 (α )]
where
1 1
ξ1 = exp − [β c1 + h2 (α )] , ξ2 = exp − [β c2 + h2 (α )] .
D D
α c1 + α c2 1 c1 ξ1 + c2 ξ2
a=− + . (8.5.12)
D D ξ1 + ξ2
α h2 (α ) + α h2 (α ) β c1 ξ1 + c2 ξ2
C= + ln(ξ1 + ξ2 ) + . (8.5.13)
D D ξ1 + ξ2
α h2 (α ) + α h2 (α )
C= + ln[e−h2 (α )/(1−α −α ) + e−h2 (α )/(1−α −α ) ].
1 − α − α
In the other particular case, when α = α ; c1 = c2 (symmetric channel), it follows
from (8.5.13) that
2α h2 (α ) β c1 + h2 (α ) β c1
C= + ln 2 − + = ln 2 − h2 (α ),
1 − 2α 1 − 2α 1 − 2α
that naturally coincides with (8.5.3).
1
c(x) = c(0) + ∑ ck xk +
2∑
(1)
ckl xk xl . (8.6.3)
k k,l
Obviously, we can choose the origin in the spaces X and Y (making the substi-
tution xk + c−1
(1)
kl cl → xk , yi − mi → yi ), so that the terms in (8.6.3) and in the ex-
0
ponent (8.6.1), which are linear in x, y, vanish. Furthermore, the constant summand
in (8.6.3) is negligible and can be omitted. Therefore, without loss of generality,
we can keep only bilinear terms in (8.6.1), (8.6.3). Using matrix notation, let us
write (8.6.1), (8.6.3) as follows:
1/2 A 1 T
p(y | x) = det exp − (y − x d )A(y − dx)
T T
(8.6.4)
2π 2
1
c(x) = xT cx. (8.6.5)
2
Here we imply a matrix product of two adjacent matrices. Character T denotes
transposition; x, y are column-matrices and xT , yT are row-matrices, correspond-
ingly:
xT = (x1 , . . . , xr ), yT = (y1 , . . . , ys ).
Certainly, matrix A is a non-singular positive definite matrix, which is inverse
to the correlation matrix K = A−1 . Matrix c is also perceived as non-singular and
positive definite.
As is seen from (8.6.1), actions of disturbances in a channel are reduced to the
addition
y = dx + z (8.6.6)
of noises zT = (z1 , . . . , zs ) having a Gaussian distribution with zero mean vector and
correlation matrix K:
E[z] = 0, E[zzT ] = K. (8.6.7)
(a matrix representation is applied in the latter formula as well).
2. Let us turn to computation of capacity C and probability densities p(x), p(y)
for the channel in consideration. To this end, consider equation (8.2.9), which (as
it was mentioned in Section 8.2) is valid in subspace X, where there are non-zero
probabilities p(x)Δ x. In our case, X will be an Euclidean subspace of the original r-
dimensional Euclidean space X. In that subspace, of course, matrix c (we can define
a scalar product with the help of it) will also be non-singular and positive definite.
We shall seek the probability density function p(x) in the Gaussian form:
1 T −1
p(x) = (2π ) −
r/2
det Kx exp − x Kx x ,
1/2
x ∈ X (8.6.8)
2
(so that E[x] = 0, E[xxT ] = Kx ). (8.6.9)
Taking into account (8.6.6), it follows from Gaussian nature of random variables
x and z that y are Gaussian random variables as well. Therefore, averaging out (8.6.6)
and accounting for (8.6.7), (8.6.9), it is easy to find their mean value and correlation
matrix
E[y] = 0, E[yyT ] = Ky = K + dKx d T . (8.6.10)
Therefore,
1
p(y) = det−1/2 (2π Ky ) exp − yT Ky−1 y
2
1
= det−1/2 [2π (K + dKx d T )] exp − yT (K + dKk d T )−1 y . (8.6.11)
2
1 1
− ln [det(A) det(Ky )] + E[(yT − xT d T )A(y − dx) − yT Ky−1 y | x] =
2 2
β
= β α − xT cx −C, x ∈ X. (8.6.12)
2
When taking a conditional expectation we account for
xT d T Ky−1 dx = β xT cx,
x ∈ X. (8.6.15)
From this moment we perceive operators c, d T Ky−1 d and others as operators acting
on vectors x from subspace X and transforming them into vectors from the same
subspace (that is, x-operators are understood as corresponding projections of initial
x-operators). Then, due to a freedom of selection of x from X, equality (8.6.15)
yields
d T Ky−1 d = β c.
Substituting the equality
270 8 Channel capacity. Important particular cases of channels
is non-singular as well. It is not difficult to conclude from here that each of opera-
x )−1 , A
tors (1x + AK is non-singular. Indeed, the determinant of the matrix product
−1
(1x + AKx ) A, which is equal to the product of the determinants of the respective
matrices, could not be different from zero if at least one factor-determinant were
equal to zero. Non-singularity of A follows from the inequality det A = 0. Thus,
−1 T
there exists a reverse operator A = (d Ad) . −1
1 1
a = MxT cx = tr cKx (8.6.24)
2 2
or, if we substitute (8.6.22) hereto,
1 1 −1
r 1 −1
a = tr 1x − cA = T − tr cA . (8.6.25)
2 β 2 2
Here we have taken into account that the trace of the identity x-operator is equal to
dimension i.e. ‘the number of active degrees of freedom’ of random
r of space X,
variable x. The corresponding ‘thermal capacity’ equals
r
da/dT = (8.6.26)
2
(if
r does not change for variation dT ). Thus, according to the laws of classical sta-
tistical thermodynamics, there is average energy T /2 per every degree of freedom.
In order to determine the capacity, we can use formulae (8.6.14), (8.6.25) [when
applying formulae (8.6.16), (8.6.21)] or regular formulae for the information of
communication between Gaussian variables, which yield
1 1
C= ln det Ky A = tr ln Ky A.
2 2
Further, we substitute (8.6.23) hereto. It is easy to prove (moving d from left to right
with the help of formula (8.6.41)) that
1
C= tr ln[1 − d(d T Ad)−1 d T A + T dc−1 d T A]
2
1 −1 ), = d T Ad).
= tr ln(T Ac (A
2
Otherwise,
C=
1 −1 ) + ln(T ) tr 1x
tr ln(Ac
2 2
1 −1
r
= tr ln(Ac ) + ln T. (8.6.27)
2 2
The provided logarithmic dependence of C from temperature T can be already
derived from formula (8.6.25) if we take into account the general thermodynamic
relationship (8.2.20). In the given case it takes the form
dC = da/T =
rdT /2T
These functions, as it was pointed out in Section 8.2, simplify a usage of condi-
tion (8.1.1), (8.1.2).
Quantity C increases with a growth of a, according to the terminology of Sec-
tion 8.2 the dependence between C and a is normal in the given case. If a condition
is of type (8.1.1), then the channel capacity is determined from (8.6.29) by the sub-
stitution a = a2 .
4. Examples. The simplest particular case is when matrix Ac −1 is a multiple of
the identity matrix:
−1 − 1x .
Ac (8.6.30)
2N
This takes place in the case when, say,
1 l = 1, . . . ,
r
c = 2 · 1x , K = N · 1y , dil = (8.6.31)
0 l > r.
since
−1 − 2N tr 1x − 2N
tr cA r, −1 ) = −
tr ln(Ac r ln(2N).
According to formulae (8.6.22), (8.6.25) the correlation matrix at input is repre-
sented as
T a
Kx = − N Ix = 1x .
2
r
Next, we consider a somewhat more difficult example. We suppose that spaces
X, Y coincide with each other, matrices 12 c, d are identity matrices and matrix K is
diagonal but not a multiple of the identity matrix:
K = Ni δi j .
δi j /2Ni , i ∈ L, j ∈ L.
For the
Let us give a number of other relations with the help of the introduced set X.
example in consideration, formula (8.6.22) takes the form
T
(Kx )i j = − Ni δi j (8.6.34)
2
These formulae solve the problem completely, if we know set L of indices i cor-
responding to non-zero components of vectors x from X. Let us demonstrate what
considerations define that set.
In the case of Gaussian distributions, probability densities p(x), p(y), of course,
cannot be negative, and therefore subspace X is not determined from the positive-
ness condition of probabilities (see Section 8.2). However, the condition of positive
definiteness of matrix Kx [which is given by formula (8.2.22)] may be violated, and
it must be verified. Besides, the condition of non-degeneracy of operator (8.6.20)
must be also satisfied. In the given example, we do not need to care much about the
latter condition because d = 1. But the condition of positive definiteness of matrix
Kx , i.e. the constraint
T
Ni < (8.6.36)
2
due to (8.6.34), is quite essential though. For each fixed T the set of indices L is
determined from constraint (8.6.36). Hence, we can replace i ∈ L with Ni < T /2
under the summation signs in formulae (8.6.35).
The derived relations can be conveniently tracked on Figure 8.2. We draw indices
i and variances of disturbances Ni (stepwise line) on abscissa and ordinate axes,
respectively. A horizontal line corresponds to a fixed temperature T /2. Below it
(between the horizontal line and the stepwise line) there are intervals (Kx )ii located
according to (8.6.34). Intersection points of the two specified lines define boundaries
of set L. The area sitting between the horizontal and the stepwise lines is equal to
274 8 Channel capacity. Important particular cases of channels
the total useful energy a. In turn, the shaded area located between the stepwise line
and the abscissa axis is equal to the total energy of disturbances ∑i∈L Ni .
Space X (as a function of T ) is determined by analogous methods for more com-
plicated cases.
5. Now we compute thermodynamic potential (7.4.3), defining estimation (7.4.2)
of the probability of decoding error, for Gaussian channels. The information
Fig. 8.2 Determination of channel capacity in the case of independent (spectral) components of a
useful signal and additive noise with identically distributed components
p(y | x)
I(x, y) = ln
p(y)
1 1
= ln det(Ky A) − (yT − xT d T )A(y − dx) + yT Ky−1 y
2 2
have already been involved in formula (8.6.12). Substituting it to the equality
eμ (s) = e−sI(x,y) p(x)p(y | x)dxdy
The latter integral can be easily calculated with the help of the well-known matrix
formula (5.4.19). Apparently, it turns out to be equal to
8.6 Gaussian channels 275
where −1
Kx + (1 − s)d T Ad (1 − s)d T A
B= (8.6.38)
−(1 − s)Ad (1 − s)A + sKy−1 .
In order to compute det B we apply the formula
b c
det T = det d det(b − cd −1 cT ), (8.6.39)
c d
which follows from formula (A.2.4) from Appendix. This application yields
Taking the logarithm of the last expression and accounting for formula (6.5.4), we
find
s 1
μ (s) = − tr ln Ky A − tr ln(1 − s + sKKy−1 )−
2 2
1
− tr ln[1 + (1 − s)sKx d T Ky−1 (1 − s + sKKy−1 )−1 d]. (8.6.40)
2
We have simplified the latter term here by taking into account that
s(1 − β cA−1 )
μ (s) = −C + tr .
−1
1 − s2 + s2 β cA
1 − β cA−1 1
a = s2 tr −1 )
+ tr ln(1 − s2 + s2 β cA
1 − s + s β cA
2 2 −1 2
−1 )
s(1 − β cA
tr = C − R.
1 − s2 + s2 β cA
Using the equality
1 − β cA−1 1
s2 = −1
1 − s + s β cA
2 2 −1 −1
1 − s + s2 β cA
2
1 (C − R)2
α= +··· . (8.6.48)
−1
r − β tr cA
2
1. Stationary channels are invariant under a translation with respect to the index
(time):
It is convenient to reduce matrices (8.7.1) to a diagonal form via the unitary trans-
formation
x̄ = U + x, (8.7.3)
where
1
U =
U jl
, U jl = √ e2π i jl/m j, l = 1, .., m (8.7.4)
m
[refer to (5.5.8)]. Its unitarity can be easily verified. As it was shown in Section 5.4,
the Hermitian conjugate operator
" "
" 1 −2π ikl/m "
U =
Ulk
= "
+ + " √ e " (8.7.5)
m "
m l∑
(U + cU) jk = e−2π i( jl −kl )/m cl −c
l
1
m∑
= e−2π i( j−k)l /m ∑ e−2π ikl/m cl
l l
= δ jk ck
In this case,
m m m
c¯j = ∑ cl e−2π i jl/m , A¯ j = ∑ e−2π i jl/m al , d¯j = ∑ e−2π i jl/m dl . (8.7.7)
l=1 l=1 l=1
considered.
Matrix (8.6.22) can be represented as
T Nj
Kx j δ jl = − δ jl .
c¯j |d¯j |2
1 |d¯j |2 T
C=
2 c¯ N ∑ ln
c¯j N j
(8.7.8)
j j <|d j | T
¯ 2
1 c¯j N j
a=
2 c¯ N ∑ T− ¯ 2 .
|d j |
(8.7.9)
j <|d j |
j
¯ 2T
and analogously for Ā j , d¯j . As Δ → 0, the expression in the right-hand side con-
verges to the limit
T T0 /2
2π j 0
c̄ = e−2iπ jt/T0 c(t)dt = e−i2π jt/T0 c(t)dt. (8.7.12)
T0 0 −T0 /2
Similarly,
A¯ j Δ → Ā(ω j ), d¯j Δ → d(
¯ ω j ),
where
T0 T0
1
Ā(ω ) = e−iω t A(t)dt ≡ , ¯ ω) =
d( e−iω t d(t)dt. (8.7.14)
0 N(ω ) 0
Moreover,
|d¯j |2 ¯ ω j)
d(
→ . (8.7.15)
c¯j N j c̄(ω j )N(ω j )
Taking into account (8.7.15), we obtain
∞ ¯ ω j )|2
1 T |d(
C=
2 ∑ ln c̄(ω j )N(ω j )
(8.7.16)
j=0
where the summation is carried out over region c̄(ω j )N(ω j ) < T |d(
¯ ω j )|2 . These
formulae could have been derived if we had disregarded a passage to the limit and
considered the following integral unitary transformation:
T0
1
x̄(ω j ) = √ e−iω j t x(t)dt, j = 0, 1, 2, . . . . (8.7.18)
T0 0
These values constitute Fourier components of the initial function and turn out to be
independent in the stationary case.
3. The other example of a finite dimensional space is the case when x =
(. . . , x1 , x2 , . . .) represents a discrete-time process on an infinite interval. Then pe-
riodicity constraints (8.7.2) vanish. Corresponding results can be obtained from the
formulae of clause 1 via the passage to the limit m → ∞.
Now it is convenient to represent transformation (8.7.4) in the form
8.7 Stationary Gaussian channels 281
1 1 2π j
x̄¯(λ j ) = √
2π
∑ e−2π i jl/m xl = √
2π
∑ e−iλ j l xl λj =
m
(8.7.19)
l l
and write
c̄¯(λr )δ (λr − λk )
, . . . ,
instead of (8.7.6), supposing c̄ j = c̄¯(λ j ), . . .. In this case, formula (8.7.7) will be
reduced to the form
m
c̄¯(λ j ) = ∑ e−iλ j l cl , . . . (8.7.20)
l=1
m ¯ λ )|2
T |d(
∑ ln N̄¯ (λ )c̄¯(λ ) (λ j+1 − λ j ).
j
C=
4π j j j
Dividing both sides of the equality by m and passing on to the limit, we will have
¯ λ )
2
C 1 T
d(
C1 = lim = ln dλ (8.7.21)
m→∞ m 4π ¯ 2T
N̄¯ c̄¯<|d| N̄¯ (λ )c̄¯(λ )
and, analogously,
a 1 N̄¯ (λ )c̄¯(λ )
a1 = lim = T− ¯ dλ . (8.7.22)
K→∞ m 4π ¯ 2T
N̄¯ c̄¯<|d| |d(λ )|2
C 1 ¯ ω j )|2
T |d(
C0 = lim
T0 →∞ T0
= lim
T0 →∞ 4π
∑ ln N(ω j )c̄(ω j ) (ω j+1 − ω j )
¯ ω )|2
1 T |d(
= ln dω . (8.7.23)
4π c̄N<T |d|
¯2 N(ω )c̄(ω )
Analogously,
a 1 N(ω )c̄(ω )
a0 = lim = T− ¯ dω .
T0 →∞ T0 4π c̄N<T |d|
¯2 |d(ω )|2
282 8 Channel capacity. Important particular cases of channels
∞ −iω t
Here, according to (8.7.12), (8.7.14) c̄(ω ) = −∞ e ¯ ω ) are spectra
c(t)dt, d(
of functions c(t − t ), d(t − t ), and N(ω ) is a spectral density of disturbances
(noises): ∞
N(ω ) = e−iωτ K(τ )d τ . (8.7.23a)
−∞
Similarly, we find the rate (per unit of time) function μ0 (s) from (8.7.10):
μ (s)
μ0 (s) = lim =
T0 →∞ T0
¯ ω )|2
1 T |d( N(ω )c̄(ω )
=− s ln + ln 1 − s2 + s2 d ω . (8.7.24)
4π N̄c<T |d|¯ 2 N(ω )c̄(ω ) ¯ ω )|2
T |d(
tk+1 − tk = τ0 (8.7.25)
mλ λ
ω= = (8.7.26)
T0 T0
1. Let X and Y be identical linear spaces. The channel [p(y | x), c(x)] is called addi-
tive if X = X, and if the probability density p(y | x) depends only on the difference of
the arguments p(y | x) = p0 (y − x). This means that y = x + z is obtained by adding
to x random variable z, having probability density p0 (z). Assuming X = X, let us
consider those simplifications that are introduced by the additivity assumption to the
theory worded in Sections 8.2 and 8.3.
Now equation (8.3.5) can be written in the form
p0 (y − x) ln p(y)dy = β a − β c(x) −C − Hz , (8.8.1)
Comparing (8.8.4) with (8.3.2), we observe that the operator L in this case is of the
following form:
L = eμ (d/dx) .
8.8 Additive channels 285
With the help of this operator, the problem of finding the channel capacity and ex-
treme distribution is solved according to the formulae of Section 8.3.
The operator transposed with respect to (8.8.5) is the following:
Therefore,
F(y)L−1 = (L−1 )T F(y) = e−μ (−d/dx) F(y).
With the help of operators (8.8.5), (8.8.6), formula (8.3.7) can be rewritten as
We also construct some other formulae of Section 8.3. Relations (8.3.10) take the
form b(y) = [e−μ (d/dx) c(x)]y=x , ν (y) = e−Hz . That is why, we obtain the following
potentials from (8.3.11):
1
ϕ (T ) = T Hz − T ln exp − e−μ (d/dx) c(x) dx
T
( ) (8.8.8)
−μ (d/dx)
Γ (β ) = −Hz + ln exp −β e c(x) dx.
1
e−z /2σ .
2 2
p0 (z) = √
2πσ
Then
eμ (s) = eσ
2 s2 /2
286 8 Channel capacity. Important particular cases of channels
and, thereby,
∞
−μ ( dy
d )
2 2
− σ2 d 2 4 (−1)k σ 2k d 2 k 4
e c(y) = e dy y = ∑ k! 2 k dy2k
y = y4 − 6σ 2 y2 + 3σ 4 (8.8.10)
k=0
because the only derivatives that are different from zero are
d2 4 d4 4
y = 12y2 , y = 24 .
dy2 dy4
Accounting for (8.8.10), formulae (8.3.11), (8.8.8) yield:
∞
Z = eβ¯ϕ = e−Hz −3β σ e−β y
4 4 +6β σ 2 y2
dy
−∞
∞ √
= 2β −1/4 e−Hz −3β σ e−t β σ 2t 2
4 4 +6
dt
0
1
−Hz −3β σ 4 3 2 9 2 √
9 βσ4 9
=e e K1σ β σ + π 2I 1
4 2 βσ 4
(8.8.11)
24 2 4 2
1 3σ 2 3 9 √ 9
Γ = −Hz + ln + β σ 4 + ln K 1 β σ 4 + π 2I 1 βσ4 .
2 2 2 4 2 4 2
3 9 F (λ )
a = − σ4 − σ4 ,
2 2 F(λ )
1 3 λ F (λ )
C = ln − + ln F(λ ).
2 4π e F(λ )
Example 8.2. Now suppose that
θ (s) = eμ (s) .
Since
d
θ e±α x = θ (±α )e±α x
dx
now we have that
1 −1 d 1 1
θ (eα x + e−α x ) = θ −1 (α )eα x + θ −1 (−α )e−α x = θ −1 (α ) cosh α x
2 dx 2 2
in consequence of the mentioned symmetry. For this case, we compute the integral
partition function:
∞
Γ β 2 β
Z=e = exp −Hz − cosh α y dy = K0 e − Hz .
−∞ θ (α ) α θ (α )
a = K1 (β )/[K0 (β )θ (α )],
2 K1 (β )
C = −Hz + ln + ln K0 (β ) + . (8.8.12)
a K0 (β )
1 1
c(x) = xT cx, μ (s) = sT Ks
2 2
(a matrix form), and the transformation e−μ ( dx ) c(x) is reduced to the following:
d
n
(−1)n ∂ T ∂ 1 T 1 1
e−μ (d/dx)
c(x) = ∑ K x cx = xT cx − tr Kc.
n=0 n!2n ∂ x ∂ x 2 2 2
!
d 1 T
Indeed, K dx 2 x cx = Kcx, so that
dT d 1 T
K x cx tr(Kc)
dx dx 2
and other higher-order derivatives vanish. Potentials (8.8.8) are easily obtainable in
this case using formula (5.3.19) that yields
β 1 βc
Γ = −Hz + tr(Kc) − tr ln . (8.8.13)
2 2 2π
Because
288 8 Channel capacity. Important particular cases of channels
1 1
Hz = tr ln(2π K) + tr l
2 2β
formulae (8.2.18), (8.2.19) lead to the following result:
1 1
a = − tr Kc + tr l
2 2 (8.8.14)
1 βc 1 1 1
C = − tr ln − tr ln(2π K) = − tr ln cK − ln β tr l
2 2π 2 2 2
The concept of the value of information, introduced in this chapter, connects Shan-
non’s information theory with statistical decision theory. In the latter theory, the
most basic is the notion of average cost or risk, which characterizes the quality of
decisions being made. The value of information can be described as the maximum
benefit that can be gained in the process of minimizing the average cost with the
help of a given amount of information. Such a definition of the value of informa-
tion turns out to be related to the formulation and the solution of certain conditional
variational problems.
The notion of the value of information can be defined in three related ways based
either on the amounts of Hartley’s information, Boltzmann’s information or Shan-
non’s mutual information. Choosing Shannon’s mutual information necessitates
solving the third variational problem. There exists a certain relationship between
these definitions, and one concept can be conveniently substituted for the other. All
of these concepts characterize a particular object—the Bayesian system—which,
along with the communication channel, is a major object of study in information
theory.
The theory of the value of information is an independent branch of informa-
tion theory, but is rooted in communication theory. Some of its elements and re-
sults emerged from the traditional theory studying communication channels. Claude
Shannon [45] (originally published in English [38]) considered the third variational
problem in 1948, which was posed as entropy minimization problem under a con-
straint on the level of costs, or, using Shannon’s terminology, under a given rate of
distortion. This terminology is quite far from that of the statistical decisions theory,
but it certainly does not change its mathematical essence. Later Kolmogorov [26]
(translated to English [27]) introduced the notion of ε -entropy based on this varia-
tional problem and obtained some related results. Instead of the term ε -entropy, we
shall use the term α -information, because we shall use Shannon’s mutual informa-
tion rather than entropy.
Recently, this theory (in the original Shannon’s interpretation) has been de-
veloped significantly in the research papers of American scientists (especially in
Berger’s monograph [3]) following Shannon’s papers [42, 45] (original papers in
English [38, 40]). However, we adhere to a different interpretation and terminology.
We emphasize that the class of problems associated with the third variational
problem is equivalent in significance to that associated with the second and the first
variational problems. (Of course, this does not preclude a unified approach; see, for
example, the assertion of the generalized Shannon’s theorem in Section 11.5.)
The utility of information is that it allows one to reduce the losses associated with
the average cost. It is assumed that a cost function is defined, which imposes differ-
ent costs for different actions and decisions. More successful actions yield smaller
costs and bigger rewards in comparison with less successful actions. Our goal is to
minimize average cost. The available information allows us to achieve a lower level
of the average cost.
Before proceeding to the mathematical formulation of the above, let us first con-
sider in this introductory section a simpler problem (or the same type as the first
variational problem). This problem illustrates the fact that a high level of uncer-
tainty in the system (neg-information) does indeed increase the level of losses.
Assume there is a system with discrete states. The system takes one of those
possible states at a time. The random variable ξ describing a certain state assumes a
fixed value. Also, assume a cost function c(ξ ) is given, which was chosen according
to the purpose of the system. For instance, if one desires that the system be posi-
tioned near the null state ξ = 0 (a stabilization problem), then one may choose cost
function c(ξ ) = |ξ |.
Suppose that, for whatever reason, the system cannot reach the ideal equality
ξ = 0. For example, inevitable fluctuations in the component parts of the system
entail statistical scattering, i.e. there is uncertainty or neg-information. In this case,
the value of the variable ξ will be random and described by some probabilities P(ξ ).
As is well known, entropy is the measure of uncertainty
Assume that the amount of uncertainty Hξ is fixed and consider the expected value
of possible costs E[c(ξ )]. There exists some lower limit for these costs that can be
found via the methods mentioned in Sections 3.2, 3.3 and 3.6. In fact, the problem
of finding the extremum of average costs given constant entropy has already been
solved. We revisit the solution of that problem now. The optimal distribution of
probabilities has the form
P(ξ ) = eβ F0 −β c(ξ ) , (9.1.2)
where
9.1 Reduction of average cost under uncertainty reduction 291
[see equation (3.3.5)]. The parameter β = 1/T is derived from the constraint of fixed
entropy (9.1.1). Further, it follows from (3.3.15) that
dF0 (T )
− = Hξ . (9.1.4)
dT
After determining parameter β or T we compute the minimum average costs by the
formula
d ∂ F0
R0 (Hξ ) = (β F0 ) = F0 − T . (9.1.5)
dβ ∂T
These formulas show us how the minimum average cost depend on uncertainty Hξ
in the system. According to Theorem 3.4, the average cost R0 (Hξ ) for T > 0 gets
bigger if entropy Hξ increases. Now assume that there is inflow of information I
that reduces entropy according to (1.1.2). If the system contained uncertainty (neg-
information) Hξ initially and that uncertainty decreased to the value Hξ − I = Hps ,
because of inflow of information, then, obviously, it has led to the cost reduction
This difference indicates the benefit brought about by the information I. It is a quan-
titative measure of the value of information.
Assume I = Δ Hξ is very small. Then it follows from (9.1.6) that
dR0
Δ R0 ≈ I = T Δ Hξ .
dHξ
Hence, the derivative dR0 /dHξ = T may be regarded as the differential value of
entropy reduction (differential value of information).
Example 9.1. Let ξ be integer-valued and the cost function be given by c(ξ ) = |ξ |.
Denoting e−β = z, we obtain the partition function for this problem
∞ ∞
2z 1+z
e−β F0 = ∑ e−β |ξ | = 1 + 2 ∑ zξ = 1 + =
1−z 1−z
.
ξ =−∞ ξ =1
Therefore,
2z ln z 1+z
Hξ = − + ln . (9.1.9)
1−z 2 1−z
292 9 Definition of the value of information
The behaviour of function R0 (Hξ ) and the differential value of information T (Hξ )
is represented on Figure 9.1.
Fig. 9.1 Average cost and differential value of information as functions of entropy (Example 9.1)
Example 9.2. Assume that ξ may only assume the eight values −3, −2, −1, 0, 1, 2,
3, 4. The cost function remains the same as in the previous example, i.e. c(ξ ) = |ξ |.
Then
1 + z2 − 2z3 1+z
Hξ = −2z ln z + ln + ln (1 − z4 ). (9.1.12)
1 − z4 1−z
9.1 Reduction of average cost under uncertainty reduction 293
With the growth of temperature T , the entropy Hξ and the average cost R0 mono-
tonically increase. In the limiting case of T → ∞, β → 0, z → 1 we have the maxi-
mum possible entropy
(1 + z)(1 − z4 )
Hξ = lim ln = ln 8 = 3 ln 2 = 3 bits
z→1 1−z
and average cost
2z(1 + z2 − 2z3 )
R0 = lim = 2,
z→1 1 − z4
which correspond to the uniform distribution with P(ξ ) = 1/8.
As noted above, a decrease in uncertainty in the system may be achieved by gain-
ing information. The amount of information was conceived simply as the difference
between two entropies Hξ for a single random variable. Meanwhile, according to the
discussion in Chapter 6, the amount of information I = Ixy is a more complex con-
cept that presupposes the existence of two random variables x and y (rather than a
single ξ ). There has to be an unknown random variable x, about which the informa-
tion is communicated, and random variable y, which carries that information. This
leads us to a complication in the reasoning and the need to turn from the simpler
(the first) variational problem to a more sophisticated one, which we shall designate
as the third variational problem of information theory.
294 9 Definition of the value of information
1. Consider the following example similar to the examples from the previous sec-
tion. Let x be an internal coordinate of a system assuming values −3, −2, −1, 0, 1,
2, 3, 4. Let u be an estimating variable selected from the same values, similar to the
variable ξ from Example 9.2.
Assume that the eight points lie on the circumference (see Figure 9.3). It is desir-
able that the variable u be located as close as possible to the internal variable x. For
example, this desire may be described by introducing the following cost function:
Fig. 9.3 The graph of the cost function for the considered example
Also, we can optimally partition the eight points into two sets, i.e. minimize the
average cost
E[min{E[c(x, u) | y]}].
u
It can be easily checked that the best way to partition the circumference containing
the eight points is to split it into two equal semicircles, each of which consists of
four points. It is reasonable to choose any point within each of those semicircles.
The proposed choice leads to the following average cost:
1 1 1
E[c(x, u)] = 0 · + 2 · 1 · + 1 · 2 · = 1.
4 4 4
Thus, in the optimal case, 1 bit of information yields a benefit that reduces the
average cost from 2 to 1. Consequently, the value of 1 bit of information is equal to
one: V (1) = 1, where we denote the described value as V .
A similar analysis can be performed for the case of 2 bits of information. In this
particular case, the set E is partitioned into four parts—ideally each part is a pair of
two adjacent points. Having determined which pair x belongs to, we choose u to be
any point from it. Then u either equals x with probability 1/2 or differs from x by 1
with probability 1/2. The average cost turns out to be equal to 1/2. Thus, 2 bits of
information reduces the losses from 2 to 1/2. The value of 2 bits of information is
equal to 3/2: V (2) = 3/2.
296 9 Definition of the value of information
Having received 3 bits of information, we can determine the exact value of x and
assign u = x, thus nullifying the cost. Therefore, the value of 3 bits of information
in the problem under consideration equals to 2: V (3) = 2. The values indicated
above (V (1), V (2) and V (3)) correspond to the points A, B and C on Figure 9.4,
respectively.
If M = eI = 2Ibits equals 3, then we should partition the eight points into three
domains: E1 , E2 and E3 . We need to compute the benefit
2 − E[min{E[c(x, u) | Ek ]}]
for every partition (k = 1, 2, 3) and select the optimal one. Calculation shows that the
optimal partition is the following: E1 consists of two points, E2 and E3 both consist
of three points. The corresponding benefit is 1.375. The resulting point (ln 3, 1.375)
is displayed on the plane (Ibits ,V ) (the point D on Figure 9.4). Points corresponding
to M = 5, 6, 7 can be constructed in the same way. We enumerate optimal partitions
and their corresponding coordinates in the table below:
Each value y = yk can be associated with one optimal value u = uk . Thus, the func-
tion u = d(y(x)) takes at most M values, similar to y(x).
9.2 Value of Hartley’s information amount. An example 297
Averaging the conditional average cost, (9.2.2) we obtain the total average cost
R = E min E[c(x, u) | y] . (9.2.3)
u
If we do not have any information about the value of the unknown random vari-
able x, then we have only one way to choose the optimal estimator u. We minimize
the average cost E[c(x, u)] = ∑x c(x, u)P(x) by u, which gives the following level of
losses:
R0 = min ∑ c(x, u)P(x) = min E[c(x, u)]. (9.2.4)
u x u
Note that E[c(x, u)] = E[c(x, u) | u], i.e. we do not average over u. Naturally, the ben-
efit yielded by the received information is related to the difference in losses (9.2.3)
and (9.2.4).
Define the value of information I = ln M as the maximum benefit that can be
obtained from the Hartley’s amount of information I = ln M as follows:
V (ln M) = min E[c(x, u)] − inf E min E[c(x, u) | y] . (9.2.5)
u y(x) u
Here we do not just minimize by u—we also minimize over all possible functions
y(x) taking M values.
Theorem 9.1. The minimization (9.2.5) over all M-valued functions y(x) can be
restricted only to the set of deterministic (non-randomized) dependencies of y on
x. In other words, taking into account randomized dependencies y(x) (when y is
random for a fixed x) does not alter the extremum.
Proof. Assume that y = yr (x) depends on x in a randomized way and ranges over
values y1 , . . . , yM . This means that the variable y is random for a fixed x and is
described by some probability distribution Pr (y | x). Let dr (y) be an optimal solution
for the dependency in question determined from (9.2.2). Express its losses (9.2.3)
as follows:
298 9 Definition of the value of information
Thus,
Denote the right-hand side of the last formula by Rn . It follows from the inequali-
ties (9.2.8) and (9.2.9) that
i.e. the non-randomized dependency yn (x) is not worse than the randomized depen-
dency yr (x) with respect to average cost.
Fig. 9.5 System of transmission of the most valuable information. Channel without noise. MD—
measuring device
9.2 Value of Hartley’s information amount. An example 299
3. One simple application of the concept of the value of Hartley’s information im-
mediately follows from the aforementioned definition: we construct of a measuring-
transmitting system containing an informational restriction.
Let there be given a measuring device (MD) (see Figure 9.5) with an output
signal x equal to a measurable quantity, say continuous. However, suppose the exact
value of x cannot be conveyed to the device’s user due to either a (noiseless) channel
with limited capacity or a recording device with limited information capacity (the
variable y can assume only M = [eI ] different values). The goal is to receive values
of u ‘the closest’ to x with respect to the cost function c(x, u). To achieve this goal,
we must construct blocks 1 and 2 reflected on Figure 9.5 such that average cost is
minimized. Since the total amount of information is limited, we need to transmit the
most valuable information through the channel.
Taking into account the definition of the value of the Hartley’s amount of in-
formation, it becomes clear how to solve this problem. Block 1 must partition the
feasible space of values x into optimal domains E1 , . . . , EM mentioned in (9.2.6)
and deliver the index of a certain domain, i.e. y = k. After receiving one of the
possible signals k, block 2 must output the value uk that minimizes the conditional
expectation E[c(x, u) | Ek ].
The above is generally valid in the case of a noisy channel, that is, when the
output signal y does not necessarily coincide with the input signal y . This case
can be asymptotically reduced to the previous one if the system is forced to work
repeatedly, where we substitute x by ξ = (x1 , . . . , xn ), u by ζ = (u1 , . . . , un ) (n is
relatively large), and apply Shannon’s noisy-channel coding theorem. Thus, as is
shown on Figure 9.6, we need to install a channel encoder (CE) at the entrance of
the channel and a channel decoder (CD) at the exit. We assume that both the encoder
and the decoder function according to optimal encoding and decoding theory (see
Chapter 7). The structure of the blocks 1 and 2 is the same as on Figure 9.5.
Fig. 9.6 System of transmission of the most valuable information. Channel with noise. CE—
channel encoder, CD—channel decoder, ν = (y1 , . . . , yn )
The above method of defining the value of information V (I) has several draw-
backs. First of all, it defines the value of information only for integer numbers
M = [eI ]. The question remains, what is the benefit of the fractional part of eI ?
Secondly, if in the above example, we change the number of points by taking, say,
9 points, 10 points, and so forth, then the value of information will fluctuate irreg-
ularly. Thirdly, the Hartley’s amount of information I = ln M, that is, the argument
of the function V (I), is not characterized by the difference of two entropies like it
is in Section 9.1. Therefore, the definition given here is not consistent with ideas
of Section 9.1. It can be made more consistent if we take the Shannon’s mutual
information Ixy as an argument because Ixy is equal to the difference of entropies
300 9 Definition of the value of information
Hx − Hx|y , where Hx is the entropy before signal receipt, and Hx|y is the entropy after
it.
The definition of the value of information proposed below has features of both
aforementioned definitions (Sections 9.1 and 9.2).
Ixu = I, (9.3.3)
where I is an arbitrarily chosen number. Further analysis shows that this same ex-
tremum distribution results in an extremum (to be more precise, a minimum) for the
information Ixu under the fixed average cost:
Ixu = min
c(x, u)P(du | x)P(dx) = α = fixed. (9.3.4)
Along with R(I) we can also consider the inverse dependence Ixu (a). The value
I(a) is called the information corresponding to the level of losses R = α or α -
information, succinctly. As can be seen from Theorem 9.6 below, the function Ixu (α )
9.3 Definition of the value of Shannon’s information amount and α -information 301
is convex (see Figure 9.7). Therefore, the function R(I) is, in general, two-valued.
In the general case, the function R(I) takes a minimum value equal to zero on some
interval
R0 R R0 . (9.3.6)
Call the function R(I) = R+ (I), being inverse to I(R), R R0 , the normal branch.
Also, call the function R(I) = R− (I), being inverse to I(R), R R0 , the anomalous
branch. We can define the value of Shannon’s information for the normal branch as
Further, define the value of Shannon’s information for the anomalous branch by the
formula
V (I) = R− (I) − R0 . (9.3.8)
With this definition, the value of information is always non-negative. In certain
cases, one of the branches may be thrown to infinity, i.e. may be absent.
In order to clarify the meaning of definitions (9.3.7) (9.3.8) we first consider the
notions of R0 and R0 . The range (9.3.6) corresponds to null information Ixy = 0. This
means that the distribution P(du | x) does not depend on x in (9.3.5). Thus,
R(0) = P(du)E[c(x, u) | u],
where E[c(x, u) | u] = c(x, u)P(dx). Hence, we can obtain the range of variation of
R(0) over all possible P(du):
Further, it will be seen from Theorem 9.2 (Section 9.4, paragraph 2) that the normal
branch corresponds to minimum cost
under the condition (9.3.3). Thus, the formula (9.3.7) can be expressed as
Ixu I. (9.3.13)
Comparing (9.3.13) with (9.2.5) we notice an analogy between these two defini-
tions of the value of information. In both cases V (I) has the meaning of the largest
possible reduction of average cost (under the condition of a fixed amount of I).
Furthermore, the formula (9.3.8) on account of (9.3.9) and (9.3.11) takes the form
In this case, the function c(x, u) should be interpreted not as a cost, but as a reward.
The value of information then has the meaning of the largest possible average reward
yielded by a given amount of information I. Of course, it is not hard to come up with
a version of the Hartley’s value of information corresponding to this case. Instead
of the formula (9.2.5) we get
Accordingly, the value function is located in the interval 0 < V (I) < R0 −RL (normal
branch) and in the interval 0 < V (I) < RU − R0 (anomalous branch).
Very important corollaries that connect the theory of the value of information
with the theory of optimal statistical solutions follow directly from the defini-
tion (9.3.12), (9.3.13) of the Shannon’s value of information. We provide the fol-
lowing easily proven results.
Theorem 9.2. Assume that a Bayesian system [c(x, u), P(x)] and an observed vari-
able y(x) with conditional probability distribution P(y | x) are given. Irrespective
of the specific decision rule u = d(y) (randomized or non-randomized), the level of
average cost satisfies the inequality
Proof. It is not hard to see that whatever algorithm u = d(y) we choose, the amount
of information about the unknown value x cannot increase, i.e. the following in-
equality holds
Ixu Ixy . (9.3.18)
It follows from (9.3.12) and (9.3.13) that
belongs to the set of distributions enumerated while minimizing minP(u|x) E[c(x, u)].
As discussed above, the dependence V (I) is non-decreasing. Therefore, (9.3.17)
follows from (9.3.18) and (9.3.19). The proof is complete.
304 9 Definition of the value of information
This theorem is evidence of the fruitfulness of the concept of the value of infor-
mation. The question of how to actually attain extremely small average cost, which
is studied by the theory of the value of information, will be covered in Chapter 11.
The idea of information corresponding to a predetermined level of losses was
first introduced by Shannon [45] (the English original is [38]) and by Kolmogorov
[26] (translated to English [27]). Shannon named it rate of generating messages)
whereas Kolmogorov called it W -entropy or ε -entropy). The notion of the value of
information was introduced by Stratonovich [47].
The quantities of Hartley’s and Shannon’s values of information do not coincide,
and the inequality V (I) V (I) follows from their definitions. An important result
of information theory is that V (I) ≈ V (I) asymptotically (Chapter 11). This result is
profound and significant in the same way as the results considered in Chapter 7—
about the asymptotically errorless communication through a noisy channel.
1. The problem of finding the extreme value distribution introduced in the previous
section for the dependence R(I) or V (I) we call the third variational problem of
information theory. As with the second variational problem, we assume that x and
u are discrete random variables for simplicity. We then extend obtained results for a
general case.
We vary probabilities P(x, u) of the joint distribution of x and u in order to make
average cost attain the extremum
[see (9.3.3)]. Since the a priori distribution P(x) is not supposed to change, it is
necessary to add a constraint
∑ P(x, u) = 1 (9.4.4)
x,u
can be ignored, since it is a consequence of (9.4.3). We can also omit the non-
negativity constraint P(x, u) 0 because the solution obtained for the problem with-
9.4 Solution of the third variational problem. The corresponding potentials. 305
out it satisfies this constraint (see below). We solve the constrained extremum prob-
lem (9.4.1)–(9.4.3) by the method of Lagrange multipliers 1/β , γ (x), finding the
extremum of the expression
Denote by Z the set of pairs (x, u) (i.e. Z is a subset of the space X ×U) having posi-
tive probabilities P(x, u) > 0 in an extreme value distribution. The partial derivative
of (9.4.5) with respect to P(x, u) must be equal to zero. We do not differentiate
ln P(x) because of (9.4.3). After differentiating K with respect to P(x, u) we obtain
(which is a necessary condition for an extremum) the equality
or, equivalently,
Here we denote
∑ P(x, u) = P(u) (9.4.8)
x
Values of β , γ (x) can be determined from (9.4.2) and (9.4.3). Multiplying (9.4.6) by
P(x, u) and summing over x, u we obtain
β R + Γ = −Ixu , (9.4.9)
where
Γ = ∑ γ (x)P(x) (9.4.10)
x
and
∑ P(u)e−β c(x,u) = eγ (x) . (9.4.13)
u∈Zx
Proof. Differentiating the function (9.4.5) twice with respect to variables P(x, u),
(x, u) ∈ Z, we find the matrix of all second-order partial derivatives
∂ 2K ∂ 2I δ δ δ
=
= xx uu − uu (9.4.14)
∂ P(x, u)∂ P(x , u ) ∂ P(x, u)∂ P(x , u ) P(x, u) P(u)
for (x, u) ∈ Z, (x , u ) ∈ Z. Further, let us prove that this matrix is positive semi-
definite. It suffices to show that the x-matrix
δxx 1
Lxx = −
P(x, u) P(u)
δxx
(u is fixed) is semi-definite or the matrix ax − ∑ 1ax is semi-definite for every ax > 0.
x
We construct a quadratic form
v2x (∑ vx )2
∑ Lxx vx vx = ∑ P(x, u)
− x
P(u)
,
xx x
xx x x
But the expression in the braces above is actually the variance of variable wx cor-
responding to the distribution P(x | u). Its non-negativity entails positive semi-
definiteness of the matrix Lxx and, consequently, the matrix (9.4.14).
As a result, K and Ixy both turn out to be convex functions of arguments P(x, u),
(x, u) ∈ Z. We expand these functions into Taylor series at the point corresponding
to the extreme value distribution (9.4.7). Then we take into account the fact that
linear terms of this expansion for the distribution in question disappear. Finally, we
use the above-mentioned positive semi-definiteness to obtain
dK 0; dIxu 0. (9.4.15)
Since these relationships are valid for arbitrary variations of the variables P(x, u),
(x, u) ∈ Z, they are also valid for variations compatible with additional constraints (9.4.3)
and others. If the condition E[c(x, u)] = a holds, we have dIxu 0 according
to (9.4.15), which proves the first assertion of the theorem. In order to prove the
9.4 Solution of the third variational problem. The corresponding potentials. 307
Together with conditions Ixu = I and (9.4.3) it yields β d[Ec(x, u)] 0, i.e. d[Ec] 0
when β > 0, and d[Ec] 0 when β < 0. The proof is complete.
Theorem 9.4. The ‘active’ domain Z, where extremum probabilities are non-zero
(P(x, u) > 0), is cylindrical:
Z = X × U.
(9.4.16)
Here X and U
are sets, where P(x) > 0 and P(u) > 0, respectively.
Proof. Suppose that Z = Z1 does not coincide with X × U. Then clearly Z1 must be
a subset of the region X × U. According to (9.4.7) an extreme value distribution for
the region Z1 can be expressed as follows:
where γ1 (x), P1 (u) satisfy all necessary conditions. Employing (9.4.17) we construct
an auxiliary distribution P2 (x, u) with probabilities that are non-zero in a broader
region Z2 = X × U.
Further, put
P2 (x, u) = P(x)P2 (u | x), (x, u) ∈ Z2 (9.4.18)
where
P(u)e−γ1 (x)−β c(x,u)
P2 (u | x) = ≡ P(u)e−γ2 (x)−β c(x,u)
∑u∈U P(u)e−γ1 (x)−β c(x,u)
Taking account of the following equality (9.4.13):
β R2 + I2 β R1 + I. (9.4.20)
308 9 Definition of the value of information
Earlier the parameter β was assumed to be fixed. Now we consider the whole
family Z(β ) of active domains dependent on β and conduct the previously described
extension Z1 (β ) → Z2 (β ) for every β . Since the inequality (9.4.20) holds for each
β , it follows that I2 (R) I1 (R) or R2 (I) R1 (I) for β > 0. Consequently, consid-
ering only the cylindrical ‘active’ domain (9.4.15), we do not lose optimality. This
completes the proof.
Both right-hand and left-hand sides of (9.4.7) are equal to zero outside of the
domain X × U.
That is why the equality
holds everywhere for an extreme value distribution. Due to Theorem 9.4, the equal-
ities (9.4.12) and (9.4.13) can be expressed as follows:
We can also use the entire space of x and u for summation in the previous two
formulae.
The equations (9.4.22) and (9.4.23) under the corresponding constraint (9.4.2)
and a fixed domain U allow us to find the optimal distribution (9.4.21). Hence,
we can represent an extremum dependence R(I) for each domain U that satisfies a
number of conditions. For a complete solution, the problem of how to choose an
active domain U from the set of feasible domains D remains. It is natural to use the
condition of extremum here
R(I) = extrU∈D
(9.4.24)
that results from (9.4.1). If set D allows for continuous changes δ U of the domain
then, as a rule, condition (9.4.24) can be replaced by the stationarity condition
U,
δ R = 0. (9.4.25)
of the domain U,
Here the variation δ R corresponds to the variation δ U and the
information I remains constant:
δ I = 0. (9.4.26)
can be
Despite the absence of the variations (9.4.25) and (9.4.26) the variation δ U
accompanied by non-zero variations δ β and δΓ . Using (9.4.25) and (9.4.26), we
derive the following condition from (9.4.9):
Rδ β = −δΓ (9.4.27)
which will be used in the future. As will be shown later, this condition can be ex-
pressed similar to (10.1.8), which means extremeness of a potential Γ for a fixed
β.
9.4 Solution of the third variational problem. The corresponding potentials. 309
F = R+TI (9.4.28)
which is an analog of free energy and β = 1/T . Hence, (9.4.28) resembles a famous
relation from thermodynamics F = U − T H (U is internal energy, H is entropy).
These relations only differ by the sign of the term T I.
We proceed to the derivation of other formulae that resemble the usual relations
from thermodynamics. We begin with the one involving the potential Γ (β ) instead
of the free energy F(T ) = −T Γ (1/T ).
Theorem 9.5. The following equations hold for the third variational problem:
dΓ
= −R (9.4.29)
dβ
dΓ
β −Γ = I (9.4.30)
dβ
dR 1
=− (9.4.31)
dI β
by analogy with the first two variational problems.
Proof. We vary the parameter β in equation (9.4.22). This variation is accompanied,
Express the
in general, by variations of the function γ (x) and the active domain U.
variations d γ (x) and dΓ as a sum of two variations:
Variations d1 γ (x) and d1Γ correspond to the variation d1 β of the parameter β for a
constant domain U. Also variations δ γ (x), δΓ , δ β = d β − d1 β correspond to the
variation δ U of the domain U.
We differentiate (9.4.22) for a constant domain U and obtain
d1 γ (x)
∑ d β1 + c(x, u) P(x)e−γ (x)−β c(x,u) = 0, u ∈ U.
x
i.e. we obtain
310 9 Definition of the value of information
d1Γ
= −R (9.4.33)
d β1
by (9.3.5) and (9.4.10). formulae (9.4.33) and (9.4.11) clearly imply the next relation
d1Γ
β − Γ = I. (9.4.34)
d1 β
We can simply obtain the required relations (9.4.29), (9.4.30) if we take into
account the fact that
dΓ d1Γ + δΓ
= = −R
dβ d1 β + δ β
in consequence of (9.4.27), (9.4.33). For derivation of (9.4.31) it suffices to take a
differential of both sides of the equality (9.4.30), which gives
dΓ dΓ dΓ
dβ +βd − dΓ = dI dΓ = dβ
dβ dβ dβ
dF
F −T =R (9.4.35)
dT
dF
=I (9.4.36)
dT
dR
= −T. (9.4.37)
dI
From these relations, it is clear that the dependence R(I) turns out to be a Legendre
transform
dF(T )
− R(I) = −F(T (I)) + T (I)I I= (9.4.38)
dT
of the function F(T ). Further, the function I(R) is essentially the following Legendre
transform:
dΓ
I(R) = −β (R)R − Γ (β (R)) R=− (β ) (9.4.39)
dβ
of the function Γ (β ). The derivative
dV (I)
≡ v(I) (9.4.40)
dI
or, equivalently,
∓dR± (I)/dI
9.4 Solution of the third variational problem. The corresponding potentials. 311
Theorem 9.6. Functions I(R), Γ (β ) are convex. Further, the normal branch of the
function R(I) is also convex, but the anomalous branch of R(I) is concave.
Proof. First, we prove that I(R) is convex. For this we consider two values R and
R from a feasible region (9.3.16). Assume that they correspond to extreme value
distributions P (x, u), P (x, u) and values of information I(R ), I(R ), respectively.
Consider an intermediate point
Let Ixu [Pλ ] denote the information Ixu for distribution (9.4.42). While proving Theo-
rem 9.3, we showed that the expression Ixu = Ixu [P] is a convex function with respect
to probabilities P(x, u). Thus,
Let us now compare Ixu [Pλ ] with I(Rλ ), which is the solution of the extremum prob-
lem (9.3.4). The distribution (9.4.42) is simply one of the distributions searched
while minimizing. Thus, I(Rλ ) Ixu [Pλ ]. Comparing this inequality with (9.4.41),
we obtain
I(λ R + (1 − λ )R ) λ I(R ) + (1 − λ )I(R )
which proves convexity of the function I(R). Further, recall that Γ (β ) and I(R) are
connected by the Legendre transform (9.4.38). Therefore, convexity of the function
Γ (β ) follows from convexity of I(R), if we take account of the fact that the Legendre
transform preserves convexity and concavity. The latter fact is most easily shown for
the case of twice differentiable functions. Differentiating (9.4.29) we obtain
d 2Γ dR
=− . (9.4.44)
dβ 2 dβ
dI
= −β (9.4.45)
dR
312 9 Definition of the value of information
d2I dβ
2
=− (9.4.46)
dR dR
Comparing (9.4.44) and (9.4.46), we obtain
*
d 2Γ d2I
= 1 .
dβ 2 dR2
This shows that the convexity condition d 2 I/dR2 0 entails the convexity condition
d 2Γ /d β 2 0.
The normal branches of R(I), F(T ) are characterized by the positivity of param-
eters β , T . Due to (9.4.31), the derivative dI/dR is negative for the normal branch,
and convexity of the inverse function R(I) follows from convexity of the function
I(R). If the derivative dI/dR is positive, then the inverse function R(I) becomes
concave. This completes the proof.
In addition to function (9.4.38) and the value function V (I) = R(0) − R+ (I) we
can introduce the corresponding random functions, i.e. functions dependent not only
on I, but also on a random variable x. Taking into account the formula (9.4.10),
written in the form
F(T ) = −T ∑ γ (1/T, x)P(x)
x
it is not hard to see that the Legendre transform (9.4.38) may be performed be-
fore averaging over x. That is, we perform the Legendre transform for the function
F(T, x) = −T γ (1/T, x).
We call the corresponding mapping
∂ F(T, x)
V (x, I) = −F(T, x) + T I I=
∂T
∂ F(T, x) ∂ γ (β , x)
I= = −γ (β , x) + β
∂T ∂β
on account of (9.4.23), (9.4.21) is nothing but the random information
P(x, u)
Iu (| x) = ∑ P(u | x) ln for dP(u)/d β = 0 .
u P(x)P(u)
9.5 Solution of a variational problem under several additional assumptions. 313
1. Let us show the solution to the third variational problem in a more explicit form
for two special cases. Denoting
∑ Q(x)e−β c(x,u) = 1,
u ∈ U. (9.5.2)
x
First, suppose that this equation can be solved for the unknown function Q(x)
just as a system of linear equations or a linear integral equation. In other words,
there exists the following transformation:
That is
bxu
=
e−β c(x,u)
−1 . (9.5.3)
Then, from (9.5.2), we have
Q(x) = ∑ bxu
u∈U
Averaging this equation with weight P(x) gives the expression for the potential
Γ = −Hx − ∑ P(x) ln ∑ bxu . (9.5.4)
x
u∈U
The subtracted part in the right-hand side of the last formula is evidently the condi-
tional entropy Hx|u .
If we introduce the function
1
−Γ0 (β ) = β F0 = ∑ P(x) ln ∑ bxu
β x u∈U
then the results (9.5.5), (9.5.6) can be expressed in the following compact form:
I = Hx − H0 (R). (9.5.7)
2. Let us now make another assumption, namely that the function Q(x) =
From (9.5.2), we move Q outside of the
Q (9.5.1) is constant within the region X.
summation, and observe that
1
∑ e−β c(x,u) = Q
x∈X
in this case. The sum on the left-hand side of the equality resembles (3.3.11), (3.6.4)
introduced to solve the first variational problem. Under the given assumptions, the
sum should not depend on u. By analogy with (9.1.3), let us denote
1
e−β F0 (β ) = ∑ e−β c(x,u) = Q(β ) . (9.5.8)
x∈X
of the function
9.5 Solution of a variational problem under several additional assumptions. 315
P(x | u) = eβ F0 −β c(x,u)
Example 9.3. We are given the Bayesian system described in the beginning of Sec-
tion 9.2. We have already calculated the value of Hartley’s information amount.
Now, for comparison, let us find the value of Shannon’s information amount.
We write the equation (9.5.2) for the current example taking into account that
the function c(x, u) (9.2.1) depends only on the difference x − u. Thus, this equation
takes the form
316 9 Definition of the value of information
∑ Q(x)e−β c(x−u) = 1.
x
This equation has a solution when the entire region U consisting of 8 points is cho-
sen to be the active domain. The solution corresponds to a trivial constant function
Q(x) = Q. The partition function (9.5.8)
is independent of u and coincides with the sum (9.1.10). The dependence R0 (H)
is represented parametrically by expressions (9.1.11) and (9.1.12). It is depicted on
Figure 9.2.
The uniform distribution P(x) = 1/8 corresponds to entropy Hx = 3 ln 2. There-
fore, it follows from formula (9.5.14) that R(I) = R0 (3 ln 2 − I), and the func-
tion for the value of the Shannon’s information amount has the form V (I) =
R0 (3 ln 2) − R0 (3 ln 2 − I). This function is represented graphically on Figure 9.4.
This curve is simply the inversion of the curve from Figure 9.2.
Those magnitudes of the value of the Hartley’s information amount that were
found in Section 9.2 are depicted by the stepped line on Figure 9.4. It lies below
the main curve V (I). It can be seen from the graph that the value V (I) of one bit of
information equals to 1.35, which exceeds the value of 1 obtained in Section 9.2.
Similarly, the value V (I) of two bits of information equals to 1.78, which is greater
than the previously determined value of 1.5.
If any other number of points located on a circle is considered, then the value of
Shannon’s information amount can be found in precisely the same way, in which
case we shall observe a monotonic dependency without any irregular jumps.
This example illustrates clearly the difference between the values of Hartley’s
and Shannon’s amounts of information. As will be shown in Chapter 11 this differ-
ence vanishes in some more complex cases. This is the case, for instance, if instead
of the random variables x and u described above, one considers sequences x1 , . . . ,
xN and u1 , . . . , uN of random variables for reasonably large N.
Example 9.4. Let us consider a space containing an infinite number of points.
Assume that x and u can take any integer value . . ., −1, 0, 1, 2, . . . similar to the
variable ξ in Example 9.1 in Section 9.1. Let us take a simple cost function
c(x, u) = |x − ξ |
P(x) 1 − ν |x|
eγ (x) = = e−β F0 ν
Q 1+ν
which can be rewritten as follows:
1+z 1−ν
∑ e−β |x−u| P(u) = 1 − z 1 + ν ν |x| .
u
Luckily, the latter equation has an exact solution. After multiplying both sides by τ x
(τ = eiλ ) and summing over x, we obtain
1+z 1−ν
∑ e τ ∑ τ u P(u) = 1 − z 1 + ν ∑ ν |x| τ x .
−β | σ | σ
σ u x
However,
∞
1 1 1 − z2
∑ e−β |σ | τ σ = ∑ z| σ | τ σ = + z −1 =
1 − zτ 1 − τ 1 − 2z cos λ + z2
σ σ =−∞
1 − ν2
∑ ν |x| τ x = 1 − 2ν cos λ + ν 2
(9.5.17)
x
and, consequently,
2
1−ν 1 + z2 − 2z cos λ
∑τ u
P(u) =
1−z 1 + ν 2 − 2ν cos λ
.
u
We apply a transformation
318 9 Definition of the value of information
Now, according to (9.4.21), we can readily compute the joint probability distribution
[see (1.2.3)].
The latter amount of information differs from the Hartley’s amount, namely
Hy < ln M
if the non-zero probabilities P(y) are not all equal to each other. Whatever the case,
the amount of information (9.6.1) assumes an intermediate place
9.6 Value of Boltzmann’s information amount 319
ln M Hy Iyz (9.6.2)
on the entropy of the partition ∑ Ek . The number of regions can be arbitrary. As for
the rest, the defining formula for V (I) will coincide with (9.2.6):
V (I) = min E[c(x, u)] − inf E min E[c(x, u) | Ek ] (9.6.5)
u ∑ Ek =X u
V (I) V (I) V (I). (9.6.6)
Proof. Let uk be the value achieved if we minimize the second term of (9.6.3). It
follows from (9.6.4) that Huk I and, consequently,
Therefore, the transition probabilities P(uk | x) are included in the set G of transition
probabilities searched during the minimization in definition (9.3.12). Therefore,
320 9 Definition of the value of information
V (I) V (I).
On the other hand, during minimization (9.2.6), one searches through partitions E1 +
· · · + EM = X. The condition (9.6.4) is satisfied for each of these partitions. Thus,
the class of partitions searched in (9.6.5) is broader than that in (9.2.6). It follows
that
V (I) V (I).
The proof is complete.
Example 9.5. In order to illustrate the above, we revisit the example considered in
paragraph 1 of Section 9.2 and Example 9.3 in Section 9.5. We search through all
possible partitions of the set consisting of eight points on connected regions E1 , . . . ,
EM (partitions without adjacent points are not extreme). We want to find the value of
Boltzmann’s information (9.6.1) and the difference (9.6.3) for each partition. Take
into account that P(yk ) = P(Ek ), and the first term in (9.6.3) equals to two in this
particular case. We plot the points (I, Δ ) found in the previous step in the plane
(I,V ). Further, we draw a stepwise function V (I) in such a way that it occupies the
lowest and the right-most position subject to the condition that no points are above
or to the left of it. This corresponds to minimization with respect to the partitions
appearing in (9.6.5). Points lying on the stepwise line correspond to the extreme
partitions. The other points are disregarded as non-extreme. The graph of V (I) is
depicted in Figure 9.4 (it is dashed when it does not coincide with V (I)). We now
provide the extreme partitions for our case and the corresponding coordinates: Here
Content of
partitions (1,7) (2,6) (3,5) (1,2,5) (1,3,4) (2,3,3)
Ibits 0.537 0.819 0.954 1.299 1.406 1.562
V (I) 0.5 0.75 1 1.125 1.25 1.375
Content of
partitions (1,1,3,3) (1,1,1,2,3) (1,1,1,1,1,3) (1,1,1,1,1,1,2)
Ibits 1.81 2.16 2.41 2.75
V (I) 1.5 1.625 1.75 1.875
the upper row presents the number of adjacent points in brackets for each optimal
partition E1 + · · · + EM (points belong to E1 , . . . , EM , respectively).
It can be seen on Figure 9.4 that the line V (I) is situated in between the lines
V (I) and V (I), coinciding with V (I) in some regions. It is fully consistent with
inequalities (9.6.6).
To conclude this section we present a simple to prove but important fact about all
three functions of the value of information.
9.7 Another approach to defining the value of Shannon’s information 321
Theorem 9.8. The value of information functions V (I), V (I), V (I) for a Bayesian
system [P(dx), c(x, u)] are invariant with respect to the following transformation of
the cost function:
c (x, u) = c0 (x) + c(x, u), (9.6.7)
where c0 (x) is an arbitrary measurable function with a finite mean value.
Proof. In order to prove invariance of the functions V (I), V (I) we need to take into
account that
E min E[c (x, u) | Ek ] = E min E[c0 (x) | Ek ] + E min E[c(x, u) | Ek ]
u u u
0
= E c (x) + E min E[c(x, u) | Ek ] (9.6.8)
u
and
min E[c (x, u)] = E[c0 (x)] + min E[c(x, u)] (9.6.9)
u u
from whence the terms E[c0 (x)] are cancelled out in (9.6.3). In order to prove the
invariance of V (I) it suffices to rewrite the second term (9.3.12) using the equality
E[c (x, u)] = E[c0 (x)] + E[c(x, u)] together with (9.6.9).
Also the optimal partition ∑k Ek or the optimal transition probabilities P(du | x)
remain unchanged when performing the substitution (9.6.7). All the above can be
extended to the case, in which the function c(x, u) takes on the meaning of rewards
instead costs, and the minimizations in (9.6.3) and other formulae are substituted
with maximization.
1. We shall now try formally to bring the definition of the value of Shannon’s in-
formation amount closer to the definitions (9.2.5) and (9.2.6), which in contrast
to (9.3.13) involve minimization over u in the second term.
Let us introduce a modified value function V̄ (I) of the Shannon’s information
amount, which, as will be seen later, sometimes coincides with V (I). That is,
Let [P(dx), c(x, u)] be a given Bayesian system. We introduce an auxiliary ran-
dom variable z taking values from some sample space Z and associated with x via
transition probabilities P(dz | x). We treat z as an observed variable carrying infor-
mation about x. After observing z, an optimal non-randomized estimator u = d(z) is
determined by the minimization
Proof. Assume that for a given Bayesian system [P(dx), c(x, u)] the extremum of
the third variational problem under the constraint
Ixu I (9.7.6)
is achieved for a certain distribution P(du | x), which we will denote as Pe (du | x).
This extremum distribution specifies a two-dimensional distribution
x −→ u −→ π , (9.7.7)
Here, the last term corresponds to the distribution Pe (dz | x) specified above. The
conditional expectation E[c(x, u) | z = Pe (dx | u)] is taken with probabilities
since
zP(du | z) = z (where z = Pe (dx | u)).
γ (x) 1 Pe (dx | u)
c(x, u) = − − ln (9.7.10)
β β P(dx)
and so
| )
1 1 P (dx u
inf E[c(x, u ) | z] = − E[γ (x) | z] + inf E − ln
e
z (9.7.11)
u β u β P(dx)
for any u . Consider u that takes the value of the inverse image of the point z
according to the transformation u → z specified above. In other words, u is one of
the points that satisfy the equality Pe (dx | u ) = z = Pe (dx | u).
Then the inequality (9.7.12) takes the following form:
1 Pe (dx | u )
inf E − ln z = Pe (dx | u)
u β P(dx)
1 Pe (dx | u)
E − ln z = Pe (dx | u) . (9.7.13)
β P(dx)
324 9 Definition of the value of information
Here Pe (dx | u ) on the right-hand side is substituted with Pe (dx | u). Further, we
plug (9.7.13) into (9.7.11) and average by z to obtain
1 1 Pe (dx | u)
E inf E[c(x, u ) | z] E − γ (x) + E − ln = c(x, u)Pe (dx, du)
u β β P(dx)
(the formula (9.7.10) has been used again). This relation allows us to transform
(9.7.8) to the form
V̄ (I) inf Ec(x, u) − c(x, u)Pe (dx, du).
The right-hand side expression is nothing more than the value V (I) of the Shan-
non’s information. Comparison of the obtained inequality V̄ (I) V (I) with (9.7.5)
yields (9.7.1). The proof is accomplished.
where P(d π ) = X P(dx)P(d π | x); P(dx | π ) = P(dx)P(d π | x)/P(d π ); π ≡ π (dx)
is a point that belongs to the space of distributions Π . At the same time the set G of
conditional distributions is constrained by the inequality Ixπ I (9.7.3).
We now show that the value functions V (I), V (I) can be defined by the for-
mula (9.7.14), but only if we replace the feasible set G with some narrower sets G,
G, respectively, such that G ⊃ G ⊃ G.
For a fixed partition ∑k Ek = X included in definitions (9.2.6), (9.6.5) it is appro-
priate to put
P(d π | x) = ∑ P(Ek | x)δ (d π , P(dx | Ek )), (9.7.15)
k
As seen from the right-hand side of (9.7.15), the event x ∈ Ek implies the event
π (dx) = P(dx | Ek ) with probability 1 and vice versa. Therefore
For the set of distributions of the type (9.7.15) the integral taken over π in (9.7.14)
becomes a sum according to (9.7.17). Hence, we obtain
V = inf
u
c(x, u)P(dx) − inf ∑ P(Ek ) inf
P(d π |x) k u
c(x, u)P (dx | π = P(dx | Ek )) .
At the same time minimization over P(d π | x) can be reduced to minimization over
partitions ∑k Ek = X for the set of distributions of the type (9.7.15). The resulting
expression coincides with the expression situated on the right-hand side of the for-
mulae (9.2.6a), (9.6.5). Consequently, the values of information V (I), V (I) can be
also determined by the formula (9.7.14), if minimization is taken over distributions
of the type (9.7.15). The set G corresponding to the definition of the value func-
tion V (I) is indeed the set of those distributions of the type (9.7.15), for which the
inequality (9.6.4) holds.
As a result of the equivalence of the events x ∈ Ek and π (dx) = P(dx | Ek ), the
transformation x → π can be reduced to the transformation x → k. In other words,
the formula
Ixπ = Hk = − ∑ P(Ek ) ln P(Ek )
k
holds in this case. Therefore, (9.6.4) implies the condition Ixπ I, i.e. (9.7.3). Thus,
⊂G
G belongs to the set G. Finally, the inclusion G results from the fact that the
constraint (9.6.4) is weaker than the upper bound constraint on the number of re-
gions E1 , . . . , EM , where M = eI .
Chapter 10
Value of Shannon’s information for the most
important Bayesian systems
In this chapter, the general theory concerning the value of Shannon’s information,
covered in the previous chapter, will be applied to a number of important practical
cases of Bayesian systems. For these systems, we derive explicit expressions for the
potential Γ (β ), which allows us to find a dependency in a parametric form between
losses (risk) R and the amount of information I and then, eventually, to find the value
function V (I).
First, we consider those Bayesian systems, for which the space X turns out to be
especially easy, namely those for which X consists of two points. In this case, the
third variational problem can be solved by reduction to a simple algebraic equation
with just one variable. In the case of systems with a homogeneous cost function
(Section 10.2), the Fourier transform method, or an equivalent operator method, can
be applied to obtain a solution.
Other specific (matrix) methods are employed in the important case of Gaussian
Bayesian systems characterized by the Gaussian prior distribution and the bilinear
cost function. They allow us to obtain a solution to the problem and study the depen-
dency of the active subspace U on the parameter β or I. Special attention is paid to
various (finite dimensional and infinite dimensional) stationary Gaussian systems,
for which the value of information function is expressed in parametric form.
1. Consider the simple case of the sample space X of a random variable such that
X contains exactly two points x1 and x2 . Probabilities P(x1 ) = p, P(x2 ) = 1 − p are
assumed to be known. The space U of all possible estimators u is assumed to be
more complicated. For instance, we consider the real axis as U. In this case, the cost
function c(x, u) can be reduced to two functions of u:
The equation (9.5.2) for the given Bayesian system takes the following form:
2
∑ Qi e−β ci (u) = 1,
u ∈ U. (10.1.1)
i=1
that is
where
dΓ
δΓ = δ β + δ0Γ = −Rδ β + δ0Γ , (10.1.7)
dβ
where the variation δ0Γ corresponds to the variation of the region U taken without
respect to the variation of the parameter β . Plugging (10.1.7) into (9.4.27) we obtain
the condition
δ0Γ = 0. (10.1.8)
Since the variations δ u1 , δ u2 are independent, the condition (10.1.8) can be broken
down into two equations
∂Γ ∂Γ
= 0; = 0. (10.1.9)
∂ u1 ∂ u2
Assuming that the Jacobian
∂ d1 ∂ d1
∂ u1 ∂ u2
∂ d2 ∂ d2
∂u ∂ u2
1
of the transformation (10.1.4) does not equal zero, the equations (10.1.9) can be
replaced by the equations
∂Γ ∂Γ
= 0; = 0. (10.1.10)
∂ d1 ∂ d2
Differentiating (10.1.4) we can express them in an explicit way
∂M
(1 − p) coth β d1 − coth β (d1 + d2 ) + = 0,
∂ d1
∂M
p coth β d2 − coth β (d1 + d2 ) + = 0. (10.1.11)
∂ d2
2. Consider quadratic cost functions
u) − (
c1 ( u − (a1 − a2 )/2)2 + b1 ; c2 ( u + (a1 − a2 )/2)2 + b2 .
u) = (
u) = u2 − (a1 − a2 )
c1 ( u; u) = u2 + (a1 − a2 )
c2 ( u. (10.1.13)
according to (10.1.4).
The values u˜1 , u˜2 must be determined from the equations (10.1.9), (10.1.10).
As a result of the symmetry of functions (10.1.13) we can look for symmetric
roots u˜2 = −u˜1 , which drastically simplifies the equations. Hence, as a consequence
of (10.1.14) the function (10.1.13) takes the form
2d12
R = −d1 tanh β d1 + M = − + M = −
u21 ,
(a1 − a2 )2
I = h2 (p) − ln(2 cosh β d1 ) + 2β u21 .
1 1
I = h2 (p) − ln 2 + (1 + ϑ ) ln(1 + ϑ ) + (1 − ϑ ) ln(1 − ϑ )
2 2
1+ϑ
= h2 (p) − h2 . (10.1.19)
2
(a1 − a2 )2 2
V (I) = R(0) − R(I) = ϑ − (2p − 1)2 . (10.1.20)
4
We eliminate parameter ϑ from (10.1.19), (10.1.20) and so obtain the depen-
dency between the value V and the information amount I that takes the form of the
equation
⎛ 2 ⎞
2
1 V 1
I = h2 (p) − h2 ⎝ + + p− ⎠. (10.1.21)
2 (a1 − a2 )2 2
While the amount of information I changes gradually from 0 to h2 (p), the value
V ranges from 0 to (a1 − a2 )2 × [1/4− (p − 1/2)2 ] = (a1 − a2 )2 p(1 − p) ϑ takes
values from 2p − 1 to 1, respectively .
1. In this section we assume that the cost function c(x, u) of the Bayesian system
[P(x), c(x, u)] is translation invariant, i.e. it depends on x and u only via their differ-
ence: c(x, u) = c(x − u) = c(z). In other words, c(x, u) is invariant under translation
x → x + a, u → u + a. At the same time it is implied that x, u, z = x − u, a all take
values from the same space X, so that a translation keeps this space invariant. Con-
sequently, the space X must not have boundaries, but it may have a ‘finite’ volume
(for instance, it can be periodically closed).
We apply the above translation to the equality (9.5.2)
which yields
∑ Q(x + a)e−β c(x−u) = 1. (10.2.2)
x
invariant, we conclude
Assuming that the translation leaves the active subspace U
from a comparison of (10.2.1) and (10.2.2) that Q(x) = Q(x + a), that is, Q(x) =
Q is constant. Therefore, in this case we can apply the theory covered in 2 from
Section 9.5 and use the formulae (9.5.8), (9.5.10), (9.5.15). Knowing the potential
332 10 Value of Shannon’s information for the most important Bayesian systems
Pe (x | u) = eβ F0 −β c(x−u) (10.2.5)
does not depend on the a priori distribution P(x). The variable H = −dF0 /dT =
Hx − I in the formulae (10.2.3) is nothing but the entropy Hx|u of this conditional
distribution: H = −E[ln P(x | u)].
The a priori distribution P(x) influences only the distribution Pe (u) in (10.2.4).
The equation (9.4.23) that defines it can be rewritten as
and obtain
θe (s) = e−ψ (β ,s)−β F0 θ (s)
from the equation (10.2.6). The last equation can be further transformed to the form
where
eψ (β ,s) = ∑ e−β c(z)+sz , (10.2.8)
z
and also
ψ (β , 0) = −β F0 (1/β ). (10.2.9)
By virtue of (10.2.8), the relation (10.2.6) can be rewritten in operator form
d
exp ψ β , − ψ (β , 0) Pe (u) = P(x) (10.2.10)
dx
10.2 Systems with translation invariant cost function 333
and ∞
d 1 ∂ lψ dl
ψ β, − ψ (β , 0) = ∑ ( β , 0)
l=1 l! ∂ s
dx l dxl
it is possible rewrite (10.2.11) as follows:
∞ ∞
(−1)n d l1 +···+ln
Pe (u) = P(x) + ∑ ∑ kl1 · · · kln l +···+l P(x)
,...,ln =1 n! l1 ! · · · ln ! dx 1 n
n=1 l1
dP(x) 1 2 d 2 P(x)
= P(x) − k1 + k1 − k2 +··· . (10.2.12)
dx 2 dx2
The derivatives kl = (∂ ξ l /∂ sl )(β , 0) here coincide, according to (10.2.5), (10.2.8),
with cumulants of the distribution P(z) = exp [β F0 − β c(z)]. The obtained for-
mula (10.2.12) can be conveniently used when the a priori distribution P(x) is much
broader than the conditional distribution (10.2.5). Those conditions entail an effect
of rapidly decreasing expansion terms.
The simplest case takes place when the distribution P(x) is uniform:
P(x) = e−Hx .
Then it follows from (10.2.6) that Pe (u) has the same form:
The relations given in this section hold true both for discrete and continuous
spaces X. Of course sums should be substituted by integrals in the continuous case.
It may happen that in the case of a continuous or unbounded space some terms
in the above expressions do not have any independent meaning (for example, Hx ).
However, they occur in combination with other terms and functions in such a way
that those combinations have meaning (for instance, the sum Hx + dF0 /dT ).
2. Examples. The examples covered in Sections 9.2 and 9.5, where variable x
was discrete, belong to the very class of systems with a translation invariant cost
function.
In this section, we consider examples, in which variable x assumes continuous
values and is described by the probability density function p(x).
Let the cost function c(z) be expressed in the following way:
334 10 Value of Shannon’s information for the most important Bayesian systems
0, if z ∈ Z0
c(z) = (10.2.14)
∞, if z ∈
/ Z0
we obtain
β F0 = − ln Ω ,
where Ω = Z0 dz is the ‘volume’ of region Z0 , where c(z) = ∞.
Fig. 10.1 Dependency of the average cost on the amount of information for the cost func-
tion (10.2.14)
d
R= (β F) = 0; I = Hx − ln Ω . (10.2.15)
dβ
It can be seen from here that in order to keep losses at zero, the amount of informa-
tion must be equal to
Icr = Hx − ln Ω = − ln[Ω p(x)]p(x)dx. (10.2.16)
Fig. 10.2 The value of information for quadratic cost function (10.2.20)
If ε 2 = E[z2i ] is the fidelity for each coordinate, then the average cost R = E[c(z)]
can be set to rε 2 due to (10.2.18). Hence, formulae (10.2.19) will yield
1
I = r ln √ + Hx ,
ε 2π e
which resembles (10.2.17). Eliminating parameter T from (10.2.19) and solving for
V (I) = R(0) − R(I), we obtain the value of information function
336 10 Value of Shannon’s information for the most important Bayesian systems
r 2 Hx −1 2
!
V (I) = er 1 − e− r I , (10.2.20)
2π
which is shown on Figure 10.2.
Let us also write the probability density for distribution Pe (u), defined by for-
mula (10.2.11). Distribution (10.2.5) in this case is Gaussian
r
β
e−β ∑i (xi −ui ) ,
2
Pe (x | u) =
π
If this ratio is small, then one can leave only the first few expansion terms, for
instance, 4
T ∂2 ε
pe (u) = p(x) − ∑ 2 p(x) + O . (10.2.21)
4 i ∂ xi Δ4
An analogous formula takes place also for the previous example (10.2.14), when
1/Ω , if x − u ∈ Z0
pe (x | u) =
0, if x − u ∈
/ Z0 .
Namely, if
(k1 )i = zi dz = 0,
Z0
1 ∂2
pe (u) = p(x) − ∑
2 i, j
Bi j
∂ xi ∂ x j
p(x) + · · · ,
1
Bi j = zi z j dz (Bi j ∼ ε 2 ).
Ω Z0
The above results are valid under the assumptions that probabilities (10.2.11)
are non-negative, and entropy Hx of the a priori distribution is not smaller than the
conditional entropy Hx|u = Hz of distribution (10.2.5) obtained as a solution of the
extremum problem.
4. In the conclusion of this section, let us consider the function
which, as will be seen in Chapter 11, plays a significant role in addressing the issue
of how the value functions for Shannon’s and Hartley’s information amounts differ
from each other.
In the case of a translation invariant cost function the expression
In the simple case of a uniform distribution Pe (u) (see (10.2.13)), formula (10.2.23)
yields
or
1
μx (t) = −Hx + tF0 − = Γ (−t), (10.2.25)
t
if we consider (10.2.9), (9.5.10). Thus, function μx (t) does not depend on x.
338 10 Value of Shannon’s information for the most important Bayesian systems
1. Let x and u be points from Euclidean spaces Rr and Rs having dimensions r and s,
respectively. A Bayesian system [p(x), c(x, u)] is called Gaussian, if the probability
density p(x) is Gaussian, and c(x, u) is the sum of linear and quadratic forms of x
and u. Choosing an appropriate origin in Rs we therefore have
1
c(x, u) = c0 (x) + xT gu + uT hu, (10.3.1)
2
(where c0 (x) = c0 +cT1 x+ 12 xT c2 x, which, however, is not mandatory). Here a matrix
form is used:
⎡ ⎤ ⎡ ⎤
x1 u1
⎢ .. ⎥ T
⎢ .. ⎥
x = ⎣ . ⎦; x = x1 . . . xr ; u = ⎣ . ⎦; uT = u1 . . . us ;
xr us
The matrix a is non-singular and positive definite. If the distribution p(x) turns out to
be singular, i.e. it is concentrated in some subspace X ⊂ Rr , then we can restrict our
consideration to the space X = Rr̃ substituting r by r̃(r̃ < r). Thus, without loss of
generality, the correlation matrix kx = a−1 of the distribution (10.3.2) can be treated
as non-singular.
For Gaussian systems we can represent the function q(x) (see (9.5.1)) in the
following Gaussian form:
1
p(x)e−γ (x)−β c (x) = q(x)e−β c (x) = exp σ + xT ρ − xT κ x .
0 0
(10.3.3)
2
and be concentrated in some subspace U = Rs̃ (dim U = s̃) of the space Rs . Specif-
ically, there may be a coincidence Rs̃ = Rr . Henceforth, we restrict ourselves by
considering the space Rs̃ and consider matrices h and g to be of sizes s̃ × s̃ and
r̃ × s̃, respectively. Thus, we consider spaces Rr̃ , Rs̃ of x, u so that matrices kx = a−1 ,
ku = a−1 u are non-singular. Variables σ , ρ = (ρ1 , . . . , ρr̃ ) , κ =
κi j
, m and au ap-
T
Assuming that the matrix κ is positive definite and, consequently, that det κ = 0, we
take that integral with the help of
1
1 T det A − 2 1 T −1
exp − x Ax + x y dx =
T
exp y A y , (10.3.5)
2 2π 2
where A is a positive definite matrix (the latter formula can be simply derived
from (5.4.19)). This leads to the result
−1/2
κ
detκ 1 T β
exp κ −1 (ρ − β gu) = exp −σ + uT hu .
(ρ − β uT gT )κ
2π 2 2
We take the logarithm of this equality and equate separately quadratic, linear and
constant terms (with respect to u). This operation yields the following equations:
β gT κ −1 g = h, (10.3.6)
−1
−g κ ρ = 0,
T
(10.3.7)
1 κ 1
σ = ln det − ρ T κ −1ρ . (10.3.8)
2 2π 2
In order to obtain the other necessary relations, we turn to the second equa-
tion (9.4.23), which, by multiplying by p(x), can be rewritten as follows:
p(u)p(x)e−γ (x)−β c(x,u) du = p(x),
x ∈ X.
U
which as can be seen from the following completely determines the matrix au . In-
troducing the unknown matrix
and taking into account that a = kx−1 we rewrite (10.3.12) in the following form:
we transform (10.3.14)
x [1u + β 2 Bk
βk x ]−1 = h, (10.3.16)
= Rs̃ .
where k̃x = gT kx g and 1u is a unity operator in U
It is not hard to write the solution of equation (10.3.16):
x = β h−1 k
1u + β 2 Bk x ,
and, therefore,
x )−1 .
β B = h−1 − (β k (10.3.17)
We have assumed here that matrices h, k̃x are non-singular (this condition charac-
As a result of (10.3.13), (10.3.10) it follows from (10.3.17)
terizes the subspace U).
that
x )−1 ]−1 − β h,
au = β [h−1 − (β k (10.3.18)
κ = a + β g[h−1 − (β kx )−1 ]gT = k−1 −1 T −1 T
x − gkx g + β gh g . (10.3.19)
Plugging the latter formula into the equality (10.3.7) we find that
x )−1 ]au m = 0,
gT κ −1 g[h−1 − (β k
10.3 Gaussian Bayesian systems 341
ρ = 0. (10.3.22)
1 1 a
Γ (β ) = −σ + E[xT κ x − xT ax] + ln det − β E[c0 (x)]
2 2 2π
1 1
κ kx ) + E[xT (κ
= − ln det(κ κ − a)x] − β E[c0 (x)]. (10.3.24)
2 2
Here we take into account (10.3.22) and (10.3.23). Because
κ − a)xxT ] = (κ
E[(κ κ − a)kx = κ kx − 1x
which is valid if f (0) = 0 (see (A.1.5) and (A.1.6)). Substituting A = g(β h−1 − k̃x−1 ),
B = gT kx into equation (10.3.27) we get
! !
tr f g(β h−1 − k−1
x )g T
kx = tr f gT
k x g( β h−1 −1
− kx )
!
x h−1 − 1 .
= tr f β k
342 10 Value of Shannon’s information for the most important Bayesian systems
Applying the last equality and taking into account (10.3.26) towards (10.3.25) for
functions f (z) = ln (1 + z), f (z) = z we obtain
1 x h−1 ) + 1 tr(β k
x h−1 − 1u ) − β E[c0 (x)] .
Γ (β ) = − tr ln(β k (10.3.28)
2 2
Hence, the dependency of Γ on β can be represented by the formula
s 1 s
Γ (β ) = − ln β − tr ln
kx h−1 − − β M, (10.3.29)
2 2 2
β ) and also
where s̃ = tr(1u ) is a dimension of the space U(
1
M = E[c0 (x)] − tr kx h−1 .
2
To calculate R and I according to (9.4.29), (9.4.30) it remains to differentiate
the potential (10.3.28), (10.3.29). It follows from the general theory concerning the
third variational problem (see the proof of Theorem 9.5) that we can equally well
vary the active domain U( β ) or assume it is constant. Following the latter more
simple approach, we obtain from (10.3.28)
1 1u −1
R = tr − kx h + E[c0 (x)];
2 β
(10.3.30)
1
s 1
x h ) = ln β + tr ln(k
−1 x h ).
−1
I = tr ln(β k
2 2 2
The value of information function (9.3.7) can be obtained by forming difference
1 1 −1 1 1 −1
V (I) = tr − kx h − tr − kx h . (10.3.31)
2 β I=0 2 β
x h−1 − 1/β ).
V = (1/2) tr(k (10.3.32)
The last relation together with the second formula (10.3.30) gives a parametric rep-
resentation of the dependence V (I).
3. Let us characterize the space U( β ). We remind that for the theory covered
in this section to be valid (it has been mentioned already before), it is necessary
that matrices κ and B−1 = β h + au were positive definite and, consequently, non-
singular; and also that matrices h and k̃x were non-singular as well. It is easy to
see that matrix k̃x = gT kx g is positive semi-definite, and therefore non-singular.
Further, taking into account (10.3.17), the requirement of positive definiteness of
matrices B = h−1 − (β k̃x )−1 and k̃x entails positive definiteness and, consequently,
10.3 Gaussian Bayesian systems 343
non-singularity of matrix h−1 = B + (β k̃x )−1 for β > 0. Finally, taking into ac-
count (10.3.19) we conclude that the matrix κ = a + β 2 gBgT is positive definite,
and the matrix gBgT is positive semi-definite. Thus, in order to fulfil all the neces-
sary requirements it is sufficient to comply only with two requirements:
1. Matrix k̃x = gT kx g must be positive definite
2. The difference h−1 − (β k̃x )−1 must be positive definite:
g2 kx
V (I) = (1 − e−2I ) (10.3.34)
2h
(the matrices coincide with scalars).
Let us turn to a two-dimensional case, when there are two independent Gaus-
sian random variables with variances k1 and k2 , respectively, and zero means. Let
matrices g, h have the diagonal form:
g 0 h 0
g= 1 ; h= 1 .
0 g2 0 h2
For clarity, let us assume that k1 g21 /h1 > k2 g22 /h2 . According to conditions 1
and 2, the space U( β ) will consist of points (u1 , u2 ) = (u1 , 0) either belonging to
the line u2 = 0 for h1 /k1 g21 < β < h2 /k2 g22 , or will coincide with the entire two-
dimensional space U = R2 for β > h2 /k2 g22 .
β ) has one dimension, we have
In the first case, when U(
g
g= 1 ; gT = g1 0 ; x = gT kx g = (k1 g21 ).
k
0
Taking into account (10.3.10) we write formulae (10.3.30) for the given example as
follows:
344 10 Value of Shannon’s information for the most important Bayesian systems
ln β + 12 ln(k1 g21 /h1 ),
1
2 for h1 /k1 g21 < β < h2 /k2 g22
I=
ln β + 12 ln(k1 g21 /h1 ) + 12 ln k2 g22 /h2 , for β > h2 /k2 g22 ,
1/2β − k1 g21 /2h1 + E[c0 (x)], for h1 /k1 g21 < β < h2 k2 g22
R=
1/β − k1 g1 /2h1 − 2 k2 g2 /h2 + E[c0 (x)], for β > h2 /k2 g22 .
2 1 2
d 2V d 1 dT
= = (10.3.36)
dI 2 dI β dI
undergoes a jump. It equals V (I2 − 0) = −2V = −2k2 g22 /h2 to the left of the point
I = I2 − 0 and V (I2 + 0) = −V = −k2 g22 /h2 to the right of the point I = I2 .
The obtained dependency is depicted in Figure 10.3. The dimensionality r̃ of
β ) may be interpreted as the number of active degrees of freedom,
the space U(
which may vary with changes in temperature. This leads to a discontinuous jump
of the second derivative (10.3.36), which is analogous to an abrupt change in heat
capacity in thermodynamics (second-order phase transition).
5. In conclusion, let us compute function (10.2.22) for Gaussian systems. Taking
into account (10.3.1), (10.3.4), (10.3.21) we get
1 t 1
μx (t) = ln det(au /2π ) + tc0 (x) + ln exp txT gu + uT hu − uT au u du.
2
U 2 2
1 1 2 −1 −1 −1
E[μx (t)] = tE[c0 (x)] − tr ln(1 − tha−1
u ) + t tr au (1 − thau ) kx (10.3.38)
2 2
and the derivatives
10.3 Gaussian Bayesian systems 345
Fig. 10.3 The value of information for a Gaussian system with a ‘phase transition’ corresponding
to formula (10.3.35)
1 ha−1 1 T −1 2 − tha−1
μx (t) = c0 (x) + tr u
+ tx gau u
gT x, (10.3.39)
2 1 − tha−1
u 2 (1 − tha−1
u )
2
1 (ha−1
u )
2
−1 −3
E[μx (t)] = tr −1
+ tr a−1
u (1 − thau ) kx . (10.3.40)
2 (1 − thau )2
To ensure this, the stochastic process x must be stationary, i.e. the density p(x) has
to satisfy the condition
and the cost function must also satisfy the stationarity condition
When applied to Gaussian systems, the conditions of stationarity lead to the re-
quirement that matrices
hi j , gi j , ai j , (10.4.1)
in the expressions (10.3.1), (10.3.2) must be stationary, i.e. their elements have to
depend only on the difference of indices:
At first, let us consider the case when spaces have a finite dimensionality r. Ma-
trices (10.4.1) can be transformed to the diagonal form by a unitary transformation
x = U+ x (10.4.2)
where
r
hl = ∑ e−2π ilk/r hk (10.4.5)
k=1
10.4 Stationary Gaussian systems 347
The result of the unitary transformation (10.4.2) is that the new variables
⎡ ⎤ ⎡ ⎤
x1 x1
⎢ .. ⎥ + ⎢ .. ⎥
⎣ . ⎦=U ⎣ . ⎦
xr xr
U+ hU =
h(ωl )δlm
,
T0 T0 /2
−iωl τ
h(ωl ) = e h(τ )d τ = eiωl τ h(τ )d τ , (10.4.9)
0 −T0 /2
and so on for the other matrices. Formulae (10.3.30) in this case can be will have
the form similar to (10.4.6)
I 1 k(ωl )|g(ωl )|2
I0 =
T0
= ∑ ln β h(ω )
2T0 l∈L
,
l
(10.4.10)
R 1 k(ωl )|g(ωl )|2 1 1
R0 =
T0
=− ∑
2T0 l∈L h(ωl )
−
β
+
T0
E[c0 (x)],
with the only difference that index l now can range over all possible integer values
. . ., −1, 0, 1, 2, . . ., which satisfy the conditions
Then for β < 1/Φmax there are no indices l, for which the condition (10.4.11) is
satisfied, and the summations in (10.4.10) will be absent, so that l will equal to zero
and also
1
(R0 )I0 =0 = E[c0 (x)]. (10.4.12)
T0
I0 will take a non-zero value as soon as β attains the value 1/Φmax and surpasses it.
Taking into account (10.4.12) by the usual formula
The dependency V0 (I0 ) has been obtained via (10.4.10), (10.4.13) in a parametric
form.
3. Let us consider the case when x = {. . . , x−1 , x0 , x1 , x2 , . . .} is an infinite station-
ary sequence, and the elements of the matrices
h =
hi− j
, g =
gi− j
10.4 Stationary Gaussian systems 349
The integration is carried over subinterval L(β ) of interval (−π , π ), for which
β k (λ )|g(λ )|2 > h(λ ). The obtained formulae determine the rates I1 , R1 correspond-
ing on average to one element from sequences {. . . , x1 , x2 , . . .}, {. . . , u1 , u2 , . . .}. De-
noting the length of the subinterval L(β ) by |L|, we obtain from (10.4.16) that
4π R1
= −Φ1 + Φ2 e−4π R1 /|L| + const, (10.4.17)
|L|
where
1 1
Φ1 = Φ (λ )d λ ; Φ2 = exp ln Φ (λ )d λ
|L| L |L| L
are some mean (in L(β )) values of functions Φ (λ ) = k (λ )|g(λ )|2 /h(λ ), ln Φ (λ ).
4. Finally, suppose there is a stationary process on the infinite continuous-time
axis. Functions h, g depend only on the differences in time, similarly to (10.4.8).
This case can be considered as a limiting case of the system considered in 2 or 3.
Thus, in the formulae of 2 it is required to take the limit T0 → ∞, for which the
points ωl = 2π l/T0 from the axis ω are indefinitely consolidating (Δ ω = 2π /T0 ),
350 10 Value of Shannon’s information for the most important Bayesian systems
The functions h̄(ω ), ḡ(ω ) included in the previous formulae are defined by the
equality (10.4.9) ∞
h(ω ) = e−iωτ h(τ )d τ , . . . . (10.4.19)
−∞
The integration in (10.4.18) is carried over the region L(β ) belonging to the axis ω ,
where
β k(ω )|g(ω )|2 > h(ω ). (10.4.20)
Denote the total length of that region by |L|. Then, similarly to (10.4.17), formu-
lae (10.4.18) entail the following:
4π V0
= Φ1 − Φ2 e−4π I0 /|L| ,
|L|
where
1 1
Φ0 = Φ (ω )d ω ; Φ2 = exp ln Φ (ω )d ω ,
|L| L |L| L
k(ω )|g(ω )|2
Φ (ω ) = . (10.4.21)
h(ω )
k(τ ) = σ 2 e−γ |τ | .
Fig. 10.4 The rate of the value of information for the example with the cost function (10.4.23)
When matrices h, g (as it can be seen from comparison with (10.3.1) have the
form h =
δ (t − t )
, g = −
δ (t − t )
, then
Φ (ω ) = 2γσ 2 /(γ 2 + ω 2 ).
2β γσ 2 > γ 2 + ω 2 ,
and,
consequently, for afixed value β > γ /2σ 2 the region L(β ) is the interval
− 2β γσ − γ< ω < 2β γσ 2 − γ 2 . Instead of parameter β , we consider now
2 2
Because of (10.4.23), and the stationarity of the process, the doubled rate of loss
2R0 , corresponding to a unit of time, coincide with the mean square error 2R0 =
E[x(t) − u(t)]2 . Furthermore, the doubled value 2V0 (I0 ) = σ 2 − 2R0 (I0 ) represents
the value of maximum decrease of the mean square error that is possible for a given
rate of information amount I0 .
The graph of the value function, corresponding to formulae (10.4.25), is shown
on Figure 10.4. These formulae can also be used to obtain approximate formulae for
the dependency V0 (I0 ) under small and large values of parameter y (or, equivalently,
of the ratio I0 /y).
For small value y 1, let us use the expansions arctan y = y − y3 /3 + y5 /5 − · · ·
and y/(1 + y2 ) = y − y3 + y5 + · · ·. Substituting them into (10.4.25) we obtain
γ y3 y5 σ2 2 3 4 5
I0 = − +··· , V0 = y − y +··· ,
π 3 5 π 3 5
π I0 /γ = y − π /2 + y−1 + · · · ,
π V0 /σ 2 = π /2 − 2y−1 + (4/3)y−3 .
The fact about asymptotic equivalence of the values of various types of information
(Hartley’s, Boltzmann’s or Shannon’s information amounts) should be regarded as
the main asymptotic result concerning the value of information, which holds true un-
der very broad assumptions, such as the requirement of information stability. This
fact cannot be reduced to the fact of asymptotically errorless information transmis-
sion through a noisy channel stated by the Shannon’s theorem (Chapter 7), but it is
an independent and no less significant fact.
The combination of these two facts leads to a generalized result, namely the gen-
eralized Shannon’s theorem (Section 11.5). The latter concerns general performance
criterion determined by an arbitrary cost function and the risk corresponding to it.
Historically, the fact of asymptotic equivalence of different values of information
was first proven (1959) precisely in this composite and implicit form, combined
with the second fact (asymptotically errorless transmission). Initially, this fact was
not regarded as an independent one, and was, in essence, incorporated into the gen-
eralized Shannon’s theorem.
In this chapter we follow another way of presentation and treat the fact of asymp-
totic equivalence of the values of different types of information as a special, com-
pletely independent fact that is more elementary than the generalized Shannon’s the-
orem. We consider this way of presentation more preferable both from fundamental
and from pedagogical points of view. At the same time we can clearly observe the
symmetry of information theory and equal importance of the second and the third
variational problems.
Apart from the fact of asymptotic equivalence of the values of information, it
is also interesting and important to study the magnitude of their difference. Sec-
tions 11.3 and 11.4 give the first terms of the asymptotic expansion for this dif-
ference, which were found by the author. These terms are exact for a chosen
random encoding, and they give an idea of the rate of decrease of the differ-
ence (as in any asymptotic semiconvergent expansion), even though the sum of
all the remaining terms of the expansion is not evaluated. We pay a special at-
tention to the question about invariance of the results with respect to a transfor-
mation c(ξ , ζ ) → c(ξ , ζ ) + f (ξ ) of the cost function, which does not influence
© Springer Nature Switzerland AG 2020 353
R. V. Belavkin et al. (eds.), Theory of Information and its Value,
https://doi.org/10.1007/978-3-030-22833-0 11
354 11 Asymptotic results about the value of information. Third asymptotic theorem
Let [P(dx), c(x, u)] be a Bayesian system. That is, there is a random variable x
from probability space (X, F, P), and F × G-measurable cost function c(x, u), where
u (from a measurable space (U, G)) is an estimator. In Chapter 9, we defined
for such a system the value functions of different types of information (Hartley,
Boltzmann, Shannon). These functions correspond to the minimum average cost
E[c(x, u)] attained for the specified amounts of information received. The Hartley’s
information consists of indexes pointing to which region Ek of the optimal partition
E1 + · · · + EM = X (Ek ∈ F) does the value of x belongs to. The minimum losses
= inf E inf E [c(x, u) | Ek ]
R(I) (11.1.1)
∑ Ek u
are determined by minimization over both the estimators, u, and different partitions.
The upper bound I on Hartley’s information amount corresponds to the upper bound
M eI on the number of the indicated regions.
When bounding the Shannon’s amount of information, the following minimum
costs are considered:
R(I) = inf c(x, u)P(dx)P(du | x) (11.1.2)
P(du|x)
R(I) R(I) R(I). (11.1.3)
Note that the definitions of functions R(I), R(I), R(I) imply only the inequal-
ity (11.1.3). A question that emerges here is how much do the functions R(I) and
R(I) differ from each other? If they do not differ much, then, instead of the function
R(I), which is difficult to compute, one may consider a much easier computationally
function R(I), which in addition has other convenient properties, such as differen-
tiability. The study of the differences between the functions R(I), R(I) and, thus,
between the value functions V (I), V (I), V (I), is the subject of the present chapter.
It turns out that for Bayesian systems of a particular kind, namely Bayesian sys-
tems possessing the property of ‘information stability’, there is asymptotic equiva-
lence of the above-mentioned functions. This profound asymptotic result (the third
asymptotic theorem) is comparable in its depth and significance with the respective
asymptotic results for the first and the second theorems.
Before giving the definition of informationally stable Bayesian systems, let us
consider composite systems, a generalization of which are the informationally stable
systems. We call a Bayesian system [Pn (d ξ ), c(ξ , ζ )] an n-th power (or degree) of
a system [P1 (dx), c1 (x, u)], if the random variable ξ is a tuple (x1 , . . . , xn ) consisting
of n independent identically distributed (i.i.d.) random variables that are copies of
x, i.e.
Pn (d ξ ) = P1 (dx1 ) · · · P1 (dxn ). (11.1.4)
The estimator, ζ , is a tuple of identical u1 , . . . , un , while the cost function is the
following sum:
n
c(ξ , ζ ) = ∑ c1 (xi , ui ). (11.1.5)
i=1
Formulae (11.1.1), (11.1.2) can be applied to the system [P1 , c1 ] as well as to its
n-th power. Naturally, the amount of information In = nI1 for a composite system
should be n times the amount of the original system. Then the optimal distribution
P(d ζ | ξ ) (corresponding to formula (11.1.2) for a composite system) is factorized
according to (9.4.21) into the product
where PI1 (du | x) is the analogous optimal distribution for the original system. Fol-
lowing (11.1.2) we have
R1 (I1 ) = c1 (x, u)P1 (dx)PI1 (du | x). (11.1.7)
Thus, for the n-th power system, as a consequence of (11.1.5), (11.1.7), we obtain
The case of function (11.1.1) is more complicated. Functions Rn (nI1 ) and R1 (I1 )
of the composite and the original systems are no longer related so simply. The num-
ber of partition regions for the composite system can be assumed to be M = [enI1 ],
whereas for the original system it is m [eI1 ] (the brackets [ ] here denote an integer
356 11 Asymptotic results about the value of information. Third asymptotic theorem
Hence,
Rn (nI1 ) nR1 (I1 ). (11.1.11)
However, besides partitions (11.1.9) there is a large number of feasible partitions
of another kind. Thus, it is reasonable to expect that nR1 (I1 ) is substantially larger
than Rn (nI1 ). One can expect that for some systems the rates Rn (nI1 )/n of aver-
age costs [clearly exceeding R1 (I1 ) on account of (11.1.3), (11.1.8)] decrease as n
increases and approach their plausible minimum, which turns out to coincide pre-
cisely with Rn (nI1 )/n = R1 (I1 ). It is this fact that is the essence of the main result
(the third asymptotic theorem). Its derivation also yields another important result,
namely, there emerges a method of finding a partition ∑ Gk close (in asymptotic
sense) to the optimal. As it turns out, a procedure analogous to decoding via a ran-
dom Shannon’s code is suitable here (see Section 7.2). One takes M code points ζ1 ,
. . . , ζM (we remind that each of them represents a block (u1 , . . . , un )). These points
are the result of M-fold random sampling with probabilities
PIn (d ζ ) = P(d ξ )PIn (d ζ | ξ ) = PI1 (u1 ) · · · PI1 (un ), (11.1.12)
Xn
The region Gk contains those points ξ , which are ‘closer’ to point ζk than to other
points (equidistant points may be assigned to any of competing regions by default).
If in the specified partition (11.1.13) the estimator is chosen to be point ζk used to
construct the region Gk , instead of point ζ minimizing the expression E[c(ξ , ζ ) |
Gk ], then it will be related to some optimality, that is
E inf E[c(ξ , ζ ) | Gk ] E [E[c(ξ , ζ ) | Gk ]] . (11.1.14)
ζ
11.2 Theorem about asymptotic equivalence of the value functions of different. . . 357
or via
γ1 (ξ ) = ln e−β c1 (x,u) PI1 (du), (11.1.17)
with functions γ1 (x), γn (ξ ) = γ1 (x1 ) + · · · + γ1 (xn ). Averaging them gives the poten-
tials
Γ1 (β ) = E[γ1 (x)] = P(dx) ln e−β c1 (x,u) PI1 (du), (11.1.18)
Γn (β ) = E[γn (ξ )] = nΓ1 (β ), (11.1.19)
dΓ1 dΓ1
Rn = −n (β ); R1 = − (β ) (11.1.20)
dβ dβ
which allow us to find upper bounds for minimum losses Rn (nI1 ) corresponding to
Hartley’s information amount. We use inequality (11.1.15) that is valid for every
set of code points ζ1 , . . . , ζM . It remains true if we average out with respect to a
statistical ensemble of code points described by probabilities PIn (d ζ1 ) · · · PIn (d ζM ),
M = [enI1 ]. At the same time we have
! 3 4
Rn nI1 E E [c (ξ , ζk ) | Gk ] | ζ1 , . . . , ζM PIn (d ζ1 ) · · · PIn (d ζM ) ≡ L. (11.2.1)
Let us write down the expression L in the right-hand side in more detail. Since
Gk is a region of points ξk , for which the ‘distance’ c(ξ , ζk ) is at most the ‘distance’
c(ξ , ζi ) to any point ζi from ζ1 , . . . , ζM , we conclude that
1
E [c (ξ , ζk ) | Gk ] = c(ξ , ζk )P(d ξ )
P(Gk )
c(ξ , ζk ) c(ξ , ζ1 ),
.................
c(ξ , ζk ) c(ξ , ζM )
For each k-th term, it is convenient first of all to make an integration with respect
to points ζi that do not coincide with ζk . If we introduce the cumulative distribution
function ⎧
⎪
⎨ PIn (d ζ ) for λ 0,
c(ξ ,ζ )λ
1 − Fξ (λ ) = (11.2.3)
⎪
⎩ PIn (d ζ ) for λ < 0,
c(ξ ,ζ )>λ
then, after an (M − 1)-fold integration by points ζi not coinciding with ζk , equa-
tion (11.2.2) results in
11.2 Theorem about asymptotic equivalence of the value functions of different. . . 359
M
L ∑ c(ξ , ζk )[1 − Fξ (c(ξ , ζk ))]M−1 P(d ξ )PIn (d ζk )
k=1
ξ ζk
=M c(ξ , ζ )[1 − Fξ (c(ξ , ζ ))]M−1 P(d ξ )PIn (d ζ ). (11.2.4)
ξ ζ
The inequality sign has emerged because for c(ξ , ζk ) 0 we slightly expanded the
regions Gi (i = k) by adding to them all ‘questionable’ points ξ , for which c(ξ , ζk ) =
c(ξ , ζi ), and also for c(ξ , ζk ) < 0 we shrank those regions by dropping out all similar
points.
It is not hard to see that (11.2.4) can be rewritten as
∞
L P(d ξ ) λ d{1 − [1 − Fξ (λ )]M },
−∞
It is evident that
1 − F1 = [1 − Fξ ]M 1 − Fξ , i.e. F1 (λ ) Fξ (λ ), (11.2.7)
1 − Fξ e−Fξ ,
we get
(1 − Fξ )M e−MFξ , (11.2.10)
that is equivalent to (11.2.9).
360 11 Asymptotic results about the value of information. Third asymptotic theorem
Employing (11.2.7), (11.2.9) we obtain that function (11.2.8) does not surpass
function (11.2.6):
F2 (λ ) F1 (λ ). (11.2.11)
It follows from the last inequality that
λ dF1 (λ ) λ dF2 (λ ). (11.2.12)
In order to ascertain this, we can take into account that (11.2.11) entails the
inequality λ1 (F) λ2 (F) for reciprocal functions due to which the difference
λ dF2 − λ dF1 can be expanded to 01 [λ2 (F) − λ1 (F)]dF and, consequently, turns
out to be non-negative.
So, inequality (11.2.5)
will only become stronger, if we substitute λ dF1 in the
right-hand side by λ dF2 . Therefore, we shall have
Rn (nI1 ) P(d ξ ) λ dF2 (λ ), (11.2.13)
Then the cost rates corresponding to different types of information coincide in the
limit:
1
Rn (nI1 ) → R1 (I1 ) (11.2.15)
n
as n → ∞. In other words, there is asymptotic equivalence
representation:
P(Γ ) = P(Γ | ξ )P(d ξ ),
split this integral into two parts and keep just one subintegral:
P(Γ ) = P(Γ | ξ )P(d ξ ) + P(Γ | ξ )P(d ξ ) P(Γ | ξ )P(d ξ ).
Ξ Ξ Ξ
(11.2.22b)
Within the complementary set Ξ̄ the inequality opposite to (11.2.21) holds true:
P(Zξ | ξ ) ≡ P(Γ | ξ ) δ , i.e. P(Γ̄ | ξ ) > δ when ξ ∈ Ξ̄ . Substituting this esti-
mation in (11.2.22b) and taking into account (11.2.22a) we find that
P(Γ ) > δ P(d ξ ) = δ P(Ξ ) δ 2 .
Ξ
P(ζ | ξ )
ln < n(I1 + ε ),
P(ζ )
that is
P(ζ ) > e−n(I1 +ε ) P(ζ | ξ ).
Summing over ζ ∈ Zξ we obtain from here that
2. Now we apply formula (11.2.5). Let us take the following cumulative distribution
function:
⎧
⎪
⎨0 for λ < nR1 (I1 ) + nε
F3 (λ ) = F1 (nR1 (I1 ) + nε ) for nR1 (I1 ) + nε < λ < nK1 (11.2.24)
⎪
⎩
1 for λ nK1 .
Because of the boundedness condition (11.2.14) of the cost function, the prob-
ability of inequality c(ξ , ζ ) > nK1 being valid is equal to zero, and it follows
from (11.2.3) that Fξ (nK1 ) = 1. As a result of (11.2.6) we have F1 (nK1 ) = 1.
Hence, functions F1 (λ ) and F3 (λ ) coincide within the interval λ nK1 . Within
the interval nR1 (I1 ) + nε < λ < nK1 we have F3 (λ ) F1 (λ ) since
in the same way as (11.2.12) has followed from (11.2.11). That is why for-
mula (11.2.5) yields
Rn (nI1 ) P(d ξ ) λ dF3 (λ ), (11.2.25)
but
λ dF3 (λ ) = [nR1 (I1 ) + nε ]F1 (nR1 (I1 ) + nε ) + nK1 [1 − F1 (nR1 (I1 ) + nε )]
featuring in (11.2.23). For the values of ζ within the set Zξ inequalities (11.2.18)
and (11.2.19) hold true due to the definition given earlier. Therefore, the domain
of integration Zξ in (11.2.26b) constitutes a subset of the domain of integration
in (11.2.26a) and, consequently,
364 11 Asymptotic results about the value of information. Third asymptotic theorem
Let us split the integral in (11.2.26) into two parts: an integral over set Ξ and an
integral over the complementary Ξ̄ . We apply (11.2.27) to the first integral and
replace the exponent by number one in the second one. This yields
ξ (nR1 (I1 ) + nε )]
P(d ξ ) exp[−MF
−nI1 −nε (1 − δ )] + 1 − P(Ξ )
P(d ξ ) exp[−Me
Ξ
−nI1 −nε (1 − δ )] + δ ,
exp[−Me (11.2.28)
= [enI1 ] enI1 − 1,
M
1
Rn (nI1 ) R1 (I1 − 2ε ) + ε +
n
+ 2K1 δ + 2K1 exp{−enε (1 − δ )(1 − e−nI1 )} (11.2.29)
1
lim sup Rn (nI1 ) R1 (I1 − 2ε ) + ε + 2K1 δ +
n→∞ n
+ lim 2K1 exp{−enε (1 − δ )(1 − e−nI1 )}.
n→∞
However,
lim exp{−enε (1 − δ )(1 − e−nI1 )} = 0
n→∞
1
lim sup Rn (nI1 ) − R1 (I1 ) R1 (I1 − 2ε ) − R1 (I1 ) + ε + 2K1 δ . (11.2.30)
n→∞ n
Because the function R1 (I1 ) is continuous, the expression in the right-hand side
of (11.2.30) can be made arbitrarily small by considering sufficiently small ε and
δ . Therefore,
1
lim sup Rn (nI1 ) R1 (I1 ).
n→∞ n
The above formula, together with the inequality
1
lim inf R1 (nI1 ) R1 (I1 ), (11.2.30a)
n→∞ n
that follows from (11.1.3) and (11.1.8), proves the following relation
1 1
lim sup R1 (nI1 ) = lim inf R1 (I1 ) = R1 (I1 ),
n→∞ n n→∞ n
that is equation (11.2.15). The proof is complete.
3. Theorem 11.1 proven above allows for a natural generalization for those cases,
when a system in consideration is not an n-th degree of some elementary system,
but instead some other more general conditions are satisfied. The corresponding
generalization is analogous to the generalization made during the transition from
Theorem 7.1 to Theorem 7.2, where the requirement for a channel to be an n-th
degree of an elementary channel was replaced by a requirement of informational
stability. Therefore, let us impose such a condition on Bayesian systems in question
that no significant changes to the aforementioned proof were required. According
to a common trick, we have to exclude from the proof the number n and the rates I1 ,
R1 , . . . , and instead consider mere combinations I = nI1 , R = nR1 and so on. Let us
require that a sequence of random variables ξ , ζ (dependent on n or some other pa-
rameter) were informationally stable in terms of the definition given in Section 7.3.
Besides let us require the following convergence in probability:
It is easy to see that under such conditions the inequalities of the type (11.2.17) will
hold. Those inequalities now take the form
for n > n(ε1 , ε2 , δ ), where ε1V = ε2 I = nε . Instead of the previous relation I1 =
I1 + 2ε we now have
I = I + 2ε2 I.
Let us take the boundedness condition (11.2.14) in the following form:
366 11 Asymptotic results about the value of information. Third asymptotic theorem
R(I− 2ε2 I) + ε1V (I) + 2KV (I)δ + 2KV (I) exp{−eε2 I (1 − δ )(1 − e−I)}
I)
R(
or
for all values of ε1 , ε2 , δ that are independent of n, and for n > n(ε1 , ε2 , δ ). In conse-
quence of condition A from the definition of informational stability (see Section 7.3)
˜
for every ε2 > 0, the expression exp{−eε2 I (1 − δ )(1 − e−I )} where I˜ = (1 + 2ε2 )I
converges to zero as n → ∞ if δ < 1. Therefore, a passage to the limit n → ∞
from (11.2.33) results in
V (I)
V (I)
1 − lim inf ε1 + 2K δ . (11.2.33a)
n→∞ V (I− 2ε2 I)
V (I)
Because ε1 , ε2 , δ are arbitrary and the function φ (y) is continuous, we obtain that
V (I)
lim inf 1. (11.2.34a)
n→∞
V (I)
V (I)
→1 (11.2.35)
V (I)
V (I)
lim sup 1
n→∞
V (I)
4. We noted in Section 9.6 that the value of information functions V (I), V (I) are
invariant under the following transformation of the cost function:
c (ξ , ζ ) = c(ξ , ζ ) + f (ξ ) (11.2.36)
(see Theorem 9.8). At the same time the difference of risks R − R and also the
regions Gk defined in Section 11.1 remain invariant, if the code points ζk and
the distribution P(d ζ ), from which they are sampled do not change. Meanwhile,
conditions (11.2.31), (11.2.32) are not invariant under the transformation defined
by (11.2.36). Thus, (11.2.31) turns into
c (ξ , ζ ) − E[c (ξ , ζ )] − f (ξ ) + E[ f (ξ )] /V (I) → 0,
Taking this into account, one may take advantage of arbitrary function f (ξ ) in trans-
formation (11.2.36) and select it in such a way that conditions (11.2.31), (11.2.32)
are satisfied, if they were not satisfied initially. This broadens the set of cases, for
which the convergence (11.2.34) can be proven, and relaxes the conditions required
for asymptotic equivalence of the value functions in the aforementioned theory.
In fact, using a particular choice of function f (ξ ) in (11.2.36) one can eliminate
the need for condition (11.2.31) altogether. Indeed, as one can see from (9.4.21), the
following equation holds for the case of extremum distribution P(ζ | ξ ):
unless, of course, the ratio I/β V does not grow to infinity. We say that a sequence
of Bayesian systems [P(d ξ ), c(ξ , ζ )] is informationally stable, if the sequence of
random variables ξ , ζ is informationally stable for extremum distributions. Hence,
the next result follows the above-mentioned.
to the level
R(I) = min E[c(x, u)] −V (I),
u
Fig. 11.1 Block diagram of a system subject to an information constraint. The channel capacity is
n times greater than that on Figure 9.5. MD—measuring device
been proven. In other words, the system (see Figure 11.1), in which block 1 classifies
the input signal ξ according to its membership in regions Gk and outputs the index
of the region containing ξ , is asymptotically optimal. It is readily seen that the
specified work of block 1 is absolutely analogous to the work of the decoder at the
output of a noisy channel mentioned in Chapter 7. The only difference is that instead
of the ‘distance’ defined by (7.1.8) we now consider ‘distance’ c(ξ , ζ ) and also ξ , η
are substituted by ζ , ξ . This analogy allows us to call block 1 a measuring decoder,
while block 2 acts as a block of optimal estimation.
The described information system and more general systems were studied in
the works of Stratonovich [47], Stratonovich and Grishanin [55], Grishanin and
Stratonovich [17].
The information constraints of the type considered before (Figures 9.6 and 11.1)
can be taken into account in various systems of optimal filtering, automatic con-
trol, dynamic programming and even game theory. Sometimes these constraints are
related to boundedness of information inflow, sometimes to memory limitations,
sometimes to a bounded complexity of an automaton or a controller. The consid-
11.3 Rate of convergence between the values of Shannon’s and Hartley’s information 369
The aforementioned Theorem 11.1 establishes the fact about asymptotic equiva-
lence of the values of Shannon’s and Hartley’s information amounts. It is of interest
how quickly does the difference between these values vanish. We remind the reader
that in Chapter 7, right after Theorems 7.1 and 7.2, which established the fact of
asymptotic vanishing of probability of decoding error, contains theorems, in which
the rate of vanishing of that probability was studied.
Undoubtedly, we can obtain a large number of results of various complexity and
importance, which are concerned with the rate of vanishing of the difference V (I) −
V (I) (as in the problem of asymptotic zero probability of error). Various methods
can also be used in solving this problem. Here we give some comparatively simple
results concerning this problem. We shall calculate the first terms of the asymptotic
expansion of the difference V (I) − Ṽ (I) in powers of the small parameter n−1 . In
so doing, we shall determine that the boundedness condition (11.2.14) of the cost
function, stipulated in the proof of Theorem 11.1, is inessential for the asymptotic
equivalence of the values of different kinds of information and is dictated only by
the adopted proof method.
Consider formula (11.2.13), which can be rewritten using the notation S =
λ dF2 (λ ) as follows:
Rn (nI1 ) SP(d ξ ). (11.3.1)
Hereinafter, we identify I˜ with I, I˜1 with I1 and M̃ with M, because it is now unnec-
essary to make the difference between them.
Let us perform an asymptotic calculation of the expression situated in the right-
hand side of the above inequality.
1. In view of (11.2.8) we have
S = S1 + S2 + S3 , (11.3.2)
where
370 11 Asymptotic results about the value of information. Third asymptotic theorem
c ∞
S1 = − λ de−MFξ (λ ) , S2 = − λ de−MFξ (λ ) , (11.3.3)
λ =−∞ λ =c
S3 = λ d Fξ (λ ) − 1 + e−MFξ (λ . (11.3.4)
−MFξ (λ )
Fξ (λ )>1−e
Fξ (λ ) = P [c(ξ , ζ ) λ ]
for λ < c(ξ , ζ )PIn (d ζ ) = c̄ and also
−sμξ (s)+μξ (s)
Fξ (λ ) = 1 − [2π [μξ (s)s2 ]−1/2 e [1 + O(μξ−1 )] (11.3.6)
and
μξ (t) = ln etc(ξ ,ζ ) PIn (d ζ ). (11.3.8)
where integration is conducted with respect to s, and the lower bound of integration
s∗ is determined by the formula lims→s∗ μξ (s) = −∞.
11.3 Rate of convergence between the values of Shannon’s and Hartley’s information 371
x=s −r
−r ( )
1
− μ1 (r) x2 d exp −enα x+nβ x +··· 1 + O(n−1 ) + · · · ,
2
2 x=s −r
(x = s − r), (11.3.11)
where
with the required precision. Further, we choose a negative root of equation (11.3.10)
r<0 (11.3.13)
(if one exists), so that α > 0. Making a change of a variable enα x = z we trans-
form (11.3.11) into the form
1
S1 = μ1 (r) 1 − e−MFξ (c) −
n
−nα r ( )
μ (r) e
ln(z)d exp e− ln z+(β /nα ) ln z+··· 1 + O(n−1 ) −
2 2
− 1
nα z=z
−nα r ( )
1 μ1 (r) e − ln z+(β /nα 2 ) ln2 z+··· −1
− ln2
(z)d exp e 1 + O(n ) −··· ,
2 n2 α 2 z=z
(z = ena (s − r)). (11.3.14)
372 11 Asymptotic results about the value of information. Third asymptotic theorem
The above equation allows one to appreciate the magnitude of various terms for
large n. By expanding the exponent
exp −eln z+(β /nα ) ln z+··· ≡ exp −eln z+ε
2 2
Plugging (11.3.15) into (11.3.14), let us retain the terms only of orders 1, n−1 , n−2
and obtain
1
S1 = μ1 (r) 1 − e−nξ (c) −
n
−nα r
μ (r) e β
− 2 ln(z)d e−z − z 2 ln2 ze−z −
nα z=z nα
e−MF α r
μ (r)
− 12 2 ln2 (z)nde−z + n−3 · · ·
2n α z=z
Here we have neglected the term −μ1 (r)e−MFξ (c̄) , which vanishes with the growth
of n and M = [enI1 ] much faster than n−2 . Also we have neglected the integrals
∞ z
μ1 (r) μ1 (r)
ln(z)e−z dz; ln(z)e−z dz,
nα e−nα r nα 0
which decrease very quickly with the growth of n, because e−nα r and z∗ = enα (s∗ −r)
exponentially converge to ∞ and 0, respectively (recall that r < 0, s∗ < r). One can
easily assess the values of these integrals, for instance, by omitting the factor ln z in
the first integral and e−z in the second one, which have relatively small influence.
However, we will not dwell on this any further.
The integral in (11.3.16) is equal to the Euler constant C = 0.577 . . . Indeed,
expressing this integral as the limit
11.3 Rate of convergence between the values of Shannon’s and Hartley’s information 373
α
lim ln(z)d e−z − 1
α →∞ 0
and the condition for the existence of the negative root (11.3.13) is equivalent (for
large n) to the condition of the existence of the root
qξ < 0. (11.3.20)
By differentiation, we can make verify that the function sμξ (s) − μξ (s) attains the
minimum value equal to zero at the point s = 0. Therefore, equation (11.3.18), at
least for sufficiently small I1 , has two roots: positive and negative. Thus, it is possible
to choose the negative root qξ .
According to (11.3.19), formula (11.3.17) can be simplified to the form
1 1 C
S1 = μ1 (qξ ) − ln[2π nμ1 (qξ )q2ξ ] + + n−2 ln2 n · · · (11.3.21)
n 2nqξ nqξ
2. Let us now turn to the estimation of integral (11.3.4). Denote by λ the bound-
ary point determined by the equation
which confirms the small value of ε . Within the domain of integration λ > λΓ in
equation (11.3.4) the derivative
d
! dF (λ )
ξ
Fξ (λ ) − 1 + e−MFξ (λ ) = 1 − Me−M eM[1−Fξ (λ )]
dλ dλ
is positive for M
1, which can be checked by taking into account that
where
ρ = sΓ μ1 (sΓ ) > 0 nμ1 (sΓ ) = λΓ , (11.3.25)
and the dots correspond to other terms, the form of which is not important to us.
Substituting (11.3.24) into (11.3.23) we obtain
∞ ∞
−nρ x+···
S3 ε |λΓ + x| de ε [|λΓ | + x] de−nρ x+···
x=0 x=0
1
= ε |λΓ | + +··· .
nρ
The right-hand side of the above equation can be written using (11.3.6) as follows:
Evidently,
S2 b e−MFξ (c) − e−MFξ (b) be−MFξ (c) . (11.3.28)
To estimate the integral S2 we use (11.3.6) and the formula of the type (11.3.24):
∞ (
)
S2 = − λ d exp −M 1 − 1 − Fξ (b) e−nρb (λ −b)+··· .
b
Let us compute the second term in (11.3.29). By analytical continuation, from for-
mula 1
1 1
e−ρ z ln zdz = (C + ln ρ ) + Ei(−ρ )
0 ρ ρ
(see formulae (3.711.2) and (3.711.3) in Ryzhik and Gradstein [36], the correspond-
ing translation to English is [37]) we obtain
376 11 Asymptotic results about the value of information. Third asymptotic theorem
1
1 1
eNz ln zdz − (C + ln N) − E1 (N),
0 N N
whereas
eN 1 2
Ei(N) = 1+ + 2 +···
N N N
(see p. 48 in Jahnke and Emde [25], the corresponding book in English is [24]).
Consequently, the main dependency of the second term T2 in (11.3.29) on M is
determined by the exponential factor e−M+N = e−MFξ (b) .
Thus, all three terms (11.3.28), (11.3.30) and T2 constituting S2 decrease with a
growth of n quite rapidly. Just as S3 , they do not influence an asymptotic expan-
sion of the type (11.3.21) over the powers of the small parameter n−1 (in combi-
nation with logarithms ln n). Therefore, we need to account only for one term in
formula (11.3.2), so that due to (11.3.21) we have
1
C
S μξ (qξ ) − ln (2π )−1 μξ (qξ )q2ξ + + O n−1 ln2 n . (11.3.31)
2qξ qξ
4. Let us average over ξ in compliance with the latter formula. This averaging
becomes easier, because of the fact that function
(11.3.8) comes as a summation of
identically distributed random summands ln etc1 (xi ,u) PI1 (du), while μ1 = μξ /n is
their arithmetic mean:
1 n
μ1 (t) = ∑ ln
n i=1
etc1 (xi ,u) PI1 (du). (11.3.33)
By virtue of the Law of Large Numbers this mean converges to the expectation
ν1 (−t, β ) ≡ E[μ1 (t)] = P(dx) ln etc1 (x,u) PI1 (du), (11.3.34)
For each fixed t, according to the above law, we have the following convergence in
probability:
μ1 (t) → ν1 (−t), (11.3.36)
and also for the derivatives
11.3 Rate of convergence between the values of Shannon’s and Hartley’s information 377
(k) (k)
μ1 (t) → (−1)k ν1 (−t), (11.3.37)
if they exist.
Because of the convergence in (11.3.36), (11.3.37), the root qξ of equation
(11.3.18) converges in probability to the root q = −z of the equation
qξ → −β as n → ∞. (11.3.39)
β = −dI/dR,
and the parameter β is positive for the normal branch R+ (9.3.10) (see Section 9.3).
Hence we obtain the inequality qξ < 0 that complies with condition (11.3.20).
Because of the convergence in (11.3.37), (11.3.39) and equalities (11.3.35),
(11.1.20), the first term E[μ1 (qξ )] in (11.3.32) tends to R1 (I1 ). That proves (if we
also take into account vanishing of other terms) the convergence
1
R(nI1 ) → R1 (I1 ) (n → ∞),
n
which was discussed in Theorem 11.1.
In order to study the rate of convergence, let us consider the deviation of the
random variables occurring in the averaged expression (11.3.32) from their non-
random limits. We introduce a random deviation δ μ1 = μ1 (−β ) − ν1 (β ). Just as
the random deviation δ μ1 = μ1 (−β ) − ν1 (β ), this deviation due to (11.3.33) has a
null expected value and a variance of the order n−1 :
and neglecting the terms of order higher than quadratic we obtain that
and, consequently,
1 1
0 Rn (nI1 ) − R1 (I1 ) ln 2π nν1 (β )β 2 −
n 2nβ
C
2 7 3
− + E β δ μ1 + δ μ1 2β ν1 (β ) + o(n−1 ). (11.3.40)
nβ
11.4 Alternative forms of the main result. Generalizations and special cases 379
Applying the same methods and conducting calculations with a greater level of
accuracy, we can also find the terms of higher order in this asymptotic expansion.
This way one can corroborate those points of the above-stated derivation, which
may appear insufficiently grounded.
In the exposition above we assumed that the domain (s1 , s2 ) of the potential μξ (t)
and its differentiability, which was mentioned in Theorem 4.8, is sufficiently large
enough. However, only the vicinity of the point s = −β is actually essential. The
other parts of the straight line s have influence only on exponential terms of the
types (11.3.26), (11.3.28), (11.3.30), and they do not influence the asymptotic ex-
pansion (11.3.40). Of course, the anomalous behaviour of the function μ1 (s) on
these other parts of s complicates the proof.
For formula (11.3.40) to be valid, condition (11.2.14) of boundedness of the
cost function is not really necessary. However, the derivation of this formula can
be somewhat simplified if we impose this condition. Then elaborate estimations of
integrals S2 , S3 and the integral over domain Ξ will not be required. Instead, we
can confine ourselves to proving that the probability of the appropriate regions of
integration vanishes rapidly (exponentially). In this case, the value of the constant
K will be inessential, since it will not appear in the final result.
As one can see from the above derivation, the terms of the asymptotic expan-
sion (11.3.40) are exact for the given random encoding. Higher-order terms can be
found, but the terms already written cannot be improved if the given encoding pro-
cedure is not rejected. The problem of interest is how close the estimate (11.3.40) is
to the actual value of the difference (1/n)R̃(nI1 ) − R1 (I1 ), and to what extent can it
be refined if more elaborate encoding techniques are used.
1. In the previous section we introduced the function μ1 (t) = (1/n)μξ (t) instead of
function (11.3.8). It was done essentially for convenience and illustration reasons
in order to emphasize the relative magnitude of terms. Almost nothing will change
if the rate quantities μ1 , ν1 , R1 , I1 and others are used only as factors of n, that
is if we do not introduce the rate quantities at all. Thus, instead of the main result
formula (11.3.40), after multiplication by n we obtain
− R(I) = V (I) − V (I)
0 R(I)
1 2π E (β δ μ + δ μ )2
ln 2 ν (β )β 2 + + o(1), (11.4.1)
2β γ 2β 3 ν (β )
(index n is omitted, the term C/β is moved under the logarithm; γ = eC = 1.781).
With these substitutions, just as in paragraph 3 in Section 11.2, it becomes unnec-
essary for the Bayesian system to be the n-th degree of some elementary Bayesian
system.
Differentiating twice function (11.4.2) at point t = −β and taking into ac-
count (11.1.16) we obtain that
ν (β ) = P(d ξ ) c2 (ξ , ζ )e−γ (ξ )−β c(ξ ,ζ ) PI (d ζ ) −
2
− c(ξ , ζ )e−γ (ξ )−β c(ξ ,ζ ) PI (d ζ ) .
Because of (9.4.21), the above integrals over ζ are the integrals of a conditional
expectations taken with respect to conditional probabilities P(d ζ | ξ ). Therefore,
( )
ν (β ) = E E c2 (ξ , ζ ) | ξ − (E [c(ξ , ζ ) | ξ ])2
≡ E [Var [c(ξ , ζ ) | ξ ]] , (11.4.4)
Let us now consider what is the mean square term E[(β δ μ + δ μ )2 ] in (11.4.4),
which, due to (11.4.3), coincides with the variance
2
E β δ μ + δ μ = Var β μξ (−β ) + μξ (−β ) (11.4.6)
In view of (9.4.21), this integral is nothing but the integral averaging with respect
to conditional distribution P(d ζ | ξ ), i.e.
11.4 Alternative forms of the main result. Generalizations and special cases 381
Thus, we see that expression (11.4.6) is exactly the variance of a particularly aver-
aged random information:
2
E β δ μ + δ μ = Var Var β μξ (−β ) + μξ (−β ) = [Iξ | (| ξ )]. (11.4.9)
Following from (11.4.5), (11.4.9) the main formula (11.4.1) takes the form
2π
0 2β [V (I) − V (I)] ln E [Var[I( ξ , ζ ) | ξ ]] +
γ2
+ Var[Iζ (| ξ )]/E [Var[I(ξ , ζ ) | ξ ]] + o(1). (11.4.10)
By averaging this equality over ξ and taking into account (9.4.10), (11.4.2) we ob-
tain
ν (−t) = Γ (−t) ∀t. (11.4.12)
That is why we can replace ν (β ) by Γ (β ) in formula (11.4.1). Further, it is useful
to recall that due to (9.4.31) the value of β is related to the derivatives of functions
R(I), V (I):
1 dR
=− = V (I), (11.4.13)
β dI
so that
β = 1/V (I). (11.4.14)
The second derivative Γ (β )
can also be expressed in terms of the function I(R) (or
V (I)), because these functions are related by the Legendre transform [see (9.4.30)
and (9.4.29)]. Differentiating (9.4.30) we have
382 11 Asymptotic results about the value of information. Third asymptotic theorem
dI
βΓ (β ) = .
dβ
This implies
1 1 dβ
=
β 2Γ (β ) β dI
or, equivalently, if we differentiate (11.4.14) and take into account (11.4.13),
On the strength of (11.4.14), (11.4.15), (11.4.9) the main formula (11.4.1) can be
written as follows:
1 2π V (I)
0 V (I) − V (I) V (I) ln − 2 −
2 γ V (I)
1
− V (I)Var Iζ | (| ξ ) + o(1). (11.4.16)
2
Besides, the function (11.4.11) sometimes turns out to be independent of ξ . We
encountered such a phenomenon in Section 10.2 where we derived formula (10.2.25)
for one particular case. According to the latter we have μξ (t) = Γ (−t), so that
μξ (t) = ν (−t) = Γ (−t); δ μ (t) = 0, and averaging over ξ becomes redundant.
In this case, the variance (11.4.6), (11.4.9) vanishes and formula (11.4.16) can be
somewhat simplified taking the form
1 2π V (I)
V (I) V (I) V (I) − V (I) ln − 2 + o(1). (11.4.17)
2 γ V (I)
At the same time the analysis carried out in 4 of the previous paragraph becomes
redundant.
3. In some important cases the sequence of values of I and the sequence of
Bayesian systems [P(d ξ ), c(ξ , ζ )] (dependent on n or another parameter) are such
that for an extremum distribution
A.
I = Iξ ζ → ∞. (11.4.18)
B. There exist finite non-zero limits
E [Var[I(ξ , ζ ) | ξ ]] Var[Iζ | (| ξ )] V (yI) dV (I)
lim , lim , lim , lim (11.4.19)
I I I dI
(y is arbitrary and independent of I).
It is easy to see that the sum
3
E [Var[I(ξ , ζ ) | ξ ]] + Var Iζ | (| ξ ) = E E[I 2 (ξ , ζ ) | ξ ]−
4
−(E [I(ξ , ζ ) | ξ ])2 + E (E[I(ξ , ζ ) | ξ ])2 − (E [E[I(ξ , ζ ) | ξ ]])2
11.4 Alternative forms of the main result. Generalizations and special cases 383
coincides with the total variance Var[I(ξ , ζ )]. Consequently, it follows from the ex-
istence of the first two limits in (11.4.19) implies the existence of the finite limit
1
lim Var[I(ξ , ζ )].
I
Therefore, on one hand, A and B imply conditions of information stability of ξ , ζ
mentioned in Section 7.3 (it can be derived in a standard way by using Chebyshev’s
inequality). Also, B implies (11.2.34). Thus, if in addition the boundedness condi-
tion (11.2.39) and continuity of function (11.2.34) with respect to y are satisfied,
then, according to Theorem 11.2, convergence (11.2.35) will take place. On the
other hand, it follows from (11.4.18) together with the finiteness of limits (11.4.19)
and equation (11.4.14) that
1 Var[Iζ | (| ξ )]
=
2β V (I) E [Var[I(ξ , ζ ) | ξ ]]
1 dV I Var[Iζ | (ξ )] I
= → 0. (11.4.20)
2I dI V (I) I E [Var[I(ξ , ζ ) | ξ ]]
Furthermore,
1 2π ln I 1 2π E [Var[I(ξ , ζ ) | ξ ]]
ln E [Var[I( ξ , ζ )] ξ ]] = + ln →0
I γ2 I I γ2 I
Further, in order to compute the variance in (11.4.9) we should use formula (10.3.43).
The variance of expressions quadratic in Gaussian variables was calculated earlier
in Section 5.4. It is easy to see that by applying the same computational method
384 11 Asymptotic results about the value of information. Third asymptotic theorem
x ).
(since gT kx g = k
After the substitution of (11.4.22), (11.4.23) into (11.4.1), (11.4.6) we will have
2
π 1
0 2β [V (I) − V (I)] ln tr 1u − hk−1 +
γ2 β x
8
1 −1 2 1 −1 2
+ tr 1u − hkx tr 1u − hk + o(1). (11.4.24)
β β x
The requirement of existence of the limit lim (dV /dI) appearing in condition B
(11.4.19) is equivalent, in view of (11.4.14), to the requirement of the existence of
the limit
lim β = β0 . (11.4.26)
The first two limits in (11.4.19) can be rewritten as
Finally, the condition of the existence of the limit V (yI)/I, in view of (10.3.32),
takes the form
lim tr(
kx h−1 − βy−1 1u )/ tr ln(β
kx h−1 ), (11.4.28)
where βy is determined from the condition
tr ln(βy
kx h−1 ) = 2yI = y tr ln(β
kx h−1 ). (11.4.29)
(ξ , ζ are discrete random variables) turn into the standard results, can be naturally
named the generalized Shannon’s theorem.
386 11 Asymptotic results about the value of information. Third asymptotic theorem
The defined direction of research is closely related to the third variational prob-
lem and the material covered in Chapters 9–11, because it involves an introduction
of the cost function c(ξ , ζ ) and the assumption that distribution P(d ξ ) is given a
priori. Results of different strength and detail can be obtained in this direction. We
present here a unique theorem, which almost immediately follows from the standard
Shannon’s theorem and results of Section 11.2. Consequently, it will not require a
new proof.
Let ξ , ζ be random variables defined by the joint distribution P(d ξ d ζ ). Let also
[P(d η̃ | η )] be a channel such that its output variable (or variables) η̃ is connected
with its input variable η by the conditional probabilities P(d η̃ | η ). The goal is to
conduct encoding ξ → η and decoding η̃ → ζ̃ by selecting probabilities P(d η | ξ )
and P(d ζ̃ | η̃ ), respectively, in such a way that the distribution P(d ξ , d ζ̃ ) induced by
the distributions P(d ξ ), P(d η | ξ ), P(d η̃ | η ), P(d ζ̃ | η̃ ) coincides with the initial
distribution P(d ξ d ζ ). One can see that the variables ξ , η , η̃ , ζ̃ connected by the
scheme
ξ →η →η → ζ, (11.5.2)
form a Markov chain with transition probabilities mentioned above. We apply for-
mula (6.3.7) and obtain that
At the same time using the same formula we can derive that
According to the Markov property, the future does not depend on the past, if the
present is fixed. Consequently, mutual information between the past (ξ ) and the
future (η̃ ) with the fixed present (η ) equals zero: Iξ η̃ |η = 0. By the same reasoning
Equating (11.5.4) with (11.5.3) and taking into account (11.5.5) we get
Iξ ζ Iη η or Iξ ζ C. (11.5.7)
It can be seen from here that the distribution P(d ξ · d ζ̃ ) can copy the initial distri-
bution P(d ξ d ζ ) only under the following necessary condition:
Iξ ζ C. (11.5.8)
11.5 Generalized Shannon’s theorem 387
Otherwise, methods of encoding and decoding do not exist, i.e. P(d η | ξ ), P(d ζ | η )
such that P(d ξ d ζ) coincides with P(d ξ d ζ ).
As we can see, this conclusion follows from the most general properties of the
considered concepts. The next fact is less trivial: condition (11.5.8) or, more pre-
cisely, the condition
lim sup Iξ ζ /C < 1, (11.5.9)
is sufficient in an asymptotic sense, if not for coincidence of P(d ξ d ζ) with P(d ξ d ζ ),
then anyways for relatively good in some sense quality of this distribution. In order
to formulate the specified fact, we need to introduce a quality criterion—the cost
function c(ξ , ζ ). The condition of equivalence between P(d ξ d ζ) and P(d ξ d ζ ) can
be replaced by a weaker condition
!
|E[c(ξ , ζ )]|−1 ϑ + E[c(ξ , ζ)] − E[c(ξ , ζ ) ] → 0 (11.5.10)
(2ϑ+ (z) = z + |z|),
which points to the fact that the quality of distribution P(d ξ d ζ̃ ) is asymptotically
not worse than the quality of P(d ξ d ζ ). The primary statement is formulated for a
sequence of schemes (11.5.2) dependent on n.
Theorem 11.3. Suppose that
1. the sequence of pairs of random variables ξ , ζ is informationally stable (see
Section 7.3);
2. the convergence
c(ξ , ζ ) − E[c(ξ , ζ )]
→0 (11.5.11)
E[c(ξ , ζ )]
in probability P(d ξ d ζ ) takes place [compare with (11.2.31)];
3. the sequence of cost functions satisfies the boundedness condition
in exactly the same way that we used to prove (11.2.33a) on the basis of conditions
of informational stability and conditions (11.2.31), (11.2.32) in Section 11.2 [see
also the derivation of (11.2.30)]. In (11.5.13) ε , δ are indefinitely small positive
ξ ζ + 2ε2 Iξ ζ ) is the average cost:
values independent of n and R(I
ξ ζ + 2ε2 Iξ ζ ) = E inf E[c(ξ , ζ ) | Ek ] ,
R(I (11.5.14)
ζ
Further, we convey the message about the index of region Ek containing ξ through
channel P(d η̃ | η ).
In consequence of (11.5.9) we can always choose ε2 such that the inequal-
ity [Iξ ζ + ε2 (2Iξ ζ + C)]/C < 1 holds true ∀n > N for some fixed N. Since M =
[exp(Iξ ζ + 2ε2 Iξ ζ )] the last statement means that the inequality ln M/C < 1 − ε2
is valid, i.e. (8.1.5) is satisfied. The latter together with requirement (4) of Theo-
rem 11.3 assures that Theorem 8.1 (generalized similarly to Theorem 7.2) can be
applied here. According to Theorem 8.1 the probability of error for a message re-
ception through a channel can be made infinitely small with a growth of n:
P(d ξ d9
ζ ) = ∑ P(d ξ )ϑEk (ξ )P(l | k)δ (ζ − ζl )d9
ζ, (11.5.17)
k,l
= ∑ P(Ek )E[c(ξ , ζk ) | Ek ]
k
+ ∑ P(Ek )P(l | k) {E[c(ξ , ζl ) | Ek ] − E[c(ξ , ζk ) | Ek ]} .
k=l
Averaging over the ensemble of random codes and taking into account (11.5.16) we
obtain that
E ∑ P(Ek )Pow (| k) = E [Pow (| k)] = Pow < ε4 .
k
From here we can make the conclusion that there exists some random code, which
is not worse than the first in the sense of the inequality
We assumed above that the initial distribution P(d ξ d ζ ) is given. The provided
consideration can be simply extended to the cases, for which there is a set of such
distributions or the Bayesian system [P(d ξ ), c(ξ , ζ )] with a fixed level of cost
E[c(ξ , ζ )] a.
390 11 Asymptotic results about the value of information. Third asymptotic theorem
condition (11.5.12) is replaced by (11.2.39), while conditions (1) and (2) of The-
orem 11.3 need to be replaced by the requirements of information stability of the
Bayesian system (Section 11.2, paragraph 4). After these substitutions one can apply
Theorem 11.2.
In conclusion, let us confirm that the regular formulation of the Shannon’s theo-
rem follows from the stated results. To this end, let us consider for ξ and ζ identical
discrete random variables taking M values. Let us choose the distribution P(ξ , ζ ) as
follows:
1
P(ξ , ζ ) = δξ ζ .
M
In this case we evidently have
Iξ ζ = Hξ = Hζ = ln M (11.5.20)
However, !
) = 1 =
∑ P( ξ , ζ
M∑
P ζ ξ | ξ
ξ =ζ ξ
is nothing else but the mean probability of error (7.1.11). Hence, convergence
(11.5.21) coincides with (7.3.7). Thus we have obtained that in this particular case,
Theorem 11.3 actually coincides with Theorem 7.2.
Chapter 12
Information theory and the second law of
thermodynamics
In this chapter, we discuss a relation between the concept of the amount of informa-
tion and that of physical entropy. As is well known, the latter allows us to express
quantitatively the second law of thermodynamics, which forbids, in an isolated sys-
tem, the existence of processes accompanied by an increase of entropy. If there
exists an influx of information dI about the system, i.e. if the physical system is
isolated only thermally, but not informationally, then the above law should be gen-
eralized by substituting inequality dH ≥ 0 with inequality dH + dI ≥ 0. Therefore,
if there is an influx of information, then the thermal energy of the system can be con-
verted (without the help of a refrigerator) into mechanical energy. In other words,
the existence of perpetual motion of the second kind powered by information be-
comes possible.
In Sections 12.1 and 12.2, the process of transformation of thermal energy into
mechanical energy using information is analysed quantitatively. Furthermore, a par-
ticular mechanism allowing such conversion is described. This mechanism consists
in installing impenetrable walls and in moving them in a special way inside the
physical system. Thereby, the well-known semi-qualitative arguments related to this
question and, for instance, contained in the book of Brillouin [4] (the corresponding
book in English is [5]) acquire an exact quantitative confirmation.
The generalization of the second law of thermodynamics by no means cancels
its initial formulation. Hence, in Section 12.3 the conclusion is made that, in prac-
tice, expenditure of energy is necessary for measuring the coordinates of a physical
system and for recording this information. If the system is at temperature T , then
in order to receive and record the amount of information I about this system, it is
necessary to spend at least energy T dI. Otherwise, the combination of an automatic
measuring device and an information converter of thermal energy into mechanical
one would result in perpetual motion of the second kind. The above general rule
is corroborated for a particular model of measuring device, which is described in
Section 12.3.
The conclusion about the necessity of minimal energy expenditure is also ex-
tended to noisy physical channels corresponding to a given temperature T (Sec-
tion 12.5). Hence, the second law of thermodynamics imposes some constraints on
© Springer Nature Switzerland AG 2020 391
R. V. Belavkin et al. (eds.), Theory of Information and its Value,
https://doi.org/10.1007/978-3-030-22833-0 12
392 12 Information theory and the second law of thermodynamics
where
F = −T ln eE(x)/T dx (12.1.2)
is the free energy of the system. The temperature T is measured in energy units, for
which the Boltzmann constant is equal to 1.
It is convenient to assume that we have a thermostat at temperature T , and the dis-
tribution mentioned above reaches its steady state as a result of a protracted contact
with the thermostat.
In the frame of the general theory of the value of information distribution (12.1.1)
is a special case of probability distribution appearing in the definition of a Bayesian
system. Certainly, the general results obtained for arbitrary Bayesian systems in
Chapters 9 and 10 can be extended naturally to this case. Besides, some special phe-
nomena related to the Second law of thermodynamics can be investigated, because
the system under consideration is a physical system. Here we are interested in a pos-
sibility of transforming thermal energy into mechanical energy, which is facilitated
by the inflow of information about the coordinate x.
In defining the values of Hartley’s and Boltzmann’s information amounts (Sec-
tions 9.2 and 9.6) we assumed that incoming information about the value of x has a
simple form. It indicates what region Ek from the specified partition ∑k Ek = X of the
sample space X point x belongs to. This information is equivalent to an indication
of the index of region Ek . Let us show that such information does indeed facilitate
the transfer of thermal energy into mechanical energy. When specifying the region
Ek the a priori distribution (12.1.1) is transformed into a posteriori distribution
12.1 The generalized second law of thermodynamics 393
exp{[ f (Ek ) − E(x)]/T } for x ∈ Ek
p(x | Ek ) = (12.1.3)
0 for x ∈
/ Ek
where
F(Ek ) = −T ln e−E(x)/T dx (12.1.4)
Ek
is a conditional free energy. Because it is known that x lies within the region Ek ,
this region can be surrounded by impenetrable walls, and the energy function E(x)
is replaced by the function
E(x) if x ∈ Ek
E(x | k) = (12.1.5)
∞ if x ∈
/ Ek .
Because the walls are being moved apart slowly, the energy transfer occurs without
changing the temperature of the system. This is the result of the influx of thermal en-
ergy from the thermostat, the contact with which must not be interrupted. Note that
a contact with it must not to be interrupted. Then the source of mechanical energy
leaving the system will be the thermal energy of the thermostat, which is converted
into mechanical work. In order to calculate the total work Ak , it is necessary to sum
the differentials (12.1.6). When the walls are moved to infinity, the region Ek coin-
cides with entire space X, and also the free energy F(Ek ) coincides with (12.1.2).
Therefore, the total mechanical energy is equal to the difference between the free
energies (12.1.2) and (12.1.4)
Ak = F(Ek ) − F. (12.1.7)
Supplementing the derived above formula with the inequality sign allowing for a
non-equilibrium process (occurring insufficiently slow), we have
A T HEk . (12.1.10)
Thus, we have obtained that the maximum amount of thermal energy turning into
work is equal to the product of the absolute temperature and the Boltzmann amount
of incoming information. The influx of information about the physical system fa-
cilitates the conversion of thermal energy into work without transferring part of the
energy to the refrigerator. The assertion of the Second law of thermodynamics about
the impossibility of such a process is valid only in the absence of information in-
flux. If there is an information influx dI, then the standard form of the second law
permitting only those processes, for which the total entropy of an isolated system
does not decrease:
dH 0, (12.1.11)
becomes insufficient. Constraint (12.1.11) has to be replaced by the condition that
the sum of entropy and information does not decrease:
dH + dI 0. (12.1.12)
In the above process of converting heat into work, there was the information in-
flow Δ I = HEk . The entropy of the thermostat decreased by Δ H = −A/T , and as a
result the energy of the system did not change. Consequently, condition (12.1.12)
for the specified process is valid with the equality sign. The equality sign is based on
the ideal nature of the process, which was specially constructed. If the walls encom-
passed a larger region instead of Ek or if their motion apart were not infinitely slow,
then there would be an inequality sign in condition (12.1.12). Also, the obtained
amount of work would be less than (12.1.9a). In view of (12.1.12), it is impossible
to produce from heat an amount of work larger than (12.1.9a).
The idea about the generalization of the Second law of thermodynamics to the
case of systems with an influx of information appeared long ago in connection with
12.2 Influx of Shannon’s information and transformation of heat into work 395
dA = dQ − dU, (12.1.14)
where U = E[E(x)] is the internal energy of the system, which is related to the free
energy F via the famous equation U = F + T Hx . Differentiating the latter, we have
dF = dU − T dHx . (12.1.15)
In this case, the second law (12.1.11) has the form dHT + dHx 0, T dHx − dQ 0,
which, on the basis of (12.1.14), (12.1.15), is equivalent to
dA −dF. (12.1.16)
Taking this relation with the equality sign (which corresponds to an ideal process),
we obtain the first relation (12.1.6).
All that has been said above about converting thermal energy into mechanical en-
ergy due to an influx of information, can be applied to the case, in which we have
information that is more complex than just specifying the region Ek , to which x
belongs. We shall have such a case if we make errors in specifying the region Ek .
Suppose that k̃ is the index of the region referred to with a possible error, and k is
the true number of the region containing x. In this case, the amount of incoming
396 12 Information theory and the second law of thermodynamics
information is determined by the Shannon’s formula I = HEk − HEk |k̃ . This amount
of information is less than the entropy HEk considered in previous section.
Further, in a more general case, information about the value of x can come not
in the form of an index of a region, but in the form of some other random vari-
able y connected with x statistically. In this case, the amount of information is also
determined by Shannon’s formula [see (6.2.1)]:
I = Hx − Hx|y . (12.2.1)
The posterior distribution p(x | y) will now have a more complicated form
than (12.1.3). Nevertheless, the generalized second law of thermodynamics will
have the same representation (12.1.12), if I is understood as Shannon’s amount of
information (12.2.1). Now formula (12.1.10) can be replaced by the following
A T I. (12.2.2)
In order to verify this, one should consider an infinitely slow isothermal transi-
tion from the state corresponding to a posteriori distribution p(x | y) and having
entropy Hx (| y) to the initial (a priori) state with a given distribution p(x). This
transition has to be carried out in compliance with the second law of thermodynam-
ics (12.1.11), i.e. according to formulae (12.1.13), (12.1.16). Summing up elemen-
tary works (12.1.16) we obtain that every found value y corresponds to the work
Ay −F + F(y). (12.2.3)
A T Hx − T Hx|y , (12.2.5)
Assume that we have received a message that y = 1, i.e. x is situated in the left half:
x ∈ [0,V /2]. This corresponds to the following posterior probability density
2(1 − p)/V for 0 x V /2
p(x | y = 1) = (12.2.6)
2p/V for V /2 < x V .
We install a wall at the point x = z0 = V /2, which we then move slowly. In order to
find the forces acting on the wall we calculate the free energy for every location z of
the wall. Since E(x) ≡ 0, the calculation of free energy is reduced to a computation
of entropy. If the wall has been moved from point x = V /2 to point x = z, then
probability density (12.2.6) should be replaced by the probability density
1−p
for 0 x < z
p(x | z, 1) = z
p (12.2.7)
V −z for z < x V ,
z V −z
Hx (| 1, z) = (1 − p) ln + p ln
1− p p
and free energy
z V −z
F(1, z) = −T (1 − p) ln − T p ln . (12.2.8)
1− p p
Differentiating with respect to z, we find the force acting on the wall
∂ F(1, z) T T
− = (1 − p) − p . (12.2.9)
∂z z V −z
If the coordinate of x were located on the left, then the acting force would be equal
to T /z (by analogy with the formula for the pressure of an ideal gas, z plays the
role of a volume); if x were on the right, then the force would be −T /(V − z).
Formula (12.2.9) gives the posterior expectation of these forces, because 1 − p is
the posterior probability of the inequality x < V /2.
The work of the force in (12.2.9) on the interval [z0 , z1 ] can be calculated by
taking the difference of potentials (12.2.8):
The initial position of the wall, as was mentioned above, is in the middle (z0 = V /2).
The final position is such that the probability density (12.2.7) becomes equilibrium.
This yields
A similar result takes place for the second message y = 2. In this case we should
move the wall to the other part of the interval. The mean work A is determined by
the same expression (12.2.11). A comparison of this formula with (12.2.5a) shows
that relation (12.2.2) is valid with the equality sign.
In conclusion, we have to point out that the condition of equilibrium of the ini-
tial probability density p(x) [see (12.1.1)] is optional. If the initial state p(x) of a
physical system is non-equilibrium, then we should instantly (while the coordinates
of x have not changed) ‘take into action’ some new Hamiltonian E(x), which differs
from the original E(x) such that the corresponding probability density is equilib-
rium. After that, we can apply all of our previous reasoning. When conversion of
thermal energy into work is finished, we must instantly ‘turn off’ Ẽ(x) by proceed-
ing to E(x). Expenditure of energy for ‘turning on’ and ‘turning off’ the Hamiltonian
Ẽ(x) compensate each other. Thus, the informational generalization of the Second
law of thermodynamics does not depend on the condition that the initial state be
equilibrium. This generalization can be stated as follows:
12.3 Energy costs of creating and recording information. An example 399
Here thermal isolation means that the thermal energy that can be converted into
work is taken from the system S itself, i.e. the thermostat is included into S.
As is well known, the second law of thermodynamics is asymptotic and not quite
exact. It is violated for processes related to thermal fluctuations. In view of this, we
can give an adjusted (relaxed) formulation for this law: in a heat-insulated system
we cannot observe processes, for which the entropy increment is
Δ H −1. (12.2.12)
In the previous sections the random variable y carrying information about the coor-
dinate (or coordinates) x of a physical system was assumed to be known beforehand
and related statistically (correlated) with x. The problem of physical realization of
such an information carrying random variable must be considered specially. It is
natural to treat this variable as one (or several) coordinate (coordinates) out of a set
of coordinates (i.e. dynamic variables) of some other physical system S0 , which we
shall refer to as the ‘measuring device’. The Second law of thermodynamics im-
plies some assertions about physical procedure for creating an information signal
y statistically related to x. These assertions concern energy costs necessary to cre-
ate y. They are, in some way, the converse of the statements given in Sections 12.1
and 12.2.
The fact is that the physical system S (with a thermostat) discussed above which
‘converts information and heat into work’, and the measuring instrument S0 (acting
automatically) creating the information can be combined into one system and then
400 12 Information theory and the second law of thermodynamics
the Second law of thermodynamics can be applied to it. Information will not be en-
tering the combined system anymore, thereby inequality (12.1.12) will turn into its
usual form (12.1.11) for it. If the measuring instrument or a thermostat (with which
the instrument is in contact) has the same temperature T as the physical system with
coordinate x, then the overall conversion of heat into work will be absent according
to the usual Second law of thermodynamics. This means that mechanical energy of
type (12.1.10) or (12.2.5) must be converted into the heat released in the measuring
instrument. It follows that every measuring instrument at temperature T must con-
vert into heat the energy no less than T I to create the amount of information I about
the coordinate of a physical system.
Let us first check this inherent property of any physical instrument on one simple
model of a real measuring instrument. Let us construct a meter such that measur-
ing the coordinate x of a physical system does not influence the behaviour of this
system. It is convenient to consider a two-dimensional or even a one-dimensional
model. Let x be the coordinate (or coordinates) of the centre of a small metal ball
that moves without friction inside a tube made of insulating material, (or between
parallel plates). Couples of electrodes (the metal plates across which potential dif-
ference is applied) are put into the insulator finished by grinding. When the ball
takes a certain position, it connects a couple of electrodes and thus closes the circuit
(Figure 12.1). Watching the current, we can determine the position of the ball. Thus,
the space of values of x is divided into regions E1 , . . . , EM corresponding to the size
of the electrodes. If the number of electrode couples is equal to M = 2n , i.e. the
integral power of 2, then it will be suitable to select n current sources and to connect
electrodes in such a way that the fact of the presence of current (or the absence of it)
from one source gives one bit information about the number of regions E1 , . . . , EM .
An example of such a connection for n = 3 is shown on Figures 12.1 and 12.2. When
the circuit corresponding to one bit of information is closed, the appearing current
reverses the magnetization of the magnetic that is placed inside the induction coil
and plays the role of a memory cell. If the current is absent, the magnetization of
the magnetic is not changed. As a result, the number of the region Ek is recorded on
three magnetics in binary code.
The sizes of the regions Ek (i.e. of the plates) can be selected for optimality
considerations according to the theory of the value of information. If the goal is to
produce maximum work (12.1.9a), then the regions Ek have to be selected such that
their probabilities are equal
12.3 Energy costs of creating and recording information. An example 401
In this case, Boltzmann’s and Hartley’s information amounts will coincide, HEk =
ln M, and formula (12.1.9a) gives the maximum mechanical energy equal to T ln M.
The logarithm ln M of the number of regions can be called the limit information
capacity of the measuring instrument.
In reality, the amount of information produced by the measuring instrument turns
out to be smaller than ln M as a result of errors arising in the instrument. A source of
these errors is thermal fluctuations, in particular fluctuations of the current flowing
through the coils Lr . We suppose that the temperature T of the measuring device is
given (T is measured in energy units). According to the laws of statistical physics,
the mean energy of fluctuation current ifl flowing through the induction coil L is
determined by the formula
1 1
LE[i2fl ] = T. (12.3.2)
2 2
Thus, it is proportional to the absolute temperature T . The useful current ius from
the emf source Er (Figures 12.1 and 12.2) is added to the fluctuational current ifl .
It is the average energy of the useful current Li2us /2 that constitutes in this case the
energy costs that have been mentioned above and are inherent in any measuring
instrument. Let us find the connection between the energy costs and the amount of
information, taking in account the fluctuation current ifl .
For given useful current ius the total current i = ifl + ius has Gaussian distribution
with variance that can be found from (12.3.2), that is
#
L −L(i−ius )2 /2T
p(i | ius ) = e . (12.3.3)
2π T
For M = 2n , the useful current [if (12.3.1) is satisfied] is equal to 0 with probability
1/2 or to some value i1 with probability 1/2. Hence,
#
1 L
Li2 /2T
+ e−L(i−i1 ) /2T .
2
p(i) = e (12.3.4)
2 2π T
402 12 Information theory and the second law of thermodynamics
Let us calculate the mutual information Ii,ius . Taking into account (12.3.3), (12.3.4)
we have
p(i | 0) 1
2
ln = − ln 1 + e(Lii1 /T )−(Li1 /2T ) ,
p(i) 2
p(i | i1 ) 1
−(Lii1 /T )+(Li2 /2T )
ln = − ln e 1 +1 .
p(i) 2
The first expression has to be averaged with the weight p(i | 0), while the second
with the
weight p(i | i1 ). Introduce the variable ξ = (i 1 /2 − i) L/T (and ξ = (i −
i1 /2) L/T for the second integral), we have the same expression for both integrals:
p(i | 0) p(i | i1 )
p(i | 0) ln di = p(i | i1 ) ln di
p(i) p(i)
∞
1 1 1 −2ξ η
e− 2 (ξ −η ) ln
1 2
= −√ + e dξ ,
2π −∞ 2 2
where η = (i1 /2) L/T .
Therefore, the information in question is equal to the same expression. Let us
bring it into the following form:
∞
1
e− 2 (ξ −η ) ln cosh ξ η d ξ .
1 3
Ii,ius = η 2 − √
2π −∞
The second term is obviously negative, because cosh ξ η > 1, ln cosh ξ η > 0 and,
consequently,
Ii,ius η 2 = Li21 /4T. (12.3.5)
However, the value Li21 /4 is nothing but the mean energy of the useful current:
1 2 1 1 2 1
E Lius = Li1 + 0.
2 2 2 2
Thus, we have just proven that in order to obtain information Ii,ius , the cost of
the useful current mean energy must exceed T Ii,ius . All that has been said above
refers only to one coil. Summing over different circuits shows that to obtain total
information I = ∑nr=1 (Ii,ius )r , it is necessary to spend, on the average, the amount of
energy that is not less than T I. Consequently, for the specified measuring instrument
the above statement about incurred energy costs necessary for receiving information
is confirmed. Taking into account of thermal fluctuations in the other elements of
the instrument makes the inequality A > T I even stronger.
12.4 Energy costs of creating and recording information. General formulation 403
In the example considered above, the information coordinate y was given by the
currents i flowing through the coils L1 , L2 , . . . (Figures 12.1 and 12.2). In order to
deal with more stable, better preserved in time informational variables, it is reason-
able to locate magnets inside the coils that get magnetized, when the current flows.
Then the variables y will be represented by the corresponding magnetization m1 ,
m2 , . . . . Because magnetization mr is a function of the initial magnetization mr0
(independent from x) and the current i (m = f (m0 , i)), it is clear that the amount
of information Imx will not exceed the amount Iix + Im0 x = Iix , that is the amount Iix
considered earlier. Thus, the inequality Iix A/T (12.3.5) can only become stronger
under the transition to Imx .
The case of the informational signal y represented by magnetization degrees m1 ,
m2 , . . . of recording magnetic cores is quite similar to the case of recording of the
information onto a magnetic tape, where the number of ‘elementary magnets’ is
rather large. One can see from the example above that the process of information
‘creation’ by a physical measuring device is inseparable from the process of physi-
cally recording the information. The inequality IT A that was checked above for
a particular example can be proven from general thermodynamic considerations, if
a set of general and precise definitions is introduced.
Let x be a subset of variables ξ of a dynamical system S, and y be a subset of
coordinates η of system S0 , referring to the same instant of time. We call a physical
process associated with systems S and S0 interacting with each other and, perhaps,
with other systems, a normal physical recording of information, if its initial state
characterized by a multiplicative joint distribution p1 (ξ )p2 (η ) is transported into a
final state joint distribution p(ξ , η ) with the same marginal distributions p1 (ξ ) and
p2 (η ). Prior to and after recording the information the systems S, S0 are assumed to
be non-interacting.
We can now state a general formulation for the main assertion.
Theorem 12.2. If a normal physical recording of information is carried out in con-
tact with a thermostat at temperature T , then the following energy consumption and
energy transfer to the thermostat (in the form of heat) are necessary:
A IT, (12.4.1)
where
I = Hx + Hy − Hxy (12.4.2)
is Shannon’s amount of information.
Proof. Let us denote by H+ the entropy of the combined system S + S0 , while HT
will denote the entropy of the thermostat. Applying the Second law of thermody-
namics to the process of information recording we have
Δ H+ + Δ HT 0. (12.4.3)
404 12 Information theory and the second law of thermodynamics
Δ H+ = Hξ η − Hξ − Hη = −Iξ η . (12.4.4)
Thus, the thermostat has received entropy Δ HT Iξ η , and, consequently, the trans-
ferred thermal energy is A T Iξ η . Where has it come from? According to the con-
ditions of the theorem, there is no interaction between systems S, S0 both in the
beginning and in the end, thereby the mean total energy U+ is the sum of the mean
partial energies: U+ = E[E1 (ξ )] + E[E2 (η )]. They too remain invariant, because the
marginal distributions p1 (ξ ) and p2 (η ) do not change. Hence, Δ U+ = 0 and, thus,
the energy A must come from some external non-thermal energy sources during the
process of information recording. In order to obtain (12.4.1), it only remains to take
into account the inequality Ixy Iξ η . The proof is complete.
In conclusion of this paragraph we consider one more example of information
recording of an absolutely different kind rather than the example from Section 12.3.
The compassion of these two examples shows that the sources of external energy
can be of completely different nature.
Suppose that we need to receive and record information about the fluctuating
coordinate x of an arrow rotating about an axis and supported near equilibrium by a
spring Π (Figure 12.3). A positively charged ball is placed at the end of the arrow.
Fig. 12.3 An example of creating and recording information by means of moving together and
apart the arrows that interact with the springs Π
the coordinates x, y of the arrows do have time to change during the process. The
state we have after this approach is non-equilibrium. Then it transforms into an
equilibrium state, in which correlation between the coordinates x, y of the arrows
is established. It is convenient to assume that the attraction force between the balls
is much stronger than the returning forces of springs, so that the correlation will
be very strong. The transition to the equilibrium is accompanied by the reduction
of the average interaction energy of the balls (‘descent into a potential well’) and
by giving some thermal energy to the thermostat (to the environment). After the
equilibrium (correlated) distribution p(x, y) is established, we move the arrows apart
quickly (both movements close and apart are performed along the axis of rotation
and do not involve the rotation coordinate). In so doing, we spend the work A2 ,
which is, obviously, greater than the work A1 we had before, since the absolute
value of the difference |x − y| has become smaller on the average. The marginal
distributions p(x) and p(y) are nearly the same if the mean E[(x − y)2 ] is small. As
thermodynamic analysis (similar to the one provided in the proof of Theorem 12.2)
shows, A2 − A1 = A IT in this example.
After moving the arrows apart, there is a correlation between them, but no force
interaction. The process of ‘recording’ information is complete. Of course, the ob-
tained recording can be converted into a different form, say, it can be recorded on a
magnetic tape. In so doing, the amount of information can only be decreased.
In this example, the work necessary for information recording is done by a human
or a device moving the arrows together or apart. According to the general theory,
such work must be expended wherever there is a creation of new information, cre-
ation of new correlations, and not simply reprocessing the old ones. The general
theory in this case only points to a lower theoretical level of these expenditures.
In practice expenditure of energy can obviously exceed and even considerably ex-
ceed this thermodynamic level. A comparison of this expenditure with the minimum
theoretical value allows us to judge energy efficiency of real devices.
Energy costs are necessary not only for creation and recording of information, but
also for its transmission, if the latter occurs in the presence of fluctuation distur-
bances, for instance, thermal ones. As is known from statistical physics, in linear
systems there is certain mean equilibrium fluctuational energy for every degree of
freedom. This energy equals Tfl /2, where Tfl is the environment (thermostat) tem-
perature. In a number of works (by Brillouin [4] (the corresponding English book is
[5]) and others), researchers came up with an idea that in order to transmit 1 Nat of
information under these conditions it is necessary to have energy at least Tfl (we use
energy units for temperature, so that the Boltzmann constant is equal to 1). In this
section we shall try to make this statement more exact and to prove it.
Let us call a channel described by transition probabilities p(y | x) and the cost
function c(x) a physical channel, if the variable y has the meaning of a complete
406 12 Information theory and the second law of thermodynamics
set of dynamical variables of some physical system S. The Hamilton function (en-
ergy) of the system is denoted by E(y) (it is non-negative). Taking this function into
account, we can apply standard formulae to calculate the equilibrium potential
Γ (β ) = −β F = ln e−β E(y) dy (12.5.1)
Theorem 12.3. The capacity (see Section 8.1) of a physical channel [p(y | x), c(x)]
satisfies the inequality
TflC(a) EE(y) − afl , (12.5.3)
where the level afl and ‘fluctuation temperature’ Tfl are defined by the equations
dH
Tfl−1 = (afl ); H(afl ) = Hy|x . (12.5.4)
dR
The mean energy E[E(y)] and the conditional entropy Hy|x are calculated using the
extremum probability density p0 (x) (which is assumed to exist) realizing the channel
capacity. It is also assumed that the second equation (12.5.4) has root afl belonging
to the normal branch, where Tfl > 0.
Proof. Formulae (12.5.1), (12.5.2) emerge in the solution to the first variational
problem (for instance, see Section 3.6)—the conditional extremum problem for en-
tropy Hy with the constraint E[E(y)] = A. Therefore, the following inequality holds:
Hy H(EE(y)) (12.5.5)
(the level E[E(y)] is fixed). Further, as it follows from a general theory (see Corol-
lary from Theorem 4.1) the function H(R) is concave. Consequently, its derivative
dH(R)
β (R) = (12.5.6)
dR
is a non-increasing function of R.
The channel capacity coincides with Shannon’s amount of information
for the extremum distribution p0 (x), which takes this amount to a conditional max-
imum. From (12.5.5) and the usual inequality Hy Hy|x we find that
Because afl is a root of the equation Hy|x = H(a f ) in the normal branch of the con-
cave function H(·), where the derivative is non-negative, regardless of which branch
the value E[E(y)] belongs to, equation (12.5.4) implies the inequality afl E[E(y)].
Taking this inequality into account together with the non-increasing nature of the
derivative (12.5.6) we obtain that
y = x+ζ. (12.5.10)
Hy|x = Hζ . (12.5.12)
We recall that the first variational problem, a solution to which is either the afore-
mentioned function H(R) or the inverse function R(H), can be interpreted (as is
known from Section 3.2) as the problem of minimizing the mean energy E[E] with
fixed entropy. Thus, R(Hζ ) is the minimum value of energy possible for fixed en-
tropy Hζ , i.e.
EE(ζ ) R(Hζ ). (12.5.13)
However, the value R(Hζ ) due to (12.5.4), (12.5.12) is nothing but afl , and there-
fore (12.5.13) can be rewritten as follows
408 12 Information theory and the second law of thermodynamics
EE(ζ ) a f . (12.5.14)
From (12.5.11) and (12.5.14) we obtain E[E(y)] − afl E[E(x)]. This inequality
allows us to transform the basic derived inequality (12.5.3) to the form TflC(a)
E[E(x)] or
TflC(a) a, (12.5.15)
if the cost function c(x) coincides with the energy E(x).
The results given here concerning physical channels are closely connected to
Theorem 8.5. In this theorem, however, the ‘temperature parameter’ 1/β has a for-
mal mathematical meaning. In order for this parameter to have the meaning of phys-
ical temperature, the costs c(x) or b(y) have to be specified as physical energy.
According to inequality (12.5.15), in order to transmit one nat of informa-
tion through a Gaussian physical channel we need energy that is not less than
Tfl . It should be noted that we could not derive any universal inequality of the
type (12.5.15) containing a real (rather than effective) temperature.
Appendix A
Some matrix (operator) identities
Suppose we have two arbitrary not necessarily square matrices A and B such that
the matrix products AB and BA have meaning. In operator language, this means the
following: if A maps an element of space X into an element of space Y , then B
maps an element of space Y into an element of space X, thus it defines the inverse
mapping. Under broad assumptions regarding function f (z), the next formula holds:
Let us prove this formula under the assumption that f can be expressed as the
Taylor series
∞
1
f (z) = ∑ f n (0)zn (A.1.2)
n=0 n!
1
tr f (AB) = tr f (0) + f (0) tr(AB) + f (0) tr(ABAB) + · · · , (A.1.3)
2
1
tr f (BA) = tr f (0) + f (0) tr(BA) + f (0) tr(BABA) + · · · (A.1.4)
2
However, tr(AB) = tr(BA) = ∑i j Ai j B ji and, therefore, tr(A[(BAk B)]) = tr([(BAk B)]A).
This is why all the terms in (A.1.3), (A.1.4), apart from the first one, are identical.
In general, the first terms tr( f (0)) are not identical, because the operator f (0) in the
expansion of f (AB) and the same operator in the expansion of f (BA) are multiples
of the identity matrices of different dimensions. However, if the following condition
is met
f (0) = 0 , (A.1.5)
then, consequently,
tr f (AB) = tr f (BA). (A.1.6)
If we were interested in determinants, then for the corresponding equality
where A, D are square matrices, and B, C are arbitrary matrices. Matrix D is assumed
to be non-singular. Let us denote
1 0 AB A B 1 0
L= = , so that K = L
0 D−1 CD D−1C 1 0D
However,
A.2 Determinant of a block matrix 411
1B A − BD−1C 0
det = 1; det = det(A − BD−1C).
01 D−1C 1
According to the above formulas, the problem of calculating the original deter-
minant is reduced to the problem of calculating determinants of smaller dimension.
References
17. Grishanin, B.A., Stratonovich, R.L.: The value of information and sufficient
statistics when observing a stochastic process. Izv. USSR Acad. Sci. Tech. Cy-
bern. 6, 4–12 (1966, in Russian)
18. Hartley, R.V.L.: Transmission of information. Bell Syst. Tech. J. 7(3) (1928)
19. Hartley, R.V.L.: Transmission of information (Translation to Russian). In:
A. Harkevich (ed.) Theory of Information and Its Applications. Fizmatgiz,
Moscow (1959)
20. Hill, T.L.: Statistical Mechanics. McGraw-Hill Book Company Inc., New York
(1956)
21. Hill, T.L.: Statistical Mechanics (Translation to Russian). Inostrannaya Liter-
atura, Moscow (1960)
22. Hirsch, M.J., Pardalos, P.M., Murphey, R. (eds.): Dynamics of Information Sys-
tems: Theory and Applications. Springer Optimization and Its Applications Se-
ries, vol. 40. Springer, Berlin (2010)
23. Huffman, D.A.: A method for the construction of minimum redundancy codes.
Proc. IRE 40(9), 1098–1101 (1952)
24. Jahnke, E., Emde, F.: Tables of Functions with Formulae and Curves. Dover
Publications, New York (1945)
25. Jahnke, E., Emde, F.: Tables of Functions with Formulae and Curves (Transla-
tion from German to Russian). Gostekhizdat, Moscow (1949)
26. Kolmogorov, A.N.: Theory of transmission of information. In: USSR Academy
of Sciences Session on Scientific Problems Related to Production Automation.
USSR Academy of Sciences, Moscow (1957, in Russian)
27. Kolmogorov, A.N.: Theory of transmission of information (translation from
Russian). Am. Math. Soc. Translat. Ser. 2(33) (1963)
28. Kraft, L.G.: A device for quantizing, grouping, and coding amplitude-
modulated pulses. Master’s Thesis, Massachusetts Institute of Technology,
Dept. of Electrical Engineering (1949)
29. Kullback, S.: Information Theory and Statistics. Wiley, New York (1959)
30. Kullback, S.: Information Theory and Statistics (Translation to Russian).
Nauka, Moscow (1967)
31. Leontovich, M.A.: Statistical Physics. Gostekhizdat, Moscow (1944, in Rus-
sian)
32. Leontovich, M.A.: Introduction to Thermodynamics. GITTL, Moscow-
Leningrad (1952, in Russian)
33. Pinsker, M.S.: The quantity of information about a Gaussian random stationary
process, contained in a second process connected with it in a stationary manner.
Dokl. Akad. Nauk USSR 99, 213–216 (1954, in Russian)
34. Rao, C.R.: Linear Statistical Inference and Its Applications. Wiley, New York
(1965)
35. Rao, C.R.: Linear Statistical Inference and Its Applications (Translation to Rus-
sian). Inostrannaya Literatura, Moscow (1968)
36. Ryzhik, J.M., Gradstein, I.S.: Tables of Series, Products and Integrals.
Gostekhizdat, Moscow (1951, in Russian)
References 415
37. Ryzhik, J.M., Gradstein, I.S.: Tables of Series, Products and Integrals (Transla-
tion from Russian). Academic, New York (1965)
38. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J.
27 (1948)
39. Shannon, C.E.: Communication in the presence of noise. Proc. IRE 37(1), 10–
21 (1949)
40. Shannon, C.E.: Certain results in coding theory for noisy channels. Inform.
Control 1(1) (1957)
41. Shannon, C.E.: Coding theorems for a discrete source with a fidelity criterion.
IRE Nat. Conv. Rec. 4(1), 142–163 (1959)
42. Shannon, C.E.: Certain results in coding theory for noisy channels (translation
to Russian). In: R.L. Dobrushin, O.B. Lupanov (eds.) Works on Information
Theory and Cybernetics. Inostrannaya Literatura, Moscow (1963)
43. Shannon, C.E.: Coding theorems for a discrete source with a fidelity criterion
(translation to Russian). In: R.L. Dobrushin, O.B. Lupanov (eds.) Works on
Information Theory and Cybernetics. Inostrannaya Literatura, Moscow (1963)
44. Shannon, C.E.: Communication in the presence of noise (translation to Rus-
sian). In: R.L. Dobrushin, O.B. Lupanov (eds.) Works on Information Theory
and Cybernetics. Inostrannaya Literatura, Moscow (1963)
45. Shannon, C.E.: A mathematical theory of communication (translation to Rus-
sian). In: R.L. Dobrushin, O.B. Lupanov (eds.) Works on Information Theory
and Cybernetics. Inostrannaya Literatura, Moscow (1963)
46. Stratonovich, R.L.: On statistics of magnetism in the Ising model (in Russian).
Fizika Tvyordogo Tela 3(10) (1961)
47. Stratonovich, R.L.: On the value of information (in Russian). Izv. USSR Acad.
Sci. Tech. Cybern. 5, 3–12 (1965)
48. Stratonovich, R.L.: Conditional Markov Processes and Their Application to the
Theory of Optimal Control. Moscow State University, Moscow (1966, in Rus-
sian)
49. Stratonovich, R.L.: The value of information when observing a stochastic pro-
cess in systems containing finite automata. Izv. USSR Acad. Sci. Tech. Cybern.
5, 3–13 (1966, in Russian)
50. Stratonovich, R.L.: Amount of information and entropy of segments of station-
ary Gaussian processes. Problemy Peredachi Informacii 3(2), 3–21 (1967, in
Russian)
51. Stratonovich, R.L.: Extremal problems of information theory and dynamic pro-
gramming. Izv. USSR Acad. Sci. Tech. Cybern. 5, 63–77 (1967, in Russian)
52. Stratonovich, R.L.: Conditional Markov Processes and Their Application to
the Theory of Optimal Control (Translation from Russian). Modern Analytic
and Computational Methods in Science and Mathematics. Elsevier, New York
(1968)
53. Stratonovich, R.L.: Theory of Information. Sovetskoe Radio, USSR, Moscow
(1975)
54. Stratonovich, R.L.: Topics in the Theory of Random Noise, vol. 1. Martino Fine
Books, Eastford (2014)
416 References
55. Stratonovich, R.L., Grishanin, B.A.: The value of information when a direct
observation of an estimated random variable is impossible. Izv. USSR Acad.
Sci. Tech. Cybern. 3, 3–15 (1966, in Russian)
56. Stratonovich, R.L., Grishanin, B.A.: Game problems with constraints of an in-
formational type. Izv. USSR Acad. Sci. Tech. Cybern. 1, 3–12 (1968, in Rus-
sian)
Index
A Chebyshev’s inequality, 14
α -information, 300 Chernoff’s inequality, 94
active domain, 59, 252, 307 code, 36, 219
additivity principle, 3 optimal, 37
asymptotic equivalence of value of information Shannon’s random, 221
functions, 360 uniquely decodable, 40
asymptotic theorem Kraft’s, 40
first, 92 condition of
second, 225, 227 multiplicativity, 31, 32
third, 356, 360, 367 normalization, 58, 304
average cost, 300 conditional Markov process, 163, 205
B D
Bayesian system, 300 decoding error, 49, 221
Gaussian, 338 distribution
stationary, 346 canonical, 79
bit, 4 extremum, 300
Boltzmann formula, 2
branch E
anomalous, 257, 301 elementary message, 38
normal, 257, 301 encoding of information, 35, 36
block, 36
C online, 35
channel optimal, 35
abstract, 250, 251 entropy, 2, 3
additive, 284 Boltzmann’s, 6
binary, 264, 266 conditional, 8, 30
capacity, 53, 57, 250 continuous random variable, 24, 25
discrete noiseless, 56 end of an interval, 103, 107
Gaussian, 267 maximum value, 6, 28
stationary, 277 properties, 6, 7
physical, 405 random, 5
capacity of, 406 rate, 15, 103, 105, 156
symmetric, 262 entropy density, 157
F N
Fokker–Planck equation, 157, 166 nat, 4
stationary, 159 neg-information, 290
free energy, 62
function P
cost, 56, 296 parameter
cumulant generating, 80 canonical, 78, 79
likelihood, 219 thermodynamic
value of information, 296, 322 conjugate, 72, 79
external, 71
G internal, 71, 78
Gibbs partition function, 62, 74
canonical distribution, 62, 78, 82, 392 potential
theorem, 83 characteristic, 20, 80, 94, 127, 188, 194, 201
conditional, 240
H thermodynamic, 65
Hartley’s formula, 2, 4 probability
final a posteriori, 116
I of error, 219
information average, 221
amount of process
Boltzmann’s, 6, 318 discrete, 104
Hartley’s, 4 Markov, 107
Shannon’s, 173, 178 stationary, 104
capacity, 57 Markov
mutual conditional, 113, 163, 205
conditional, 181 conditional, entropy of, 113
pairwise, 178 diffusion, 157
random, 179 secondary a posteriori, 118
rate, 196 stationary-connected, 196
triple, 185 stationary periodic, 138
Ising model, 69 stochastic point, 144
property of hierarchical additivity, 10, 30, 182
J
Jensen’s inequality, 6 R
Radon–Nikodym derivative, 25
K random flow, 144
Khinchin’s theorem, 225 risk, 300
Kotelnikov’s theorem, 282
S
L sequence of informationally stable Bayesian
Lévy formula, 97 systems, 367
law Shannon’s theorem, 225, 250
conservation of information amount, 35 generalized, 353, 385, 387
of thermodynamics, Second, 392, 398 simple noise, 175
generalized, 392, 399 stability
Legendre transform, 72, 81, 230 canonical, 88
length of record, 37, 42 entropic, 16, 19
sufficient condition, 19
M informational, 226
Markov chain, 108 Stirling’s formula, 150
Markov condition, 115
method of Lagrange multipliers, 58 T
micro state, 2 thermodynamic relation, 62, 64, 254, 309
Index 419
V variational problem
value of information, 289, 291, 296 first, 57, 58
Boltzmann’s, 319 second, 249–251
differential, 291, 292 third, 293, 300, 304
Hartley’s, 297
random, 312 W
Shannon’s, 301, 321 W -process secondary a posteriori, 118